CN118797642A - A method and system for automatically generating APK data sets based on genetic engineering - Google Patents
A method and system for automatically generating APK data sets based on genetic engineering Download PDFInfo
- Publication number
- CN118797642A CN118797642A CN202410896182.9A CN202410896182A CN118797642A CN 118797642 A CN118797642 A CN 118797642A CN 202410896182 A CN202410896182 A CN 202410896182A CN 118797642 A CN118797642 A CN 118797642A
- Authority
- CN
- China
- Prior art keywords
- apk
- file
- genome
- gene
- submodule
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/56—Computer malware detection or handling, e.g. anti-virus arrangements
- G06F21/561—Virus type analysis
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/56—Computer malware detection or handling, e.g. anti-virus arrangements
- G06F21/562—Static detection
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/57—Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
- G06F21/577—Assessing vulnerabilities and evaluating computer system security
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformation of program code
- G06F8/53—Decompilation; Disassembly
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/12—Computing arrangements based on biological models using genetic models
- G06N3/123—DNA computing
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/12—Computing arrangements based on biological models using genetic models
- G06N3/126—Evolutionary algorithms, e.g. genetic algorithms or genetic programming
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Computer Hardware Design (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biophysics (AREA)
- General Physics & Mathematics (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- General Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Virology (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- Genetics & Genomics (AREA)
- Computational Linguistics (AREA)
- Molecular Biology (AREA)
- Evolutionary Computation (AREA)
- Physiology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses an APK data set automatic generation method and system based on genetic engineering. The APK data set automatic generation method comprises the following steps: performing static analysis on the decompressed APK file; selecting part of APK file characteristics and expressing the selected part of APK file characteristics by means of a gene sequence; randomly selecting gene sequences in a genome set; adding a random gene sequence into a genome set for gene characteristic evaluation, and then putting the genome set into a genetic algorithm for training; updating the genome collection with the new genome; and remapping the gene sequences in the updated genome set back to corresponding files, and repackaging the Android application. The automatic generation system comprises an APK static analysis module, an APK gene generation module, an APK characteristic genetic module and an APK repackaging module. The method and the device can overcome the difficulties of insufficient variety, insufficient quantity, long acquisition period and the like in the existing data set, and provide a large amount of APK sets for users conveniently.
Description
Technical Field
The invention belongs to the technical field of Android application program data set production, and particularly relates to an APK data set automatic generation method and system based on genetic engineering.
Background
The Android operating system is a mobile intelligent terminal operating system with the largest market share at present by virtue of the advantages of strong compatibility, complete functions and openness. Because of the openness of the Android operating system, an attacker can easily insert own malicious codes into a normal application program to perform malicious attacks or execute unauthorized dangerous behaviors, and a large number of Android smart phones become main attack targets of the malicious codes of the mobile terminal.
When the malicious APK is actually classified and identified by a evidence obtaining means based on APK automatic analysis, because the data set sample adopted by training is limited, manual collection and preparation of APK files are tedious and time-consuming, and the novel malicious APK appearing for the first time cannot be found out in the first time because of no corresponding type. Therefore, automating the method of generating new APK datasets becomes critical.
Appy Pie is used for constructing an Android APK application program, is quick and convenient to use, and can be used for creating an application program within a few minutes. But the software cannot create new malicious APKs for experiments in batches. There are also flutter, beeWare, etc. existing published APK application creation software, but these tools lack the following functionality:
1. Automatically generating batch engineering functions of APK;
2. the function of producing novel APK after analyzing the internal characteristics of the existing APK;
3. the production of APK for experiments is simple and efficient.
Disclosure of Invention
The invention aims to: the invention aims to provide an automatic generation method of an APK data set based on genetic engineering, which solves the problems of insufficient variety, insufficient quantity, long acquisition period and the like in the existing data set. Another object of the present invention is to propose an APK dataset automatic generation system, solving the problem of how to execute the above method.
The technical scheme is as follows: the invention relates to an APK data set automatic generation method based on genetic engineering, which comprises the following steps:
Performing static analysis on the decompressed APK file to obtain APK file characteristics;
selecting part of APK file characteristics and expressing the selected part of APK file characteristics in a gene sequence mode to obtain a genome set;
randomly selecting gene sequences in the genome set to generate random gene sequences;
Adding a random gene sequence into a genome set for gene characteristic evaluation, and then putting the evaluated genome set into a genetic algorithm for training to obtain a new genome;
Updating the genome collection with the new genome;
And remapping the gene sequences in the updated genome set back to corresponding files, and repackaging the Android application.
Preferably, the static analysis of the decompressed APK file includes: decompiling and converting jar package are carried out on the decompressed APK file, information extraction is carried out on the APK file, and components and structures in the APK file are analyzed to obtain the characteristics of the APK file.
Preferably, the method for decompiling and transferring jar packets comprises the following steps: decompilation is performed by adopting ApkTool tools, and meanwhile, the obtained class. Dex file is converted into a jar package file by adopting a dex2jar tool.
APKTool is an APK compiling tool provided by GOOGLE, which can decompil and recompile APK, and can view xml files, android management. In the decompilation submodule, since ApkTool tools run under the command line, a system process needs to be newly started by using a sub.pop () function in the system, then a command of 'apktool d apkName' is executed in the new process through exec () system call, and the process is closed after the execution is finished.
The dex2jar is a tool set capable of operating dalvik (. Dex) file format of Android and Java (. Class). In the decompilation submodule, since the dex2jar tool runs under the command line, a system process needs to be newly started by using a sub-process.Popen () function in the system, then a "d2j-dex2 jar-force class.dex" command is executed in the new process through exec () system call, the process is closed after the execution is finished, and then the jar packet is unpacked to obtain a result.
Preferably, the extracting information from the APK file includes: extracting package names, sdk versions, main activities, app names, authority information, detailed authority information, icons and android version names of the APK files;
Analyzing the components and structures in the APK file comprises analyzing the Dex file, the Manifest file, the resource file, the code and the decompilation result by adopting Androguard tools.
Androguard is an open source tool written in Python for analyzing Android applications and malware. The method provides a series of tools, which can help security researchers analyze and detect security vulnerabilities and malicious behaviors of Android application programs, and can also help analysts analyze various components and structures in APK files, such as Dex files, aliest files, resource files, codes and decompiling results.
Preferably, the selecting part APK file features and is expressed by means of a gene sequence comprising the following steps:
constructing a mapping database of one-to-one correspondence between feature binary and feature codes;
Selecting file types, file sizes, file contents and file paths in the APK file features as mapping features, and mapping each mapping feature to a binary sequence with a fixed length, namely a feature binary sequence, based on a mapping database;
adding a priority label of binary numbers in each characteristic binary sequence to obtain a binary gene sequence;
a fixed number of binary gene sequences are combined into several genomes, which form a genome set.
Preferably, the mapping database for constructing the one-to-one correspondence between feature binary and feature codes comprises:
placing the same file features in the same table, and distinguishing each table by file feature naming;
A first column in the database table stores the file characteristics, a second column stores the bottom code when the file characteristics are 0, and a third column stores the bottom code when the file characteristics are 1; wherein 0 represents that the file feature is not present or enabled, and 1 represents that the file feature is present and enabled;
Based on static analysis, APK files are scanned one by one, a first APK file is used as a template, after the template scanning is finished, a new table is created every time a new file feature appears in the subsequent APK files, and the like, so that the construction and filling of a mapping database are completed.
Preferably, the adding the random gene sequence into the genome assembly for gene characteristic evaluation, and then putting the evaluated genome assembly into a genetic algorithm for training comprises the following steps:
Adding a random gene sequence into a genome set to obtain a data set to be evaluated, and carrying out fitness evaluation on each binary gene sequence in the data set to be evaluated based on coverage, diversity and representativeness of the data set to be evaluated to ensure that each genome in the data set to be evaluated is completely available and has differences among groups;
the estimated genome is inherited by at least one of single-point crossing, multi-point crossing, uniform crossing and random change to obtain a new genome.
Preferably, said updating the genome collection with the new genome comprises:
The new genome is subjected to priority label updating, and then the new genome is added into the genome set to update the genome set;
Detecting termination conditions after updating the genome set each time, judging whether the maximum iteration times are reached at the moment, and ending genetic inheritance if the maximum iteration times are reached; if not, the new genome is subjected to fitness evaluation, and the next iteration is performed.
Preferably, the remapping the gene sequences in the updated genome set back to the corresponding file comprises:
And (3) remapping the binary gene sequences in the genome set after iteration updating back to corresponding feature codes based on the mapping database, and finding out corresponding APK files according to the feature codes.
Based on the above method, the invention further discloses an automatic generation system of an APK data set, comprising:
the APK static analysis module is used for statically analyzing the decompressed APK file to obtain relevant APK file characteristics;
The APK gene generation module is used for carrying out APK file feature selection and expressing the APK file features in a gene sequence mode;
The APK characteristic genetic module is used for carrying out gene characteristic evaluation on the genome set added with the random gene sequence, putting the evaluated genome set into a genetic algorithm for training, generating a new genome and updating the genome set;
and the APK repackaging module is used for remapping the gene sequences in the updated genome set back to corresponding files and repackaging the Android application.
In some embodiments, the APK decompression module is responsible for two parts of uploading an APK file and unpacking the APK file, and determines whether the file to be tested is an APK file before uploading the file, and unpacks the APK file.
Preferably, the APK static analysis module comprises a decompilation sub-module and an information extraction sub-module; after unpacking, the APK file to be analyzed enters a decompilation sub-module, a system uses APKTool tools to decompilate, a class. Dex file is synchronously selected, a dex2jar tool is used for transferring jar package, after the package is successful, the result is stored into the system and enters an information extraction sub-module, and Androuguard analysis data is introduced.
The information extraction submodule is used for completing the work of extracting the information of the APK file and comprises extracting information such as package names, sdk versions, main activities, app names, authority information, detailed authority information, icons, android version names and the like of the APK. And if one of the decompilation process of the APKTool tool of the decompilation sub-module, the jar packet conversion process and the data analysis process of the information extraction sub-module fails, an error message is returned, and the process is declared to be ended.
The APK gene generation module comprises an APK file feature selection submodule and a gene combination submodule, wherein the APK file feature selection submodule is used for selecting APK file features and mapping the features and attributes of each APK file to a binary sequence with a fixed length, and selectable APK file features comprise file types (such as android management. Xml, classes. Dex, resource files and the like), file sizes, file contents (such as authority lists, method calls and the like), file paths and the like. Wherein the characteristics and attributes of each APK file are mapped to a fixed length binary sequence, wherein each bit represents the presence or absence of a characteristic or attribute. For example, a simple feature may be the package name, version number, rights list, etc. of the application.
The gene combination submodule is used for combining a fixed number of binary gene sequences into one genome to obtain a genome set; that is, each individual's gene is represented as a set of file features, which is a collection of different file features.
The APK characteristic genetic module comprises a gene initialization sub-module, an adaptability evaluation sub-module, a selection, crossing and mutation sub-module, an updating base factor module and a termination condition detection sub-module.
The gene initialization submodule is used for generating random gene sequences, wherein each individual in an initial genome has a group of random gene sequences, and each individual represents one possible APK file characteristic data set and is generated by adopting a random selection method.
The fitness evaluation submodule is used for evaluating the quality of the APK file according to the file characteristics, and relates to evaluation of the safety, performance, functional integrity and the like of the file.
The selection, crossover and mutation submodules are used for carrying out selection operation, crossover operation and mutation operation on the evaluated genome; based on the selection, crossover and mutation of genetic algorithms, genetic operations operate on representations based on file characteristics. In the selection operation, individuals with better feature combinations can be selected for reproduction according to the fitness function, and the crossover and mutation operations can be adjusted according to file features. The crossover operation may be in the form of single-point crossover, multi-point crossover, or uniform crossover, etc., for exchanging partial information of the gene sequence; mutation operations are random alterations of individual genes to increase population diversity.
The updating base factor module is used for carrying out gene updating after the set iteration times, and replacing old individuals with new individuals to form new genomes, wherein the individuals are composed of file features;
The termination condition detection sub-module is used for determining whether to stop execution of the algorithm, and the condition is based on the maximum iteration times and the adaptability reaching a preset threshold or other preset conditions;
the APK repacking module comprises an APK gene conversion submodule and an APK packing submodule, and the APK gene conversion submodule is used for remapping APK file characteristics back to corresponding files; the APK packaging submodule is used for repackaging the Android application by using Apktool.
The beneficial effects are that: compared with the prior art, the invention has the following remarkable advantages: based on the existing data set, the APK file features are scientifically split, and the features are combined in a genetic mode to generate APK files of different types. The design can fill the difficulties of insufficient variety, insufficient quantity, long acquisition period and the like in the existing data set, and provides a large amount of APK sets for users conveniently. The method specifically comprises the following advantages:
1. And (3) automatic generation: and an APK data set is automatically generated based on a genetic engineering method, manual intervention is not needed, a large amount of time and labor cost are saved, and efficiency is improved.
2. Diversity and coverage: through the optimization process of the genetic algorithm, an APK data set with diversity and higher coverage can be generated, and the performance and the function of the Android application can be more comprehensively evaluated and tested.
3. Personalized customization: the genetic engineering-based method can custom generate an APK dataset according to user requirements and specified requirements. Personalized customization may ensure that the generated data set meets specific test goals and scenario requirements.
4. Solves the problem of insufficient data: for certain fields or specific applications, existing APK datasets may not be sufficient to meet test requirements. The genetic engineering method can overcome the problem of insufficient data and generate a richer and more diverse data set.
5. Validity and extensibility: the genetic algorithm is an efficient optimization algorithm that can search the solution space and find the optimal solution. The method has higher expandability and can solve the problems of different scales and complexity.
6. Adaptability and flexibility: the genetic engineering method has strong adaptability and flexibility, can be applied to the generation of APK data sets of different types, and can be adjusted and improved according to specific situations.
Drawings
FIG. 1 is a diagram of the overall structure of the present invention;
FIG. 2 is a workflow diagram of an APK static analysis module;
FIG. 3 is a diagram showing the composition and workflow of an APK gene generation module;
FIG. 4 is a composition and workflow diagram of an APK signature genetic module;
Fig. 5 is a composition and workflow diagram of an APK repacking module.
Detailed Description
The technical scheme of the invention is further described below with reference to the accompanying drawings.
As shown in fig. 1, an APK dataset automatic generation system includes:
and the APK decompression module is responsible for uploading the APK file and unpacking the APK file, judging whether the file to be processed is the APK file before uploading the file, and unpacking the APK file.
The APK static analysis module is used for statically analyzing the decompressed APK file to obtain relevant APK file characteristics;
The APK gene generation module is used for carrying out APK file feature selection and expressing the APK file features in a gene sequence mode;
The APK characteristic genetic module is used for carrying out gene characteristic evaluation on the genome set added with the random gene sequence, putting the evaluated genome set into a genetic algorithm for training, generating a new genome and updating the genome set;
and the APK repackaging module is used for remapping the gene sequences in the updated genome set back to corresponding files and repackaging the Android application.
As shown in fig. 2, the APK static analysis module is composed of two sub-modules, namely a decompilation sub-module and an information extraction sub-module.
The decompilation sub-module is a module for performing basic work of uncoating, unpacking and decompiling on the APK file.
The information extraction sub-module is a module for extracting and analyzing data of the processed APK file.
After unpacking, the APK file to be analyzed enters a decompilation sub-module, a system uses APKTool tools to decompilate, a class. Dex file is synchronously selected, a dex2jar tool is used for transferring jar package, after the package is successful, the result is stored into the system and enters an information extraction sub-module, and Androuguard analysis data is introduced. And if one of the decompilation process of the APKTool tool of the decompilation sub-module, the jar packet conversion process and the data analysis process of the information extraction sub-module fails, an error message is returned, and the process is declared to be ended.
APK decompilation refers to the restoration of APK files to readable source code, XML resource files, and other components. The module integrates ApkTool tools provided by Google for decompilation, and ApkTool tools can view xml files, android management.xml and pictures under res files. Since ApkTool tools run under the command line, a new system process needs to be started in the system by using the subsuppass.pop () function, then the "apktool d apkName" command is executed in the new process through exec () system call, and the process is closed after the execution is finished.
The method also integrates a dex2jar tool to convert the obtained class. Dex file into a jar package file, namely the package of the java source code, so as to facilitate the subsequent analysis. Because the dex2jar tool operates under the command line, a system process needs to be newly started by using a subsource.Popen () function in the system, then a "d2j-dex2 jar-force class.dex" command is executed in the new process through exec () system call, the process is closed after the execution is finished, and then the jar packet is unpacked to obtain a result.
The information extraction submodule completes the work of extracting the information of the APK file, and comprises the steps of extracting the package name, sdk version, main activity, app name, authority information, detailed authority information, icons, android version names and other information of the APK. Androguard tools are integrated in the system to realize the function. Androguard is an open source tool written in Python for analysis of Android applications and malware. The method provides a series of tools, which can help security researchers analyze and detect security vulnerabilities and malicious behaviors of Android application programs, and can also help analysts analyze various components and structures in APK files, such as Dex files, aliest files, resource files, codes and decompiling results.
The specific method comprises the following steps:
1. Introducing Androguard into the module, newly building a AnalyzeAPK object according to the position of the APK file, wherein the AnalyzeAPK object is also a well-defined object in Androguard, three sub-objects of the APK file, namely an APK object and an abstract APK object, can be derived through an array, and some information of the APK such as a version number, a package name, an Activity and the like can be obtained; dalvikVMFormat, an array, wherein one element is actually corresponding to class. DEX, and class, method or character string can be obtained from the DEX file; analysis, the object is analyzed because it contains special classes that link information about class. The three child objects are named a, d, dx, respectively.
2. The method comprises the steps of obtaining a package name of an apk through a get_package method of an abstract apk object, obtaining a main activity of the apk through a get_main_activity method, obtaining an app name of the apk through a get_app_name method, obtaining authority of an apk request through a get_ permissions method, obtaining authority details of the apk request through a get_details_ permissions method, obtaining an apk icon through a get_app_icon method, obtaining an android version of the apk through a get_ androidversion _name method, obtaining an effective sdk version of the apk through a get_effect_target_ sdk _version method, and obtaining xml configuration file information through a get_android_manifest_ axml (). Get_xml method.
3. The method comprises the steps of acquiring the sensitive authority called by the APK and the class name of the called sensitive authority through DalvikVMFormat objects, and firstly loading a apkPermissonMap resource module which comprises the method of the Android API and the mapping of the authority required by the method. This is a predefined mapping that is used to match the method in the APK. All methods in the APK are then traversed through the get_methods () method of DalvikVMFormat objects and their class names and method names are obtained and matched with the AndroidAPI module created above, which is preserved by matching to the system sensitive authority method.
4. Acquiring a system method called by the APK and a class called by the system method through DalvikVMFormat objects, and acquiring the system method called by the APK and the class called by the system method through 'dx.class [' Ljava/io/File; the get_methods () "method obtains all Ljava/io/File methods in the APK, then obtains the reference information of each method, formats the output, and obtains the system method result called by the APK through the method.
As shown in FIG. 3, the APK gene generation module consists of a code-base library sub-module, an APK file feature selection sub-module and a gene combination module.
The code-base library sub-module aims to form a mapping database of one-to-one correspondence between feature binary and feature codes, wherein the mapping database comprises file features, 0-1 pairs of file features, bottom layer codes corresponding to the file features and the like. Where 0 represents the absence or non-enablement of the file feature and 1 represents the presence and enablement of the file feature.
The first column in the database table stores the bottom code when the file feature is 0, the second column stores the bottom code when the file feature is 1. Based on the APK static analysis module, scanning APK files one by one, and creating a database feature set by taking the first APK file as a template. Because the APK dataset is large, the same file features are placed in the same table, with each table being distinguished by file feature naming. After the scanning of the template APK file is finished, a new table is created every time a new file feature appears in the subsequent APK file. And the like, completing the construction and filling of the mapping database. The number of tables in the mapping database represents the binary sequence feature tag length.
In the APK file feature selection sub-module, according to the information obtained by the APK static analysis module, a file type (such as android management. Xml, class. Dex, resource file, etc.), a file size, a file content (such as a permission list, a method call, etc.), a file path, etc. are selected. Wherein the characteristics and attributes of each APK file are mapped to a fixed length binary sequence, wherein each bit represents the presence or absence of a characteristic or attribute. 0 represents the absence of this attribute and 1 represents the presence of this attribute.
To increase the diversity of the selection, an eight-bit binary number is added before the feature binary sequence as a priority label to identify whether the APK is malicious, whether the APK contains a specified feature, and the like. The first binary number is used for specifying whether the APK is malicious or not, wherein the identification of the malicious APK can refer to whether the sensitive authority in the existing file characteristics is opened or not.
The final APK file feature selection submodule will generate a plurality of binary gene sequences, each binary gene sequence consisting of a priority tag and a feature tag. The priority label is eight bits in total, and the characteristic label length is determined by the number of tables in the database.
And then entering a gene combination submodule, combining a fixed number of binary gene sequences into one genome, and finally forming n genomes to obtain a genome set.
As shown in FIG. 4, at the APK signature genetic module, the gene initialization submodule is entered first. The existing genome collection is introduced. In addition, binary gene sequences in the genome collection are randomly selected to generate a partial random gene sequence to simulate uncertainty and possible bad values of the gene.
And then entering an fitness evaluation submodule, adding the generated random gene sequence into the existing genome set, and performing gene characteristic evaluation. Each individual (one individual representing one binary gene sequence) is subjected to fitness evaluation, which is based on coverage, diversity, representativeness, etc. of the data set, ensuring that each genome is fully available and that there should be a variance between groups. Each individual in the initial genome has a set of random gene sequences, each individual representing a possible APK file feature dataset, generated using a random selection method.
The estimated genome is put into a genetic algorithm for training, and inheritance can be carried out in various modes such as single-point crossing, multi-point crossing, uniform crossing, random change and the like, so that a new genome is finally obtained. Based on the selection, crossover and mutation of genetic algorithms, genetic operations operate on representations based on file characteristics. In the selection operation, individuals with better feature combinations can be selected for reproduction according to the fitness function, and the crossover and mutation operations can be adjusted according to file features. The crossover operation may be in the form of single-point crossover, multi-point crossover, or uniform crossover, etc., for exchanging partial information of the gene sequence; mutation operations are random alterations of individual genes to increase population diversity.
And updating the priority label of the obtained new genome, and adding the new genome into the genome set to update the genome set.
Updating the genome collection facilitates increasing the genetic diversity, stimulating more possibilities in subsequent training, and achieving more genetic varieties. And after updating the genome set each time, detecting termination conditions, and judging whether the maximum iteration number is reached at the moment. If the genome reaches the target, finishing genetic inheritance of the gene, and transferring the genome into a next module; if not, the newly generated genome is returned to the fitness evaluation submodule for evaluation, and the next iteration is carried out.
As shown in fig. 5, the APK repackaging module includes an APK gene conversion sub-module and an APK packing sub-module.
The APK gene conversion submodule receives the genome of which the last module finishes iteration, and remaps the APK file characteristics in the genome back to corresponding files.
The mapping process requires the incorporation of a mapping database generated by the code-base library submodule. Because of the popularity of scanning, various bottom codes exist for each feature in the database, and then the corresponding feature table is searched according to 0 or 1 in the binary code. In order to reduce the complexity of searching the file feature codes, a sequential searching mode is adopted, if the corresponding file feature codes are found in the searched columns, the file feature codes are adopted and the code searching of the feature is stopped, and if the corresponding result is not found by traversing the whole table, the file feature codes are set to be empty.
The APK bagging sub-module then repacks the Android application using Apktool, apktool b output _folder-onew _app APK
Where output_folder is a directory containing the modified file, and new_app.apk is the name of the APK file generated after repacking.
If integrity is required, a signing tool may be used to sign, here using the jarsigner tool of Java.
The code is jarsigner-verbose-sigalg SHA1withRSA-digestalg SHA1-keystore my-release-key.keystore new_app.apk alias_name
Wherein, my-release-key.key is a key store file, new_app.apk is an APK file needing to be signed, and alias_name is a key store alias.
If there is no keystore, a keytool command may be used to create one:
keytool-genkey-v-keystore my-release-key.keystore-alias alias_name-keyalg RSA-keysize 2048-validity 10000
In summary, the invention is realized according to the idea of biological inheritance, namely, genome is generated according to file characteristics, and genome collection is used as DNA for inheritance. Gene combination, gene mutation, genetic variation, genetic drift and the like can occur in the genetic process. Wherein gene combination refers to the combination of genes by parents and rearrangement in offspring, resulting in offspring having a unique genetic combination; gene mutation refers to the possible occurrence of mutation, namely, change of DNA sequence, in the genetic process, wherein the mutation can be in the forms of point mutation, insertion, deletion and the like, so as to influence the genetic characteristics and properties of individuals; genetic variation refers to the existence of genetic variation between individuals by gene recombination mutation, resulting in variability between offspring; genetic drift refers to the phenomenon of random variation in gene frequency in a population due to the influence of random events, which can have an influence on the genetic structure and evolution of the population.
According to the invention, by combining biological genetic characteristics, APK files with different characteristics are formed by combining, a large number of diversified and automatic APKs are generated by adopting a program, manual intervention is not needed, a large amount of time and labor cost can be saved, and the efficiency is improved. In some cases, the method can replace the work of collecting a large amount of APK files on the network, thereby greatly facilitating the research work or the early collection work of model training.
According to the invention, through the optimization process of the genetic algorithm, an APK data set with diversity and higher coverage can be generated. In this way, the performance and functionality of the Android application can be more fully evaluated and tested. Meanwhile, the novel APK can be well generated, and the problem that part of APK files are not collected to a certain extent is solved.
The invention can properly modify genetic algorithm links according to the requirements, and can customize and generate APK data sets according to the requirements of users and the specified requirements. Such personalized customization may ensure that the generated data set meets specific test goals and scenario requirements, with overall high flexibility and variability.
Claims (11)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202410896182.9A CN118797642A (en) | 2024-07-05 | 2024-07-05 | A method and system for automatically generating APK data sets based on genetic engineering |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202410896182.9A CN118797642A (en) | 2024-07-05 | 2024-07-05 | A method and system for automatically generating APK data sets based on genetic engineering |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN118797642A true CN118797642A (en) | 2024-10-18 |
Family
ID=93023387
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202410896182.9A Pending CN118797642A (en) | 2024-07-05 | 2024-07-05 | A method and system for automatically generating APK data sets based on genetic engineering |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN118797642A (en) |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20100074439A1 (en) * | 2006-07-06 | 2010-03-25 | William Garreth James Howells | method and apparatus for the generation of code from pattern features |
| CN103106253A (en) * | 2013-01-16 | 2013-05-15 | 西安交通大学 | Data balance method based on genetic algorithm in MapReduce calculation module |
| CN109977227A (en) * | 2019-03-19 | 2019-07-05 | 中国科学院自动化研究所 | Text feature, system, device based on feature coding |
| CN116383816A (en) * | 2023-02-03 | 2023-07-04 | 北京工业大学 | Android malicious software detection feature selection method based on genetic algorithm |
-
2024
- 2024-07-05 CN CN202410896182.9A patent/CN118797642A/en active Pending
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20100074439A1 (en) * | 2006-07-06 | 2010-03-25 | William Garreth James Howells | method and apparatus for the generation of code from pattern features |
| CN103106253A (en) * | 2013-01-16 | 2013-05-15 | 西安交通大学 | Data balance method based on genetic algorithm in MapReduce calculation module |
| CN109977227A (en) * | 2019-03-19 | 2019-07-05 | 中国科学院自动化研究所 | Text feature, system, device based on feature coding |
| CN116383816A (en) * | 2023-02-03 | 2023-07-04 | 北京工业大学 | Android malicious software detection feature selection method based on genetic algorithm |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN109361643B (en) | A deep traceability method for malicious samples | |
| CN105184160B (en) | A kind of method of the Android phone platform application program malicious act detection based on API object reference relational graphs | |
| US8010844B2 (en) | File mutation method and system using file section information and mutation rules | |
| CN111639337B (en) | Unknown malicious code detection method and system for massive Windows software | |
| RU91213U1 (en) | SYSTEM OF AUTOMATIC COMPOSITION OF DESCRIPTION AND CLUSTERING OF VARIOUS, INCLUDING AND MALIMENTAL OBJECTS | |
| CN109829312B (en) | JAVA vulnerability detection method and detection system based on call chain | |
| CN104123493A (en) | Method and device for detecting safety performance of application program | |
| CN111897742B (en) | Method and device for generating intelligent contract test case | |
| CN108536451B (en) | Method and device for embedding embedded point of application program | |
| CN109740347A (en) | A method for identifying and cracking fragile hash functions for smart device firmware | |
| JP7119096B2 (en) | license verification device | |
| CN113139192B (en) | Third party library security risk analysis method and system based on knowledge graph | |
| US20200226232A1 (en) | Method of selecting software files | |
| CN105740132B (en) | Software package source automatic analysis method based on modification daily record | |
| CN104866764B (en) | A kind of Android phone malware detection method based on object reference figure | |
| KR20190102456A (en) | Method for clustering application and apparatus thereof | |
| Liu et al. | Enhancing malware detection for android apps: Detecting fine-granularity malicious components | |
| CN113901463B (en) | Concept drift-oriented interpretable Android malicious software detection method | |
| CN105205398B (en) | It is a kind of that shell side method is looked into based on APK shell adding software dynamic behaviours | |
| CN117009972A (en) | Vulnerability detection method, vulnerability detection device, computer equipment and storage medium | |
| CN113254024A (en) | Code inheritance relationship optimization method, device, equipment and storage medium | |
| Feichtner et al. | Obfuscation-resilient code recognition in Android apps | |
| CN109508545B (en) | An Android Malware Classification Method Based on Sparse Representation and Model Fusion | |
| CN114817925B (en) | Android malicious software detection method and system based on multi-modal graph features | |
| CN118797642A (en) | A method and system for automatically generating APK data sets based on genetic engineering |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| RJ01 | Rejection of invention patent application after publication | ||
| RJ01 | Rejection of invention patent application after publication |
Application publication date: 20241018 |