[go: up one dir, main page]

CN118797642A - A method and system for automatically generating APK data sets based on genetic engineering - Google Patents

A method and system for automatically generating APK data sets based on genetic engineering Download PDF

Info

Publication number
CN118797642A
CN118797642A CN202410896182.9A CN202410896182A CN118797642A CN 118797642 A CN118797642 A CN 118797642A CN 202410896182 A CN202410896182 A CN 202410896182A CN 118797642 A CN118797642 A CN 118797642A
Authority
CN
China
Prior art keywords
apk
file
genome
gene
submodule
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410896182.9A
Other languages
Chinese (zh)
Inventor
梁广俊
王群
印杰
夏玲玲
刘家银
诸葛程晨
郭向民
倪雪莉
徐杰
马卓
唐可言
陆优
陈孟轩
徐仲平
张扬
许懿铨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
JIANGSU POLICE INSTITUTE
Original Assignee
JIANGSU POLICE INSTITUTE
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by JIANGSU POLICE INSTITUTE filed Critical JIANGSU POLICE INSTITUTE
Priority to CN202410896182.9A priority Critical patent/CN118797642A/en
Publication of CN118797642A publication Critical patent/CN118797642A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/561Virus type analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/57Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
    • G06F21/577Assessing vulnerabilities and evaluating computer system security
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/53Decompilation; Disassembly
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/12Computing arrangements based on biological models using genetic models
    • G06N3/123DNA computing
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/12Computing arrangements based on biological models using genetic models
    • G06N3/126Evolutionary algorithms, e.g. genetic algorithms or genetic programming
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Virology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Genetics & Genomics (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physiology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an APK data set automatic generation method and system based on genetic engineering. The APK data set automatic generation method comprises the following steps: performing static analysis on the decompressed APK file; selecting part of APK file characteristics and expressing the selected part of APK file characteristics by means of a gene sequence; randomly selecting gene sequences in a genome set; adding a random gene sequence into a genome set for gene characteristic evaluation, and then putting the genome set into a genetic algorithm for training; updating the genome collection with the new genome; and remapping the gene sequences in the updated genome set back to corresponding files, and repackaging the Android application. The automatic generation system comprises an APK static analysis module, an APK gene generation module, an APK characteristic genetic module and an APK repackaging module. The method and the device can overcome the difficulties of insufficient variety, insufficient quantity, long acquisition period and the like in the existing data set, and provide a large amount of APK sets for users conveniently.

Description

Automatic APK data set generation method and system based on genetic engineering
Technical Field
The invention belongs to the technical field of Android application program data set production, and particularly relates to an APK data set automatic generation method and system based on genetic engineering.
Background
The Android operating system is a mobile intelligent terminal operating system with the largest market share at present by virtue of the advantages of strong compatibility, complete functions and openness. Because of the openness of the Android operating system, an attacker can easily insert own malicious codes into a normal application program to perform malicious attacks or execute unauthorized dangerous behaviors, and a large number of Android smart phones become main attack targets of the malicious codes of the mobile terminal.
When the malicious APK is actually classified and identified by a evidence obtaining means based on APK automatic analysis, because the data set sample adopted by training is limited, manual collection and preparation of APK files are tedious and time-consuming, and the novel malicious APK appearing for the first time cannot be found out in the first time because of no corresponding type. Therefore, automating the method of generating new APK datasets becomes critical.
Appy Pie is used for constructing an Android APK application program, is quick and convenient to use, and can be used for creating an application program within a few minutes. But the software cannot create new malicious APKs for experiments in batches. There are also flutter, beeWare, etc. existing published APK application creation software, but these tools lack the following functionality:
1. Automatically generating batch engineering functions of APK;
2. the function of producing novel APK after analyzing the internal characteristics of the existing APK;
3. the production of APK for experiments is simple and efficient.
Disclosure of Invention
The invention aims to: the invention aims to provide an automatic generation method of an APK data set based on genetic engineering, which solves the problems of insufficient variety, insufficient quantity, long acquisition period and the like in the existing data set. Another object of the present invention is to propose an APK dataset automatic generation system, solving the problem of how to execute the above method.
The technical scheme is as follows: the invention relates to an APK data set automatic generation method based on genetic engineering, which comprises the following steps:
Performing static analysis on the decompressed APK file to obtain APK file characteristics;
selecting part of APK file characteristics and expressing the selected part of APK file characteristics in a gene sequence mode to obtain a genome set;
randomly selecting gene sequences in the genome set to generate random gene sequences;
Adding a random gene sequence into a genome set for gene characteristic evaluation, and then putting the evaluated genome set into a genetic algorithm for training to obtain a new genome;
Updating the genome collection with the new genome;
And remapping the gene sequences in the updated genome set back to corresponding files, and repackaging the Android application.
Preferably, the static analysis of the decompressed APK file includes: decompiling and converting jar package are carried out on the decompressed APK file, information extraction is carried out on the APK file, and components and structures in the APK file are analyzed to obtain the characteristics of the APK file.
Preferably, the method for decompiling and transferring jar packets comprises the following steps: decompilation is performed by adopting ApkTool tools, and meanwhile, the obtained class. Dex file is converted into a jar package file by adopting a dex2jar tool.
APKTool is an APK compiling tool provided by GOOGLE, which can decompil and recompile APK, and can view xml files, android management. In the decompilation submodule, since ApkTool tools run under the command line, a system process needs to be newly started by using a sub.pop () function in the system, then a command of 'apktool d apkName' is executed in the new process through exec () system call, and the process is closed after the execution is finished.
The dex2jar is a tool set capable of operating dalvik (. Dex) file format of Android and Java (. Class). In the decompilation submodule, since the dex2jar tool runs under the command line, a system process needs to be newly started by using a sub-process.Popen () function in the system, then a "d2j-dex2 jar-force class.dex" command is executed in the new process through exec () system call, the process is closed after the execution is finished, and then the jar packet is unpacked to obtain a result.
Preferably, the extracting information from the APK file includes: extracting package names, sdk versions, main activities, app names, authority information, detailed authority information, icons and android version names of the APK files;
Analyzing the components and structures in the APK file comprises analyzing the Dex file, the Manifest file, the resource file, the code and the decompilation result by adopting Androguard tools.
Androguard is an open source tool written in Python for analyzing Android applications and malware. The method provides a series of tools, which can help security researchers analyze and detect security vulnerabilities and malicious behaviors of Android application programs, and can also help analysts analyze various components and structures in APK files, such as Dex files, aliest files, resource files, codes and decompiling results.
Preferably, the selecting part APK file features and is expressed by means of a gene sequence comprising the following steps:
constructing a mapping database of one-to-one correspondence between feature binary and feature codes;
Selecting file types, file sizes, file contents and file paths in the APK file features as mapping features, and mapping each mapping feature to a binary sequence with a fixed length, namely a feature binary sequence, based on a mapping database;
adding a priority label of binary numbers in each characteristic binary sequence to obtain a binary gene sequence;
a fixed number of binary gene sequences are combined into several genomes, which form a genome set.
Preferably, the mapping database for constructing the one-to-one correspondence between feature binary and feature codes comprises:
placing the same file features in the same table, and distinguishing each table by file feature naming;
A first column in the database table stores the file characteristics, a second column stores the bottom code when the file characteristics are 0, and a third column stores the bottom code when the file characteristics are 1; wherein 0 represents that the file feature is not present or enabled, and 1 represents that the file feature is present and enabled;
Based on static analysis, APK files are scanned one by one, a first APK file is used as a template, after the template scanning is finished, a new table is created every time a new file feature appears in the subsequent APK files, and the like, so that the construction and filling of a mapping database are completed.
Preferably, the adding the random gene sequence into the genome assembly for gene characteristic evaluation, and then putting the evaluated genome assembly into a genetic algorithm for training comprises the following steps:
Adding a random gene sequence into a genome set to obtain a data set to be evaluated, and carrying out fitness evaluation on each binary gene sequence in the data set to be evaluated based on coverage, diversity and representativeness of the data set to be evaluated to ensure that each genome in the data set to be evaluated is completely available and has differences among groups;
the estimated genome is inherited by at least one of single-point crossing, multi-point crossing, uniform crossing and random change to obtain a new genome.
Preferably, said updating the genome collection with the new genome comprises:
The new genome is subjected to priority label updating, and then the new genome is added into the genome set to update the genome set;
Detecting termination conditions after updating the genome set each time, judging whether the maximum iteration times are reached at the moment, and ending genetic inheritance if the maximum iteration times are reached; if not, the new genome is subjected to fitness evaluation, and the next iteration is performed.
Preferably, the remapping the gene sequences in the updated genome set back to the corresponding file comprises:
And (3) remapping the binary gene sequences in the genome set after iteration updating back to corresponding feature codes based on the mapping database, and finding out corresponding APK files according to the feature codes.
Based on the above method, the invention further discloses an automatic generation system of an APK data set, comprising:
the APK static analysis module is used for statically analyzing the decompressed APK file to obtain relevant APK file characteristics;
The APK gene generation module is used for carrying out APK file feature selection and expressing the APK file features in a gene sequence mode;
The APK characteristic genetic module is used for carrying out gene characteristic evaluation on the genome set added with the random gene sequence, putting the evaluated genome set into a genetic algorithm for training, generating a new genome and updating the genome set;
and the APK repackaging module is used for remapping the gene sequences in the updated genome set back to corresponding files and repackaging the Android application.
In some embodiments, the APK decompression module is responsible for two parts of uploading an APK file and unpacking the APK file, and determines whether the file to be tested is an APK file before uploading the file, and unpacks the APK file.
Preferably, the APK static analysis module comprises a decompilation sub-module and an information extraction sub-module; after unpacking, the APK file to be analyzed enters a decompilation sub-module, a system uses APKTool tools to decompilate, a class. Dex file is synchronously selected, a dex2jar tool is used for transferring jar package, after the package is successful, the result is stored into the system and enters an information extraction sub-module, and Androuguard analysis data is introduced.
The information extraction submodule is used for completing the work of extracting the information of the APK file and comprises extracting information such as package names, sdk versions, main activities, app names, authority information, detailed authority information, icons, android version names and the like of the APK. And if one of the decompilation process of the APKTool tool of the decompilation sub-module, the jar packet conversion process and the data analysis process of the information extraction sub-module fails, an error message is returned, and the process is declared to be ended.
The APK gene generation module comprises an APK file feature selection submodule and a gene combination submodule, wherein the APK file feature selection submodule is used for selecting APK file features and mapping the features and attributes of each APK file to a binary sequence with a fixed length, and selectable APK file features comprise file types (such as android management. Xml, classes. Dex, resource files and the like), file sizes, file contents (such as authority lists, method calls and the like), file paths and the like. Wherein the characteristics and attributes of each APK file are mapped to a fixed length binary sequence, wherein each bit represents the presence or absence of a characteristic or attribute. For example, a simple feature may be the package name, version number, rights list, etc. of the application.
The gene combination submodule is used for combining a fixed number of binary gene sequences into one genome to obtain a genome set; that is, each individual's gene is represented as a set of file features, which is a collection of different file features.
The APK characteristic genetic module comprises a gene initialization sub-module, an adaptability evaluation sub-module, a selection, crossing and mutation sub-module, an updating base factor module and a termination condition detection sub-module.
The gene initialization submodule is used for generating random gene sequences, wherein each individual in an initial genome has a group of random gene sequences, and each individual represents one possible APK file characteristic data set and is generated by adopting a random selection method.
The fitness evaluation submodule is used for evaluating the quality of the APK file according to the file characteristics, and relates to evaluation of the safety, performance, functional integrity and the like of the file.
The selection, crossover and mutation submodules are used for carrying out selection operation, crossover operation and mutation operation on the evaluated genome; based on the selection, crossover and mutation of genetic algorithms, genetic operations operate on representations based on file characteristics. In the selection operation, individuals with better feature combinations can be selected for reproduction according to the fitness function, and the crossover and mutation operations can be adjusted according to file features. The crossover operation may be in the form of single-point crossover, multi-point crossover, or uniform crossover, etc., for exchanging partial information of the gene sequence; mutation operations are random alterations of individual genes to increase population diversity.
The updating base factor module is used for carrying out gene updating after the set iteration times, and replacing old individuals with new individuals to form new genomes, wherein the individuals are composed of file features;
The termination condition detection sub-module is used for determining whether to stop execution of the algorithm, and the condition is based on the maximum iteration times and the adaptability reaching a preset threshold or other preset conditions;
the APK repacking module comprises an APK gene conversion submodule and an APK packing submodule, and the APK gene conversion submodule is used for remapping APK file characteristics back to corresponding files; the APK packaging submodule is used for repackaging the Android application by using Apktool.
The beneficial effects are that: compared with the prior art, the invention has the following remarkable advantages: based on the existing data set, the APK file features are scientifically split, and the features are combined in a genetic mode to generate APK files of different types. The design can fill the difficulties of insufficient variety, insufficient quantity, long acquisition period and the like in the existing data set, and provides a large amount of APK sets for users conveniently. The method specifically comprises the following advantages:
1. And (3) automatic generation: and an APK data set is automatically generated based on a genetic engineering method, manual intervention is not needed, a large amount of time and labor cost are saved, and efficiency is improved.
2. Diversity and coverage: through the optimization process of the genetic algorithm, an APK data set with diversity and higher coverage can be generated, and the performance and the function of the Android application can be more comprehensively evaluated and tested.
3. Personalized customization: the genetic engineering-based method can custom generate an APK dataset according to user requirements and specified requirements. Personalized customization may ensure that the generated data set meets specific test goals and scenario requirements.
4. Solves the problem of insufficient data: for certain fields or specific applications, existing APK datasets may not be sufficient to meet test requirements. The genetic engineering method can overcome the problem of insufficient data and generate a richer and more diverse data set.
5. Validity and extensibility: the genetic algorithm is an efficient optimization algorithm that can search the solution space and find the optimal solution. The method has higher expandability and can solve the problems of different scales and complexity.
6. Adaptability and flexibility: the genetic engineering method has strong adaptability and flexibility, can be applied to the generation of APK data sets of different types, and can be adjusted and improved according to specific situations.
Drawings
FIG. 1 is a diagram of the overall structure of the present invention;
FIG. 2 is a workflow diagram of an APK static analysis module;
FIG. 3 is a diagram showing the composition and workflow of an APK gene generation module;
FIG. 4 is a composition and workflow diagram of an APK signature genetic module;
Fig. 5 is a composition and workflow diagram of an APK repacking module.
Detailed Description
The technical scheme of the invention is further described below with reference to the accompanying drawings.
As shown in fig. 1, an APK dataset automatic generation system includes:
and the APK decompression module is responsible for uploading the APK file and unpacking the APK file, judging whether the file to be processed is the APK file before uploading the file, and unpacking the APK file.
The APK static analysis module is used for statically analyzing the decompressed APK file to obtain relevant APK file characteristics;
The APK gene generation module is used for carrying out APK file feature selection and expressing the APK file features in a gene sequence mode;
The APK characteristic genetic module is used for carrying out gene characteristic evaluation on the genome set added with the random gene sequence, putting the evaluated genome set into a genetic algorithm for training, generating a new genome and updating the genome set;
and the APK repackaging module is used for remapping the gene sequences in the updated genome set back to corresponding files and repackaging the Android application.
As shown in fig. 2, the APK static analysis module is composed of two sub-modules, namely a decompilation sub-module and an information extraction sub-module.
The decompilation sub-module is a module for performing basic work of uncoating, unpacking and decompiling on the APK file.
The information extraction sub-module is a module for extracting and analyzing data of the processed APK file.
After unpacking, the APK file to be analyzed enters a decompilation sub-module, a system uses APKTool tools to decompilate, a class. Dex file is synchronously selected, a dex2jar tool is used for transferring jar package, after the package is successful, the result is stored into the system and enters an information extraction sub-module, and Androuguard analysis data is introduced. And if one of the decompilation process of the APKTool tool of the decompilation sub-module, the jar packet conversion process and the data analysis process of the information extraction sub-module fails, an error message is returned, and the process is declared to be ended.
APK decompilation refers to the restoration of APK files to readable source code, XML resource files, and other components. The module integrates ApkTool tools provided by Google for decompilation, and ApkTool tools can view xml files, android management.xml and pictures under res files. Since ApkTool tools run under the command line, a new system process needs to be started in the system by using the subsuppass.pop () function, then the "apktool d apkName" command is executed in the new process through exec () system call, and the process is closed after the execution is finished.
The method also integrates a dex2jar tool to convert the obtained class. Dex file into a jar package file, namely the package of the java source code, so as to facilitate the subsequent analysis. Because the dex2jar tool operates under the command line, a system process needs to be newly started by using a subsource.Popen () function in the system, then a "d2j-dex2 jar-force class.dex" command is executed in the new process through exec () system call, the process is closed after the execution is finished, and then the jar packet is unpacked to obtain a result.
The information extraction submodule completes the work of extracting the information of the APK file, and comprises the steps of extracting the package name, sdk version, main activity, app name, authority information, detailed authority information, icons, android version names and other information of the APK. Androguard tools are integrated in the system to realize the function. Androguard is an open source tool written in Python for analysis of Android applications and malware. The method provides a series of tools, which can help security researchers analyze and detect security vulnerabilities and malicious behaviors of Android application programs, and can also help analysts analyze various components and structures in APK files, such as Dex files, aliest files, resource files, codes and decompiling results.
The specific method comprises the following steps:
1. Introducing Androguard into the module, newly building a AnalyzeAPK object according to the position of the APK file, wherein the AnalyzeAPK object is also a well-defined object in Androguard, three sub-objects of the APK file, namely an APK object and an abstract APK object, can be derived through an array, and some information of the APK such as a version number, a package name, an Activity and the like can be obtained; dalvikVMFormat, an array, wherein one element is actually corresponding to class. DEX, and class, method or character string can be obtained from the DEX file; analysis, the object is analyzed because it contains special classes that link information about class. The three child objects are named a, d, dx, respectively.
2. The method comprises the steps of obtaining a package name of an apk through a get_package method of an abstract apk object, obtaining a main activity of the apk through a get_main_activity method, obtaining an app name of the apk through a get_app_name method, obtaining authority of an apk request through a get_ permissions method, obtaining authority details of the apk request through a get_details_ permissions method, obtaining an apk icon through a get_app_icon method, obtaining an android version of the apk through a get_ androidversion _name method, obtaining an effective sdk version of the apk through a get_effect_target_ sdk _version method, and obtaining xml configuration file information through a get_android_manifest_ axml (). Get_xml method.
3. The method comprises the steps of acquiring the sensitive authority called by the APK and the class name of the called sensitive authority through DalvikVMFormat objects, and firstly loading a apkPermissonMap resource module which comprises the method of the Android API and the mapping of the authority required by the method. This is a predefined mapping that is used to match the method in the APK. All methods in the APK are then traversed through the get_methods () method of DalvikVMFormat objects and their class names and method names are obtained and matched with the AndroidAPI module created above, which is preserved by matching to the system sensitive authority method.
4. Acquiring a system method called by the APK and a class called by the system method through DalvikVMFormat objects, and acquiring the system method called by the APK and the class called by the system method through 'dx.class [' Ljava/io/File; the get_methods () "method obtains all Ljava/io/File methods in the APK, then obtains the reference information of each method, formats the output, and obtains the system method result called by the APK through the method.
As shown in FIG. 3, the APK gene generation module consists of a code-base library sub-module, an APK file feature selection sub-module and a gene combination module.
The code-base library sub-module aims to form a mapping database of one-to-one correspondence between feature binary and feature codes, wherein the mapping database comprises file features, 0-1 pairs of file features, bottom layer codes corresponding to the file features and the like. Where 0 represents the absence or non-enablement of the file feature and 1 represents the presence and enablement of the file feature.
The first column in the database table stores the bottom code when the file feature is 0, the second column stores the bottom code when the file feature is 1. Based on the APK static analysis module, scanning APK files one by one, and creating a database feature set by taking the first APK file as a template. Because the APK dataset is large, the same file features are placed in the same table, with each table being distinguished by file feature naming. After the scanning of the template APK file is finished, a new table is created every time a new file feature appears in the subsequent APK file. And the like, completing the construction and filling of the mapping database. The number of tables in the mapping database represents the binary sequence feature tag length.
In the APK file feature selection sub-module, according to the information obtained by the APK static analysis module, a file type (such as android management. Xml, class. Dex, resource file, etc.), a file size, a file content (such as a permission list, a method call, etc.), a file path, etc. are selected. Wherein the characteristics and attributes of each APK file are mapped to a fixed length binary sequence, wherein each bit represents the presence or absence of a characteristic or attribute. 0 represents the absence of this attribute and 1 represents the presence of this attribute.
To increase the diversity of the selection, an eight-bit binary number is added before the feature binary sequence as a priority label to identify whether the APK is malicious, whether the APK contains a specified feature, and the like. The first binary number is used for specifying whether the APK is malicious or not, wherein the identification of the malicious APK can refer to whether the sensitive authority in the existing file characteristics is opened or not.
The final APK file feature selection submodule will generate a plurality of binary gene sequences, each binary gene sequence consisting of a priority tag and a feature tag. The priority label is eight bits in total, and the characteristic label length is determined by the number of tables in the database.
And then entering a gene combination submodule, combining a fixed number of binary gene sequences into one genome, and finally forming n genomes to obtain a genome set.
As shown in FIG. 4, at the APK signature genetic module, the gene initialization submodule is entered first. The existing genome collection is introduced. In addition, binary gene sequences in the genome collection are randomly selected to generate a partial random gene sequence to simulate uncertainty and possible bad values of the gene.
And then entering an fitness evaluation submodule, adding the generated random gene sequence into the existing genome set, and performing gene characteristic evaluation. Each individual (one individual representing one binary gene sequence) is subjected to fitness evaluation, which is based on coverage, diversity, representativeness, etc. of the data set, ensuring that each genome is fully available and that there should be a variance between groups. Each individual in the initial genome has a set of random gene sequences, each individual representing a possible APK file feature dataset, generated using a random selection method.
The estimated genome is put into a genetic algorithm for training, and inheritance can be carried out in various modes such as single-point crossing, multi-point crossing, uniform crossing, random change and the like, so that a new genome is finally obtained. Based on the selection, crossover and mutation of genetic algorithms, genetic operations operate on representations based on file characteristics. In the selection operation, individuals with better feature combinations can be selected for reproduction according to the fitness function, and the crossover and mutation operations can be adjusted according to file features. The crossover operation may be in the form of single-point crossover, multi-point crossover, or uniform crossover, etc., for exchanging partial information of the gene sequence; mutation operations are random alterations of individual genes to increase population diversity.
And updating the priority label of the obtained new genome, and adding the new genome into the genome set to update the genome set.
Updating the genome collection facilitates increasing the genetic diversity, stimulating more possibilities in subsequent training, and achieving more genetic varieties. And after updating the genome set each time, detecting termination conditions, and judging whether the maximum iteration number is reached at the moment. If the genome reaches the target, finishing genetic inheritance of the gene, and transferring the genome into a next module; if not, the newly generated genome is returned to the fitness evaluation submodule for evaluation, and the next iteration is carried out.
As shown in fig. 5, the APK repackaging module includes an APK gene conversion sub-module and an APK packing sub-module.
The APK gene conversion submodule receives the genome of which the last module finishes iteration, and remaps the APK file characteristics in the genome back to corresponding files.
The mapping process requires the incorporation of a mapping database generated by the code-base library submodule. Because of the popularity of scanning, various bottom codes exist for each feature in the database, and then the corresponding feature table is searched according to 0 or 1 in the binary code. In order to reduce the complexity of searching the file feature codes, a sequential searching mode is adopted, if the corresponding file feature codes are found in the searched columns, the file feature codes are adopted and the code searching of the feature is stopped, and if the corresponding result is not found by traversing the whole table, the file feature codes are set to be empty.
The APK bagging sub-module then repacks the Android application using Apktool, apktool b output _folder-onew _app APK
Where output_folder is a directory containing the modified file, and new_app.apk is the name of the APK file generated after repacking.
If integrity is required, a signing tool may be used to sign, here using the jarsigner tool of Java.
The code is jarsigner-verbose-sigalg SHA1withRSA-digestalg SHA1-keystore my-release-key.keystore new_app.apk alias_name
Wherein, my-release-key.key is a key store file, new_app.apk is an APK file needing to be signed, and alias_name is a key store alias.
If there is no keystore, a keytool command may be used to create one:
keytool-genkey-v-keystore my-release-key.keystore-alias alias_name-keyalg RSA-keysize 2048-validity 10000
In summary, the invention is realized according to the idea of biological inheritance, namely, genome is generated according to file characteristics, and genome collection is used as DNA for inheritance. Gene combination, gene mutation, genetic variation, genetic drift and the like can occur in the genetic process. Wherein gene combination refers to the combination of genes by parents and rearrangement in offspring, resulting in offspring having a unique genetic combination; gene mutation refers to the possible occurrence of mutation, namely, change of DNA sequence, in the genetic process, wherein the mutation can be in the forms of point mutation, insertion, deletion and the like, so as to influence the genetic characteristics and properties of individuals; genetic variation refers to the existence of genetic variation between individuals by gene recombination mutation, resulting in variability between offspring; genetic drift refers to the phenomenon of random variation in gene frequency in a population due to the influence of random events, which can have an influence on the genetic structure and evolution of the population.
According to the invention, by combining biological genetic characteristics, APK files with different characteristics are formed by combining, a large number of diversified and automatic APKs are generated by adopting a program, manual intervention is not needed, a large amount of time and labor cost can be saved, and the efficiency is improved. In some cases, the method can replace the work of collecting a large amount of APK files on the network, thereby greatly facilitating the research work or the early collection work of model training.
According to the invention, through the optimization process of the genetic algorithm, an APK data set with diversity and higher coverage can be generated. In this way, the performance and functionality of the Android application can be more fully evaluated and tested. Meanwhile, the novel APK can be well generated, and the problem that part of APK files are not collected to a certain extent is solved.
The invention can properly modify genetic algorithm links according to the requirements, and can customize and generate APK data sets according to the requirements of users and the specified requirements. Such personalized customization may ensure that the generated data set meets specific test goals and scenario requirements, with overall high flexibility and variability.

Claims (11)

1.一种基于基因遗传工程的APK数据集自动生成方法,其特征在于,包括如下步骤:1. A method for automatically generating an APK data set based on genetic engineering, characterized in that it comprises the following steps: 对解压后APK文件进行静态分析,得到APK文件特征;Perform static analysis on the decompressed APK file to obtain the APK file features; 选择部分APK文件特征并用基因序列的方式表示,得到基因组集合;Select some APK file features and express them in the form of gene sequences to obtain a genome set; 对基因组集合中的基因序列进行随机选择,生成随机基因序列;Randomly select gene sequences in the genome collection to generate random gene sequences; 将随机基因序列加入基因组集合中进行基因特征评估,再将经过评估的基因组集合放入遗传算法进行训练,得到新的基因组;Add random gene sequences to the genome set to evaluate gene characteristics, and then put the evaluated genome set into the genetic algorithm for training to obtain a new genome; 采用新的基因组更新基因组集合;Update genome collections with new genomes; 将更新后的基因组集合中的基因序列重新映射回相应文件,再重新打包Android应用。Remap the gene sequences in the updated genome set back to the corresponding files and repackage the Android application. 2.根据权利要求1所述基于基因遗传工程的APK数据集自动生成方法,其特征在于,所述对解压后APK文件进行静态分析包括:对解压后的APK文件进行反编译和转jar包,再对APK文件进行信息提取,分析APK文件中的组件和结构得到APK文件特征。2. According to the method for automatically generating APK data sets based on genetic engineering in claim 1, it is characterized in that the static analysis of the decompressed APK file includes: decompiling and converting the decompressed APK file into a jar package, extracting information from the APK file, and analyzing the components and structures in the APK file to obtain the APK file characteristics. 3.根据权利要求2所述基于基因遗传工程的APK数据集自动生成方法,其特征在于,所述反编译和转jar包的方法为:采用ApkTool工具进行反编译,同时采用dex2jar工具将获得的classes.dex文件转化为jar包文件。3. According to the method for automatically generating an APK data set based on genetic engineering in claim 2, it is characterized in that the method for decompiling and converting to a jar package is: using the ApkTool tool for decompiling, and using the dex2jar tool to convert the obtained classes.dex file into a jar package file. 4.根据权利要求2所述基于基因遗传工程的APK数据集自动生成方法,其特征在于,所述对APK文件进行信息提取包括:提取APK文件的包名、sdk版本、主活动、app名、权限信息、详细权限信息、图标、安卓版本名;4. The method for automatically generating an APK data set based on genetic engineering according to claim 2 is characterized in that the information extraction of the APK file includes: extracting the package name, SDK version, main activity, app name, permission information, detailed permission information, icon, and Android version name of the APK file; 所述分析APK文件中的组件和结构包括采用Androguard工具分析Dex文件、Manifest文件、资源文件、代码和反编译结果。The analysis of components and structures in the APK file includes using the Androguard tool to analyze Dex files, Manifest files, resource files, codes and decompilation results. 5.根据权利要求1所述基于基因遗传工程的APK数据集自动生成方法,其特征在于,所述选择部分APK文件特征并用基因序列的方式表示包括如下步骤:5. According to the method for automatically generating an APK data set based on genetic engineering in claim 1, the step of selecting some APK file features and expressing them in the form of gene sequences comprises the following steps: 构建特征二进制与特征代码一一对应关系的映射数据库;Construct a mapping database of the one-to-one correspondence between feature binary and feature code; 选择APK文件特征中的文件类型、文件大小、文件内容、文件路径作为映射特征,基于映射数据库,将每个映射特征映射到一个固定长度的二进制序列,即特征二进制序列;Select the file type, file size, file content, and file path from the APK file features as mapping features, and map each mapping feature to a binary sequence of a fixed length, namely, a feature binary sequence, based on a mapping database; 在每条特征二进制序列中增加二进制数的优先级标签,得到二进制基因序列;Add a binary priority label to each feature binary sequence to obtain a binary gene sequence; 将固定数量的二进制基因序列组合成若干基因组,若干基因组形成基因组集合。A fixed number of binary gene sequences are combined into several genomes, and several genomes form a genome set. 6.根据权利要求5所述基于基因遗传工程的APK数据集自动生成方法,其特征在于,所述构建特征二进制与特征代码一一对应关系的映射数据库包括:6. The method for automatically generating an APK data set based on genetic engineering according to claim 5 is characterized in that the mapping database for constructing a one-to-one correspondence between feature binary and feature code comprises: 将同一文件特征放置在同一表格中,每份表格由文件特征命名区分;Place the same file features in the same table, and each table is distinguished by the file feature name; 数据库表格中的第一列存放文件特征,第二列存放文件特征为0时的底层代码,第三列存放文件特征为1时的底层代码;其中,0代表该文件特征不存在或是未启用,1代表文件特征存在且启用;The first column in the database table stores the file features, the second column stores the underlying code when the file feature is 0, and the third column stores the underlying code when the file feature is 1; 0 represents that the file feature does not exist or is not enabled, and 1 represents that the file feature exists and is enabled; 基于静态分析,逐个扫描APK文件,以第一个APK文件作为范本,范本扫描结束后,之后的APK文件每多出现一个新的文件特征,就创建一个新的表格,以此类推,完成映射数据库的搭建填充。Based on static analysis, APK files are scanned one by one, and the first APK file is used as a template. After the template is scanned, a new table is created for each new file feature that appears in subsequent APK files, and so on, to complete the construction and filling of the mapping database. 7.根据权利要求1所述基于基因遗传工程的APK数据集自动生成方法,其特征在于,所述将随机基因序列加入基因组集合中进行基因特征评估,再将经过评估的基因组集合放入遗传算法进行训练包括:7. The method for automatically generating an APK data set based on genetic engineering according to claim 1 is characterized in that the step of adding random gene sequences to a genome set for gene feature evaluation and then placing the evaluated genome set into a genetic algorithm for training comprises: 将随机基因序列加入基因组集合中得到待评估数据集,基于待评估数据集的覆盖范围、多样性、代表性,对待评估数据集中的每个二进制基因序列进行适应度评估,保证待评估数据集中每个基因组完整可用,且组间存在差异;Add random gene sequences to the genome set to obtain the data set to be evaluated. Based on the coverage, diversity, and representativeness of the data set to be evaluated, the fitness of each binary gene sequence in the data set to be evaluated is evaluated to ensure that each genome in the data set to be evaluated is complete and available, and there are differences between groups. 将经过评估的基因组通过单点交叉、多点交叉、均匀交叉和随机改变中的至少一种方式进行遗传,得到新的基因组。The evaluated genome is inherited through at least one of single-point crossover, multi-point crossover, uniform crossover and random change to obtain a new genome. 8.根据权利要求1所述基于基因遗传工程的APK数据集自动生成方法,其特征在于,所述采用新的基因组更新基因组集合包括:8. The method for automatically generating an APK data set based on genetic engineering according to claim 1, wherein the step of updating the genome set using a new genome comprises: 将新的基因组进行优先级标签更新,再将新的基因组加入基因组集合中以更新基因组集合;Update the priority label of the new genome, and then add the new genome to the genome set to update the genome set; 每次更新完基因组集合后进行终止条件检测,判断是否在此时到达最大迭代次数,若到达,则结束基因遗传;若未到达,则将新的基因组进行适应度评估,进行下一次迭代。After each update of the genome set, a termination condition check is performed to determine whether the maximum number of iterations has been reached at this time. If it has been reached, gene inheritance is terminated; if not, the new genome is evaluated for fitness and the next iteration is performed. 9.根据权利要求5所述基于基因遗传工程的APK数据集自动生成方法,其特征在于,所述将更新后的基因组集合中的基因序列重新映射回相应文件包括:9. The method for automatically generating an APK data set based on genetic engineering according to claim 5, characterized in that remapping the gene sequences in the updated genome set back to the corresponding files comprises: 基于映射数据库,将迭代更新后的基因组集合中的二进制基因序列重新映射回对应的特征代码,再根据特征代码找到相应的APK文件。Based on the mapping database, the binary gene sequences in the iteratively updated genome set are remapped back to the corresponding feature codes, and then the corresponding APK files are found according to the feature codes. 10.一种APK数据集自动生成系统,其特征在于,包括:10. An automatic generation system of APK data set, characterized by comprising: APK静态分析模块,用于静态分析解压后APK文件得到相关APK文件特征;APK static analysis module, used to statically analyze the decompressed APK file to obtain relevant APK file features; APK基因生成模块,用于进行APK文件特征选择并将APK文件特征用基因序列的方式进行表示;APK gene generation module, used to select APK file features and represent APK file features in the form of gene sequences; APK特征遗传模块,用于对加入有随机基因序列的基因组集合进行基因特征评估,再将经过评估的基因组集合放入遗传算法进行训练,生成新的基因组并更新基因组集合;APK characteristic genetic module, used to evaluate the gene characteristics of the genome set with random gene sequences added, and then put the evaluated genome set into the genetic algorithm for training, generate new genomes and update the genome set; APK重打包模块,用于将更新后的基因组集合中的基因序列重新映射回相应文件并重新打包Android应用。The APK repackaging module is used to remap the gene sequences in the updated genome set back to the corresponding files and repackage the Android application. 11.根据权利要求10所述APK数据集自动生成系统,其特征在于,所述APK静态分析模块包括反编译子模块和信息提取子模块;11. The APK data set automatic generation system according to claim 10, characterized in that the APK static analysis module includes a decompilation submodule and an information extraction submodule; 所述APK基因生成模块包含APK文件特征选择子模块和基因组合子模块,所述APK文件特征选择子模块用于对APK文件特征进行选择并将每个APK文件的特征和属性映射到一个固定长度的二进制序列,所述基因组合子模块用于将固定数量的二进制基因序列组合成一个基因组,得到基因组集合;The APK gene generation module includes an APK file feature selection submodule and a gene combination submodule, wherein the APK file feature selection submodule is used to select APK file features and map the features and attributes of each APK file to a binary sequence of a fixed length, and the gene combination submodule is used to combine a fixed number of binary gene sequences into a genome to obtain a genome set; 所述APK特征遗传模块包含基因初始化子模块,适应度评估子模块,选择、交叉和突变子模块,更新基因子模块和终止条件检测子模块,所述基因初始化子模块用于生成随机基因序列,所述适应度评估子模块用于根据文件特征评估APK文件的质量,所述选择、交叉和突变子模块用于对经过评估的基因组进行选择操作、交叉操作和突变操作;所述更新基因子模块用于在既定迭代次数后进行基因更新,使用新生成的个体替换旧的个体,以形成新的基因组;所述终止条件检测子模块,用于决定是否停止算法的执行;The APK characteristic inheritance module includes a gene initialization submodule, a fitness evaluation submodule, a selection, crossover and mutation submodule, an update gene submodule and a termination condition detection submodule. The gene initialization submodule is used to generate a random gene sequence. The fitness evaluation submodule is used to evaluate the quality of the APK file according to the file characteristics. The selection, crossover and mutation submodule is used to perform selection operations, crossover operations and mutation operations on the evaluated genome; the update gene submodule is used to perform gene update after a predetermined number of iterations, and replace the old individuals with the newly generated individuals to form a new genome; the termination condition detection submodule is used to decide whether to stop the execution of the algorithm; 所述APK重打包模块包含APK基因转换子模块和APK打包子模块,所述APK基因转换子模块用于将APK文件特征重新映射回相应文件;所述APK打包子模块用于使用Apktool重新打包Android应用。The APK repackaging module includes an APK gene conversion submodule and an APK packaging submodule. The APK gene conversion submodule is used to remap APK file features back to corresponding files; the APK packaging submodule is used to repack Android applications using Apktool.
CN202410896182.9A 2024-07-05 2024-07-05 A method and system for automatically generating APK data sets based on genetic engineering Pending CN118797642A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410896182.9A CN118797642A (en) 2024-07-05 2024-07-05 A method and system for automatically generating APK data sets based on genetic engineering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410896182.9A CN118797642A (en) 2024-07-05 2024-07-05 A method and system for automatically generating APK data sets based on genetic engineering

Publications (1)

Publication Number Publication Date
CN118797642A true CN118797642A (en) 2024-10-18

Family

ID=93023387

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410896182.9A Pending CN118797642A (en) 2024-07-05 2024-07-05 A method and system for automatically generating APK data sets based on genetic engineering

Country Status (1)

Country Link
CN (1) CN118797642A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100074439A1 (en) * 2006-07-06 2010-03-25 William Garreth James Howells method and apparatus for the generation of code from pattern features
CN103106253A (en) * 2013-01-16 2013-05-15 西安交通大学 Data balance method based on genetic algorithm in MapReduce calculation module
CN109977227A (en) * 2019-03-19 2019-07-05 中国科学院自动化研究所 Text feature, system, device based on feature coding
CN116383816A (en) * 2023-02-03 2023-07-04 北京工业大学 Android malicious software detection feature selection method based on genetic algorithm

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100074439A1 (en) * 2006-07-06 2010-03-25 William Garreth James Howells method and apparatus for the generation of code from pattern features
CN103106253A (en) * 2013-01-16 2013-05-15 西安交通大学 Data balance method based on genetic algorithm in MapReduce calculation module
CN109977227A (en) * 2019-03-19 2019-07-05 中国科学院自动化研究所 Text feature, system, device based on feature coding
CN116383816A (en) * 2023-02-03 2023-07-04 北京工业大学 Android malicious software detection feature selection method based on genetic algorithm

Similar Documents

Publication Publication Date Title
CN109361643B (en) A deep traceability method for malicious samples
CN105184160B (en) A kind of method of the Android phone platform application program malicious act detection based on API object reference relational graphs
US8010844B2 (en) File mutation method and system using file section information and mutation rules
CN111639337B (en) Unknown malicious code detection method and system for massive Windows software
RU91213U1 (en) SYSTEM OF AUTOMATIC COMPOSITION OF DESCRIPTION AND CLUSTERING OF VARIOUS, INCLUDING AND MALIMENTAL OBJECTS
CN109829312B (en) JAVA vulnerability detection method and detection system based on call chain
CN104123493A (en) Method and device for detecting safety performance of application program
CN111897742B (en) Method and device for generating intelligent contract test case
CN108536451B (en) Method and device for embedding embedded point of application program
CN109740347A (en) A method for identifying and cracking fragile hash functions for smart device firmware
JP7119096B2 (en) license verification device
CN113139192B (en) Third party library security risk analysis method and system based on knowledge graph
US20200226232A1 (en) Method of selecting software files
CN105740132B (en) Software package source automatic analysis method based on modification daily record
CN104866764B (en) A kind of Android phone malware detection method based on object reference figure
KR20190102456A (en) Method for clustering application and apparatus thereof
Liu et al. Enhancing malware detection for android apps: Detecting fine-granularity malicious components
CN113901463B (en) Concept drift-oriented interpretable Android malicious software detection method
CN105205398B (en) It is a kind of that shell side method is looked into based on APK shell adding software dynamic behaviours
CN117009972A (en) Vulnerability detection method, vulnerability detection device, computer equipment and storage medium
CN113254024A (en) Code inheritance relationship optimization method, device, equipment and storage medium
Feichtner et al. Obfuscation-resilient code recognition in Android apps
CN109508545B (en) An Android Malware Classification Method Based on Sparse Representation and Model Fusion
CN114817925B (en) Android malicious software detection method and system based on multi-modal graph features
CN118797642A (en) A method and system for automatically generating APK data sets based on genetic engineering

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20241018