More framework binary system similar codes detecting systems and method
Technical field
The present invention relates to a kind of technology of computer realm, specifically a kind of more framework binary system similar codes detections
System and method.
Background technology
With the popularization and application of smart machine, increasing program has been migrated to ARM, MIPS from desktop platform
Deng in the embedded device for framework, although the binary program analytical technology of desktop platform (such as x86) is more ripe,
Because different frameworks in instruction set, code offset and calling convention etc. suffer from huge difference, pass through these technologies point
The program information separated out is difficult to move among another framework (such as ARM) from a framework (such as x86).
The content of the invention
The present invention can not match the program code of API semantic informations shortage for prior art and can not realize across framework
The defects of matching, a kind of more framework binary system similar codes detecting systems and method are proposed, similar generation is positioned by automating
Code, so as to which the code information synchronous migration of completion will have been analyzed to different platforms, extra manual analysis cost is reduced,
Improve analysis efficiency.
The present invention is achieved by the following technical solutions:
The present invention relates to a kind of more framework binary system similar codes detecting systems, including:Pretreatment module, parameter identification mould
Block, Switch redirect identification module, semantic generation module and comparison module indirectly, wherein:Pretreatment module receives pending
Binary code simultaneously exports binary function and redirects identification module and semantic generation indirectly to parameter identification module, Switch respectively
Module, parameter identification module therefrom extract sequencing table and exported to semantic generation module, redirect identification module indirectly and therefrom carry
Take out the indirect skip instructions of Switch and export to semantic generation module, semantic generation module and generated according to parameter identification module
Parameter information, the simulation for carrying out binary function performs, and extracts semantic feature sequence therein and export to comparison module, than
The semantic feature sequence that receives is compared by the way of it using sequence pair compared with module and exports Similarity value.
The present invention relates to more framework binary system similar codes detection methods of said system, including:
Step 1) pre-processes to binary code, obtains the assembly code and controlling stream graph of binary function, tool
Body step includes:
1.1) dis-assembling:Use IDA Pro dis-assemblings code to be analyzed;
1.2) functional boundary identifies:The interface provided using IDAPython obtains the address of each function of code to be analyzed
Section;
1.3) control flow graph obtains:The interface provided using IDAPython obtains each basic block address area of function
Between and basic block between points relationship.
Step 2) function parameter identifies:The framework according to where binary function, entered according to corresponding calling convention record two
Register where the parameter of function processed or the skew relative to stack pointer, then to register using register number as index order,
Stack is offset using bias size as index order, specific steps include:
2.1):Travel through each paths in binary function controlling stream graph;
2.2):For every on the path instruction for including read operation, according to calling convention, if the target variable read is
Parameter register or be the variable in stack frame parameter area, and the target variable was not defined before this path, then
It is identified as register parameters or stack passes parameter, records the register number of register parameters or stack passes parameter and referred to respect to stack
The skew of pin;
2.3):To the register number of record using number size as index order, the relative skew of parameter is passed to the stack of record
Using bias size as index order.
The indirect jump target identifications of step 3) Switch:Redirected from the read-only data section identification of binary executable
Table, and the indirect skip instruction of correspondence for being mapped to correlation function, specific steps include:
3.1):The read-only data section of binary executable is traveled through, if the numerical value of an address size points to code
Section and its address quoted by code segment, then the numerical value is identified as the header element of jump list, afterwards the number of continuous address size
As long as the Same Function of value sensing code segment is considered as the element in the jump list;
3.2):Every paths among control flow graph are traveled through, if refer to some jump list in certain path, it
First redirects and is identified as the correspondence of the jump list and redirects indirectly indirectly afterwards, then records corresponding relation.
Step 4) binary code is translated:The binary code of different frameworks is shown as unified shape with intermediate language table
Formula, i.e., each binary function static conversion is expressed into VEX-IR using the interface that PyVEX is provided.
Step 5) semantic feature generates:Simulation performs the VEX-IR expression of each binary function, and in simulation process
In extract semantic feature in a manner of dynamic pitching pile.
Described semantic feature includes:Input and output value, condition compare numerical value, library function call record.
Described simulation performs, and firstly generates random number sequence, the function to compare shares.
Described simulation performs, and used parameter value is random number sequence, is identified according to the parameter obtained in step 2)
As a result order carries out assignment.For each function, the assignment all since the first element of random number sequence.
Described simulation performs, and when running into when redirecting indirectly of Switch, chooses constant offset among corresponding jump list
As jump target, guarantee continues simulation and performed for address.
Step 6) similarity system design, two for similarity system design are calculated by longest common subsequence (LCS) algorithm
The semantic feature similarity of binary function, returns to Similarity value, and specific steps include:
6.1):For the semantic feature sequence of two functions to compare, the longest common subsequence of the two is calculated;
6.2):The Similarity value of two characteristic sequences is calculated using Jaccard coefficients, wherein, two sequences common factor length
For previous step longest common subsequence length, union is two sequences length and subtracts longest common subsequence length.
Technique effect
Compared with prior art, the present invention is completely dependent on semantic feature, and has carefully considered the distribution of function parameter, subtracted
Few negative effect redirected indirectly to static analysis, so other relative more framework similar codes detection schemes can ensure to imitate
On the premise of rate, more accurately detect, orient the binary code of similar semantic.The present invention is in across framework binary code
Clone context of detection worked relatively accuracy rate raising, more compiling options compiling binary codes clone context of detection it is relative
The accuracy rate that worked improves, and is 10 seconds in average a pair of binary function match time orders of magnitude.
Brief description of the drawings
Fig. 1 is present system structural representation;
Fig. 2 is IA-32vsARM effect diagrams in embodiment;
Fig. 3 is IA-32vsMIPS effect diagrams in embodiment;
Fig. 4 is ARMvsMIPS effect diagrams in embodiment;
Fig. 5 is gccvsclang (IA-32) effect diagram in embodiment;
Fig. 6 is-O3vs-O0 (IA-32gcc) effect diagram in embodiment.
Embodiment
As shown in figure 1, the present embodiment includes:Pretreatment module, parameter identification module, identification module, semanteme are redirected indirectly
Generation module and comparison module, wherein:Pretreatment module receives pending binary code and exports binary function difference
To parameter identification module, indirectly identification module and semantic generation module are redirected, parameter identification module therefrom extracts sequencing table simultaneously
Output redirects identification module and therefrom extracts indirect skip instruction and export to semanteme generation mould indirectly to semantic generation module
Block, the simulation that semantic generation module carries out binary function by dynamic pitching pile performs, and it is defeated to extract semantic feature therein
Go out to comparison module, comparison module and the semantic feature received is compared by the way of longest common subsequence algorithm alignment
And Similarity value is exported by Jaccard coefficients.
The present embodiment is related to the detection method of said system, comprises the following steps:
1) assembly code of binary file to be detected is obtained using IDA Pro (disassemblers);
2) parameter information of each function of above binary code is obtained using IDAPython plug-in unit combinations automatized script
And Switch structure jump list information.Wherein, function parameter recognizer is as follows:
3) for every instruction on the every paths of function:
3.1) when parameter on stack is read in the instruction, then recording address is offset relative to stack pointer;
3.2) when the instruction uses parameter register value, and the parameter register is not local register, then record should
Register;
3.3) then it is local register by the register tagging when instruction rewriting parameter register value.
4) Switch jump targets identify:
4.1) the read-only data section of linear search binary executable, when one section of continuous address size numerical value is in generation
In code address section, then it is assumed that this one piece of data is jump list, records first numerical value of the table as jump list first address;
4.2) using distance read each jump list first address instruct the indirect jump instruction of beeline be used as it is corresponding between
Connect and redirect;
5) random number streams are generated, numerical value among random number streams is provided in order according to the parameter information of each function and simulation is held
OK, when going to when redirecting indirectly of Switch structures, the jump target of particular offset is chosen.In implementation procedure is simulated, note
The semantic feature of each function is recorded, including:Input and output value, condition compare numerical value and library function call record;
Similarity value and the output for participating in comparing binary function semantic feature are calculated using LCS algorithms.
Specific implementation sample includes:busybox v1.25.1,convert v6.9.2,curl v7.39,lua
V5.2.3, mutt v1.5.24, openssl v1.0.1p, puttygen v0.64, siege v3.0.1 and wget
v1.15.
Specific implementation environment is described as follows:
1) this is embodied on three main flow frameworks and realized:IA-32, ARM and MIPS, three frameworks are 32;
2) IA-32 binary files compile in the virtual machine of Ubuntu 12.04 (i386) position system, ARM and MIPS
Binary file compiles in Debian 7.0 for the QEMU simulators of system;
3) in three above framework, each sample program uses two compilers of gcc v4.7.3 and clang v3.0 respectively
With three compiling optimization option compilings of-O3 ,-O2 and-O0;
4) hardware configuration of analysis environments is:Intel Core i5-2320@3GHz (CPU), 8G DDR3-RAM (RAM)
Experimental data is as follows:
1) across framework binary code clone detection:Concrete outcome such as Fig. 2-Fig. 4, wherein Average Accuracy are 80.1%;
2) compile option compiling binary code clone's detection more:Concrete outcome such as Fig. 5-Fig. 6, wherein Fig. 5 average standard
The Average Accuracy that true rate is 78.2%, Fig. 6 is 82.6%;
3) the matching used time of above empirical average each pair binary function is 5.2 seconds, parameter detecting and Switch brief introductions
Jump target identification average 55.2 seconds used times.
Compared with the prior art:
1) compared with Multi-MH:According to document Jannik Pewny, Behrad Garmany, Robert Gawlik,
Christian Rossow,and Thorsten Holz.2015.Cross-Architecture Bug Search in
Binary Executables.In Proceedings of the 2015IEEE Symposium on Security and
Method in Privacy (SP'15) .IEEE Computer Society, Washington, DC, USA, 709-724.,
In busybox (ARM vsIA-32) experiment, Multi-MH accuracy rate is 32.4%, and the present invention is 83.4%;
In openssl (ARM vs MIPS) experiment, Multi-MH accuracy rate is 32.1%, and the present invention is 87.8%.
2) compared with BinGo:According to document Mahinthan Chandramohan, Yinxing Xue, Zhengzi Xu,
Yang Liu,Chia Yuan Cho,and Hee Beng Kuan Tan.2016.BinGo:cross-architecture
cross-OS binary search.In Proceedings of the 2016 24th ACM SIGSOFT
International Symposium on Foundations of Software Engineering(FSE 2016).ACM,
Method in New York, NY, USA, 678-689., in busybox (IA-32vs ARM) experiment, BinGo accuracy rate
Less than 40%, and the accuracy rate of the present invention is 79.2%.
Above-mentioned specific implementation can by those skilled in the art on the premise of without departing substantially from the principle of the invention and objective with difference
Mode local directed complete set is carried out to it, protection scope of the present invention is defined by claims and not by above-mentioned specific implementation institute
Limit, each implementation in the range of it is by the constraint of the present invention.