CN119903529A

CN119903529A - C/C++ source code third-party library reuse detection method based on function structure information

Info

Publication number: CN119903529A
Application number: CN202510105000.6A
Authority: CN
Inventors: 王俊峰; 贾昀峰; 吴鹏; 方智阳
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2025-01-23
Filing date: 2025-01-23
Publication date: 2025-04-29

Abstract

The present invention relates to the field of software development and software supply chain security technology, and discloses a C/C++ source code third-party library reuse detection method based on function structure information. The method constructs a TPL vector index library based on the AST structure of open source software functions; in the vector index library, an outlier detection method is used to filter public functions whose reuse frequency exceeds the normal frequency range; and a dynamic proportion screening strategy is used to effectively filter misidentified reuse TPL candidates to obtain more accurate reuse TPL results. The present invention significantly improves the robustness and accuracy of TPL reuse detection, effectively reducing missed reports and false positives; at the same time, compared with traditional methods, the method has higher detection efficiency and lower storage requirements.

Description

C/C++ source code third party library reuse detection method based on function structure information

Technical Field

The invention relates to the technical field of software development and software supply chain safety, in particular to a C/C++ source code third-party library reuse detection method based on function structure information.

Background

The number of Open Source Software (OSS) has been expanding dramatically with the popularity of remote code repositories such as GitHub to enable the use of Third-party libraries (TPL) by a variety of code reuse means such as install dependencies, API calls, project fork, file copies, code cloning, etc. Especially in the C/c++ project, to avoid duplication of labor, developers often introduce the required TPL with code clones and modify it to meet business needs. The situation brings brand new challenges for accurately identifying the TPL work, and expands the potential attack surface of software. It is critical to comprehensively track the reuse of software TPL, and the security of the software supply chain directly relates to the security and reliability of the software product. Therefore, the software composition analysis (Software composition analysis, SCA) technique of reusing TPL in detection software becomes a research hotspot as an effective means to discover software security risks in time and reduce hazards.

In SCA practice OSSPOLICE tries to build feature libraries using lightweight features such as directory structures, but problems with detection failure, insufficient robustness may occur when project structures are changed. The CENTRIS proposed by wo et al improves the accuracy of TPL detection in a reuse scenario with few modifications by segmenting OSS source code, eliminating redundant functions and using fuzzy hashing techniques, however, this approach may fail in the face of more text modifications (e.g., the tinyfiledialogs library functions in julius project differ from the original functions on Github), yielding false negative results. OSSFP introduces a filtering method for public functions and trivial functions, improves detection performance and accuracy by constructing hash fingerprint indexes for the core functions of each TPL, however, OSSFP assumes that some functions in reuse TPL are not changed, so that when large-area function modification occurs (for example, SQLITECPP item reuses sqlite library through sqlite3.C source file, wherein all functions are added with new access modifiers), the problem of insufficient robustness is also faced, resulting in detection failure.

Furthermore, it was observed that CENTRIS did not take into account the effect of the interference function, and was prone to more false positives. However, OSSFP filters the public functions through a fixed frequency threshold, and in practice, there is a problem of inaccurate filtering, for example, a Boost library contains a large number of public functions, while a xxhash library has few public functions, and OSSFP considers that the public functions of the two should be filtered in the same proportion.

The conventional static SCA technology is analyzed to find that the robustness of TPL features is insufficient in detection robustness and interference function processing, namely TypeII, typeIII code clones (such as variable names or code line modifications) are faced to work depending on function text hash features such as CENTRIS, OSSFP, false Negative (FN) is easy to generate, and (2) interference function filtering inaccuracy is caused by CENTRIS, interference functions are not fully considered, so that more False Positives (FP) are caused, and OSSFP filters public functions through a fixed frequency threshold, so that the problem of inaccurate filtering exists in practice.

Therefore, a method for detecting the reuse of the C/C++ source code third-party library based on the function structure information is provided to solve the problems.

Disclosure of Invention

Aiming at the problems, the invention aims to provide a C/C++ source code third party library reuse detection method based on function structure information, which is characterized in that an AST structure-based feature vector index library is constructed, and the robustness and accuracy of TPL reuse detection are obviously improved by combining an anomaly detection technology and a dynamic duty ratio screening strategy, so that missing report and false report are effectively reduced, and the technical scheme is as follows:

The method for detecting the reuse of the C/C++ source code third-party library based on the function structure information comprises the following steps:

s1, collecting source codes of different versions of large-scale open source software, and preprocessing the source codes, wherein the preprocessing comprises the following steps:

S11, performing function segmentation on the source code, and extracting a function AST structure of the source code;

S12, analyzing the function AST structure into vectors with two different structural features according to the degree and the order of node types in the function AST structure, and performing feature stitching on the two vectors to generate a function feature vector;

s2, filtering a general structure function based on the structure complexity of a function AST structure in a function set obtained by cutting a source code;

Step S3, using function feature vectors, constructing a function feature dictionary for each TPL through hash matching, merging the function feature dictionaries of all TPLs to generate a vector index library, and recording all function clone pairs and storing the function clone pairs as a function clone pair set in the merging process of the vector index library;

Step S4, acquiring reuse frequency of each function based on the function cloning pair set, filtering common functions with reuse frequency exceeding a normal frequency range in a vector index library, and acquiring a core vector index library;

And S5, performing TPL reuse detection on the software to be detected by adopting an approximate nearest neighbor algorithm based on the core vector index library to obtain reuse TPL candidates, and filtering the reuse TPL candidates which are identified by mistake according to the function characteristic duty ratio of the reuse TPL candidates so as to determine a reuse TPL detection result.

Further, the step S12 specifically includes:

According to the degree and sequence of node types in the function AST structure, the function AST structure is analyzed into vectors with two different structural features by using TreeCen algorithm, and feature stitching is carried out on the two vectors to generate a function feature vector, wherein the calculation formula is as follows:

v=βv₁+(1-β)v₂,β∈[0,1]

Where v ₁ represents a vector derived from the sum of degrees of each node type, v ₂ represents a vector derived from the sum of order of each node type, and β represents the weight that v ₁ occupies in the feature concatenation.

Further, the step S2 includes the following steps:

S21, defining the structural complexity C of the function as:

C=|{t_i|c_i≥1,i=1,2,...,n}|

wherein t _i is the ith structure type in the function AST structure, c _i is the occurrence number of the ith structure type in the function AST structure, and the total number of the structure type types with the occurrence number of more than or equal to 1 in the function AST structure is the structure complexity of the function;

S22, counting the occurrence times of different structure types in each function AST structure in a function set obtained by cutting the source code, and obtaining the structure complexity C of each function;

S23, eliminating the structural complexity C in the function set to be smaller than a preset threshold value Wherein a threshold is preset

Further, the step S3 includes the following steps:

S31, using a function text hash value as a key of a dictionary, using a function feature of the earliest submitting time as a value of the dictionary, and constructing a function feature dictionary for each TPL through hash matching, wherein the function feature is defined as a tuple for describing function attribute information, namely { the function text hash value, a function address, the submitting time and a function feature vector };

S32, extracting function feature vectors from function feature dictionaries of all TPLs, comparing and de-duplicating all the function feature vectors through an approximate nearest neighbor algorithm, preferentially reserving function feature vectors with earlier commit time, further merging the function feature dictionaries of all the TPLs to generate a vector index library, and storing all similar function pairs f _a、f_b as a function clone pair set in a form of function clone pairs (f _a,f_b) in the merging process of the vector index library, and ensuring that the function with earlier commit time is f _a and f _b is a clone thereof.

Further, the step S4 includes the following steps:

S41, constructing and checking a DSU (direct sequence unit) of the set by using function cloning, taking a function of the earliest submitting time as a root node of each set and taking the rest functions with cloning relation with the function of the earliest submitting time as child nodes of each set;

S42, calculating a standardized distance of the function reuse frequency by adopting a Z-score anomaly detection method based on statistical distribution, wherein the calculation formula is as follows:

Z(X_f)＝(X_f-μ)/σ

Wherein X _f is the reuse frequency value of the function f, Z (X _f) is the standardized distance, and mu and sigma are the mean value and standard deviation of the reuse frequency of the function in the TPL to which the function f belongs respectively;

The screening conditions for the feature set of the core function in the normal frequency range are as follows:

B^′＝{f∈B|Z(X_f)≤τ}

wherein B represents a function feature set of the TPL to which the function f belongs, B ^′ represents a function feature set in a normal frequency range in B, and the threshold tau is set to be 1;

In the vector index library, filtering and reusing a public function feature set with frequency exceeding a normal frequency range according to screening conditions, and reserving a core function feature set in the normal frequency range to obtain a core vector index library;

Further, the step S5 includes the following steps:

S51, preprocessing source codes of the software to be tested by using the same method of the step S1 to obtain feature vectors of functions to be tested;

s52, searching a function feature vector which is approximately matched with the function feature vector to be detected in a core vector index library by adopting an approximate nearest neighbor algorithm, and taking a third party library set which belongs to a function set corresponding to the function feature vector and is searched as a reuse TPL candidate;

S53, calculating function feature duty ratios corresponding to all reuse TPL candidates, wherein the calculation formula is as follows:

Wherein p (T _j) is the function feature duty cycle corresponding to the jth reuse TPL candidate, T _j is the jth reuse TPL candidate, For a set of functional features matching T _j in the software under test,The method comprises the steps of indexing a function feature set corresponding to T _j in a library for a core vector;

And sorting the reuse TPL candidates according to the order of the function characteristic proportion from small to large, and removing the TPL candidates which are ranked at the top 10% and have the function characteristic proportion lower than 2% in the sorting result to obtain a reuse TPL detection result.

Compared with the prior art, the invention has the following advantages and beneficial effects:

The method comprises the steps of obtaining function feature vectors through a function AST structure of source codes of open source software, constructing a function feature dictionary for each TPL by using the function feature vectors, merging feature dictionaries of all TPLs to generate a vector index library, recording a function clone pair set, obtaining reuse frequency of each function based on the function clone pair set, filtering common functions with abnormal reuse frequency to obtain a core vector index library, carrying out TPL reuse detection on software to be detected based on the core vector index library, and determining a reuse TPL function set according to the TPL candidate item of misjudgment of function feature ratio of the reuse TPL candidate item. The method is based on the structural characteristics of the function AST, the robustness of TPL detection in a modified reuse scene is remarkably improved, missing report is effectively reduced, the influence of interference functions is reduced by filtering common functions with frequency exceeding a normal range based on an abnormal detection technology, the accuracy of TPL reuse detection is improved, false report is effectively reduced, the reuse TPL candidate item which is mistakenly identified is effectively filtered by adopting a dynamic duty ratio screening strategy, the accuracy of TPL reuse detection is further improved, and meanwhile, compared with the traditional method, the method has higher detection efficiency and lower storage requirement.

Drawings

Fig. 1 is a general frame schematic of the present method.

Fig. 2 is a schematic diagram of structure vector coding.

FIG. 3 is feature dictionary building pseudocode.

FIG. 4 is a vector index merge pseudocode.

FIG. 5 is a flow chart of a common function filtering method based on outlier detection.

FIG. 6 is a graph showing the comparison of the accuracy of the detection results of different methods.

Detailed Description

The following will make clear and complete description of the technical solution according to the present invention, with reference to the accompanying drawings, so as to further understand the concept of the present invention, the technical problems to be solved, the technical features constituting the technical solution, and the technical effects brought thereby.

The general flow chart of the method is shown in fig. 1, and specifically comprises the following steps:

Step S1, extracting function characteristics, collecting source codes of large-scale open source software OSS (Open Source Software, OSS) and different versions thereof, preprocessing the source codes, and comprising the following steps:

s11, OSS data collection, a Git API or a crawler is utilized to collect an effective open source code repo warehouse, for example, a version tag list of the warehouse and the release time of the version tag list are obtained through a Git tag command, and file change conditions between adjacent versions are identified through a Git diff command.

S12, OSS data preprocessing, which comprises the following steps:

S121, for each source code file, unnecessary comments and blank lines are eliminated, and invalid information interference is avoided;

S122, based on a public source code analysis tool Tree-sitter, performing function segmentation on the source codes and extracting a function AST structure of the source codes;

S123, analyzing the function AST structure into vectors with two different structural features according to the degree and the order of node types in the function AST structure, and performing feature stitching on the two vectors to generate a function feature vector;

As shown in FIG. 2, the conversion from function AST structure information to fixed length vector is efficiently completed by TreeCen algorithm, wherein the dimension of the feature vector is determined by the number of necessary C/C++ AST node types and can be obtained through the configuration file node-types. Json of the analysis tool;

In order to cope with the limitation of node degree characteristics in TypeII, typeIII code cloning problems (such as variable name modification and code adjacent node position interchange), the method analyzes the degree (degree) and sequence (sequence) of node types in the function AST structure of the source code to obtain vectors of two different structural characteristics, and then generates a final function characteristic vector v through characteristic splicing, wherein the calculation formula is as follows:

v=βv₁+(1-β)v₂,β∈[0,1]

Wherein v ₁ represents a vector obtained from the sum of degrees of each node type, v ₂ represents a vector obtained from the sum of orders of each node type, and β represents a weight occupied by v ₁ in feature stitching;

the method solves the invalidation problem of TypeII, typeIII code cloning (such as variable name modification and code adjacent node position interchange) based on the function AST structural characteristics, remarkably improves the robustness of TPL detection under the modified reuse scene, and effectively reduces the missing report condition.

S2, filtering a general structure function based on the structure complexity of a function AST structure in a function set obtained by cutting a source code, so as to reduce the false alarm effect;

referring to maintainability index, the method designs a measurement filtering mechanism based on structural complexity, and comprises the following steps:

S21, for a function, the structure type set in an AST structure is A= { t ₁,t₂,…,t_n }, wherein the occurrence number of each structure type t _i is C _i(c_i ≡0), the structure complexity C of the function is defined as:

C=|{t_i|c_i≥1,i=1,2,...,n}|

Wherein C _i is equal to or greater than 1, which means that the structure type t _i appears at least once in AST of the function, if C _i =0, the structure type does not count in complexity calculation, so that when the structure type actually used by the function is less, the structure complexity C is lower, and otherwise, is higher.

S22, counting the occurrence times of different structure types in each function AST structure in a function set obtained by cutting the source code, and obtaining the structure complexity C of each function.

S23, setting a predefined threshold valueWhen the structural complexity of the functionThe function is labeled as a generic structure function or a simple function and is eliminated from the collection of functions, where,

The method filters the interference function (general structure function or simple function) based on the structural complexity of the function AST structure, improves the accuracy of TPL reuse detection, and effectively reduces false alarm.

Step S3, constructing a function feature dictionary for each TPL by hash matching by using the function feature vectors, merging the function feature dictionaries of all TPLs to generate a vector index library, and recording and storing all function clone pairs as a function clone pair set in the merging process of the vector index library, wherein the method comprises the following steps:

S31, constructing a feature dictionary, namely quickly constructing an independent feature dictionary for each TPL by adopting hash matching to realize function deduplication due to the fact that a large number of repeated functions exist in different versions of TPL, wherein keys of the feature dictionary are function text hash values, and the values are function features of earliest submission time, and the function features are defined as tuples describing function attribute information, namely { the function text hash values, function addresses, submission time and function feature vectors };

As shown in fig. 3, a process of constructing a single TPL function feature dictionary is described. Wherein D represents a feature dictionary of a TPL, F represents a specific function feature, VF is a function feature set of a certain version, VFL is a list of function attribute information of all versions of the TPL, and in order to ensure that the stored function feature is the earliest submitted version, the algorithm performs operations by directly adding F to the dictionary D if the text hash value of the current function F is not recorded in the dictionary D, otherwise, comparing the submission time of the new function F, and if the submission time of F is earlier, updating the record in the dictionary D.

S32, vector index merging, namely introducing ANN search based on a vector database, extracting function feature vectors from function feature dictionaries of all TPLs to compare and remove duplicates, preferentially reserving function feature vectors with earlier submission time in the duplication removal process, simplifying indexes, merging to generate a vector index library, and effectively eliminating repeated functions and similar functions among different TPLs, wherein as shown in FIG. 4, DL is a list of the function feature dictionaries D, RL represents matching results of function feature vectors in the function feature dictionaries D in the IM, FT and FM respectively represent the compared function features and similar matching results thereof, and delta is a set vector similarity threshold;

In this process, the search logs of the function clone pairs are recorded for the subsequent common function filtering stage, for example, for two similar functions f ₁ and f ₂, if f ₁ commit time is earlier than f ₂, there may be two cases where f ₁ is detected first, or f ₂ is detected first, and the latter case also requires updating IM, in any case in sequence, all the function clone pairs (f ₁,f₂) need to be recorded and stored as a function clone pair set, and it needs to be ensured that the function with earlier commit time is f ₁, and f ₂ is its clone.

And S4, acquiring reuse frequency of each function based on the function cloning pair set, and filtering common functions with reuse frequency exceeding a normal frequency range in a vector index library to acquire a core vector index library, wherein the method comprises the following steps of:

S41, frequency statistics based on path compression, constructing and searching a frequency statistics structure of a set (Union-find disjoint sets, DSU) based on a function cloning pair set as shown in FIG. 5, wherein a root node of each set is a function of earliest commit time and is called a root function, child nodes of each set are other functions which have cloning relations with the root function, the number of members (child nodes) of each set in the DSU is efficiently searched by using a path compression algorithm, so that reuse frequency of each root function is obtained, and then a root function frequency table is counted for each TPL, so that reuse frequency of each function is obtained.

S42, filtering a public function based on Z-score, namely, when the reuse frequency of a certain function deviates from a normal frequency range obviously, judging the reuse frequency as a public function according to the abnormal value, wherein the single-dimensional abnormal value is detected, a Z-score abnormal detection method based on statistical distribution is selected, and the standardized distance Z (X _f) of the reuse frequency of each function is calculated firstly, wherein the calculation formula is as follows:

Z(X_f)＝(X_f-μ)/σ

if Z (X _f) exceeds the specified distance threshold τ, it is considered as outlier and filtered, considering that common functions typically behave as high frequency features, only those with too high reuse frequency values are concerned, reuse frequency values below the mean μ are also considered as reasonable ranges, and therefore the screening conditions for the core function feature set B ^′ in the normal frequency range are as follows:

B^′＝{f∈B|Z(X_f)≤τ}

in the vector index library, filtering a common function feature set with reuse frequency exceeding a normal frequency range according to a screening condition, extracting a core function feature set of each TPL, and obtaining a core vector index library;

the method filters the public function with the frequency exceeding the normal range based on the anomaly detection technology, reduces the influence of an interference function (public function), improves the accuracy of TPL reuse detection, and effectively reduces false alarm conditions.

Step S5, the TPL reuse detection of the software to be detected, wherein the step aims at identifying an approximate nearest neighbor function set of function characteristics from the software items to be detected based on a vector index library so as to determine reuse TPL candidates, and the TPL candidates possibly contain inaccurate detection items due to the limitation of data scale, so that a dynamic duty ratio filtering strategy for reusing the TPL candidates is designed so as to obtain a more accurate reuse TPL detection result, and the method comprises the following steps:

the method comprises the steps that the source codes of software to be tested are integrally input into a system, preprocessing is carried out by using the same method in the step S1, each function of the source codes of the software to be tested is segmented, an AST structure of the function to be tested is extracted, feature extraction is carried out on the AST structure of the function to be tested, and feature vectors of the function to be tested are generated.

S52, in the core vector index library, a function feature vector which is approximately matched with the function feature vector to be detected is searched by adopting an approximate nearest neighbor algorithm, and a third party library set which belongs to a function set corresponding to the searched function feature vector is used as a reuse TPL candidate.

S53, filtering the misrecognized reuse TPL candidates according to the function feature duty ratio of the reuse TPL candidates, and further determining a final reuse TPL detection result;

The method adopts a dynamic duty ratio threshold strategy to optimize the screening process of reusing TPL candidates, firstly, the function feature duty ratio corresponding to all the reuse TPL candidates is calculated, and the calculation formula is as follows:

The TPL candidates ranked at the top 10% and with the function characteristic ratio lower than 2% in the ranking result are removed, so that a more accurate reuse TPL detection result is obtained, and a detailed TPL detection report is output;

Therefore, the strategy aims at considering the minimum first 10 percent of the function characteristic ratio, filtering the TPL candidates which are misidentified due to accidental similarity as much as possible, achieving a balance between filtering and correctness, reserving the reused TPL candidates with higher reliability and significance characteristics and further improving the accuracy of the TPL reuse detection.

Step S6, based on the verification data sets of 100 well-known open source items, evaluating the TPL reuse detection accuracy of TPLADD, CENTRIS and OSSFP methods, designing candidate result filtering modules for TPLADD and CENTRIS, respectively carrying out experiments under the condition of opening and unopened filtering modules, and recording two scene results, wherein the result of opening the filtering modules is represented by "(f)", for example, TPLADD (f) represents an opening filtering TPL candidate, and TPLADD represents an unopened filtering TPL candidate;

as shown in fig. 6, in the case of turning on the filter modules, compared with the existing method CENTRIS, the accuracy rate and the recall rate of the method TPLADD are respectively improved by 3.88% and 2.76%, and the accuracy of the TPL reuse detection is obviously improved by the method, wherein TP is true positive, PRE is the accuracy rate, REC is the recall rate, and F1 is the F1 value;

Compared with CENTRIS method, the method has the advantages that the speed is improved by 21.9% in time, the average time for detecting one TPL is only 0.42 seconds, the detection efficiency of TPL reuse detection is obviously improved, and in space, only 0.41% of the overall function characteristics are required to be reserved, and compared with CENTRIS and OSSFP, the vector index library has the size of only 20% of CENTRIS, so that the storage requirement of TPL reuse detection is obviously reduced.

The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the scope of the invention, but to limit the invention to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. The method for detecting the reuse of the C/C++ source code third-party library based on the function structure information is characterized by comprising the following steps:

2. The method for detecting C/c++ source code third-party library reuse based on function structure information according to claim 1, wherein the step S12 specifically comprises:

v=βv₁+(1-β)v₂,β∈[0,1]

3. The method for detecting C/c++ source code third-party library reuse based on function structure information according to claim 1, wherein the step S2 comprises the steps of:

S21, defining the structural complexity C of the function as:

C=|{t_i|c_i≥1,i=1,2,...,n}|

4. The method for detecting C/c++ source code third-party library reuse based on function structure information according to claim 1, wherein the step S3 comprises the steps of:

5. The method for detecting C/c++ source code third-party library reuse based on function structure information according to claim 1, wherein the step S4 comprises the steps of:

Z(X_f)＝(X_f-μ)/σ

B^′＝{f∈B|Z(X_f)≤τ}

And filtering the common function feature set with reuse frequency exceeding the normal frequency range according to the screening condition in the vector index library, reserving the core function feature set in the normal frequency range, and obtaining the core vector index library.

6. The method for detecting C/c++ source code third-party library reuse based on function structure information according to claim 1, wherein the step S5 comprises the steps of:

Wherein p (T _j) is the function feature duty cycle corresponding to the jth reuse TPL candidate, T _j is the jth reuse TPL candidate, For a set of functional features matching T _j in the software under test,And sorting the reuse TPL candidates according to the order of the function characteristic proportion from small to large, and removing the TPL candidates which are ranked in the top 10% and have the function characteristic proportion lower than 2% in the sorting result to obtain a reuse TPL detection result.