[go: up one dir, main page]

CN119903529A - C/C++ source code third-party library reuse detection method based on function structure information - Google Patents

C/C++ source code third-party library reuse detection method based on function structure information Download PDF

Info

Publication number
CN119903529A
CN119903529A CN202510105000.6A CN202510105000A CN119903529A CN 119903529 A CN119903529 A CN 119903529A CN 202510105000 A CN202510105000 A CN 202510105000A CN 119903529 A CN119903529 A CN 119903529A
Authority
CN
China
Prior art keywords
function
reuse
tpl
feature
source code
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202510105000.6A
Other languages
Chinese (zh)
Inventor
王俊峰
贾昀峰
吴鹏
方智阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Priority to CN202510105000.6A priority Critical patent/CN119903529A/en
Publication of CN119903529A publication Critical patent/CN119903529A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/57Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
    • G06F21/577Assessing vulnerabilities and evaluating computer system security
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/30Creation or generation of source code
    • G06F8/36Software reuse
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • G06F8/75Structural analysis for program understanding
    • G06F8/751Code clone detection
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Computing Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明涉及软件开发及软件供应链安全技术领域,公开了基于函数结构信息的C/C++源代码第三方库重用检测方法,本方法通过构建基于开源软件函数AST结构的TPL向量索引库;在向量索引库中,采用异常值检测方法过滤重用频率超过正常频率范围的公共函数;并使用动态占比筛选策略有效过滤误识别的重用TPL候选项,得到更为准确的重用TPL结果。本发明显著提升了TPL重用检测的鲁棒性和准确性,有效减少漏报和误报;同时,本方法相较于传统方法具有更高的检测效率以及更低的存储需求。

The present invention relates to the field of software development and software supply chain security technology, and discloses a C/C++ source code third-party library reuse detection method based on function structure information. The method constructs a TPL vector index library based on the AST structure of open source software functions; in the vector index library, an outlier detection method is used to filter public functions whose reuse frequency exceeds the normal frequency range; and a dynamic proportion screening strategy is used to effectively filter misidentified reuse TPL candidates to obtain more accurate reuse TPL results. The present invention significantly improves the robustness and accuracy of TPL reuse detection, effectively reducing missed reports and false positives; at the same time, compared with traditional methods, the method has higher detection efficiency and lower storage requirements.

Description

C/C++ source code third party library reuse detection method based on function structure information
Technical Field
The invention relates to the technical field of software development and software supply chain safety, in particular to a C/C++ source code third-party library reuse detection method based on function structure information.
Background
The number of Open Source Software (OSS) has been expanding dramatically with the popularity of remote code repositories such as GitHub to enable the use of Third-party libraries (TPL) by a variety of code reuse means such as install dependencies, API calls, project fork, file copies, code cloning, etc. Especially in the C/c++ project, to avoid duplication of labor, developers often introduce the required TPL with code clones and modify it to meet business needs. The situation brings brand new challenges for accurately identifying the TPL work, and expands the potential attack surface of software. It is critical to comprehensively track the reuse of software TPL, and the security of the software supply chain directly relates to the security and reliability of the software product. Therefore, the software composition analysis (Software composition analysis, SCA) technique of reusing TPL in detection software becomes a research hotspot as an effective means to discover software security risks in time and reduce hazards.
In SCA practice OSSPOLICE tries to build feature libraries using lightweight features such as directory structures, but problems with detection failure, insufficient robustness may occur when project structures are changed. The CENTRIS proposed by wo et al improves the accuracy of TPL detection in a reuse scenario with few modifications by segmenting OSS source code, eliminating redundant functions and using fuzzy hashing techniques, however, this approach may fail in the face of more text modifications (e.g., the tinyfiledialogs library functions in julius project differ from the original functions on Github), yielding false negative results. OSSFP introduces a filtering method for public functions and trivial functions, improves detection performance and accuracy by constructing hash fingerprint indexes for the core functions of each TPL, however, OSSFP assumes that some functions in reuse TPL are not changed, so that when large-area function modification occurs (for example, SQLITECPP item reuses sqlite library through sqlite3.C source file, wherein all functions are added with new access modifiers), the problem of insufficient robustness is also faced, resulting in detection failure.
Furthermore, it was observed that CENTRIS did not take into account the effect of the interference function, and was prone to more false positives. However, OSSFP filters the public functions through a fixed frequency threshold, and in practice, there is a problem of inaccurate filtering, for example, a Boost library contains a large number of public functions, while a xxhash library has few public functions, and OSSFP considers that the public functions of the two should be filtered in the same proportion.
The conventional static SCA technology is analyzed to find that the robustness of TPL features is insufficient in detection robustness and interference function processing, namely TypeII, typeIII code clones (such as variable names or code line modifications) are faced to work depending on function text hash features such as CENTRIS, OSSFP, false Negative (FN) is easy to generate, and (2) interference function filtering inaccuracy is caused by CENTRIS, interference functions are not fully considered, so that more False Positives (FP) are caused, and OSSFP filters public functions through a fixed frequency threshold, so that the problem of inaccurate filtering exists in practice.
Therefore, a method for detecting the reuse of the C/C++ source code third-party library based on the function structure information is provided to solve the problems.
Disclosure of Invention
Aiming at the problems, the invention aims to provide a C/C++ source code third party library reuse detection method based on function structure information, which is characterized in that an AST structure-based feature vector index library is constructed, and the robustness and accuracy of TPL reuse detection are obviously improved by combining an anomaly detection technology and a dynamic duty ratio screening strategy, so that missing report and false report are effectively reduced, and the technical scheme is as follows:
The method for detecting the reuse of the C/C++ source code third-party library based on the function structure information comprises the following steps:
s1, collecting source codes of different versions of large-scale open source software, and preprocessing the source codes, wherein the preprocessing comprises the following steps:
S11, performing function segmentation on the source code, and extracting a function AST structure of the source code;
S12, analyzing the function AST structure into vectors with two different structural features according to the degree and the order of node types in the function AST structure, and performing feature stitching on the two vectors to generate a function feature vector;
s2, filtering a general structure function based on the structure complexity of a function AST structure in a function set obtained by cutting a source code;
Step S3, using function feature vectors, constructing a function feature dictionary for each TPL through hash matching, merging the function feature dictionaries of all TPLs to generate a vector index library, and recording all function clone pairs and storing the function clone pairs as a function clone pair set in the merging process of the vector index library;
Step S4, acquiring reuse frequency of each function based on the function cloning pair set, filtering common functions with reuse frequency exceeding a normal frequency range in a vector index library, and acquiring a core vector index library;
And S5, performing TPL reuse detection on the software to be detected by adopting an approximate nearest neighbor algorithm based on the core vector index library to obtain reuse TPL candidates, and filtering the reuse TPL candidates which are identified by mistake according to the function characteristic duty ratio of the reuse TPL candidates so as to determine a reuse TPL detection result.
Further, the step S12 specifically includes:
According to the degree and sequence of node types in the function AST structure, the function AST structure is analyzed into vectors with two different structural features by using TreeCen algorithm, and feature stitching is carried out on the two vectors to generate a function feature vector, wherein the calculation formula is as follows:
v=βv1+(1-β)v2,β∈[0,1]
Where v 1 represents a vector derived from the sum of degrees of each node type, v 2 represents a vector derived from the sum of order of each node type, and β represents the weight that v 1 occupies in the feature concatenation.
Further, the step S2 includes the following steps:
S21, defining the structural complexity C of the function as:
C=|{ti|ci≥1,i=1,2,...,n}|
wherein t i is the ith structure type in the function AST structure, c i is the occurrence number of the ith structure type in the function AST structure, and the total number of the structure type types with the occurrence number of more than or equal to 1 in the function AST structure is the structure complexity of the function;
S22, counting the occurrence times of different structure types in each function AST structure in a function set obtained by cutting the source code, and obtaining the structure complexity C of each function;
S23, eliminating the structural complexity C in the function set to be smaller than a preset threshold value Wherein a threshold is preset
Further, the step S3 includes the following steps:
S31, using a function text hash value as a key of a dictionary, using a function feature of the earliest submitting time as a value of the dictionary, and constructing a function feature dictionary for each TPL through hash matching, wherein the function feature is defined as a tuple for describing function attribute information, namely { the function text hash value, a function address, the submitting time and a function feature vector };
S32, extracting function feature vectors from function feature dictionaries of all TPLs, comparing and de-duplicating all the function feature vectors through an approximate nearest neighbor algorithm, preferentially reserving function feature vectors with earlier commit time, further merging the function feature dictionaries of all the TPLs to generate a vector index library, and storing all similar function pairs f a、fb as a function clone pair set in a form of function clone pairs (f a,fb) in the merging process of the vector index library, and ensuring that the function with earlier commit time is f a and f b is a clone thereof.
Further, the step S4 includes the following steps:
S41, constructing and checking a DSU (direct sequence unit) of the set by using function cloning, taking a function of the earliest submitting time as a root node of each set and taking the rest functions with cloning relation with the function of the earliest submitting time as child nodes of each set;
S42, calculating a standardized distance of the function reuse frequency by adopting a Z-score anomaly detection method based on statistical distribution, wherein the calculation formula is as follows:
Z(Xf)=(Xf-μ)/σ
Wherein X f is the reuse frequency value of the function f, Z (X f) is the standardized distance, and mu and sigma are the mean value and standard deviation of the reuse frequency of the function in the TPL to which the function f belongs respectively;
The screening conditions for the feature set of the core function in the normal frequency range are as follows:
B={f∈B|Z(Xf)≤τ}
wherein B represents a function feature set of the TPL to which the function f belongs, B represents a function feature set in a normal frequency range in B, and the threshold tau is set to be 1;
In the vector index library, filtering and reusing a public function feature set with frequency exceeding a normal frequency range according to screening conditions, and reserving a core function feature set in the normal frequency range to obtain a core vector index library;
Further, the step S5 includes the following steps:
S51, preprocessing source codes of the software to be tested by using the same method of the step S1 to obtain feature vectors of functions to be tested;
s52, searching a function feature vector which is approximately matched with the function feature vector to be detected in a core vector index library by adopting an approximate nearest neighbor algorithm, and taking a third party library set which belongs to a function set corresponding to the function feature vector and is searched as a reuse TPL candidate;
S53, calculating function feature duty ratios corresponding to all reuse TPL candidates, wherein the calculation formula is as follows:
Wherein p (T j) is the function feature duty cycle corresponding to the jth reuse TPL candidate, T j is the jth reuse TPL candidate, For a set of functional features matching T j in the software under test,The method comprises the steps of indexing a function feature set corresponding to T j in a library for a core vector;
And sorting the reuse TPL candidates according to the order of the function characteristic proportion from small to large, and removing the TPL candidates which are ranked at the top 10% and have the function characteristic proportion lower than 2% in the sorting result to obtain a reuse TPL detection result.
Compared with the prior art, the invention has the following advantages and beneficial effects:
The method comprises the steps of obtaining function feature vectors through a function AST structure of source codes of open source software, constructing a function feature dictionary for each TPL by using the function feature vectors, merging feature dictionaries of all TPLs to generate a vector index library, recording a function clone pair set, obtaining reuse frequency of each function based on the function clone pair set, filtering common functions with abnormal reuse frequency to obtain a core vector index library, carrying out TPL reuse detection on software to be detected based on the core vector index library, and determining a reuse TPL function set according to the TPL candidate item of misjudgment of function feature ratio of the reuse TPL candidate item. The method is based on the structural characteristics of the function AST, the robustness of TPL detection in a modified reuse scene is remarkably improved, missing report is effectively reduced, the influence of interference functions is reduced by filtering common functions with frequency exceeding a normal range based on an abnormal detection technology, the accuracy of TPL reuse detection is improved, false report is effectively reduced, the reuse TPL candidate item which is mistakenly identified is effectively filtered by adopting a dynamic duty ratio screening strategy, the accuracy of TPL reuse detection is further improved, and meanwhile, compared with the traditional method, the method has higher detection efficiency and lower storage requirement.
Drawings
Fig. 1 is a general frame schematic of the present method.
Fig. 2 is a schematic diagram of structure vector coding.
FIG. 3 is feature dictionary building pseudocode.
FIG. 4 is a vector index merge pseudocode.
FIG. 5 is a flow chart of a common function filtering method based on outlier detection.
FIG. 6 is a graph showing the comparison of the accuracy of the detection results of different methods.
Detailed Description
The following will make clear and complete description of the technical solution according to the present invention, with reference to the accompanying drawings, so as to further understand the concept of the present invention, the technical problems to be solved, the technical features constituting the technical solution, and the technical effects brought thereby.
The general flow chart of the method is shown in fig. 1, and specifically comprises the following steps:
Step S1, extracting function characteristics, collecting source codes of large-scale open source software OSS (Open Source Software, OSS) and different versions thereof, preprocessing the source codes, and comprising the following steps:
s11, OSS data collection, a Git API or a crawler is utilized to collect an effective open source code repo warehouse, for example, a version tag list of the warehouse and the release time of the version tag list are obtained through a Git tag command, and file change conditions between adjacent versions are identified through a Git diff command.
S12, OSS data preprocessing, which comprises the following steps:
S121, for each source code file, unnecessary comments and blank lines are eliminated, and invalid information interference is avoided;
S122, based on a public source code analysis tool Tree-sitter, performing function segmentation on the source codes and extracting a function AST structure of the source codes;
S123, analyzing the function AST structure into vectors with two different structural features according to the degree and the order of node types in the function AST structure, and performing feature stitching on the two vectors to generate a function feature vector;
As shown in FIG. 2, the conversion from function AST structure information to fixed length vector is efficiently completed by TreeCen algorithm, wherein the dimension of the feature vector is determined by the number of necessary C/C++ AST node types and can be obtained through the configuration file node-types. Json of the analysis tool;
In order to cope with the limitation of node degree characteristics in TypeII, typeIII code cloning problems (such as variable name modification and code adjacent node position interchange), the method analyzes the degree (degree) and sequence (sequence) of node types in the function AST structure of the source code to obtain vectors of two different structural characteristics, and then generates a final function characteristic vector v through characteristic splicing, wherein the calculation formula is as follows:
v=βv1+(1-β)v2,β∈[0,1]
Wherein v 1 represents a vector obtained from the sum of degrees of each node type, v 2 represents a vector obtained from the sum of orders of each node type, and β represents a weight occupied by v 1 in feature stitching;
the method solves the invalidation problem of TypeII, typeIII code cloning (such as variable name modification and code adjacent node position interchange) based on the function AST structural characteristics, remarkably improves the robustness of TPL detection under the modified reuse scene, and effectively reduces the missing report condition.
S2, filtering a general structure function based on the structure complexity of a function AST structure in a function set obtained by cutting a source code, so as to reduce the false alarm effect;
referring to maintainability index, the method designs a measurement filtering mechanism based on structural complexity, and comprises the following steps:
S21, for a function, the structure type set in an AST structure is A= { t 1,t2,…,tn }, wherein the occurrence number of each structure type t i is C i(ci ≡0), the structure complexity C of the function is defined as:
C=|{ti|ci≥1,i=1,2,...,n}|
Wherein C i is equal to or greater than 1, which means that the structure type t i appears at least once in AST of the function, if C i =0, the structure type does not count in complexity calculation, so that when the structure type actually used by the function is less, the structure complexity C is lower, and otherwise, is higher.
S22, counting the occurrence times of different structure types in each function AST structure in a function set obtained by cutting the source code, and obtaining the structure complexity C of each function.
S23, setting a predefined threshold valueWhen the structural complexity of the functionThe function is labeled as a generic structure function or a simple function and is eliminated from the collection of functions, where,
The method filters the interference function (general structure function or simple function) based on the structural complexity of the function AST structure, improves the accuracy of TPL reuse detection, and effectively reduces false alarm.
Step S3, constructing a function feature dictionary for each TPL by hash matching by using the function feature vectors, merging the function feature dictionaries of all TPLs to generate a vector index library, and recording and storing all function clone pairs as a function clone pair set in the merging process of the vector index library, wherein the method comprises the following steps:
S31, constructing a feature dictionary, namely quickly constructing an independent feature dictionary for each TPL by adopting hash matching to realize function deduplication due to the fact that a large number of repeated functions exist in different versions of TPL, wherein keys of the feature dictionary are function text hash values, and the values are function features of earliest submission time, and the function features are defined as tuples describing function attribute information, namely { the function text hash values, function addresses, submission time and function feature vectors };
As shown in fig. 3, a process of constructing a single TPL function feature dictionary is described. Wherein D represents a feature dictionary of a TPL, F represents a specific function feature, VF is a function feature set of a certain version, VFL is a list of function attribute information of all versions of the TPL, and in order to ensure that the stored function feature is the earliest submitted version, the algorithm performs operations by directly adding F to the dictionary D if the text hash value of the current function F is not recorded in the dictionary D, otherwise, comparing the submission time of the new function F, and if the submission time of F is earlier, updating the record in the dictionary D.
S32, vector index merging, namely introducing ANN search based on a vector database, extracting function feature vectors from function feature dictionaries of all TPLs to compare and remove duplicates, preferentially reserving function feature vectors with earlier submission time in the duplication removal process, simplifying indexes, merging to generate a vector index library, and effectively eliminating repeated functions and similar functions among different TPLs, wherein as shown in FIG. 4, DL is a list of the function feature dictionaries D, RL represents matching results of function feature vectors in the function feature dictionaries D in the IM, FT and FM respectively represent the compared function features and similar matching results thereof, and delta is a set vector similarity threshold;
In this process, the search logs of the function clone pairs are recorded for the subsequent common function filtering stage, for example, for two similar functions f 1 and f 2, if f 1 commit time is earlier than f 2, there may be two cases where f 1 is detected first, or f 2 is detected first, and the latter case also requires updating IM, in any case in sequence, all the function clone pairs (f 1,f2) need to be recorded and stored as a function clone pair set, and it needs to be ensured that the function with earlier commit time is f 1, and f 2 is its clone.
And S4, acquiring reuse frequency of each function based on the function cloning pair set, and filtering common functions with reuse frequency exceeding a normal frequency range in a vector index library to acquire a core vector index library, wherein the method comprises the following steps of:
S41, frequency statistics based on path compression, constructing and searching a frequency statistics structure of a set (Union-find disjoint sets, DSU) based on a function cloning pair set as shown in FIG. 5, wherein a root node of each set is a function of earliest commit time and is called a root function, child nodes of each set are other functions which have cloning relations with the root function, the number of members (child nodes) of each set in the DSU is efficiently searched by using a path compression algorithm, so that reuse frequency of each root function is obtained, and then a root function frequency table is counted for each TPL, so that reuse frequency of each function is obtained.
S42, filtering a public function based on Z-score, namely, when the reuse frequency of a certain function deviates from a normal frequency range obviously, judging the reuse frequency as a public function according to the abnormal value, wherein the single-dimensional abnormal value is detected, a Z-score abnormal detection method based on statistical distribution is selected, and the standardized distance Z (X f) of the reuse frequency of each function is calculated firstly, wherein the calculation formula is as follows:
Z(Xf)=(Xf-μ)/σ
Wherein X f is the reuse frequency value of the function f, Z (X f) is the standardized distance, and mu and sigma are the mean value and standard deviation of the reuse frequency of the function in the TPL to which the function f belongs respectively;
if Z (X f) exceeds the specified distance threshold τ, it is considered as outlier and filtered, considering that common functions typically behave as high frequency features, only those with too high reuse frequency values are concerned, reuse frequency values below the mean μ are also considered as reasonable ranges, and therefore the screening conditions for the core function feature set B in the normal frequency range are as follows:
B={f∈B|Z(Xf)≤τ}
wherein B represents a function feature set of the TPL to which the function f belongs, B represents a function feature set in a normal frequency range in B, and the threshold tau is set to be 1;
in the vector index library, filtering a common function feature set with reuse frequency exceeding a normal frequency range according to a screening condition, extracting a core function feature set of each TPL, and obtaining a core vector index library;
the method filters the public function with the frequency exceeding the normal range based on the anomaly detection technology, reduces the influence of an interference function (public function), improves the accuracy of TPL reuse detection, and effectively reduces false alarm conditions.
Step S5, the TPL reuse detection of the software to be detected, wherein the step aims at identifying an approximate nearest neighbor function set of function characteristics from the software items to be detected based on a vector index library so as to determine reuse TPL candidates, and the TPL candidates possibly contain inaccurate detection items due to the limitation of data scale, so that a dynamic duty ratio filtering strategy for reusing the TPL candidates is designed so as to obtain a more accurate reuse TPL detection result, and the method comprises the following steps:
S51, preprocessing source codes of the software to be tested by using the same method of the step S1 to obtain feature vectors of functions to be tested;
the method comprises the steps that the source codes of software to be tested are integrally input into a system, preprocessing is carried out by using the same method in the step S1, each function of the source codes of the software to be tested is segmented, an AST structure of the function to be tested is extracted, feature extraction is carried out on the AST structure of the function to be tested, and feature vectors of the function to be tested are generated.
S52, in the core vector index library, a function feature vector which is approximately matched with the function feature vector to be detected is searched by adopting an approximate nearest neighbor algorithm, and a third party library set which belongs to a function set corresponding to the searched function feature vector is used as a reuse TPL candidate.
S53, filtering the misrecognized reuse TPL candidates according to the function feature duty ratio of the reuse TPL candidates, and further determining a final reuse TPL detection result;
The method adopts a dynamic duty ratio threshold strategy to optimize the screening process of reusing TPL candidates, firstly, the function feature duty ratio corresponding to all the reuse TPL candidates is calculated, and the calculation formula is as follows:
Wherein p (T j) is the function feature duty cycle corresponding to the jth reuse TPL candidate, T j is the jth reuse TPL candidate, For a set of functional features matching T j in the software under test,The method comprises the steps of indexing a function feature set corresponding to T j in a library for a core vector;
The TPL candidates ranked at the top 10% and with the function characteristic ratio lower than 2% in the ranking result are removed, so that a more accurate reuse TPL detection result is obtained, and a detailed TPL detection report is output;
Therefore, the strategy aims at considering the minimum first 10 percent of the function characteristic ratio, filtering the TPL candidates which are misidentified due to accidental similarity as much as possible, achieving a balance between filtering and correctness, reserving the reused TPL candidates with higher reliability and significance characteristics and further improving the accuracy of the TPL reuse detection.
Step S6, based on the verification data sets of 100 well-known open source items, evaluating the TPL reuse detection accuracy of TPLADD, CENTRIS and OSSFP methods, designing candidate result filtering modules for TPLADD and CENTRIS, respectively carrying out experiments under the condition of opening and unopened filtering modules, and recording two scene results, wherein the result of opening the filtering modules is represented by "(f)", for example, TPLADD (f) represents an opening filtering TPL candidate, and TPLADD represents an unopened filtering TPL candidate;
as shown in fig. 6, in the case of turning on the filter modules, compared with the existing method CENTRIS, the accuracy rate and the recall rate of the method TPLADD are respectively improved by 3.88% and 2.76%, and the accuracy of the TPL reuse detection is obviously improved by the method, wherein TP is true positive, PRE is the accuracy rate, REC is the recall rate, and F1 is the F1 value;
Compared with CENTRIS method, the method has the advantages that the speed is improved by 21.9% in time, the average time for detecting one TPL is only 0.42 seconds, the detection efficiency of TPL reuse detection is obviously improved, and in space, only 0.41% of the overall function characteristics are required to be reserved, and compared with CENTRIS and OSSFP, the vector index library has the size of only 20% of CENTRIS, so that the storage requirement of TPL reuse detection is obviously reduced.
The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the scope of the invention, but to limit the invention to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims (6)

1. The method for detecting the reuse of the C/C++ source code third-party library based on the function structure information is characterized by comprising the following steps:
s1, collecting source codes of different versions of large-scale open source software, and preprocessing the source codes, wherein the preprocessing comprises the following steps:
S11, performing function segmentation on the source code, and extracting a function AST structure of the source code;
S12, analyzing the function AST structure into vectors with two different structural features according to the degree and the order of node types in the function AST structure, and performing feature stitching on the two vectors to generate a function feature vector;
s2, filtering a general structure function based on the structure complexity of a function AST structure in a function set obtained by cutting a source code;
Step S3, using function feature vectors, constructing a function feature dictionary for each TPL through hash matching, merging the function feature dictionaries of all TPLs to generate a vector index library, and recording all function clone pairs and storing the function clone pairs as a function clone pair set in the merging process of the vector index library;
Step S4, acquiring reuse frequency of each function based on the function cloning pair set, filtering common functions with reuse frequency exceeding a normal frequency range in a vector index library, and acquiring a core vector index library;
And S5, performing TPL reuse detection on the software to be detected by adopting an approximate nearest neighbor algorithm based on the core vector index library to obtain reuse TPL candidates, and filtering the reuse TPL candidates which are identified by mistake according to the function characteristic duty ratio of the reuse TPL candidates so as to determine a reuse TPL detection result.
2. The method for detecting C/c++ source code third-party library reuse based on function structure information according to claim 1, wherein the step S12 specifically comprises:
According to the degree and sequence of node types in the function AST structure, the function AST structure is analyzed into vectors with two different structural features by using TreeCen algorithm, and feature stitching is carried out on the two vectors to generate a function feature vector, wherein the calculation formula is as follows:
v=βv1+(1-β)v2,β∈[0,1]
Where v 1 represents a vector derived from the sum of degrees of each node type, v 2 represents a vector derived from the sum of order of each node type, and β represents the weight that v 1 occupies in the feature concatenation.
3. The method for detecting C/c++ source code third-party library reuse based on function structure information according to claim 1, wherein the step S2 comprises the steps of:
S21, defining the structural complexity C of the function as:
C=|{ti|ci≥1,i=1,2,...,n}|
wherein t i is the ith structure type in the function AST structure, c i is the occurrence number of the ith structure type in the function AST structure, and the total number of the structure type types with the occurrence number of more than or equal to 1 in the function AST structure is the structure complexity of the function;
S22, counting the occurrence times of different structure types in each function AST structure in a function set obtained by cutting the source code, and obtaining the structure complexity C of each function;
S23, eliminating the structural complexity C in the function set to be smaller than a preset threshold value Wherein a threshold is preset
4. The method for detecting C/c++ source code third-party library reuse based on function structure information according to claim 1, wherein the step S3 comprises the steps of:
S31, using a function text hash value as a key of a dictionary, using a function feature of the earliest submitting time as a value of the dictionary, and constructing a function feature dictionary for each TPL through hash matching, wherein the function feature is defined as a tuple for describing function attribute information, namely { the function text hash value, a function address, the submitting time and a function feature vector };
S32, extracting function feature vectors from function feature dictionaries of all TPLs, comparing and de-duplicating all the function feature vectors through an approximate nearest neighbor algorithm, preferentially reserving function feature vectors with earlier commit time, further merging the function feature dictionaries of all the TPLs to generate a vector index library, and storing all similar function pairs f a、fb as a function clone pair set in a form of function clone pairs (f a,fb) in the merging process of the vector index library, and ensuring that the function with earlier commit time is f a and f b is a clone thereof.
5. The method for detecting C/c++ source code third-party library reuse based on function structure information according to claim 1, wherein the step S4 comprises the steps of:
S41, constructing and checking a DSU (direct sequence unit) of the set by using function cloning, taking a function of the earliest submitting time as a root node of each set and taking the rest functions with cloning relation with the function of the earliest submitting time as child nodes of each set;
S42, calculating a standardized distance of the function reuse frequency by adopting a Z-score anomaly detection method based on statistical distribution, wherein the calculation formula is as follows:
Z(Xf)=(Xf-μ)/σ
Wherein X f is the reuse frequency value of the function f, Z (X f) is the standardized distance, and mu and sigma are the mean value and standard deviation of the reuse frequency of the function in the TPL to which the function f belongs respectively;
The screening conditions for the feature set of the core function in the normal frequency range are as follows:
B={f∈B|Z(Xf)≤τ}
wherein B represents a function feature set of the TPL to which the function f belongs, B represents a function feature set in a normal frequency range in B, and the threshold tau is set to be 1;
And filtering the common function feature set with reuse frequency exceeding the normal frequency range according to the screening condition in the vector index library, reserving the core function feature set in the normal frequency range, and obtaining the core vector index library.
6. The method for detecting C/c++ source code third-party library reuse based on function structure information according to claim 1, wherein the step S5 comprises the steps of:
S51, preprocessing source codes of the software to be tested by using the same method of the step S1 to obtain feature vectors of functions to be tested;
s52, searching a function feature vector which is approximately matched with the function feature vector to be detected in a core vector index library by adopting an approximate nearest neighbor algorithm, and taking a third party library set which belongs to a function set corresponding to the function feature vector and is searched as a reuse TPL candidate;
S53, calculating function feature duty ratios corresponding to all reuse TPL candidates, wherein the calculation formula is as follows:
Wherein p (T j) is the function feature duty cycle corresponding to the jth reuse TPL candidate, T j is the jth reuse TPL candidate, For a set of functional features matching T j in the software under test,And sorting the reuse TPL candidates according to the order of the function characteristic proportion from small to large, and removing the TPL candidates which are ranked in the top 10% and have the function characteristic proportion lower than 2% in the sorting result to obtain a reuse TPL detection result.
CN202510105000.6A 2025-01-23 2025-01-23 C/C++ source code third-party library reuse detection method based on function structure information Pending CN119903529A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202510105000.6A CN119903529A (en) 2025-01-23 2025-01-23 C/C++ source code third-party library reuse detection method based on function structure information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202510105000.6A CN119903529A (en) 2025-01-23 2025-01-23 C/C++ source code third-party library reuse detection method based on function structure information

Publications (1)

Publication Number Publication Date
CN119903529A true CN119903529A (en) 2025-04-29

Family

ID=95464285

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202510105000.6A Pending CN119903529A (en) 2025-01-23 2025-01-23 C/C++ source code third-party library reuse detection method based on function structure information

Country Status (1)

Country Link
CN (1) CN119903529A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120159434A1 (en) * 2010-12-20 2012-06-21 Microsoft Corporation Code clone notification and architectural change visualization
CN115455417A (en) * 2022-09-22 2022-12-09 中国电子科技网络信息安全有限公司 Attribute diagram-based homology code detection method and system
CN117591172A (en) * 2023-11-20 2024-02-23 浙江大学 A feature fusion code clone detection method and device based on vector database
US20240411897A1 (en) * 2023-06-12 2024-12-12 Endor Labs Inc Identifying and addressing potential vulnerabilities in third-party code

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120159434A1 (en) * 2010-12-20 2012-06-21 Microsoft Corporation Code clone notification and architectural change visualization
CN115455417A (en) * 2022-09-22 2022-12-09 中国电子科技网络信息安全有限公司 Attribute diagram-based homology code detection method and system
US20240411897A1 (en) * 2023-06-12 2024-12-12 Endor Labs Inc Identifying and addressing potential vulnerabilities in third-party code
CN117591172A (en) * 2023-11-20 2024-02-23 浙江大学 A feature fusion code clone detection method and device based on vector database

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
贾昀等: "TPLADD:高鲁棒性与高精度的C_C++第三方库检测方法", 《计算机科学与探索》, vol. 17, no. 7, 7 January 2025 (2025-01-07), pages 1969 - 1980 *

Similar Documents

Publication Publication Date Title
CN110888849B (en) An online log parsing method, system and electronic terminal device thereof
CN112800113B (en) Bidding auditing method and system based on data mining analysis technology
US9442991B2 (en) Ascribing actionable attributes to data that describes a personal identity
US20190228085A1 (en) Log file pattern identifier
CN110659282A (en) Data route construction method and device, computer equipment and storage medium
CN117648214A (en) Exception log processing method and device
CN115953123A (en) Generation method, device, equipment and storage medium of robot automation process
CN118606520A (en) A software supply chain malicious open source project monitoring and early warning method and device
CN118093735A (en) Method, system and medium for realizing multi-source heterogeneous data stream data warehouse
CN113239365A (en) Vulnerability repairing method based on knowledge graph
CN116738979A (en) Power grid data search method, system and electronic equipment based on core data identification
CN119441410A (en) An operation and maintenance question-answering system and method based on large language model workflow
CN114969041A (en) Processing method for multi-source main and subsidiary entity identity discrimination and data self-complementing
CN119903529A (en) C/C++ source code third-party library reuse detection method based on function structure information
CN114968258B (en) Code tracing-oriented clone code inheritance relationship judging method, system and medium
Pan et al. An Intelligent Framework for Log Anomaly Detection Based on Log Template Extraction
Zhang et al. Research on data cleaning method based on SNM algorithm
CN114297729A (en) A configuration management database audit method, system and related device
CN118898068B (en) A software and process matching method based on similarity calculation
CN110781309A (en) Entity parallel relation similarity calculation method based on pattern matching
Chantaranimi¹ et al. Strategies in Entity Matching
LESTANTO et al. Development of a Data Cleaning System for Consumer Master Data using Sorted Neighborhood and N-Gram Methods
CN119128237A (en) Field classification method, device, equipment and storage medium based on dynamic rules
CN119515520A (en) Method and device for locating the root cause of event failure
CN117762774A (en) Defect positioning method based on implicit relation and quantized calling relation enhancement

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination