Disclosure of Invention
      Aiming at the problems, the invention aims to provide a C/C++ source code third party library reuse detection method based on function structure information, which is characterized in that an AST structure-based feature vector index library is constructed, and the robustness and accuracy of TPL reuse detection are obviously improved by combining an anomaly detection technology and a dynamic duty ratio screening strategy, so that missing report and false report are effectively reduced, and the technical scheme is as follows:
       The method for detecting the reuse of the C/C++ source code third-party library based on the function structure information comprises the following steps: 
       s1, collecting source codes of different versions of large-scale open source software, and preprocessing the source codes, wherein the preprocessing comprises the following steps: 
       S11, performing function segmentation on the source code, and extracting a function AST structure of the source code; 
       S12, analyzing the function AST structure into vectors with two different structural features according to the degree and the order of node types in the function AST structure, and performing feature stitching on the two vectors to generate a function feature vector; 
       s2, filtering a general structure function based on the structure complexity of a function AST structure in a function set obtained by cutting a source code; 
       Step S3, using function feature vectors, constructing a function feature dictionary for each TPL through hash matching, merging the function feature dictionaries of all TPLs to generate a vector index library, and recording all function clone pairs and storing the function clone pairs as a function clone pair set in the merging process of the vector index library; 
       Step S4, acquiring reuse frequency of each function based on the function cloning pair set, filtering common functions with reuse frequency exceeding a normal frequency range in a vector index library, and acquiring a core vector index library; 
       And S5, performing TPL reuse detection on the software to be detected by adopting an approximate nearest neighbor algorithm based on the core vector index library to obtain reuse TPL candidates, and filtering the reuse TPL candidates which are identified by mistake according to the function characteristic duty ratio of the reuse TPL candidates so as to determine a reuse TPL detection result. 
      Further, the step S12 specifically includes:
       According to the degree and sequence of node types in the function AST structure, the function AST structure is analyzed into vectors with two different structural features by using TreeCen algorithm, and feature stitching is carried out on the two vectors to generate a function feature vector, wherein the calculation formula is as follows: 
      v=βv1+(1-β)v2,β∈[0,1]
       Where v 1 represents a vector derived from the sum of degrees of each node type, v 2 represents a vector derived from the sum of order of each node type, and β represents the weight that v 1 occupies in the feature concatenation. 
      Further, the step S2 includes the following steps:
       S21, defining the structural complexity C of the function as: 
      C=|{ti|ci≥1,i=1,2,...,n}|
       wherein t i is the ith structure type in the function AST structure, c i is the occurrence number of the ith structure type in the function AST structure, and the total number of the structure type types with the occurrence number of more than or equal to 1 in the function AST structure is the structure complexity of the function; 
       S22, counting the occurrence times of different structure types in each function AST structure in a function set obtained by cutting the source code, and obtaining the structure complexity C of each function; 
       S23, eliminating the structural complexity C in the function set to be smaller than a preset threshold value Wherein a threshold is preset 
      Further, the step S3 includes the following steps:
       S31, using a function text hash value as a key of a dictionary, using a function feature of the earliest submitting time as a value of the dictionary, and constructing a function feature dictionary for each TPL through hash matching, wherein the function feature is defined as a tuple for describing function attribute information, namely { the function text hash value, a function address, the submitting time and a function feature vector }; 
       S32, extracting function feature vectors from function feature dictionaries of all TPLs, comparing and de-duplicating all the function feature vectors through an approximate nearest neighbor algorithm, preferentially reserving function feature vectors with earlier commit time, further merging the function feature dictionaries of all the TPLs to generate a vector index library, and storing all similar function pairs f a、fb as a function clone pair set in a form of function clone pairs (f a,fb) in the merging process of the vector index library, and ensuring that the function with earlier commit time is f a and f b is a clone thereof. 
      Further, the step S4 includes the following steps:
       S41, constructing and checking a DSU (direct sequence unit) of the set by using function cloning, taking a function of the earliest submitting time as a root node of each set and taking the rest functions with cloning relation with the function of the earliest submitting time as child nodes of each set; 
       S42, calculating a standardized distance of the function reuse frequency by adopting a Z-score anomaly detection method based on statistical distribution, wherein the calculation formula is as follows: 
      Z(Xf)=(Xf-μ)/σ
       Wherein X f is the reuse frequency value of the function f, Z (X f) is the standardized distance, and mu and sigma are the mean value and standard deviation of the reuse frequency of the function in the TPL to which the function f belongs respectively; 
       The screening conditions for the feature set of the core function in the normal frequency range are as follows: 
      B′={f∈B|Z(Xf)≤τ}
       wherein B represents a function feature set of the TPL to which the function f belongs, B ′ represents a function feature set in a normal frequency range in B, and the threshold tau is set to be 1; 
       In the vector index library, filtering and reusing a public function feature set with frequency exceeding a normal frequency range according to screening conditions, and reserving a core function feature set in the normal frequency range to obtain a core vector index library; 
       Further, the step S5 includes the following steps: 
       S51, preprocessing source codes of the software to be tested by using the same method of the step S1 to obtain feature vectors of functions to be tested; 
       s52, searching a function feature vector which is approximately matched with the function feature vector to be detected in a core vector index library by adopting an approximate nearest neighbor algorithm, and taking a third party library set which belongs to a function set corresponding to the function feature vector and is searched as a reuse TPL candidate; 
       S53, calculating function feature duty ratios corresponding to all reuse TPL candidates, wherein the calculation formula is as follows: 
      
        
      
       Wherein p (T j) is the function feature duty cycle corresponding to the jth reuse TPL candidate, T j is the jth reuse TPL candidate, For a set of functional features matching T j in the software under test,The method comprises the steps of indexing a function feature set corresponding to T j in a library for a core vector;
       And sorting the reuse TPL candidates according to the order of the function characteristic proportion from small to large, and removing the TPL candidates which are ranked at the top 10% and have the function characteristic proportion lower than 2% in the sorting result to obtain a reuse TPL detection result. 
      Compared with the prior art, the invention has the following advantages and beneficial effects:
       The method comprises the steps of obtaining function feature vectors through a function AST structure of source codes of open source software, constructing a function feature dictionary for each TPL by using the function feature vectors, merging feature dictionaries of all TPLs to generate a vector index library, recording a function clone pair set, obtaining reuse frequency of each function based on the function clone pair set, filtering common functions with abnormal reuse frequency to obtain a core vector index library, carrying out TPL reuse detection on software to be detected based on the core vector index library, and determining a reuse TPL function set according to the TPL candidate item of misjudgment of function feature ratio of the reuse TPL candidate item. The method is based on the structural characteristics of the function AST, the robustness of TPL detection in a modified reuse scene is remarkably improved, missing report is effectively reduced, the influence of interference functions is reduced by filtering common functions with frequency exceeding a normal range based on an abnormal detection technology, the accuracy of TPL reuse detection is improved, false report is effectively reduced, the reuse TPL candidate item which is mistakenly identified is effectively filtered by adopting a dynamic duty ratio screening strategy, the accuracy of TPL reuse detection is further improved, and meanwhile, compared with the traditional method, the method has higher detection efficiency and lower storage requirement. 
    
    
      Detailed Description
      The following will make clear and complete description of the technical solution according to the present invention, with reference to the accompanying drawings, so as to further understand the concept of the present invention, the technical problems to be solved, the technical features constituting the technical solution, and the technical effects brought thereby.
      The general flow chart of the method is shown in fig. 1, and specifically comprises the following steps:
       Step S1, extracting function characteristics, collecting source codes of large-scale open source software OSS (Open Source Software, OSS) and different versions thereof, preprocessing the source codes, and comprising the following steps: 
       s11, OSS data collection, a Git API or a crawler is utilized to collect an effective open source code repo warehouse, for example, a version tag list of the warehouse and the release time of the version tag list are obtained through a Git tag command, and file change conditions between adjacent versions are identified through a Git diff command. 
      S12, OSS data preprocessing, which comprises the following steps:
       S121, for each source code file, unnecessary comments and blank lines are eliminated, and invalid information interference is avoided; 
       S122, based on a public source code analysis tool Tree-sitter, performing function segmentation on the source codes and extracting a function AST structure of the source codes; 
       S123, analyzing the function AST structure into vectors with two different structural features according to the degree and the order of node types in the function AST structure, and performing feature stitching on the two vectors to generate a function feature vector; 
       As shown in FIG. 2, the conversion from function AST structure information to fixed length vector is efficiently completed by TreeCen algorithm, wherein the dimension of the feature vector is determined by the number of necessary C/C++ AST node types and can be obtained through the configuration file node-types. Json of the analysis tool; 
       In order to cope with the limitation of node degree characteristics in TypeII, typeIII code cloning problems (such as variable name modification and code adjacent node position interchange), the method analyzes the degree (degree) and sequence (sequence) of node types in the function AST structure of the source code to obtain vectors of two different structural characteristics, and then generates a final function characteristic vector v through characteristic splicing, wherein the calculation formula is as follows: 
      v=βv1+(1-β)v2,β∈[0,1]
       Wherein v 1 represents a vector obtained from the sum of degrees of each node type, v 2 represents a vector obtained from the sum of orders of each node type, and β represents a weight occupied by v 1 in feature stitching; 
       the method solves the invalidation problem of TypeII, typeIII code cloning (such as variable name modification and code adjacent node position interchange) based on the function AST structural characteristics, remarkably improves the robustness of TPL detection under the modified reuse scene, and effectively reduces the missing report condition. 
      S2, filtering a general structure function based on the structure complexity of a function AST structure in a function set obtained by cutting a source code, so as to reduce the false alarm effect;
       referring to maintainability index, the method designs a measurement filtering mechanism based on structural complexity, and comprises the following steps: 
       S21, for a function, the structure type set in an AST structure is A= { t 1,t2,…,tn }, wherein the occurrence number of each structure type t i is C i(ci ≡0), the structure complexity C of the function is defined as: 
      C=|{ti|ci≥1,i=1,2,...,n}|
       Wherein C i is equal to or greater than 1, which means that the structure type t i appears at least once in AST of the function, if C i =0, the structure type does not count in complexity calculation, so that when the structure type actually used by the function is less, the structure complexity C is lower, and otherwise, is higher. 
      S22, counting the occurrence times of different structure types in each function AST structure in a function set obtained by cutting the source code, and obtaining the structure complexity C of each function.
      S23, setting a predefined threshold valueWhen the structural complexity of the functionThe function is labeled as a generic structure function or a simple function and is eliminated from the collection of functions, where, 
      The method filters the interference function (general structure function or simple function) based on the structural complexity of the function AST structure, improves the accuracy of TPL reuse detection, and effectively reduces false alarm.
      Step S3, constructing a function feature dictionary for each TPL by hash matching by using the function feature vectors, merging the function feature dictionaries of all TPLs to generate a vector index library, and recording and storing all function clone pairs as a function clone pair set in the merging process of the vector index library, wherein the method comprises the following steps:
       S31, constructing a feature dictionary, namely quickly constructing an independent feature dictionary for each TPL by adopting hash matching to realize function deduplication due to the fact that a large number of repeated functions exist in different versions of TPL, wherein keys of the feature dictionary are function text hash values, and the values are function features of earliest submission time, and the function features are defined as tuples describing function attribute information, namely { the function text hash values, function addresses, submission time and function feature vectors }; 
       As shown in fig. 3, a process of constructing a single TPL function feature dictionary is described. Wherein D represents a feature dictionary of a TPL, F represents a specific function feature, VF is a function feature set of a certain version, VFL is a list of function attribute information of all versions of the TPL, and in order to ensure that the stored function feature is the earliest submitted version, the algorithm performs operations by directly adding F to the dictionary D if the text hash value of the current function F is not recorded in the dictionary D, otherwise, comparing the submission time of the new function F, and if the submission time of F is earlier, updating the record in the dictionary D. 
      S32, vector index merging, namely introducing ANN search based on a vector database, extracting function feature vectors from function feature dictionaries of all TPLs to compare and remove duplicates, preferentially reserving function feature vectors with earlier submission time in the duplication removal process, simplifying indexes, merging to generate a vector index library, and effectively eliminating repeated functions and similar functions among different TPLs, wherein as shown in FIG. 4, DL is a list of the function feature dictionaries D, RL represents matching results of function feature vectors in the function feature dictionaries D in the IM, FT and FM respectively represent the compared function features and similar matching results thereof, and delta is a set vector similarity threshold;
       In this process, the search logs of the function clone pairs are recorded for the subsequent common function filtering stage, for example, for two similar functions f 1 and f 2, if f 1 commit time is earlier than f 2, there may be two cases where f 1 is detected first, or f 2 is detected first, and the latter case also requires updating IM, in any case in sequence, all the function clone pairs (f 1,f2) need to be recorded and stored as a function clone pair set, and it needs to be ensured that the function with earlier commit time is f 1, and f 2 is its clone. 
      And S4, acquiring reuse frequency of each function based on the function cloning pair set, and filtering common functions with reuse frequency exceeding a normal frequency range in a vector index library to acquire a core vector index library, wherein the method comprises the following steps of:
       S41, frequency statistics based on path compression, constructing and searching a frequency statistics structure of a set (Union-find disjoint sets, DSU) based on a function cloning pair set as shown in FIG. 5, wherein a root node of each set is a function of earliest commit time and is called a root function, child nodes of each set are other functions which have cloning relations with the root function, the number of members (child nodes) of each set in the DSU is efficiently searched by using a path compression algorithm, so that reuse frequency of each root function is obtained, and then a root function frequency table is counted for each TPL, so that reuse frequency of each function is obtained. 
      S42, filtering a public function based on Z-score, namely, when the reuse frequency of a certain function deviates from a normal frequency range obviously, judging the reuse frequency as a public function according to the abnormal value, wherein the single-dimensional abnormal value is detected, a Z-score abnormal detection method based on statistical distribution is selected, and the standardized distance Z (X f) of the reuse frequency of each function is calculated firstly, wherein the calculation formula is as follows:
      Z(Xf)=(Xf-μ)/σ
       Wherein X f is the reuse frequency value of the function f, Z (X f) is the standardized distance, and mu and sigma are the mean value and standard deviation of the reuse frequency of the function in the TPL to which the function f belongs respectively; 
       if Z (X f) exceeds the specified distance threshold τ, it is considered as outlier and filtered, considering that common functions typically behave as high frequency features, only those with too high reuse frequency values are concerned, reuse frequency values below the mean μ are also considered as reasonable ranges, and therefore the screening conditions for the core function feature set B ′ in the normal frequency range are as follows: 
      B′={f∈B|Z(Xf)≤τ}
       wherein B represents a function feature set of the TPL to which the function f belongs, B ′ represents a function feature set in a normal frequency range in B, and the threshold tau is set to be 1; 
       in the vector index library, filtering a common function feature set with reuse frequency exceeding a normal frequency range according to a screening condition, extracting a core function feature set of each TPL, and obtaining a core vector index library; 
       the method filters the public function with the frequency exceeding the normal range based on the anomaly detection technology, reduces the influence of an interference function (public function), improves the accuracy of TPL reuse detection, and effectively reduces false alarm conditions. 
      Step S5, the TPL reuse detection of the software to be detected, wherein the step aims at identifying an approximate nearest neighbor function set of function characteristics from the software items to be detected based on a vector index library so as to determine reuse TPL candidates, and the TPL candidates possibly contain inaccurate detection items due to the limitation of data scale, so that a dynamic duty ratio filtering strategy for reusing the TPL candidates is designed so as to obtain a more accurate reuse TPL detection result, and the method comprises the following steps:
       S51, preprocessing source codes of the software to be tested by using the same method of the step S1 to obtain feature vectors of functions to be tested; 
       the method comprises the steps that the source codes of software to be tested are integrally input into a system, preprocessing is carried out by using the same method in the step S1, each function of the source codes of the software to be tested is segmented, an AST structure of the function to be tested is extracted, feature extraction is carried out on the AST structure of the function to be tested, and feature vectors of the function to be tested are generated. 
      S52, in the core vector index library, a function feature vector which is approximately matched with the function feature vector to be detected is searched by adopting an approximate nearest neighbor algorithm, and a third party library set which belongs to a function set corresponding to the searched function feature vector is used as a reuse TPL candidate.
      S53, filtering the misrecognized reuse TPL candidates according to the function feature duty ratio of the reuse TPL candidates, and further determining a final reuse TPL detection result;
       The method adopts a dynamic duty ratio threshold strategy to optimize the screening process of reusing TPL candidates, firstly, the function feature duty ratio corresponding to all the reuse TPL candidates is calculated, and the calculation formula is as follows: 
      
        
      
       Wherein p (T j) is the function feature duty cycle corresponding to the jth reuse TPL candidate, T j is the jth reuse TPL candidate, For a set of functional features matching T j in the software under test,The method comprises the steps of indexing a function feature set corresponding to T j in a library for a core vector;
       The TPL candidates ranked at the top 10% and with the function characteristic ratio lower than 2% in the ranking result are removed, so that a more accurate reuse TPL detection result is obtained, and a detailed TPL detection report is output; 
       Therefore, the strategy aims at considering the minimum first 10 percent of the function characteristic ratio, filtering the TPL candidates which are misidentified due to accidental similarity as much as possible, achieving a balance between filtering and correctness, reserving the reused TPL candidates with higher reliability and significance characteristics and further improving the accuracy of the TPL reuse detection. 
      Step S6, based on the verification data sets of 100 well-known open source items, evaluating the TPL reuse detection accuracy of TPLADD, CENTRIS and OSSFP methods, designing candidate result filtering modules for TPLADD and CENTRIS, respectively carrying out experiments under the condition of opening and unopened filtering modules, and recording two scene results, wherein the result of opening the filtering modules is represented by "(f)", for example, TPLADD (f) represents an opening filtering TPL candidate, and TPLADD represents an unopened filtering TPL candidate;
       as shown in fig. 6, in the case of turning on the filter modules, compared with the existing method CENTRIS, the accuracy rate and the recall rate of the method TPLADD are respectively improved by 3.88% and 2.76%, and the accuracy of the TPL reuse detection is obviously improved by the method, wherein TP is true positive, PRE is the accuracy rate, REC is the recall rate, and F1 is the F1 value; 
       Compared with CENTRIS method, the method has the advantages that the speed is improved by 21.9% in time, the average time for detecting one TPL is only 0.42 seconds, the detection efficiency of TPL reuse detection is obviously improved, and in space, only 0.41% of the overall function characteristics are required to be reserved, and compared with CENTRIS and OSSFP, the vector index library has the size of only 20% of CENTRIS, so that the storage requirement of TPL reuse detection is obviously reduced. 
      The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the scope of the invention, but to limit the invention to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the invention are intended to be included within the scope of the invention.