CN112084439B

CN112084439B - A method, device, equipment and storage medium for identifying variables in a URL

Info

Publication number: CN112084439B
Application number: CN202010909457.XA
Authority: CN
Inventors: 尚侠; 张雪松; 罗清篮; 陈宁
Original assignee: Shanghai Mule Network Technology Co ltd
Current assignee: Shanghai Mule Network Technology Co ltd
Priority date: 2020-09-02
Filing date: 2020-09-02
Publication date: 2023-12-19
Anticipated expiration: 2040-09-02
Also published as: CN112084439A

Abstract

The invention relates to a method, a device, equipment and a storage medium for identifying variables in a URL. The method comprises the following steps: acquiring access path data of a website to be identified; preprocessing access path data to obtain a path relation data set and a hierarchical threshold set; identifying quantitative data and suspected variable data in the access path data according to the path relation data set and the hierarchical threshold set; verifying the suspected variable data to obtain variable data; and integrating and outputting the quantitative data and the variable data. By adopting the method, the variable identification can be automatically carried out on the access path data, and the variable identification efficiency is greatly improved.

Description

A method, device, equipment and storage medium for identifying variables in a URL

技术领域Technical field

本发明涉及网站测试与防护技术领域，具体涉及一种识别URL中变量的方法、装置、设备及存储介质。The invention relates to the technical field of website testing and protection, and in particular to a method, device, equipment and storage medium for identifying variables in a URL.

背景技术Background technique

随着网站的应用普及，越来越多的网站应用于各行各业。在网站投入使用前，为了保证网站能够按照预期计划正常运行，需要对网站进行渗透测试。通过参数提交攻击代码是对网站进行渗透测试或扫描的一种常见手段。With the popularity of website applications, more and more websites are used in all walks of life. Before the website is put into use, in order to ensure that the website can operate normally as expected, a penetration test needs to be conducted on the website. Submitting attack code via parameters is a common means of pen testing or scanning a website.

参数传递形式通常有3种，第一种被称为Query String，传递方式为通过URL传递。此类方式常见于获取数据，如果获取一篇文章的详情。第二种为通过表单传递，由前端将用户填写的内容根据要求组装后放在请求数据包的payload部分中。此类方式常见于向后端发送数据，如创建一篇文章时向后端发送文章相关内容。第三种为包含在URL中。此类方式常见于RESTful API或路由伪静态化，从后端获取数据或向后端发送数据时都可能出现。现有其他方式通常为人工标记，通常将整个URL作为一个完整变量进行处理，例如当网站使用RESTful API或地址伪静态化时，有时参数会作为访问路径的一部分被提交，无法像表单那样明确的知晓那个部分是参数。同时，现有技术在识别请求参数时通常会放弃解析地址中的变量部分，或者使用人工定义配置的方式进行标记。There are usually three types of parameter transfer forms. The first one is called Query String, and the transfer method is through URL. This method is commonly used to obtain data, such as obtaining details of an article. The second method is to pass it through a form. The front end assembles the content filled in by the user according to the requirements and puts it in the payload part of the request packet. This method is commonly used to send data to the backend, such as sending article-related content to the backend when creating an article. The third type is included in the URL. This type of method is common in RESTful APIs or pseudo-static routing, and may occur when obtaining data from the backend or sending data to the backend. Other existing methods are usually manual tagging, usually processing the entire URL as a complete variable. For example, when the website uses a RESTful API or the address is pseudo-static, sometimes the parameters are submitted as part of the access path and cannot be as clear as a form. Know which part is the parameter. At the same time, when identifying request parameters, existing technologies usually give up parsing the variable part of the address, or use manual definition and configuration to mark them.

发明内容Contents of the invention

有鉴于此，本发明的目的在于克服现有技术的不足，提供一种识别URL中变量的方法、装置、设备及存储介质。In view of this, the purpose of the present invention is to overcome the shortcomings of the existing technology and provide a method, device, equipment and storage medium for identifying variables in a URL.

为实现以上目的，本发明采用如下技术方案：In order to achieve the above objectives, the present invention adopts the following technical solutions:

一种识别URL中变量的方法，包括：A way to identify variables in a URL, including:

获取待识别网站的访问路径数据；Obtain the access path data of the website to be identified;

对所述访问路径数据进行预处理，得到路径关系数据集及层级阈值集；Preprocess the access path data to obtain a path relationship data set and a hierarchical threshold set;

根据所述路径关系数据集和所述层级阈值集识别所述访问路径数据内的定量数据及疑似变量数据；Identify quantitative data and suspected variable data within the access path data according to the path relationship data set and the hierarchical threshold set;

对所述疑似变量数据进行校验，得到变量数据；Verify the suspected variable data to obtain variable data;

将所述定量数据和所述变量数据进行整合输出。The quantitative data and the variable data are integrated and output.

可选的，所述对所述访问路径数据进行预处理，得到路径关系数据集及层级阈值集，包括：Optionally, preprocess the access path data to obtain a path relationship data set and a hierarchical threshold set, including:

将所述访问路径数据根据预设规则进行分割，得到多个路径段节点；Divide the access path data according to preset rules to obtain multiple path segment nodes;

根据所述路径段节点结合路径关系生成节点数据结构；所述路径关系由所述路径段节点根据所述访问路径数据得到；A node data structure is generated according to the path segment node combined with the path relationship; the path relationship is obtained by the path segment node according to the access path data;

根据所述节点数据结构得到所述路径关系数据集及所述层级阈值集。The path relationship data set and the hierarchical threshold set are obtained according to the node data structure.

可选的，所述路径关系数据集包括：子节点数、被引用数和兄弟节点数；Optionally, the path relationship data set includes: the number of child nodes, the number of references, and the number of sibling nodes;

所述层级阈值集包括：子节点数阈值、被引用数阈值和反向引用系数阈值；The hierarchical threshold set includes: child node number threshold, referenced number threshold and reverse reference coefficient threshold;

所述根据所述节点数据结构得到所述路径关系数据集及所述层级阈值集，包括：Obtaining the path relationship data set and the hierarchical threshold set according to the node data structure include:

统计各所述路径段节点的子节点数、父节点数及兄弟节点数；Count the number of child nodes, parent nodes and sibling nodes of each path segment node;

根据所述父节点数计算所述路径段节点的被引用数和被引用数均值；Calculate the number of references and the average number of references of the path segment node according to the number of parent nodes;

根据所述子节点数计算所述节点数据结构中各层级的所述子节点数加权系数；Calculate the weighting coefficient of the number of child nodes at each level in the node data structure according to the number of child nodes;

根据所述子节点数加权系数计算所述子节点数阈值；Calculate the child node number threshold according to the child node number weighting coefficient;

根据所述被引用数均值计算各层级节点的被引用数加权系数；Calculate the citation number weighting coefficient of nodes at each level according to the average number of citations;

根据所述被引用数加权系数计算所述被引用数阈值；Calculate the citation count threshold according to the citation count weighting coefficient;

根据所述父节点数及所述兄弟节点数计算反向引用系数；Calculate the back reference coefficient according to the number of parent nodes and the number of sibling nodes;

根据所述反向引用系数计算所述反向引用系数均值；Calculate the mean value of the back citation coefficient according to the back citation coefficient;

利用所述反向引用系数均值计算所述反向引用系数阈值。The back-citation coefficient threshold is calculated using the average back-citation coefficient.

可选的，所述根据所述路径关系数据集和所述层级阈值集识别所述访问路径数据内的定量数据及疑似变量数据，包括：Optionally, identifying quantitative data and suspected variable data in the access path data based on the path relationship data set and the hierarchical threshold set includes:

判断任一所述路径段节点的所述子节点数是否大于所述子节点数阈值；Determine whether the number of child nodes of any of the path segment nodes is greater than the child node number threshold;

若是，判定所述路径段节点为定量数据；If so, determine that the path segment node is quantitative data;

否则，判断所述被引用数是否大于所述被引用数阈值；Otherwise, determine whether the number of citations is greater than the threshold number of citations;

否则，判断所述被引用数是否大于所述兄弟节点数；Otherwise, determine whether the number of references is greater than the number of sibling nodes;

否则，判断所述反向引用系数阈值除以所述子节点数是否大于所述子节点数阈值；Otherwise, determine whether the back reference coefficient threshold divided by the number of child nodes is greater than the threshold of the number of child nodes;

否则判定所述路径段节点为所述疑似变量数据。Otherwise, it is determined that the path segment node is the suspected variable data.

可选的，还包括：Optional, also includes:

利用预设通配符标记所述疑似变量数据，并结合所述定量数据生成树型路径结构。The suspected variable data is marked with preset wildcards and combined with the quantitative data to generate a tree path structure.

可选的，所述对所述疑似变量数据进行校验，得到变量数据，包括：Optionally, the suspected variable data is verified to obtain variable data, including:

遍历所述树型路径结构中的所述疑似变量数据的下级节点及所述定量数据的下级节点；Traverse the subordinate nodes of the suspected variable data and the subordinate nodes of the quantitative data in the tree path structure;

判断所述定量数据的下级节点中是否存在满足第一预设条件的定量节点；所述定量节点与所述疑似变量数据的下级节点相同；Determine whether there is a quantitative node that satisfies the first preset condition among the lower-level nodes of the quantitative data; the quantitative node is the same as the lower-level node of the suspected variable data;

若存在，判定所述疑似变量数据为定量数据；If it exists, determine that the suspected variable data is quantitative data;

否则，判断所述疑似变量数据是否存在满足第二预设条件的兄弟节点；所述兄弟节点为末端节点，且所述兄弟节点为定量数据；Otherwise, determine whether the suspected variable data has a sibling node that meets the second preset condition; the sibling node is an end node, and the sibling node is quantitative data;

否则，判定所述疑似变量数据为变量数据。Otherwise, it is determined that the suspected variable data is variable data.

可选的，所述预设规则为：以“/”作为分割点。Optionally, the preset rule is: use "/" as the dividing point.

一种识别URL中变量的装置，包括：A device for identifying variables in a URL, including:

访问路径获取模块，用于获取待识别网站的访问路径数据；The access path acquisition module is used to obtain the access path data of the website to be identified;

预处理模块，用于对所述访问路径数据进行预处理，得到路径关系数据集及层级阈值集；A preprocessing module, used to preprocess the access path data to obtain a path relationship data set and a hierarchical threshold set;

定量识别模块，用于根据所述路径关系数据集和所述层级阈值集识别所述访问路径数据内的定量数据及疑似变量数据；A quantitative identification module, configured to identify quantitative data and suspected variable data in the access path data according to the path relationship data set and the hierarchical threshold set;

疑似变量校验模块，用于对所述疑似变量数据进行校验，得到变量数据；The suspected variable verification module is used to verify the suspected variable data and obtain the variable data;

结果整合输出模块，用于将所述定量数据和所述变量数据进行整合输出。The result integration output module is used to integrate and output the quantitative data and the variable data.

一种识别URL中变量的设备，包括：A device that identifies variables in URLs, including:

处理器，以及与所述处理器相连接的存储器；A processor, and memory coupled to the processor;

所述存储器用于存储计算机程序，所述计算机程序至少用于执行上述所述的识别URL中变量的方法；The memory is used to store a computer program, and the computer program is at least used to perform the above-mentioned method of identifying variables in a URL;

所述处理器用于调用并执行所述存储器中的所述计算机程序。The processor is used to call and execute the computer program in the memory.

一种存储介质，所述存储介质存储有计算机程序，所述计算机程序被处理器执行时，实现如上述所述的识别URL中变量的方法中各个步骤。A storage medium that stores a computer program. When the computer program is executed by a processor, each step of the method for identifying variables in a URL is implemented as described above.

本申请提供的技术方案可以包括以下有益效果：The technical solution provided by this application can include the following beneficial effects:

本申请中公开一种识别URL中变量的方法，包括：获取待识别网站的访问路径数据；对访问路径数据进行预处理，得到路径关系数据集及层级阈值集；根据路径关系数据集和层级阈值集识别访问路径数据内的定量数据及疑似变量数据；对疑似变量数据进行校验，得到变量数据；将定量数据和变量数据进行整合输出。上述方法中通过对网站的访问路径数据进行预处理，然后识别路径中的定量数据和疑似变量数据，再对疑似变量数据进行校验，确定最终的变量数据，然后整合定量数据和变量数据得到最终的识别结果。上述方法中能够通过访问数据自动分析并计算出访问路径中的变量数据部分，极大地提高了变量识别效率。This application discloses a method for identifying variables in a URL, which includes: obtaining the access path data of the website to be identified; preprocessing the access path data to obtain a path relationship data set and a hierarchical threshold set; based on the path relationship data set and hierarchical threshold value Collect and identify quantitative data and suspected variable data in the access path data; verify the suspected variable data to obtain variable data; integrate and output the quantitative data and variable data. In the above method, the website access path data is preprocessed, and then the quantitative data and suspected variable data in the path are identified, and then the suspected variable data is verified to determine the final variable data, and then the quantitative data and variable data are integrated to obtain the final identification results. The above method can automatically analyze and calculate the variable data part in the access path through access data, which greatly improves the efficiency of variable identification.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings in the following description are only These are some embodiments of the present invention. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without exerting creative efforts.

图1是本发明一实施例提供的识别URL中变量的方法流程图；Figure 1 is a flow chart of a method for identifying variables in a URL provided by an embodiment of the present invention;

图2是本发明一实施例提供的预处理的方法流程图；Figure 2 is a flow chart of a preprocessing method provided by an embodiment of the present invention;

图3是本发明一实施例提供的定量数据识别的方法流程图；Figure 3 is a flow chart of a quantitative data identification method provided by an embodiment of the present invention;

图4是本发明一实施例提供的疑似变量校验的方法流程图；Figure 4 is a flow chart of a method for verifying suspected variables provided by an embodiment of the present invention;

图5是本发明一实施例提供的识别URL中变量的装置模块图；Figure 5 is a module diagram of a device for identifying variables in a URL provided by an embodiment of the present invention;

图6是本发明一实施例提供的识别URL中变量的设备结构图。Figure 6 is a structural diagram of a device for identifying variables in a URL provided by an embodiment of the present invention.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚，下面将对本发明的技术方案进行详细的描述。显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动的前提下所得到的所有其它实施方式，都属于本发明所保护的范围。In order to make the purpose, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be described in detail below. Obviously, the described embodiments are only some of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other implementations obtained by those of ordinary skill in the art without any creative work fall within the scope of protection of the present invention.

图1是本发明一实施例提供的识别URL中变量的方法流程图。参见图1，一种识别URL中变量的方法，包括：Figure 1 is a flow chart of a method for identifying variables in a URL provided by an embodiment of the present invention. Referring to Figure 1, a method for identifying variables in a URL includes:

步骤101：获取待识别网站的访问路径数据。本申请中的访问路径数据采用的是日志数据。Step 101: Obtain the access path data of the website to be identified. The access path data in this application uses log data.

步骤102：对访问路径数据进行预处理，得到路径关系数据集及层级阈值集。此步骤中将需要分析的访问路径数据加载到本实施例对应系统中后，对该数据预先进行预处理，然后得到能够进行后续识别操作的数据。Step 102: Preprocess the access path data to obtain a path relationship data set and a hierarchical threshold set. In this step, after the access path data that needs to be analyzed is loaded into the system corresponding to this embodiment, the data is preprocessed in advance, and then data that can be used for subsequent identification operations is obtained.

步骤103：根据路径关系数据集和层级阈值集识别访问路径数据内的定量数据及疑似变量数据。Step 103: Identify the quantitative data and suspected variable data in the access path data according to the path relationship data set and the hierarchical threshold set.

步骤104：对疑似变量数据进行校验，得到变量数据。Step 104: Verify the suspected variable data to obtain the variable data.

步骤105：将定量数据和变量数据进行整合输出。Step 105: Integrate the quantitative data and variable data and output.

上述方法中通过对访问路径数据进行分析，识别出路径中的变量数据部分和定量数据部分，实现了网络路径中变量部分的自动识别功能，极大地提高了变量识别效率，进而提高了渗透测试的效率。In the above method, by analyzing the access path data, the variable data part and the quantitative data part in the path are identified, and the automatic identification function of the variable part in the network path is realized, which greatly improves the efficiency of variable identification, thereby improving the efficiency of penetration testing. efficiency.

更详细地，在上述实施例的基础上，本申请中还公开了步骤102，即对访问路径数据进行预处理，得到路径关系数据集及层级阈值集的实现过程，具体如下：In more detail, on the basis of the above embodiments, this application also discloses step 102, that is, the implementation process of preprocessing access path data to obtain a path relationship data set and a hierarchical threshold set, as follows:

图2是本发明一实施例提供的预处理的方法流程图。参见图2，对所述访问路径数据进行预处理，得到路径关系数据集及层级阈值集，包括：Figure 2 is a flow chart of a preprocessing method provided by an embodiment of the present invention. Referring to Figure 2, the access path data is preprocessed to obtain a path relationship data set and a hierarchical threshold set, including:

步骤201：将所述访问路径数据根据预设规则进行分割，得到多个路径段节点。本申请中是依据URL的特性使用“/”将路径分割多个路径段，单个路径段统称为路径段节点。Step 201: Divide the access path data according to preset rules to obtain multiple path segment nodes. In this application, "/" is used to divide the path into multiple path segments based on the characteristics of the URL, and the individual path segments are collectively referred to as path segment nodes.

步骤202：根据所述路径段节点结合路径关系生成节点数据结构；所述路径关系由所述路径段节点根据所述访问路径数据得到。将各路径段节点按照原有的路径关系生成一种节点数据结构，当日志中原始数据的路径分别为/a/b/c、/a/b/e、/a/d/c时，节点数据结构如下：Step 202: Generate a node data structure based on the path segment nodes combined with path relationships; the path relationships are obtained by the path segment nodes based on the access path data. Generate a node data structure for each path segment node according to the original path relationship. When the paths of the original data in the log are /a/b/c, /a/b/e, and /a/d/c, the node The data structure is as follows:

。 .

步骤203：根据所述节点数据结构得到所述路径关系数据集及所述层级阈值集。其中，路径关系数据集包括：子节点数、被引用数和兄弟节点数；层级阈值集包括：子节点数阈值、被引用数阈值和反向引用系数阈值。Step 203: Obtain the path relationship data set and the hierarchical threshold set according to the node data structure. Among them, the path relationship data set includes: the number of child nodes, the number of citations, and the number of sibling nodes; the hierarchical threshold set includes: the threshold of the number of child nodes, the number of citations, and the backreference coefficient threshold.

步骤203的具体过程如下：统计各所述路径段节点的子节点数、父节点数及兄弟节点数。例如：在上述提到的节点数据结构中，b的子节点有c和e，所以b的子节点数为2。c节点的父节点有b和d两种，所以c的父节点数为2。e节点的兄弟节点为c，兄弟节点数为1，同样c的兄弟节点为e，兄弟节点数为1。The specific process of step 203 is as follows: Count the number of child nodes, number of parent nodes, and number of sibling nodes of each path segment node. For example: In the node data structure mentioned above, the child nodes of b include c and e, so the number of child nodes of b is 2. There are two types of parent nodes of node c: b and d, so the number of parent nodes of c is 2. The sibling node of the e node is c, and the number of sibling nodes is 1. Similarly, the sibling node of c is e, and the number of sibling nodes is 1.

根据所述父节点数计算所述路径段节点的被引用数和被引用数均值。以所述节点数据结构为例，e被b引用1次，被引用数为1，c被b和d各引用一次共计为2次，被引用数为2，在层级为3的深度上，b的子节点层级被引用数均值为(1+2)/2=1.5，d的子节点层级被引用数均值为2/1=2。Calculate the number of references and the average number of references of the path segment node according to the number of parent nodes. Taking the node data structure as an example, e is referenced once by b, and the number of references is 1, and c is referenced once by b and d, a total of 2 times, and the number of references is 2. At a depth of level 3, b The average number of citations at the sub-node level of is (1+2)/2=1.5, and the average number of citations at the sub-node level of d is 2/1=2.

根据所述子节点数计算所述节点数据结构中各层级的所述子节点数加权系数；根据所述子节点数加权系数计算所述子节点数阈值；子节点数阈值根据子节点数加权系数以及叠加数据间的离散程度得出。Calculate the weighting coefficient of the number of child nodes at each level in the node data structure according to the number of child nodes; calculate the threshold value of the number of child nodes according to the weighting coefficient of the number of child nodes; the threshold value of the number of child nodes is based on the weighting coefficient of the number of child nodes And the degree of dispersion between superimposed data is obtained.

根据所述被引用数均值计算各层级节点的被引用数加权系数；根据所述被引用数加权系数计算所述被引用数阈值；被引用数加权系数根据本层级各路径段节点的被引用数结合各数据间的离散度计算出来。Calculate the citation number weighting coefficient of nodes at each level according to the average number of citations; calculate the citation number threshold according to the citation number weighting coefficient; the citation number weighting coefficient is based on the citation number of each path segment node at this level It is calculated by combining the dispersion between each data.

根据所述父节点数及所述兄弟节点数计算反向引用系数；根据所述反向引用系数计算所述反向引用系数均值；利用所述反向引用系数均值计算所述反向引用系数阈值。Calculate the back reference coefficient according to the number of parent nodes and the number of sibling nodes; calculate the average back reference coefficient according to the back reference coefficient; calculate the back reference coefficient threshold using the average back reference coefficient .

需要注意的是，上述实施例中涉及到的阈值可使用分割全集的一种计算方式得出的值代替，如均值、中位数、（最大+最小）/2得到的值等，具体形式不限。加权系数可以使用表达全集数据间离散关系的值代替，如标准差、方差等，具体表达形式不限。It should be noted that the thresholds involved in the above embodiments can be replaced by values obtained by a calculation method of dividing the entire set, such as the mean, median, (maximum + minimum)/2 values, etc. The specific form does not vary. limit. The weighting coefficient can be replaced by a value that expresses the discrete relationship between the entire set of data, such as standard deviation, variance, etc. The specific expression form is not limited.

更详细地，在上述实施例的基础上，本申请中还公开了步骤103，即根据路径关系数据集和层级阈值集识别访问路径数据内的定量数据及疑似变量数据的实现过程，具体如下：In more detail, on the basis of the above embodiments, this application also discloses step 103, that is, the implementation process of identifying quantitative data and suspected variable data in the access path data according to the path relationship data set and the hierarchical threshold set, specifically as follows:

图3是本发明一实施例提供的定量数据识别的方法流程图。根据所述路径关系数据集和所述层级阈值集识别所述访问路径数据内的定量数据及疑似变量数据，包括：Figure 3 is a flow chart of a method for quantitative data identification provided by an embodiment of the present invention. Identifying quantitative data and suspected variable data in the access path data according to the path relationship data set and the hierarchical threshold set includes:

步骤301：判断任一所述路径段节点的所述子节点数是否大于所述子节点数阈值；Step 301: Determine whether the number of child nodes of any path segment node is greater than the child node number threshold;

步骤302：若子节点数大于所述子节点数阈值，判定所述路径段节点为定量数据；Step 302: If the number of child nodes is greater than the child node number threshold, determine that the path segment node is quantitative data;

步骤303：否子节点数不大于子节点数阈值，判断所述被引用数是否大于所述被引用数阈值；若是，执行步骤302；Step 303: If the number of child nodes is not greater than the threshold of the number of child nodes, determine whether the number of references is greater than the threshold of the number of references; if so, execute step 302;

步骤304：被引用数不大于所述被引用数阈值，判断所述被引用数是否大于所述兄弟节点数；若是，执行步骤302；Step 304: If the number of citations is not greater than the citation number threshold, determine whether the number of citations is greater than the number of sibling nodes; if so, perform step 302;

步骤305：被引用数不大于所述兄弟节点数，判断所述反向引用系数阈值除以所述子节点数是否大于所述子节点数阈值；若是，执行步骤302；Step 305: If the number of references is not greater than the number of sibling nodes, determine whether the reverse reference coefficient threshold divided by the number of child nodes is greater than the number of child nodes threshold; if so, perform step 302;

步骤306：反向引用系数阈值除以所述子节点数不大于所述子节点数阈值，判定所述路径段节点为所述疑似变量数据。Step 306: The back reference coefficient threshold divided by the number of child nodes is not greater than the threshold of the number of child nodes, and the path segment node is determined to be the suspected variable data.

更进一步地，在上述实施例的基础上，还包括：利用预设通配符标记所述疑似变量数据，并结合所述定量数据生成树型路径结构。根据判定结果将变量节点泛化成“&”并进行合并后输出到数据结构中，若例子中判定b和d为疑似变量，其他判定为定量，则输出的树型路径结构为：Furthermore, based on the above embodiment, the method further includes: using preset wildcards to mark the suspected variable data, and combining the quantitative data to generate a tree path structure. According to the judgment results, the variable nodes are generalized into "&" and merged and output into the data structure. If b and d are judged to be suspected variables in the example, and the other judgments are quantitative, the output tree path structure is:

。 .

同时，更进一步地，在上述实施例的基础上，本申请中还公开了步骤104的实现过程，具体如下：At the same time, further, on the basis of the above embodiments, this application also discloses the implementation process of step 104, which is as follows:

图4是本发明一实施例提供的疑似变量校验的方法流程图。参见图4，对所述疑似变量数据进行校验，得到变量数据，包括：Figure 4 is a flow chart of a method for verifying suspected variables provided by an embodiment of the present invention. Referring to Figure 4, the suspected variable data is verified and the variable data is obtained, including:

步骤401：遍历所述树型路径结构中的所述疑似变量数据的下级节点及所述定量数据的下级节点。Step 401: Traverse the subordinate nodes of the suspected variable data and the subordinate nodes of the quantitative data in the tree path structure.

步骤402：判断所述定量数据的下级节点中是否存在满足第一预设条件的定量节点；所述定量节点与所述疑似变量数据的下级节点相同。以下述数据结构为例：e节点的原父级节点为b。Step 402: Determine whether there is a quantitative node that satisfies the first preset condition among the lower-level nodes of the quantitative data; the quantitative node is the same as the lower-level node of the suspected variable data. Take the following data structure as an example: The original parent node of e node is b.

步骤403：若存在，判定所述疑似变量数据为定量数据。被认定为定量节点的f与b有相同的子节点e，在此阶段变量b将被还原为定量数据，修正后的输出结果为：。Step 403: If it exists, determine that the suspected variable data is quantitative data. f and b, which are identified as quantitative nodes, have the same child node e. At this stage, the variable b will be restored to quantitative data. The corrected output result is: .

步骤404：否则，判断所述疑似变量数据是否存在满足第二预设条件的兄弟节点；所述兄弟节点为末端节点，且所述兄弟节点为定量数据。若存在，执行步骤403；Step 404: Otherwise, determine whether the suspected variable data has a sibling node that meets the second preset condition; the sibling node is an end node, and the sibling node is quantitative data. If it exists, execute step 403;

例如下述情况：。在经过定量数据识别处理后g被认定为疑似变量，数据结构为/>。For example: . After quantitative data identification processing, g was identified as a suspected variable, and the data structure is/> .

由于g是末端节点，其兄弟节点h同样为末端节点且被认定为定量，g在此阶段被还原为定量数据。Since g is an end node and its sibling node h is also an end node and is identified as quantitative, g is reduced to quantitative data at this stage.

步骤405：否则，判定所述疑似变量数据为变量数据。Step 405: Otherwise, determine that the suspected variable data is variable data.

上述实施例中能够根据日志数据自动分析并计算出路径中的变量部分，相较于人工标记，能够减少工作量并无需对网站有了解且无需开发人员协助，自动化的计算可以随被分析目标的更新而自动更新，增加了实时性。同时本方法解决了URL中无法有效识别变量的问题，为机器学习提供了精确的学习特征。由于识别变量后，可将诸如/a/1/c/与/a/2/c合并为/a/?/c，使原本分散的权重合并到了一起，为机器学习提升精度提供了帮助。In the above embodiment, the variable part in the path can be automatically analyzed and calculated based on the log data. Compared with manual marking, the workload can be reduced and no knowledge of the website is required and no developer assistance is required. The automated calculation can be carried out according to the target being analyzed. Updates are automatically updated, increasing real-time performance. At the same time, this method solves the problem of being unable to effectively identify variables in the URL and provides accurate learning features for machine learning. After identifying the variables, /a/1/c/ and /a/2/c can be merged into /a/?/c, which merges the originally scattered weights together, which helps improve the accuracy of machine learning.

对应于本发明实施例提供的一种识别URL中变量的方法，本发明实施例还提供一种识别URL中变量的装置。请参见下文实施例。Corresponding to the method for identifying variables in a URL provided by embodiments of the present invention, embodiments of the present invention also provide a device for identifying variables in a URL. See examples below.

图5是本发明一实施例提供的识别URL中变量的装置模块图。一种识别URL中变量的装置，包括：Figure 5 is a module diagram of a device for identifying variables in a URL provided by an embodiment of the present invention. A device for identifying variables in a URL, including:

访问路径获取模块501，用于获取待识别网站的访问路径数据；The access path acquisition module 501 is used to obtain the access path data of the website to be identified;

预处理模块502，用于对所述访问路径数据进行预处理，得到路径关系数据集及层级阈值集；The preprocessing module 502 is used to preprocess the access path data to obtain a path relationship data set and a hierarchical threshold set;

定量识别模块503，用于根据所述路径关系数据集和所述层级阈值集识别所述访问路径数据内的定量数据及疑似变量数据；Quantitative identification module 503, configured to identify quantitative data and suspected variable data in the access path data according to the path relationship data set and the hierarchical threshold set;

疑似变量校验模块504，用于对所述疑似变量数据进行校验，得到变量数据；The suspected variable verification module 504 is used to verify the suspected variable data to obtain variable data;

结果整合输出模块505，用于将所述定量数据和所述变量数据进行整合输出。The result integration output module 505 is used to integrate and output the quantitative data and the variable data.

更详细地，预处理模块502具体用于：将所述访问路径数据根据预设规则进行分割，得到多个路径段节点；根据所述路径段节点结合路径关系生成节点数据结构；所述路径关系由所述路径段节点根据所述访问路径数据得到；根据所述节点数据结构得到所述路径关系数据集及所述层级阈值集。In more detail, the preprocessing module 502 is specifically configured to: segment the access path data according to preset rules to obtain multiple path segment nodes; generate a node data structure based on the path segment nodes combined with path relationships; the path relationships It is obtained from the path segment node according to the access path data; the path relationship data set and the hierarchical threshold set are obtained according to the node data structure.

定量识别模块503具体用于：判断任一所述路径段节点的所述子节点数是否大于所述子节点数阈值；若是，判定所述路径段节点为定量数据；否则，判断所述被引用数是否大于所述被引用数阈值；若是，判定所述路径段节点为定量数据；否则，判断所述被引用数是否大于所述兄弟节点数；若是，判定所述路径段节点为定量数据；否则，判断所述反向引用系数阈值除以所述子节点数是否大于所述子节点数阈值；若是，判定所述路径段节点为定量数据；否则判定所述路径段节点为所述疑似变量数据。The quantitative identification module 503 is specifically used to: determine whether the number of child nodes of any of the path segment nodes is greater than the child node number threshold; if so, determine that the path segment node is quantitative data; otherwise, determine that the referenced Whether the number is greater than the cited number threshold; if so, determine that the path segment node is quantitative data; otherwise, determine whether the cited number is greater than the number of sibling nodes; if so, determine that the path segment node is quantitative data; Otherwise, it is determined whether the back reference coefficient threshold divided by the number of child nodes is greater than the threshold of the number of child nodes; if so, it is determined that the path segment node is quantitative data; otherwise, it is determined that the path segment node is the suspected variable. data.

疑似变量校验模块504具体用于：遍历所述树型路径结构中的所述疑似变量数据的下级节点及所述定量数据的下级节点；判断所述定量数据的下级节点中是否存在满足第一预设条件的定量节点；所述定量节点与所述疑似变量数据的下级节点相同；若存在，判定所述疑似变量数据为定量数据；否则，判断所述疑似变量数据是否存在满足第二预设条件的兄弟节点；所述兄弟节点为末端节点，且所述兄弟节点为定量数据；若存在，判定所述疑似变量数据为定量数据；否则，判定所述疑似变量数据为变量数据。The suspected variable verification module 504 is specifically configured to: traverse the subordinate nodes of the suspected variable data and the subordinate nodes of the quantitative data in the tree path structure; determine whether there is a subordinate node of the quantitative data that satisfies the first requirement. A quantitative node with preset conditions; the quantitative node is the same as the subordinate node of the suspected variable data; if it exists, determine that the suspected variable data is quantitative data; otherwise, determine whether the suspected variable data exists and satisfies the second preset The sibling node of the condition; the sibling node is an end node, and the sibling node is quantitative data; if it exists, it is determined that the suspected variable data is quantitative data; otherwise, it is determined that the suspected variable data is variable data.

更进一步地，在上述实施例的基础上，本申请中装置还包括：Furthermore, based on the above embodiments, the device in this application also includes:

通配符标记模块，用于利用预设通配符标记所述疑似变量数据，并结合所述定量数据生成树型路径结构。A wildcard marking module is used to mark the suspected variable data using preset wildcards and combine it with the quantitative data to generate a tree path structure.

采用上述识别装置可对访问路径中的变量进行自动识别，极大地提高了变量识别效率。同时对识别出的分散的变量进行整合，为机器学习提升精度提供了帮助。The above identification device can be used to automatically identify variables in the access path, which greatly improves the efficiency of variable identification. At the same time, the identified scattered variables are integrated to help improve the accuracy of machine learning.

为了更清楚地介绍实现本发明实施例的硬件系统，对应于本发明实施例提供的一种识别URL中变量的方法，本发明实施例还提供一种识别URL中变量的设备。请参见下文实施例。In order to more clearly introduce the hardware system that implements the embodiment of the present invention, corresponding to the method for identifying variables in the URL provided by the embodiment of the present invention, the embodiment of the present invention also provides a device for identifying variables in the URL. See examples below.

图6是本发明一实施例提供的识别URL中变量的设备结构图。参见图6，一种识别URL中变量的设备，包括：Figure 6 is a structural diagram of a device for identifying variables in a URL provided by an embodiment of the present invention. See Figure 6, a device that identifies variables in a URL, including:

处理器601，以及与所述处理器601相连接的存储器602；Processor 601, and memory 602 connected to the processor 601;

所述存储器602用于存储计算机程序，所述计算机程序至少用于执行上述所述的识别URL中变量的方法；The memory 602 is used to store a computer program, which is at least used to perform the above-mentioned method of identifying variables in a URL;

所述处理器601用于调用并执行所述存储器602中的所述计算机程序。The processor 601 is used to call and execute the computer program in the memory 602 .

在上述实施例的基础上，还公开了一种存储介质，该存储介质存储有计算机程序，所述计算机程序被处理器执行时，实现如上述所述的识别URL中变量的方法中各个步骤。Based on the above embodiments, a storage medium is also disclosed. The storage medium stores a computer program. When the computer program is executed by a processor, each step of the method for identifying variables in a URL is implemented as described above.

可以理解的是，上述各实施例中相同或相似部分可以相互参考，在一些实施例中未详细说明的内容可以参见其他实施例中相同或相似的内容。It can be understood that the same or similar parts in the above-mentioned embodiments can be referred to each other, and the content that is not described in detail in some embodiments can be referred to the same or similar content in other embodiments.

需要说明的是，在本发明的描述中，术语“第一”、“第二”等仅用于描述目的，而不能理解为指示或暗示相对重要性。此外，在本发明的描述中，除非另有说明，“多个”的含义是指至少两个。It should be noted that in the description of the present invention, the terms "first", "second", etc. are only used for descriptive purposes and cannot be understood as indicating or implying relative importance. Furthermore, in the description of the present invention, unless otherwise stated, the meaning of "plurality" means at least two.

流程图中或在此以其他方式描述的任何过程或方法描述可以被理解为，表示包括一个或更多个用于实现特定逻辑功能或过程的步骤的可执行指令的代码的模块、片段或部分，并且本发明的优选实施方式的范围包括另外的实现，其中可以不按所示出或讨论的顺序，包括根据所涉及的功能按基本同时的方式或按相反的顺序，来执行功能，这应被本发明的实施例所属技术领域的技术人员所理解。Any process or method descriptions in flowcharts or otherwise described herein may be understood to represent modules, segments, or portions of code that include one or more executable instructions for implementing the specified logical functions or steps of the process. , and the scope of the preferred embodiments of the invention includes additional implementations in which functions may be performed out of the order shown or discussed, including in a substantially simultaneous manner or in the reverse order, depending on the functionality involved, which shall It should be understood by those skilled in the art to which embodiments of the present invention belong.

应当理解，本发明的各部分可以用硬件、软件、固件或它们的组合来实现。在上述实施方式中，多个步骤或方法可以用存储在存储器中且由合适的指令执行系统执行的软件或固件来实现。例如，如果用硬件来实现，和在另一实施方式中一样，可用本领域公知的下列技术中的任一项或他们的组合来实现：具有用于对数据信号实现逻辑功能的逻辑门电路的离散逻辑电路，具有合适的组合逻辑门电路的专用集成电路，可编程门阵列（PGA），现场可编程门阵列（FPGA）等。It should be understood that various parts of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if it is implemented in hardware, as in another embodiment, it can be implemented by any one or a combination of the following technologies known in the art: a logic gate circuit with a logic gate circuit for implementing a logic function on a data signal. Discrete logic circuits, application specific integrated circuits with suitable combinational logic gates, programmable gate arrays (PGA), field programmable gate arrays (FPGA), etc.

本技术领域的普通技术人员可以理解实现上述实施例方法携带的全部或部分步骤是可以通过程序来指令相关的硬件完成，所述的程序可以存储于一种计算机可读存储介质中，该程序在执行时，包括方法实施例的步骤之一或其组合。Those of ordinary skill in the art can understand that all or part of the steps involved in implementing the methods of the above embodiments can be completed by instructing relevant hardware through a program. The program can be stored in a computer-readable storage medium. The program can be stored in a computer-readable storage medium. When executed, one of the steps of the method embodiment or a combination thereof is included.

此外，在本发明各个实施例中的各功能单元可以集成在一个处理模块中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个模块中。上述集成的模块既可以采用硬件的形式实现，也可以采用软件功能模块的形式实现。所述集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时，也可以存储在一个计算机可读取存储介质中。In addition, each functional unit in various embodiments of the present invention can be integrated into a processing module, or each unit can exist physically alone, or two or more units can be integrated into one module. The above integrated modules can be implemented in the form of hardware or software function modules. If the integrated module is implemented in the form of a software function module and sold or used as an independent product, it can also be stored in a computer-readable storage medium.

上述提到的存储介质可以是只读存储器，磁盘或光盘等。The storage media mentioned above can be read-only memory, magnetic disks or optical disks, etc.

在本说明书的描述中，参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本发明的至少一个实施例或示例中。在本说明书中，对上述术语的示意性表述不一定指的是相同的实施例或示例。而且，描述的具体特征、结构、材料或者特点可以在任何的一个或多个实施例或示例中以合适的方式结合。In the description of this specification, reference to the terms "one embodiment," "some embodiments," "an example," "specific examples," or "some examples" or the like means that specific features are described in connection with the embodiment or example. , structures, materials or features are included in at least one embodiment or example of the invention. In this specification, schematic representations of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

尽管上面已经示出和描述了本发明的实施例，可以理解的是，上述实施例是示例性的，不能理解为对本发明的限制，本领域的普通技术人员在本发明的范围内可以对上述实施例进行变化、修改、替换和变型。Although the embodiments of the present invention have been shown and described above, it can be understood that the above-mentioned embodiments are illustrative and should not be construed as limitations of the present invention. Those of ordinary skill in the art can make modifications to the above-mentioned embodiments within the scope of the present invention. The embodiments are subject to changes, modifications, substitutions and variations.

Claims

1. A method of identifying a variable in a URL, comprising:

acquiring access path data of a website to be identified;

preprocessing the access path data to obtain a path relation data set and a hierarchical threshold set;

identifying quantitative data and suspected variable data in the access path data according to the path relation data set and the hierarchical threshold set;

verifying the suspected variable data to obtain variable data;

integrating and outputting the quantitative data and the variable data;

the preprocessing the access path data to obtain a path relation data set and a hierarchical threshold set, including:

dividing the access path data according to a preset rule to obtain a plurality of path segment nodes;

generating a node data structure according to the path segment nodes and the path relation; the path relation is obtained by the path segment node according to the access path data;

obtaining the path relation data set and the hierarchical threshold set according to the node data structure;

the path relation dataset comprises: number of child nodes, number of referenced and number of sibling nodes;

the hierarchical threshold set includes: a child node number threshold, a referenced number threshold, and a reverse reference coefficient threshold;

the obtaining the path relation data set and the hierarchical threshold set according to the node data structure includes:

counting the number of child nodes, the number of father nodes and the number of brother nodes of each path segment node;

calculating the referenced number and the mean value of the referenced number of the path segment nodes according to the parent node number;

calculating the weight coefficient of the number of the child nodes of each level in the node data structure according to the number of the child nodes;

calculating the threshold value of the number of the child nodes according to the weighting coefficient of the number of the child nodes;

calculating the weighted coefficient of the referenced number of each level node according to the referenced number average value;

calculating the threshold value of the referenced number according to the weighted coefficient of the referenced number;

calculating an inverse reference coefficient according to the number of father nodes and the number of brother nodes;

calculating the average value of the reverse reference coefficients according to the reverse reference coefficients;

and calculating the threshold value of the back reference coefficient by using the average value of the back reference coefficient.

2. The method of claim 1, wherein the identifying quantitative data and suspected variable data within the access path data from the path relationship data set and the hierarchical threshold set comprises:

judging whether the number of the child nodes of any path segment node is larger than the threshold value of the number of the child nodes;

if the number of the child nodes is larger than the threshold value of the number of the child nodes, judging that the path segment nodes are quantitative data;

otherwise, judging whether the referenced number is larger than the referenced number threshold;

if the referenced number is larger than the referenced number threshold, judging that the path segment node is quantitative data;

otherwise, judging whether the referenced number is larger than the number of the brother nodes;

if the number of the cited nodes is larger than the number of the brother nodes, judging that the path segment nodes are quantitative data;

otherwise, judging whether the inverse reference coefficient threshold divided by the number of the child nodes is larger than the threshold of the number of the child nodes or not;

if the inverse reference coefficient threshold divided by the number of the child nodes is larger than the threshold of the number of the child nodes, judging that the path segment nodes are quantitative data;

otherwise, judging the path segment node as the suspected variable data.

3. The method as recited in claim 1, further comprising:

and marking the suspected variable data by using a preset wildcard, and generating a tree-type path structure by combining the quantitative data.

4. A method according to claim 3, wherein the verifying the suspected variable data to obtain variable data comprises:

traversing the lower nodes of the suspected variable data and the lower nodes of the quantitative data in the tree path structure;

judging whether a quantitative node meeting a first preset condition exists in lower nodes of the quantitative data; the quantitative node is the same as a subordinate node of the suspected variable data;

if a quantitative node meeting a first preset condition exists in the lower nodes of the quantitative data, judging that the suspected variable data is quantitative data;

otherwise, judging whether the suspected variable data has brother nodes meeting a second preset condition or not; the sibling node is an end node, and the sibling node is quantitative data;

if the suspected variable data have brother nodes meeting a second preset condition, judging the suspected variable data to be quantitative data;

otherwise, judging the suspected variable data as variable data.

5. The method according to claim 1, wherein the preset rule is: with "/" as the segmentation point.

6. An apparatus for identifying a variable in a URL, comprising:

the access path acquisition module is used for acquiring access path data of the website to be identified;

the preprocessing module is used for preprocessing the access path data to obtain a path relation data set and a hierarchical threshold set;

calculating the threshold value of the back reference coefficient by utilizing the average value of the back reference coefficient;

the quantitative identification module is used for identifying quantitative data and suspected variable data in the access path data according to the path relation data set and the hierarchical threshold set;

the suspected variable verification module is used for verifying the suspected variable data to obtain variable data;

and the result integration output module is used for integrating and outputting the quantitative data and the variable data.

7. An apparatus for identifying a variable in a URL, comprising:

a processor, and a memory coupled to the processor;

the memory is used for storing a computer program at least for executing the method of identifying variables in a URL according to any one of claims 1-5;

the processor is configured to invoke and execute the computer program in the memory.

8. A storage medium storing a computer program which, when executed by a processor, performs the steps of the method of identifying a variable in a URL as claimed in any one of claims 1 to 5.