WO2025190164A1

WO2025190164A1 - Mapping relationship acquiring method and apparatus

Info

Publication number: WO2025190164A1
Application number: PCT/CN2025/081208
Authority: WO
Inventors: 彭煦潭; 冯军; 朱家法
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2024-03-11
Filing date: 2025-03-07
Publication date: 2025-09-18
Anticipated expiration: 2026-09-11
Also published as: CN120639727A

Abstract

Provided in the embodiments of the present application are a mapping relationship acquiring method and an apparatus. The method comprises: on the basis of target data, acquiring an attribute of the port number of a first port, the target data comprising a port number field and a background field, the port number field comprising the port number of the first port, and the background field comprising the attribute of the port number; performing feature extraction on the attribute of the port number so as to obtain a feature tag set of the port number, the feature tag set comprising one or more feature tags of the port number; and, on the basis of the port number and the feature tag set, acquiring a mapping relationship, the mapping relationship comprising a key-value pair, the key in the mapping relationship comprising the port number, and the value in the mapping relationship comprising the feature tag set.

Description

Method and device for obtaining mapping relationship

本申请要求于2024年3月11日提交国家知识产权局、申请号为202410288365.2、发明名称为“映射关系的获取方法及装置”的中国专利申请的优先权，其全部内容通过引用结合在本申请中。This application claims priority to the Chinese patent application filed with the State Intellectual Property Office on March 11, 2024, with application number 202410288365.2 and invention name “Method and Device for Obtaining Mapping Relationship”, the entire contents of which are incorporated by reference into this application.

Technical Field

本申请涉及计算机领域，特别涉及一种映射关系的获取方法及装置。The present application relates to the field of computers, and in particular to a method and device for obtaining a mapping relationship.

Background Art

端口(port)可以理解为通信设备与外界通信的出口。端口包括虚拟端口和物理端口。一个通信设备可以包含多个端口，每个端口都可以与该通信设备中特定的应用程序或服务关联。端口号是用于唯一标识一个端口的数据。在面向数据通信数据的机器学习系统中，端口号是一种极其重要的特征。通过将端口号转换为特征标签，可以将端口号的特征标签作为输入，通过机器学习算法执行应用识别、异常检测等任务。A port can be understood as the outlet for a communication device to communicate with the outside world. Ports include both virtual ports and physical ports. A communication device can have multiple ports, each of which can be associated with a specific application or service within the device. A port number is data that uniquely identifies a port. In machine learning systems targeting data communication data, port numbers are extremely important features. By converting port numbers into feature labels, these feature labels can be used as input for machine learning algorithms to perform tasks such as application identification and anomaly detection.

为了获得端口号的特征标签，通常采用直接对端口号本身进行编码的方式。例如，当采用独热编码的方式，如果全部样本中共出现过d种不同的端口号，则采用d维的二值化的向量作为端口号的特征标签。To obtain the feature label of the port number, the port number itself is usually directly encoded. For example, when using one-hot encoding, if there are d different port numbers in all samples, a d-dimensional binary vector is used as the feature label of the port number.

然而，直接对端口号本身编码得到的特征标签不能表征端口号的属性，造成端口号的特征标签丢失了端口号的信息。并且，不同端口号的特征标签之间也无法表征不同端口之间的相关性。However, the feature tag obtained by directly encoding the port number itself cannot represent the properties of the port number, causing the feature tag of the port number to lose the information of the port number. In addition, the feature tags of different port numbers cannot represent the correlation between different ports.

Summary of the Invention

本申请提供了一种映射关系的获取方法及装置，能够避免信息丢失。所述技术方案如下。This application provides a method and device for obtaining a mapping relationship, which can avoid information loss. The technical solution is as follows.

第一方面，提供了一种映射关系的获取方法，所述方法包括：基于目标数据获取第一端口的端口号的属性，所述目标数据包括端口号字段以及背景字段，所述端口号字段包括所述第一端口的端口号，所述背景字段包括所述端口号的属性；对所述端口号的属性进行特征提取，以获得所述端口号的特征标签集合，所述特征标签集合包括所述端口号的一个或多个特征标签；基于所述端口号以及所述特征标签集合获取映射关系，所述映射关系包括键值对，所述映射关系中的键包括所述端口号，所述映射关系中的值包括所述特征标签集合。In a first aspect, a method for obtaining a mapping relationship is provided, the method comprising: obtaining the attributes of the port number of a first port based on target data, the target data comprising a port number field and a background field, the port number field comprising the port number of the first port, and the background field comprising the attributes of the port number; performing feature extraction on the attributes of the port number to obtain a feature tag set of the port number, the feature tag set comprising one or more feature tags of the port number; obtaining a mapping relationship based on the port number and the feature tag set, the mapping relationship comprising a key-value pair, the key in the mapping relationship comprising the port number, and the value in the mapping relationship comprising the feature tag set.

在第一方面提供的方法中，由于利用端口号与背景字段之间的共现关系，不是直接对端口号本身进行编码，而是从与端口号有关系的背景字段中获取端口号的属性，基于端口号的属性获取端口号的特征标签集合(或者说端口号的编码)，使得端口号的特征标签集合能够表征端口号的属性(或者说表征端口号的语义或用途)，从而避免端口号的特征标签集合无法表征端口号属性造成的信息丢失。此外，如果两个端口号具有共有的属性，则两个端口号的特征标签集合中会存在相同的部分，换句话说两个端口号的编码的局部相同或者相近，因此两个端口号的特征标签集合之间的相似度能够表征两个端口号之间的关联关系的紧密程度，从而更充分地表征出不同端口之间的相关性。In the method provided in the first aspect, due to the co-occurrence relationship between the port number and the background field, the port number itself is not directly encoded, but the attributes of the port number are obtained from the background field related to the port number, and the feature tag set of the port number (or the encoding of the port number) is obtained based on the attributes of the port number, so that the feature tag set of the port number can represent the attributes of the port number (or the semantics or purpose of the port number), thereby avoiding the information loss caused by the inability of the feature tag set of the port number to represent the attributes of the port number. In addition, if two port numbers have common attributes, there will be the same part in the feature tag sets of the two port numbers. In other words, the encodings of the two port numbers are partially the same or similar. Therefore, the similarity between the feature tag sets of the two port numbers can represent the closeness of the association relationship between the two port numbers, thereby more fully representing the correlation between different ports.

在一些实施方式中，所述目标数据包括静态数据，所述静态数据是指与端口实际的连接使用情况无关的数据，所述端口号的属性包括所述端口号的用途信息，所述用途信息包括通过所述端口号通信的应用程序的标识、通过所述端口号通信的协议的标识或通过所述端口号通信的业务的标识中至少一项，所述对所述端口号的属性进行特征提取，以获得所述端口号的特征标签集合，包括：对所述端口号的用途信息进行特征提取，以获得所述端口号的静态标签集合，所述端口号的静态标签集合包括所述端口号的一个或多个静态标签。In some embodiments, the target data includes static data, which refers to data that is unrelated to the actual connection usage of the port. The attributes of the port number include usage information of the port number, and the usage information includes at least one of an identifier of an application communicated through the port number, an identifier of a protocol communicated through the port number, or an identifier of a service communicated through the port number. The feature extraction of the attributes of the port number to obtain a feature tag set of the port number includes: feature extraction of the usage information of the port number to obtain a static tag set of the port number, and the static tag set of the port number includes one or more static tags of the port number.

通过将端口号的用途信息作为端口号的静态标签，由于与一个端口号相关的应用数量通常较少，如大部分情况下一个端口号仅对应一个应用，少部分情况下一个端口号对应多个应用，但即使在一个端口号对应多个应用的情况下，一个端口号对应的应用数量基本上也仅是在个位数左右，因此通过将端口号对应的用途信息作为端口号的静态标签集合，端口号的静态标签集合的尺寸或者说特征向量的维度不超过10，远小于采用独热编码的方式或标签编码的方式时端口号的静态标签集合会达到的上万的尺寸，从而避免维度灾难。By using the usage information of the port number as the static label of the port number, since the number of applications related to a port number is usually small, for example, in most cases a port number corresponds to only one application, and in a few cases a port number corresponds to multiple applications, but even when a port number corresponds to multiple applications, the number of applications corresponding to a port number is basically only in the single digit. Therefore, by using the usage information corresponding to the port number as the static label set of the port number, the size of the static label set of the port number or the dimension of the feature vector does not exceed 10, which is much smaller than the size of tens of thousands that the static label set of the port number will reach when using the one-hot encoding method or label encoding method, thereby avoiding the dimensionality disaster.

在一些实施方式中，所述静态数据包括端口注册表，所述端口注册表包括IANA端口注册表或/和私有协议端口注册表其中至少一项。In some implementations, the static data includes a port registry, and the port registry includes at least one of an IANA port registry and/or a private protocol port registry.

由于利用端口注册表即可获得端口号的静态标签，计算的开销比较小，也不需要再对端口注册表进行更新，所以减少了额外的更新端口注册表会产生的开销。Since the static label of the port number can be obtained by using the port registry, the calculation overhead is relatively small and there is no need to update the port registry, thereby reducing the overhead generated by additional updating of the port registry.

在一些实施方式中，所述基于目标数据获取第一端口的端口号的属性之前，所述方法还包括：对原始端口注册表进行去冗余处理，得到所述端口注册表。In some implementations, before acquiring the attribute of the port number of the first port based on the target data, the method further includes: performing redundancy removal processing on an original port registry to obtain the port registry.

通过对原始端口注册表进行去冗余处理，从而降低端口注册表的数据量，进而减少从端口注册表中提取特征标签时所需处理的数据量，进而提高提取特征标签的效率。By performing redundancy processing on the original port registry, the data volume of the port registry is reduced, thereby reducing the amount of data required to be processed when extracting feature tags from the port registry, thereby improving the efficiency of extracting feature tags.

在一些实施方式中，所述对原始端口注册表进行去冗余处理，包括：如果检测出所述原始端口注册表中一个背景字段中包括表征重复的关键词，从所述原始端口注册表删除所述背景字段所在的行。In some implementations, performing redundancy removal on the original port registry includes: if it is detected that a background field in the original port registry includes a keyword indicating duplication, deleting the row containing the background field from the original port registry.

由于端口注册表中一行出现表征重复的关键词时，通常表明该行与在先出现的某一行是冗余重复的，因此通过从原始端口注册表中删除表征重复的关键词所在的行，从而避免从重复的多行内容分别提取特征标签造成处理的数据量无谓增加，进一步提高提取特征标签的效率。Since the presence of a keyword representing duplication in a row of the port registry usually indicates that the row is redundant with a row that appeared earlier, the row containing the keyword representing duplication is deleted from the original port registry, thereby avoiding the unnecessary increase in the amount of processed data caused by extracting feature labels from multiple repeated rows of content, and further improving the efficiency of extracting feature labels.

在一些实施方式中，所述对原始端口注册表进行去冗余处理，包括：如果检测出所述原始端口注册表中多个背景字段中存在含义相同且表述不同的术语，确定所述术语的目标用词，将所述多个背景字段中出现的所述术语替换为所述目标用词。In some embodiments, the de-redundancy processing of the original port registry includes: if it is detected that there are terms with the same meaning but different expressions in multiple background fields in the original port registry, determining the target words for the terms, and replacing the terms appearing in the multiple background fields with the target words.

由于将端口注册表中含义相同且表述不同的术语的表述统一化，从而避免从表述不一致而含义相同的多行内容中分别提取特征标签造成处理的数据量无谓增加，进一步提高提取特征标签的效率。By unifying the expressions of terms with the same meaning but different expressions in the port registry, it is possible to avoid unnecessary increase in the amount of processed data by extracting feature labels from multiple lines of content with inconsistent expressions but the same meaning, thereby further improving the efficiency of extracting feature labels.

在一些实施方式中，所述目标数据包括动态数据，所述动态数据是指与所述第一端口实际的连接使用情况相关的数据，所述端口号的属性包括对象标识，所述对象标识用于标识与所述第一端口存在交互关系的对象，所述对所述端口号的属性进行特征提取，以获得所述端口号的特征标签集合，包括：对所述对象标识进行特征提取，以获得所述端口号的动态标签集合，所述端口号的动态标签集合包括所述端口号的一个或多个动态标签。In some embodiments, the target data includes dynamic data, and the dynamic data refers to data related to the actual connection usage of the first port. The attributes of the port number include an object identifier, and the object identifier is used to identify an object that has an interactive relationship with the first port. The feature extraction of the attributes of the port number to obtain a feature tag set of the port number includes: feature extraction of the object identifier to obtain a dynamic tag set of the port number, and the dynamic tag set of the port number includes one or more dynamic tags of the port number.

由于将与端口号存在交互关系的对象标识获取端口号的动态标签，使得与同一个对象交互的不同端口的特征标签集合会具有相似性，使得端口号的标签能够更充分地表征不同端口号之间的关联关系，端口号的标签包含的信息更丰富。此外，与一个端口号存在交互关系的对象数量通常是有限的，即使与一个端口号存在交互关系的对象数量较多，也可以通过从与一个端口号存在交互关系的对象中保留出现频率高于预定频率的对象从而减少动态标签集合的维度，使得动态标签集合的尺寸不至于像独热编码或标签编码的方式那样达到上万的尺寸，避免维度灾难。Since the dynamic label of the port number is obtained by identifying the object that has an interactive relationship with the port number, the feature label sets of different ports that interact with the same object will have similarities, so that the port number label can more fully characterize the association relationship between different port numbers, and the port number label contains richer information. In addition, the number of objects that have an interactive relationship with a port number is usually limited. Even if there are a large number of objects that have an interactive relationship with a port number, the dimension of the dynamic label set can be reduced by retaining objects that appear more frequently than a predetermined frequency from the objects that have an interactive relationship with the port number. This prevents the size of the dynamic label set from reaching tens of thousands as in the one-hot encoding or label encoding method, thus avoiding the dimensionality disaster.

在一些实施方式中，所述动态数据包括第一通信设备的日志记录，所述日志记录用于记录所述第一通信设备的所述第一端口与第二通信设备通信产生的数据，所述对象标识包括所述第二通信设备的网络地址、所述第二通信设备所在的区域的标识、所述第二通信设备中与所述第一端口通信的第一端口或/和所述第二通信设备中与所述第一端口通信的应用程序的标识中至少一项。In some embodiments, the dynamic data includes log records of a first communication device, wherein the log records are used to record data generated by communication between the first port of the first communication device and the second communication device, and the object identifier includes at least one of the network address of the second communication device, the identifier of the area where the second communication device is located, the first port in the second communication device that communicates with the first port, and/or the identifier of an application in the second communication device that communicates with the first port.

远端设备的地址、远端设备所在的区域的标识、远端设备的接口、远端设备所在的区域的标识、远端设备的端口或应用程序的标识等均可充当端口的特征标签，使得特征标签的数据形式更加多样化，适配的应用场景更丰富，提高灵活性。The address of the remote device, the identifier of the area where the remote device is located, the interface of the remote device, the identifier of the area where the remote device is located, the port of the remote device or the identifier of the application program can all serve as feature tags of the port, making the data form of the feature tag more diversified, the application scenarios adapted are richer, and the flexibility is improved.

在一些实施方式中，所述端口号的属性还包括所述对象与所述第一端口之间的交互次数，所述对所述对象标识进行特征提取，以获得所述端口号的动态标签集合，包括对所述对象标识与所述交互次数进行拼接或者融合，以获得所述端口号的特征标签集合。In some embodiments, the attributes of the port number also include the number of interactions between the object and the first port, and the feature extraction of the object identifier to obtain a dynamic tag set for the port number includes splicing or fusing the object identifier and the number of interactions to obtain a feature tag set for the port number.

由于端口的动态标签中保留了该端口与远端IP地址、远端应用等对象之间的交互次数，使得端口的动态标签能够提供更加丰富的背景信息，例如与同一个对象交互的频繁程度相近的不同端口的特征标签的相似度会相近，使得两个端口号的特征标签之间的相似度能够更充分地表征两个端口号之间的关联关系的紧密程度。Since the dynamic label of the port retains the number of interactions between the port and objects such as remote IP addresses and remote applications, the dynamic label of the port can provide richer background information. For example, the similarity of the feature labels of different ports that interact with the same object with similar frequencies will be similar, so that the similarity between the feature labels of two port numbers can more fully characterize the closeness of the association relationship between the two port numbers.

在一些实施方式中，所述动态数据还包括威胁记录，所述威胁记录中的背景字段包括攻击类型字段，所述端口号的属性包括以所述第一端口为攻击目标的网络攻击的类型以及所述第一端口被所述类型的网络攻击所攻击的次数，所述对所述端口号的属性进行特征提取，以获得所述端口号的特征标签集合，包括：对所述网络攻击或/和所述攻击的次数进行特征提取，以获得所述端口号的动态标签集合，所述端口号的动态标签集合包括所述端口号的一个或多个动态标签。In some embodiments, the dynamic data also includes threat records, the background field in the threat record includes an attack type field, the attributes of the port number include the type of network attack targeting the first port and the number of times the first port is attacked by the type of network attack, and the feature extraction of the attributes of the port number to obtain a feature tag set of the port number includes: feature extraction of the network attack and/or the number of attacks to obtain a dynamic tag set of the port number, and the dynamic tag set of the port number includes one or more dynamic tags of the port number.

通过以上方式，使得被同一种类型的网络攻击所攻击的不同端口的特征标签集合之间会具有更高的相似度，被攻击的频繁程度相近的不同端口的特征标签集合之间会具有更高的相似度，使得端口的特征标签集合能更充分地表征端口的语义。Through the above method, the feature tag sets of different ports attacked by the same type of network attack will have a higher similarity, and the feature tag sets of different ports attacked with similar frequency will have a higher similarity, so that the feature tag set of the port can more fully represent the semantics of the port.

在一些实施方式中，所述对所述端口号的属性进行特征提取，以获得所述端口号的特征标签集合，包括：对静态数据中的所述端口号的属性进行特征提取，以获得所述端口号的静态标签集合；对动态数据中的所述端口号的属性进行特征提取，以获得所述端口号的动态标签集合；对所述端口号的静态标签集合与所述端口号的静态标签集合进行融合，以获得所述端口号的特征标签集合。In some embodiments, the feature extraction of the attributes of the port number to obtain a feature tag set of the port number includes: feature extraction of the attributes of the port number in static data to obtain a static tag set of the port number; feature extraction of the attributes of the port number in dynamic data to obtain a dynamic tag set of the port number; and fusion of the static tag set of the port number with the static tag set of the port number to obtain a feature tag set of the port number.

由于采用动态路径与静态路径融合的方式提取特征标签，不仅避免了信息丢失以及维度灾难，此外由于特征标签集合包括动态数据和静态数据两种数据源，使得提取特征标签的过程兼顾了静态数据以及动态数据，解决了动静数据难以兼顾的问题，使得特征标签能够更加充分地表征端口的语义。By extracting feature labels by fusing dynamic paths with static paths, not only is information loss and dimensionality disaster avoided, but because the feature label set includes two data sources, dynamic data and static data, the feature label extraction process takes into account both static and dynamic data, solving the problem of difficulty in balancing dynamic and static data, and enabling feature labels to more fully characterize the semantics of ports.

在一些实施方式中，所述基于目标数据获取第一端口的端口号的属性，包括：确定所述目标数据中背景字段与所述端口号之间的相似度；如果所述相似度高于相似度阈值，则从所述背景字段中提取所述端口号的属性。In some embodiments, obtaining the attributes of the port number of the first port based on the target data includes: determining the similarity between a background field in the target data and the port number; and extracting the attributes of the port number from the background field if the similarity is higher than a similarity threshold.

以上方式相当于对背景字段与端口号之间的相关性进行校验，与端口号之间联系不密切的背景字段无需参与特征标签的提取过程，与端口号之间联系密切的背景字段才会参与特征标签的提取过程，从而一方面减少了与端口号之间联系不密切的背景字段的干扰，也减少了特征标签的维度数量。The above method is equivalent to verifying the correlation between the background field and the port number. The background field that is not closely related to the port number does not need to participate in the feature label extraction process. Only the background field that is closely related to the port number will participate in the feature label extraction process. This reduces the interference of the background field that is not closely related to the port number and reduces the number of dimensions of the feature label.

在一些实施方式中，所述基于目标数据获取第一端口的端口号的属性，包括：确定所述目标数据中背景字段的散列度；如果所述散列度高于散列度阈值，则从所述背景字段中提取所述端口号的属性。In some embodiments, acquiring the attribute of the port number of the first port based on the target data includes: determining a hash degree of a background field in the target data; and extracting the attribute of the port number from the background field if the hash degree is higher than a hash degree threshold.

以上方式相当于对背景字段的散列度进行校验，散列度不足的背景字段无需参与特征标签的提取过程，散列度足够高的背景字段才会参与特征标签的提取过程，从而提高不同的端口号特征标签的区分度，降低不同的端口号特征标签重复的概率。The above method is equivalent to verifying the hash degree of the background field. Background fields with insufficient hash degree do not need to participate in the feature label extraction process. Background fields with sufficiently high hash degree will participate in the feature label extraction process, thereby improving the distinction between different port number feature labels and reducing the probability of duplication of different port number feature labels.

在一些实施方式中，所述方法还包括：比较所述第一端口的特征标签集合与第二端口的特征标签集合，以获得所述第一端口的特征标签集合与第二端口的特征标签集合之间的相似度；基于所述第一端口的特征标签集合与第二端口的特征标签集合之间的相似度，确定所述第一端口与所述第二端口的相似度。In some embodiments, the method further includes: comparing the feature tag set of the first port with the feature tag set of the second port to obtain the similarity between the feature tag set of the first port and the feature tag set of the second port; and determining the similarity between the first port and the second port based on the similarity between the feature tag set of the first port and the feature tag set of the second port.

以上方式支持定量化地较为准确地评估不同端口之间的相似度。The above method supports quantitative and relatively accurate evaluation of the similarity between different ports.

在一些实施方式中，所述目标数据包括告警日志集合，所述告警日志集合包括第一告警日志以及第二告警日志，所述方法还包括：基于第一端口的特征标签集合与第二端口的特征标签集合之间的相似度，确定所述第一告警日志与所述第二告警日志之间的相似度，所述第一端口的特征标签集合是基于所述第一告警日志中的背景字段确定的，所述第二端口的特征标签集合是基于所述第二告警日志中的背景字段确定的；基于所述第一告警日志与第二告警日志之间的相似度，对所述告警日志集合进行聚类。In some embodiments, the target data includes an alarm log set, the alarm log set includes a first alarm log and a second alarm log, and the method further includes: determining the similarity between the first alarm log and the second alarm log based on the similarity between the feature tag set of the first port and the feature tag set of the second port, the feature tag set of the first port is determined based on the background field in the first alarm log, and the feature tag set of the second port is determined based on the background field in the second alarm log; clustering the alarm log set based on the similarity between the first alarm log and the second alarm log.

由于基于端口的特征标签集合之间的相似度，确定出现端口的不同告警日志之间的相似度，使得告警日志之间的相似度包含端口的维度的相似度，告警日志之间的相似度更准确，因此基于不同告警日志的相似度对告警日志进行聚类，能够更加准确地确定出告警日志的类型。Since the similarity between the feature tag sets based on the port is used to determine the similarity between different alarm logs where the port appears, the similarity between the alarm logs includes the similarity of the port dimension, and the similarity between the alarm logs is more accurate. Therefore, clustering the alarm logs based on the similarity of different alarm logs can more accurately determine the type of the alarm log.

在一些实施方式中，所述目标数据包括第一通信设备在正常状态下的网络连接信息，所述方法还包括：当检测到第二通信设备与所述第一通信设备的所述第一端口建立连接，基于所述第一端口的端口号查找所述映射关系，以获得所述端口号的特征标签集合；如果所述第二通信设备的特征命中所述端口号的特征标签集合，确定所述连接为正常连接；如果所述第二通信设备的特征未命中所述端口号的特征标签集合，确定所述连接为异常连接。In some embodiments, the target data includes network connection information of the first communication device in a normal state, and the method further includes: when it is detected that a second communication device establishes a connection with the first port of the first communication device, searching the mapping relationship based on the port number of the first port to obtain a feature tag set of the port number; if the feature of the second communication device hits the feature tag set of the port number, determining that the connection is a normal connection; if the feature of the second communication device does not hit the feature tag set of the port number, determining that the connection is an abnormal connection.

正常状态下的网络连接信息所确定出的端口与特征标签集合之间的映射关系可以实现流量基线的作用，如果第二通信设备的特征属于该端口对应的特征标签集合，表征第二通信设备当前向该端口发起的连接没有超过流量基线，第二通信设备历史时间段与第一通信设备的端口正常连接；如果第二通信设备的特征属于该端口对应的特征标签集合，表征第二通信设备当前向该端口发起的连接偏离流量基线，则确定第二通信设备与第一通信设备的连接为异常连接，从而有助于更加准确地检测异常连接。The mapping relationship between the port and the feature tag set determined by the network connection information under normal conditions can realize the role of the traffic baseline. If the characteristics of the second communication device belong to the feature tag set corresponding to the port, it indicates that the connection currently initiated by the second communication device to the port does not exceed the traffic baseline, and the second communication device is normally connected to the port of the first communication device in the historical time period; if the characteristics of the second communication device belong to the feature tag set corresponding to the port, it indicates that the connection currently initiated by the second communication device to the port deviates from the traffic baseline, then the connection between the second communication device and the first communication device is determined to be an abnormal connection, which helps to detect abnormal connections more accurately.

在一些实施方式中，所述目标数据包括训练样本集，所述训练样本集中每个样本包括源地址、目的地址、源端口号、目的端口号、协议类型以及类型标签，所述方法还包括：基于所述源端口号查找所述映射关系，以获得所述源端口号的特征标签集合；基于所述目的端口号查找所述映射关系，以获得所述目的端口号的特征标签集合；对所述源地址的特征标签集合、所述目的地址的特征标签集合、所述源端口号的特征标签集合、所述目的端口号的特征标签集合以及所述协议类型的特征标签集合进行拼接或融合，以获得所述训练样本集中每个样本的特征标签集合；基于所述训练样本集中每个样本的特征标签集合以及所述训练样本集中每个样本的类型标签进行模型训练，以得到分类模型。In some embodiments, the target data includes a training sample set, each sample in the training sample set includes a source address, a destination address, a source port number, a destination port number, a protocol type, and a type label, and the method further includes: searching the mapping relationship based on the source port number to obtain a feature label set of the source port number; searching the mapping relationship based on the destination port number to obtain a feature label set of the destination port number; splicing or fusing the feature label set of the source address, the feature label set of the destination address, the feature label set of the source port number, the feature label set of the destination port number, and the feature label set of the protocol type to obtain a feature label set of each sample in the training sample set; performing model training based on the feature label set of each sample in the training sample set and the type label of each sample in the training sample set to obtain a classification model.

第二方面，提供一种映射关系的获取装置，所述装置包括：In a second aspect, a device for obtaining a mapping relationship is provided, the device comprising:

获取单元，用于基于目标数据获取第一端口的端口号的属性，所述目标数据包括端口号字段以及背景字段，所述端口号字段包括所述第一端口的端口号，所述背景字段包括所述端口号的属性；an acquiring unit, configured to acquire an attribute of a port number of a first port based on target data, wherein the target data includes a port number field and a background field, the port number field includes the port number of the first port, and the background field includes the attribute of the port number;

处理单元，用于对所述端口号的属性进行特征提取，以获得所述端口号的特征标签集合，所述特征标签集合包括所述端口号的一个或多个特征标签；a processing unit, configured to perform feature extraction on the attributes of the port number to obtain a feature tag set of the port number, wherein the feature tag set includes one or more feature tags of the port number;

所述获取单元，还用于基于所述端口号以及所述特征标签集合获取映射关系，所述映射关系包括键值对，所述映射关系中的键包括所述端口号，所述映射关系中的值包括所述特征标签集合。The acquisition unit is further configured to acquire a mapping relationship based on the port number and the feature tag set, wherein the mapping relationship includes a key-value pair, the key in the mapping relationship includes the port number, and the value in the mapping relationship includes the feature tag set.

在一些实施方式中，所述目标数据包括静态数据，所述静态数据是指与端口实际的连接使用情况无关的数据，所述端口号的属性包括所述端口号的用途信息，所述用途信息包括通过所述端口号通信的应用程序的标识、通过所述端口号通信的协议的标识或通过所述端口号通信的业务的标识中至少一项，所述处理单元，还用于对所述端口号的用途信息进行特征提取，以获得所述端口号的静态标签集合，所述端口号的静态标签集合包括所述端口号的一个或多个静态标签。In some embodiments, the target data includes static data, which refers to data that is unrelated to the actual connection usage of the port. The attributes of the port number include usage information of the port number, and the usage information includes at least one of an identifier of an application communicated through the port number, an identifier of a protocol communicated through the port number, or an identifier of a service communicated through the port number. The processing unit is further used to perform feature extraction on the usage information of the port number to obtain a static tag set of the port number, and the static tag set of the port number includes one or more static tags of the port number.

在一些实施方式中，所述处理单元，还用于对原始端口注册表进行去冗余处理，得到所述端口注册表。In some implementations, the processing unit is further configured to perform redundancy removal processing on the original port registry to obtain the port registry.

在一些实施方式中，所述处理单元还用于执行以下至少一项：如果检测出所述原始端口注册表中一个背景字段中包括表征重复的关键词，从所述原始端口注册表删除所述背景字段所在的行；如果检测出所述原始端口注册表中多个背景字段中存在含义相同且表述不同的术语，确定所述术语的目标用词，将所述多个背景字段中出现的所述术语替换为所述目标用词。In some embodiments, the processing unit is further used to perform at least one of the following: if it is detected that a background field in the original port registry includes keywords representing repetition, deleting the row where the background field is located from the original port registry; if it is detected that there are terms with the same meaning but different expressions in multiple background fields in the original port registry, determining the target terms for the terms, and replacing the terms appearing in the multiple background fields with the target terms.

在一些实施方式中，所述目标数据包括动态数据，所述动态数据是指与所述第一端口实际的连接使用情况相关的数据，所述端口号的属性包括对象标识，所述对象标识用于标识与所述第一端口存在交互关系的对象，所述处理单元用于对所述对象标识进行特征提取，以获得所述端口号的动态标签集合，所述端口号的动态标签集合包括所述端口号的一个或多个动态标签。In some embodiments, the target data includes dynamic data, which refers to data related to the actual connection usage of the first port. The attributes of the port number include an object identifier, which is used to identify an object that has an interactive relationship with the first port. The processing unit is used to extract features from the object identifier to obtain a dynamic tag set for the port number, and the dynamic tag set for the port number includes one or more dynamic tags of the port number.

在一些实施方式中，所述端口号的属性还包括所述对象与所述第一端口之间的交互次数，所述处理单元用于对所述对象标识与所述交互次数进行拼接或者融合，以获得所述端口号的特征标签集合。In some implementations, the attributes of the port number further include the number of interactions between the object and the first port, and the processing unit is configured to concatenate or fuse the object identifier and the number of interactions to obtain a feature tag set for the port number.

在一些实施方式中，所述动态数据还包括威胁记录，所述威胁记录中的背景字段包括攻击类型字段，所述端口号的属性包括以所述第一端口为攻击目标的网络攻击的类型以及所述第一端口被所述类型的网络攻击所攻击的次数，所述处理单元用于对所述网络攻击或/和所述攻击的次数进行特征提取，以获得所述端口号的动态标签集合，所述端口号的动态标签集合包括所述端口号的一个或多个动态标签。In some embodiments, the dynamic data also includes a threat record, the background field in the threat record includes an attack type field, the attributes of the port number include the type of network attack targeting the first port and the number of times the first port has been attacked by the type of network attack, and the processing unit is used to perform feature extraction on the network attack and/or the number of attacks to obtain a dynamic label set for the port number, and the dynamic label set for the port number includes one or more dynamic labels for the port number.

在一些实施方式中，所述处理单元还用于对静态数据中的所述端口号的属性进行特征提取，以获得所述端口号的静态标签集合；对动态数据中的所述端口号的属性进行特征提取，以获得所述端口号的动态标签集合；对所述端口号的静态标签集合与所述端口号的静态标签集合进行融合，以获得所述端口号的特征标签集合。In some embodiments, the processing unit is further used to perform feature extraction on the attributes of the port number in static data to obtain a static tag set of the port number; perform feature extraction on the attributes of the port number in dynamic data to obtain a dynamic tag set of the port number; and fuse the static tag set of the port number with the static tag set of the port number to obtain a feature tag set of the port number.

在一些实施方式中，所述处理单元，还用于确定所述目标数据中背景字段与所述端口号之间的相似度；如果所述相似度高于相似度阈值，则从所述背景字段中提取所述端口号的属性。In some embodiments, the processing unit is further configured to determine a similarity between a background field in the target data and the port number; and if the similarity is higher than a similarity threshold, extracting the attribute of the port number from the background field.

在一些实施方式中，所述处理单元，还用于确定所述目标数据中背景字段的散列度；如果所述散列度高于散列度阈值，则从所述背景字段中提取所述端口号的属性。In some implementations, the processing unit is further configured to determine a hash degree of a background field in the target data; and if the hash degree is higher than a hash degree threshold, extract the attribute of the port number from the background field.

在一些实施方式中，所述装置还包括：In some embodiments, the device further comprises:

比较单元，比较所述第一端口的特征标签集合与第二端口的特征标签集合，以获得所述第一端口的特征标签集合与第二端口的特征标签集合之间的相似度；a comparing unit, configured to compare the feature tag set of the first port with the feature tag set of the second port to obtain a similarity between the feature tag set of the first port and the feature tag set of the second port;

所述处理单元，还用于基于所述第一端口的特征标签集合与第二端口的特征标签集合之间的相似度，确定所述第一端口与所述第二端口的相似度。The processing unit is further configured to determine a similarity between the first port and the second port based on a similarity between the feature tag set of the first port and the feature tag set of the second port.

在一些实施方式中，所述目标数据包括告警日志集合，所述告警日志集合包括第一告警日志以及第二告警日志，所述处理单元，还用于基于第一端口的特征标签集合与第二端口的特征标签集合之间的相似度，确定所述第一告警日志与所述第二告警日志之间的相似度，所述第一端口的特征标签集合是基于所述第一告警日志中的背景字段确定的，所述第二端口的特征标签集合是基于所述第二告警日志中的背景字段确定的；基于所述第一告警日志与第二告警日志之间的相似度，对所述告警日志集合进行聚类。In some embodiments, the target data includes an alarm log set, the alarm log set includes a first alarm log and a second alarm log, and the processing unit is further used to determine the similarity between the first alarm log and the second alarm log based on the similarity between the feature tag set of the first port and the feature tag set of the second port, the feature tag set of the first port is determined based on the background field in the first alarm log, and the feature tag set of the second port is determined based on the background field in the second alarm log; based on the similarity between the first alarm log and the second alarm log, the alarm log set is clustered.

在一些实施方式中，所述目标数据包括第一通信设备在正常状态下的网络连接信息，所述处理单元，还用于当检测到第二通信设备与所述第一通信设备的所述第一端口建立连接，基于所述第一端口的端口号查找所述映射关系，以获得所述端口号的特征标签集合；如果所述第二通信设备的特征命中所述端口号的特征标签集合，确定所述连接为正常连接；如果所述第二通信设备的特征未命中所述端口号的特征标签集合，确定所述连接为异常连接。In some embodiments, the target data includes network connection information of the first communication device in a normal state, and the processing unit is further used to, when detecting that a second communication device has established a connection with the first port of the first communication device, search the mapping relationship based on the port number of the first port to obtain a feature tag set of the port number; if the feature of the second communication device hits the feature tag set of the port number, determine that the connection is a normal connection; if the feature of the second communication device does not hit the feature tag set of the port number, determine that the connection is an abnormal connection.

在一些实施方式中，所述目标数据包括训练样本集，所述训练样本集中每个样本包括源地址、目的地址、源端口号、目的端口号、协议类型以及类型标签，所述装置还包括：In some embodiments, the target data includes a training sample set, each sample in the training sample set includes a source address, a destination address, a source port number, a destination port number, a protocol type, and a type label, and the apparatus further includes:

查找单元，用于基于所述源端口号查找所述映射关系，以获得所述源端口号的特征标签集合；基于所述目的端口号查找所述映射关系，以获得所述目的端口号的特征标签集合；A searching unit, configured to search the mapping relationship based on the source port number to obtain a feature tag set of the source port number; and search the mapping relationship based on the destination port number to obtain a feature tag set of the destination port number;

所述处理单元，还用于对所述源地址的特征标签集合、所述目的地址的特征标签集合、所述源端口号的特征标签集合、所述目的端口号的特征标签集合以及所述协议类型的特征标签集合进行拼接或融合，以获得所述训练样本集中每个样本的特征标签集合；基于所述训练样本集中每个样本的特征标签集合以及所述训练样本集中每个样本的类型标签进行模型训练，以得到分类模型。The processing unit is also used to splice or fuse the feature label set of the source address, the feature label set of the destination address, the feature label set of the source port number, the feature label set of the destination port number, and the feature label set of the protocol type to obtain the feature label set of each sample in the training sample set; and perform model training based on the feature label set of each sample in the training sample set and the type label of each sample in the training sample set to obtain a classification model.

第三方面，提供了一种通信设备，所述通信设备包括：处理器，所述处理器与存储器耦合，所述存储器中存储有至少一条计算机程序指令，所述至少一条计算机程序指令由所述处理器加载并执行，以使所述通信设备实现第一方面或第一方面任一种可选方式所述的方法。According to a third aspect, a communication device is provided, comprising: a processor, the processor being coupled to a memory, the memory storing at least one computer program instruction, the at least one computer program instruction being loaded and executed by the processor, so that the communication device implements the method described in the first aspect or any optional method of the first aspect.

第四方面，提供了一种计算机可读存储介质，所述存储介质中存储有至少一条指令，所述指令在计算机上运行时，使得计算机执行如第一方面或第一方面任一种可选方式所述的方法。In a fourth aspect, a computer-readable storage medium is provided, wherein the storage medium stores at least one instruction, and when the instruction is executed on a computer, the computer executes the method according to the first aspect or any optional method of the first aspect.

第五方面，提供了一种计算机程序产品，所述计算机程序产品包括一个或多个计算机程序指令，当所述计算机程序指令被计算机加载并运行时，使得所述计算机执行第一方面或第一方面任一种可选方式所述的方法。In a fifth aspect, a computer program product is provided, which includes one or more computer program instructions. When the computer program instructions are loaded and run by a computer, the computer executes the method described in the first aspect or any optional method of the first aspect.

BRIEF DESCRIPTION OF THE DRAWINGS

图1是本申请实施例提供的一种映射关系的获取方法的流程图；FIG1 is a flowchart of a method for obtaining a mapping relationship provided in an embodiment of the present application;

图2是本申请实施例提供的一种映射关系的获取装置的结构示意图；FIG2 is a schematic structural diagram of a device for obtaining a mapping relationship provided in an embodiment of the present application;

图3是本申请实施例提供的一种通信设备的结构示意图。FIG3 is a schematic structural diagram of a communication device provided in an embodiment of the present application.

DETAILED DESCRIPTION

为使本申请的目的、技术方案和优点更加清楚，下面将结合附图对本申请实施方式作进一步地详细描述。In order to make the objectives, technical solutions and advantages of this application clearer, the implementation methods of this application will be further described in detail below with reference to the accompanying drawings.

下面对本申请实施例涉及的一些术语概念做解释说明。The following is an explanation of some terminology concepts involved in the embodiments of this application.

(1)端口号(1) Port number

端口号用于在计算机网络中用来标识应用程序或服务的端口。端口号通常是一个16位的整数，端口号的范围通常为从0到65535。A port number is used to identify an application or service port on a computer network. A port number is usually a 16-bit integer, ranging from 0 to 65535.

(2)背景字段(2) Background field

背景字段是指与端口号具有关联关系的除端口号之外的其他字段。背景字段的内容与端口号之间具有依赖关系。例如，背景字段包括与端口存在交互关系的互联网协议(Internet Protocol，IP)地址，该IP地址需要利用端口建立网络连接。又如，背景字段包括利用端口的应用的标识以及利用端口的业务的标识等。背景字段用于提取端口号的特征标签。例如，在采用动态路径的情况下，可利用背景字段与端口号的交互关系(共现关系)提取端口号的特征标签。背景字段可以来自于静态数据，也可以来自于动态数据。例如，静态数据中的背景字段包括与端口号用途有关的字段，例如应用标识字段或者业务标识字段。动态数据中的背景字段包括与端口存在交互关系的IP地址字段或者端口对应的攻击类型字段。Background fields refer to fields other than port numbers that are associated with port numbers. The content of the background field is dependent on the port number. For example, the background field includes an Internet Protocol (IP) address that interacts with the port, and the IP address needs to use the port to establish a network connection. For another example, the background field includes the identifier of the application that uses the port and the identifier of the business that uses the port. The background field is used to extract the feature label of the port number. For example, when a dynamic path is used, the interactive relationship (co-occurrence relationship) between the background field and the port number can be used to extract the feature label of the port number. The background field can come from static data or dynamic data. For example, the background field in static data includes fields related to the purpose of the port number, such as an application identification field or a business identification field. The background field in dynamic data includes an IP address field that interacts with the port or an attack type field corresponding to the port.

(3)端口号的特征标签集合(3) Feature tag set of port number

端口号的特征标签集合包括端口号的一个或多个特征标签。特征标签用于标识端口号的特征。在一些实施方式中，特征标签用于标识端口号的用途。例如，特征标签为端口号对应的应用的标识，又如，特征标签可以是字符串的形式，也可以是数字的形式。特征标签可以是静态标签，也可以是动态标签。在一些实施方式中，特征标签的散列度高于散列度阈值。The feature tag set of a port number includes one or more feature tags of the port number. The feature tag is used to identify the characteristics of the port number. In some embodiments, the feature tag is used to identify the purpose of the port number. For example, the feature tag is an identifier of the application corresponding to the port number. For another example, the feature tag can be in the form of a string or a number. The feature tag can be a static tag or a dynamic tag. In some embodiments, the hash degree of the feature tag is higher than the hash degree threshold.

(4)静态数据(4) Static data

静态数据也称静态知识。静态数据是指与端口实际的连接使用情况无关的知识。静态数据在本申请实施例中的主要作用为充当特征标签的一种数据源。静态数据包括端口号以及端口号的属性。例如，静态数据包括端口号字段以及背景字段，端口号字段为键，背景字段为值，背景字段包括端口号的属性。Static data, also known as static knowledge, refers to knowledge that is unrelated to the actual connection and usage of a port. In the embodiments of this application, static data primarily serves as a data source for feature tags. Static data includes the port number and its attributes. For example, static data includes a port number field and a background field, where the port number field is the key and the background field is the value, with the background field including the attributes of the port number.

端口号的属性也可以称为端口号的描述信息、端口号的语义信息、端口号的用途信息或者端口的类别信息。比如，端口号的属性包括端口号的用途、与端口号绑定的应用名称、与端口号绑定的服务名称(Service Name)以及与端口号绑定的协议类型(Protocol Name)其中至少一项。The attributes of a port number can also be called its descriptive information, semantic information, usage information, or port category information. For example, the attributes of a port number include at least one of the usage of the port number, the application name bound to the port number, the service name bound to the port number, and the protocol type bound to the port number.

在一些实施方式中，静态数据为端口注册表。端口注册表用于记录端口的属性。端口注册表中包括端口号的多种维度的描述信息。例如，端口注册表的键包括端口号(Port Number)，端口注册表的值字段包括服务名称、协议类型(Protocol Name)、描述(Description)、分配状态(Assignment Status)以及参考文档等。端口注册表可以进一步细分为两类。一类端口注册表来自标准化组织，来自标准化组织的端口注册表例如包括IANA端口注册表。另一类端口注册表来自私有协议，来自私有协议的端口注册表例如私有协议端口注册表，比如，为IANA尚未分配的端口号指派对应的服务，通过私有协议端口注册表记录端口号与服务之间的对应关系。In some embodiments, the static data is a port registry. The port registry is used to record the attributes of the port. The port registry includes descriptive information of multiple dimensions of the port number. For example, the key of the port registry includes the port number (Port Number), and the value field of the port registry includes the service name, protocol type (Protocol Name), description (Description), assignment status (Assignment Status) and reference documents, etc. The port registry can be further divided into two categories. One type of port registry comes from a standardization organization, and the port registry from the standardization organization includes, for example, the IANA port registry. The other type of port registry comes from a private protocol, and the port registry from the private protocol is, for example, a private protocol port registry, for example, to assign corresponding services to port numbers that have not yet been assigned by IANA, and record the correspondence between the port number and the service through the private protocol port registry.

在一些实施方式中，静态数据为端口注册表之外的其他能够表征端口号用途或者说包含端口号属性的文件。比如说，静态数据为端口使用情报、操作系统或应用程序的使用文档或通信设备的厂商资料等。静态数据可以为格式化的文件(如表格)，也可以为非格式化的文件。本实施例对用于提取端口号的特征的静态数据的类型不做限定。In some embodiments, the static data is a file other than the port registry that can characterize the purpose of the port number or contain port number attributes. For example, the static data is port usage information, operating system or application usage documentation, or communication device manufacturer information. The static data can be a formatted file (such as a table) or an unformatted file. This embodiment does not limit the type of static data used to extract the characteristics of the port number.

(5)静态标签(5) Static tags

静态标签是指从静态数据中提取的端口号的特征标签。例如，可以基于IANA端口注册表中端口支持的协议，例如安全外壳协议(Secure Shell，SSH)、超文本传输协议(Hypertext Transfer Protocol，HTTP)、文件传输协议(File Transfer Protocol，FTP)，又如基于IANA端口注册表中端口对应的服务描述信息(例如File Transfer、Unassigned)等可以被加工为静态标签。Static tags are feature tags of port numbers extracted from static data. For example, they can be based on the protocols supported by the port in the IANA port registry, such as Secure Shell (SSH), Hypertext Transfer Protocol (HTTP), and File Transfer Protocol (FTP). Alternatively, static tags can be processed based on the service description information corresponding to the port in the IANA port registry (such as File Transfer and Unassigned).

(6)动态数据(6) Dynamic data

动态数据是指与端口实际的连接使用情况相关的数据。例如，动态数据包括与实际部署场景相关的数据。动态数据例如为日志记录。例如，动态数据为通信设备在转发数据流过程中产生的流量记录。又如，动态数据为安全设备在执行威胁检测过程中产生的威胁检测记录。动态数据在本申请实施例中的主要作用为充当特征标签的另一种数据源。Dynamic data refers to data related to the actual connection usage of a port. For example, dynamic data includes data related to actual deployment scenarios. Examples of dynamic data include log records. For example, dynamic data may be traffic records generated by a communication device during the process of forwarding data streams. Another example is threat detection records generated by a security device during threat detection. In the embodiments of this application, dynamic data primarily serves as another data source for feature tags.

(7)动态标签(7) Dynamic Tags

动态标签是指从动态数据中提取的端口号的标签。例如，一个端口的动态标签包括以该端口为攻击目标的网络攻击的类型。又如，一个端口的动态标签包括历史时间段内预定时长中与该端口存在交互的IP地址集合。又如，一个端口的动态标签包括历史时间段内预定时长中该端口提供过的服务类型集合。例如，在开发一个应用的过程中，首先暂时使用端口A提供该应用的业务，将端口A与该应用绑定，后续不再使用端口A提供该应用的业务，而是使用端口B提供该应用的业务，将端口B与该应用绑定。在这一场景下，从动态数据中提取端口A的特征标签集合以及端口B的特征标签集合后，可以确定端口A的特征标签集合以及端口B的特征标签集合存在重叠，基于此可以确定端口A与端口B之间的相似度相较于端口A与端口B之外其他端口的相似度更高。A dynamic label refers to a label for a port number extracted from dynamic data. For example, a port's dynamic label includes the type of network attack targeting that port. Another example is a port's dynamic label includes the set of IP addresses that interacted with the port for a predetermined period of time within a historical time period. Another example is a port's dynamic label includes the set of service types provided by the port for a predetermined period of time within a historical time period. For example, during the development of an application, port A is temporarily used to provide the application's services and is bound to the application. Subsequently, port A is no longer used to provide the application's services, but port B is used to provide the application's services and is bound to the application. In this scenario, after extracting the feature label set of port A and the feature label set of port B from the dynamic data, it can be determined that there is overlap between the feature label set of port A and the feature label set of port B. Based on this, it can be determined that the similarity between port A and port B is higher than the similarity between port A and port B.

(8)互联网号码分配局(Internet Assigned Numbers Authority，IANA)端口注册表(8) Internet Assigned Numbers Authority (IANA) Port Registry

IANA是全球权威的互联网地址指派机构。IANA是管理国际互联网中使用的IP地址、域名和许多其他参数。本申请实施例涉及IANA提供的用于描述互联网服务和端口号关系的文件，该描述互联网服务和端口号关系的文件也称IANA端口注册表。IANA is the world's leading authority for assigning Internet addresses. IANA manages IP addresses, domain names, and many other parameters used on the Internet. Embodiments of the present application relate to a file provided by IANA that describes the relationship between Internet services and port numbers. This file, also known as the IANA Port Registry, describes the relationship between Internet services and port numbers.

IANA端口注册表的访问链接可以参考如下地址。The access link to the IANA port registry can be found at the following address.

https://www.iana.org/assignments/service-names-port-numbers/service-names-port-numbers.xhtml。 https://www.iana.org/assignments/service-names-port-numbers/service-names-port-numbers.xhtml .

下面对本申请实施例的应用场景举例说明。The following is an example of an application scenario of the embodiment of the present application.

端口号是各类通信设备中最经常处理的字段之一。在面向数通数据的机器学习系统中，端口号是一种极其重要的特征。但是，不同于常见的值域连续的特征(比如流量值、连接数等)，端口号天然具有值域离散的特点。例如，样本点A、B、C这三个样本点的端口号分别为79、80、8080，而端口号之外其他维度的特征均一致。端口号79和端口号80的数值差距为1(80-79＝1)，端口号80和端口号8080的数值差距为8000(8080-80＝8000)。虽然从数值差距的角度而言，<79,80>的数值差距(即1)小于<80,8080>的差距(即8000)，但不能说明样本点A和B这两个样本点之间的相似度比样本点B和C之间的相似度更高，因为不同端口号在数值上的差距并不代表不同端口号在语义维度实际的差异程度。Port numbers are one of the most frequently processed fields in various communication devices. In machine learning systems for data communication data, port numbers are extremely important features. However, unlike common features with continuous ranges (such as traffic values and number of connections), port numbers inherently have discrete ranges. For example, sample points A, B, and C have port numbers 79, 80, and 8080, respectively, while all other features are identical. The numerical difference between port numbers 79 and 80 is 1 (80 - 79 = 1), and the numerical difference between port numbers 80 and 8080 is 8000 (8080 - 80 = 8000). Although the numerical difference between <79, 80> (i.e., 1) is smaller than the difference between <80, 8080> (i.e., 8000), this does not mean that the similarity between sample points A and B is higher than that between sample points B and C. This is because the numerical difference between different port numbers does not represent the actual degree of difference between the port numbers in the semantic dimension.

因此，各类流行的面向连续值的特征工程(即，对原始特征进行加工和处理，从而将其转化成更好地表达问题本质的特征)技术，均无法直接应用于端口号。Therefore, various popular continuous-value-oriented feature engineering techniques (i.e., processing and manipulating original features to transform them into features that better express the essence of the problem) cannot be directly applied to port numbers.

对端口号进行编码的方式通常包括独热编码的方式以及标签编码的方式。Methods for encoding port numbers generally include one-hot encoding and label encoding.

在采用独热编码的方式时，将端口号编码为二值化的向量。如果全部样本中，共出现过d种不同的端口号，则采用d维的特征向量来表征一个端口号。端口号的特征向量的每个维度的值非0即1。对d种端口号按照预定规则(如字母序或者数值大小等规则)进行排序，依次为每种端口号分配一个维度编号，编号范围为1到d。对每个样本的端口号，将该端口号对应的维度的值设为1，端口号对应的维度之外的其余维度的值设为0。比如，全部样本共出现过{79,80,8080}三种端口号，则可以将三种端口号分别编码为[1,0,0],[0,1,0]和[0,0,1]。When using the one-hot encoding method, the port number is encoded as a binary vector. If there are d different port numbers in all samples, a d-dimensional feature vector is used to represent a port number. The value of each dimension of the feature vector of the port number is either 0 or 1. The d port numbers are sorted according to predetermined rules (such as alphabetical order or numerical size rules), and a dimension number is assigned to each port number in turn, with the number range from 1 to d. For the port number of each sample, the value of the dimension corresponding to the port number is set to 1, and the values of the remaining dimensions other than the dimension corresponding to the port number are set to 0. For example, if there are three port numbers {79, 80, 8080} in all samples, the three port numbers can be encoded as [1, 0, 0], [0, 1, 0] and [0, 0, 1] respectively.

在采用标签编码的方式时，如果全部样本中共出现过d种不同的端口号，则采用d种不同的编号分别表征d种端口号。比如，全部样本共出现过{79,80,8080}三种端口号，则可以将三种端口号分别编码为0,1和2。When using label encoding, if all samples have d different port numbers, then d different numbers are used to represent the d port numbers. For example, if all samples have three port numbers {79, 80, 8080}, then the three port numbers can be encoded as 0, 1, and 2 respectively.

然而，无论是独热编码的方式还是标签编码的方式，均存在如下三个缺陷。However, both the one-hot encoding method and the label encoding method have the following three defects.

(1)信息丢失(1) Information loss

独热编码以及标签编码的方式均忽视了不同端口之间语义上可能存在的相关性。例如，经上述方法对端口79、端口80以及端口8080分别编码后，端口79的编码与端口80的编码之间的相似度与端口80的编码与端口8080的编码相似度始终相同。Both one-hot encoding and label encoding ignore the semantic correlations that may exist between different ports. For example, after encoding port 79, port 80, and port 8080 using the above method, the similarity between the encoding of port 79 and port 80 is always the same as the similarity between the encoding of port 80 and port 8080.

(2)维度灾难(2) Curse of Dimensionality

常见的系统共有65536种可能的端口号。如果采用独热编码，则可能导致出现几万维的巨大向量；如果采用标签编码，则可能出现几万种不同的取值，导致独热编码和标签编码这两种方式均会引起对应机器学习系统的计算复杂度剧增。为了解决这个问题，往往仅对高频的端口号进行单独编码，其它的各个低频端口号放入“其它”，作为一个整体进行编码。但是，这样的操作又势必会造成低频端口号的信息丢失——当端口号的分布偏度过大或过小时，这一缺陷尤为突出。Common systems have a total of 65,536 possible port numbers. Using one-hot encoding can result in a massive vector of tens of thousands of dimensions; using label encoding can result in tens of thousands of different values. Consequently, both one-hot encoding and label encoding methods dramatically increase the computational complexity of the corresponding machine learning system. To address this issue, only high-frequency port numbers are typically encoded individually, while the remaining low-frequency port numbers are grouped under "Other" and encoded as a whole. However, this inevitably results in information loss for low-frequency port numbers—a drawback that is particularly pronounced when the port number distribution is too skewed or too small.

(3)动静数据难以兼顾(3) It is difficult to take into account both dynamic and static data

例如，要么只编码了静态信息(例如将80号端口直接编码为“80”)，要么只编码了动态信息(例如根据端口在数据中的出现频次进行编码)，未能同时兼顾静态数据和动态数据这两大类信息来源，导致端口的编码可能难以充分准确表征端口的语义。For example, either only static information is encoded (for example, port 80 is directly encoded as "80"), or only dynamic information is encoded (for example, encoding according to the frequency of the port's appearance in the data), failing to take into account both static data and dynamic data. As a result, the port encoding may be difficult to fully and accurately represent the port's semantics.

有鉴于此，本申请的一些实施方式中，并不是直接对端口号本身进行编码，而是利用端口号与背景字段之间的共现关系对背景字段进行编码。例如，从端口号之外的背景字段获取端口号的属性(如端口号相关的应用、端口号相关的服务、端口号相关的协议、与端口号存在交互关系的IP地址等与端口号相关的数据，也可称为端口号的语义信息)，基于端口号的属性获取端口号的特征标签集合。In view of this, in some embodiments of the present application, the port number itself is not directly encoded, but the background field is encoded by utilizing the co-occurrence relationship between the port number and the background field. For example, the attributes of the port number (such as the application related to the port number, the service related to the port number, the protocol related to the port number, the IP address that has an interactive relationship with the port number, and other data related to the port number, which can also be called the semantic information of the port number) are obtained from the background field other than the port number, and the feature tag set of the port number is obtained based on the attributes of the port number.

由于端口号的特征标签集合表征了端口号的属性，例如，如果两个端口号具有共有的属性，则两个端口号的特征标签中会存在相同的部分，换句话说两个端口号的编码的局部相同或者相近，因此两个端口号的特征标签之间的相关性(即相似性)能够表征两个端口号之间的关联关系的紧密程度，因此端口号的特征标签保留了端口号的语义信息，避免了信息丢失。Since the feature tag set of the port number represents the attributes of the port number, for example, if two port numbers have common attributes, there will be the same part in the feature tags of the two port numbers. In other words, the encodings of the two port numbers are partially the same or similar. Therefore, the correlation (i.e., similarity) between the feature tags of the two port numbers can represent the closeness of the association between the two port numbers. Therefore, the feature tag of the port number retains the semantic information of the port number and avoids information loss.

此外，端口号的属性的取值数量通常远小于端口号本身的取值数量，因此能有效避免维度灾难。In addition, the number of values that the port number attribute can take is usually much smaller than the number of values that the port number itself can take, thus effectively avoiding the curse of dimensionality.

此外，针对动静数据难以兼顾的问题，本申请的一些实施方式中，通过静态路径从静态数据提取端口号的静态标签集合，通过动态路径从动态数据提取端口号的动态标签集合，对静态标签集合以及动态标签集合进行融合，将融合后的特征标签集合作为端口号的特征标签集合。由于融合后的特征标签集合的来源同时包括动态数据以及静态数据，从而实现动静数据的兼顾。Furthermore, to address the difficulty of balancing dynamic and static data, some embodiments of the present application extract a static tag set for the port number from static data via a static path, and extract a dynamic tag set for the port number from dynamic data via a dynamic path. The static and dynamic tag sets are then fused, and the fused feature tag set is used as the feature tag set for the port number. Because the fused feature tag set is sourced from both dynamic and static data, balancing dynamic and static data is achieved.

本申请实施例中获取端口号的特征标签的方式包括静态路径、动态路径以及动静态路径融合，下面对这几种方式分别举例说明。In the embodiment of the present application, the methods for obtaining the characteristic tags of the port number include static path, dynamic path and dynamic and static path fusion. These methods are respectively described with examples below.

A、静态路径A. Static path

静态路径是指从静态数据提取端口号的静态标签，静态数据的定义可参考前文描述。A static path refers to a static label that extracts a port number from static data. For the definition of static data, refer to the previous description.

在一些实施方式中，静态标签包含端口号对应的应用标识。例如，从静态数据中获取端口号对应的应用标识集合，将应用标识集合作为端口号的静态标签集合，或者按照预定规则对应用标识集合进行处理，得到端口号的静态标签集合。例如，在一个端口号绑定一个应用标识的情况下，从静态数据中获得的静态标签集合可以包括端口号对应的一个应用标识；在一个端口号绑定多个应用标识的情况下，从静态数据中获得的静态标签集合可以包括端口号对应的多个应用标识。应用标识用于标识与端口号绑定的应用或业务(或者说端口号的用途)。例如，应用标识为应用名称或业务名称。预定规则例如为将应用标识进行截短，将截短后的应用标识作为静态标签；预定规则又如为对应用标识进行向量化，将向量形式的应用标识作为静态标签；预定规则再如为从应用标识中提取预定位数的字段，作为静态标签；In some embodiments, the static tag includes an application identifier corresponding to the port number. For example, a set of application identifiers corresponding to the port number is obtained from static data, and the set of application identifiers is used as a static tag set for the port number, or the set of application identifiers is processed according to predetermined rules to obtain a static tag set for the port number. For example, in the case where a port number is bound to an application identifier, the set of static tags obtained from the static data may include an application identifier corresponding to the port number; in the case where a port number is bound to multiple application identifiers, the set of static tags obtained from the static data may include multiple application identifiers corresponding to the port number. The application identifier is used to identify the application or service bound to the port number (or the purpose of the port number). For example, the application identifier is an application name or a service name. The predetermined rule is, for example, to truncate the application identifier and use the truncated application identifier as the static tag; the predetermined rule is, for example, to vectorize the application identifier and use the application identifier in vector form as the static tag; the predetermined rule is, for example, to extract a field of a predetermined number of bits from the application identifier as the static tag;

在一些实施方式中，静态标签包含端口号对应的协议标识。例如，从静态数据中获取端口号对应的协议标识集合，将协议标识集合作为端口号的静态标签集合，或者按照预定规则对协议标识集合进行处理，得到端口号的静态标签集合。端口号的静态标签集合包括端口号对应的一个或多个协议标识。协议标识用于标识与端口号绑定的协议。例如，协议标识用于标识与端口号绑定的应用层协议、与端口号绑定的传输层协议或与端口号绑定的网络层协议。例如，协议标识为协议名称或协议编号。比如说，协议标识包括TCP、用户数据报协议(User Datagram Protocol，UDP)、流控制传输协议(Stream Control Transmission Protocol，SCTP)或SSH。In some embodiments, the static label includes a protocol identifier corresponding to the port number. For example, a set of protocol identifiers corresponding to the port number is obtained from static data, and the set of protocol identifiers is used as a static label set for the port number, or the set of protocol identifiers is processed according to a predetermined rule to obtain a static label set for the port number. The static label set for the port number includes one or more protocol identifiers corresponding to the port number. The protocol identifier is used to identify the protocol bound to the port number. For example, the protocol identifier is used to identify the application layer protocol bound to the port number, the transport layer protocol bound to the port number, or the network layer protocol bound to the port number. For example, the protocol identifier is a protocol name or a protocol number. For example, the protocol identifier includes TCP, User Datagram Protocol (UDP), Stream Control Transmission Protocol (SCTP), or SSH.

以静态数据为端口注册表为例，在一些实施方式中，以端口号为索引，从端口注册表中查找端口号对应的背景字段，该背景字段包括Service Name字段或者description字段其中至少一项。从端口号对应的Service Name字段中提取一个或多个应用标识，从description字段中提取一个或多个协议标识，以得到端口号的静态标签集合。Taking a port registry as an example of static data, in some implementations, the port number is used as an index to search the port registry for a context field corresponding to the port number. The context field includes at least one of a Service Name field and a Description field. One or more application identifiers are extracted from the Service Name field corresponding to the port number, and one or more protocol identifiers are extracted from the Description field to obtain a static tag set for the port number.

在提取静态标签集合时，IANA端口注册表以及私有协议端口注册表这两种静态数据可以单独使用也可以结合使用。在结合使用IANA端口注册表以及私有协议端口注册表的一些实施方式中，从IANA端口注册表中确定端口号对应的背景字段的内容，得到静态标签A；并从私有协议端口注册表中确定端口号对应的背景字段的内容，得到静态标签B；对静态标签A和静态标签B融合，以得到端口号的静态标签集合。对静态标签A和静态标签B融合的方式例如为对静态标签A和静态标签B取并集，又如为对静态标签A和静态标签B按照预定权重加权求和。When extracting a static tag set, the two static data, the IANA port registry and the private protocol port registry, can be used alone or in combination. In some embodiments of combining the IANA port registry and the private protocol port registry, the content of the background field corresponding to the port number is determined from the IANA port registry to obtain a static tag A; and the content of the background field corresponding to the port number is determined from the private protocol port registry to obtain a static tag B; and static tag A and static tag B are fused to obtain a static tag set for the port number. The fusion of static tag A and static tag B can be, for example, taking the union of static tag A and static tag B, or taking the weighted sum of static tag A and static tag B according to a predetermined weight.

例如，对于端口80，以端口80的端口号(80)为索引，从IANA端口注册表中查找端口80对应的Service Name，得到www以及http，因此确定端口80的静态标签包括{www,http}。又如，对于端口750，IANA端口注册表中端口750对应的Service Name包括Rfile、loadav以及kerberos-iv，因此确定端口750的静态标签包括{Rfile,loadav,kerberos-iv}。又如，对于端口79，以端口79的端口号(79)为索引，从IANA端口注册表中查找端口79对应的Service Name，得到finger；以端口79的端口号(79)为索引，从私有协议文档中查找端口79对应的应用名称，得到private_79；对finger和private_79取并集，从而确定端口79的静态标签包括{finger,private_79}。For example, for port 80, using the port number (80) of port 80 as an index, the Service Name corresponding to port 80 is searched from the IANA port registry, and www and http are obtained. Therefore, the static label of port 80 is determined to include {www, http}. For another example, for port 750, the Service Name corresponding to port 750 in the IANA port registry includes Rfile, loadav, and kerberos-iv. Therefore, the static label of port 750 is determined to include {Rfile, loadav, kerberos-iv}. For another example, for port 79, using the port number (79) of port 79 as an index, the Service Name corresponding to port 79 is searched from the IANA port registry, and finger is obtained; using the port number (79) of port 79 as an index, the application name corresponding to port 79 is searched from the private protocol document, and private_79 is obtained; the union of finger and private_79 is taken, and the static label of port 79 is determined to include {finger, private_79}.

由于通过静态路径获得的端口号的应用标识或协议标识均能够表征端口的语义(端口的用途)，因此采用应用标识或协议标识作为端口号的静态标签集合，避免端口号的静态标签集合丢失端口的语义，不同端口号的静态标签集合之间的相似度能够表征端口号在语义上的相关性。例如，端口A和端口B这两个端口绑定了相同的应用，而端口A和端口C这两个端口绑定的应用各不相同，基于端口绑定的应用分别确定端口A、端口B和端口C的静态标签集合之后，端口A的静态标签集合和端口B的静态标签集合中会具有相同的应用标识(或者说端口A和端口B的静态标签集合中具有重叠部分)，从而表征端口A与端口B之间的相关性。同理地，端口A的静态标签集合和端口C的静态标签集合中不具有相同的应用标识(或者说端口A和端口C的静态标签集合中不具有重叠部分)，从而表征端口A与端口C之间的相关性。基于端口A的静态标签集合和端口B的静态标签集合确定端口A与端口B之间的相似度，并基于端口A的静态标签集合和端口C的静态标签集合确定端口A与端口C之间的相似度后，端口A与端口B之间的相似度大于端口A与端口C之间的相似度。Since the application identifier or protocol identifier of the port number obtained through the static path can both characterize the semantics of the port (the purpose of the port), using the application identifier or protocol identifier as the static label set of the port number prevents the static label set of the port number from losing the semantics of the port. The similarity between the static label sets of different port numbers can characterize the semantic correlation of the port numbers. For example, ports A and B are bound to the same application, while ports A and C are bound to different applications. After determining the static label sets of ports A, B, and C based on the port-bound applications, the static label sets of port A and port B will have the same application identifier (or the static label sets of ports A and B will have overlapping parts), thereby characterizing the correlation between ports A and B. Similarly, the static label set of port A and the static label set of port C will not have the same application identifier (or the static label sets of ports A and C will have no overlapping parts), thereby characterizing the correlation between ports A and C. After determining the similarity between port A and port B based on the static label set of port A and the static label set of port B, and determining the similarity between port A and port C based on the static label set of port A and the static label set of port C, the similarity between port A and port B is greater than the similarity between port A and port C.

由此可见，通过静态路径提取端口号的静态标签集合的方式，一方面解决了信息丢失的问题，另一方面，由于应用标识的取值数量以及协议标识的取值数量均远远少于端口号的取值数量，从而避免了维度爆炸。It can be seen that the method of extracting the static label set of the port number through the static path solves the problem of information loss on the one hand, and on the other hand, because the number of values of the application identifier and the number of values of the protocol identifier are far less than the number of values of the port number, it avoids dimensionality explosion.

在一些实施方式中，获取端口号的描述文本，采用词频逆文档频率(Term Frequency-Inverse Document Frequency，TF-IDF)的方式，对端口号的描述文本进行向量化处理，生成每个描述文本对应的TF-IDF特征向量，作为端口号的静态标签。In some embodiments, the description text of the port number is obtained, and the description text of the port number is vectorized using the term frequency-inverse document frequency (TF-IDF) method to generate a TF-IDF feature vector corresponding to each description text as a static label of the port number.

在一些实施方式中，在提取静态标签时还进一步考虑背景字段的散列度，从而提高静态标签的区分度，有助于通过不同的静态标签区分不同的端口号用途。例如，对于端口注册表中每个与端口号用途有关的背景字段，判断该背景字段的散列度是否高于散列度阈值，如果背景字段与端口号用途有关且散列度高于散列度阈值，则从该背景字段中提取端口号的静态标签。散列度用于表征该背景字段中不同取值的数量。例如，如果端口注册表中一个背景字段的散列度高于散列度阈值，或者说该背景字段的丰富度或者多样性比较强，该背景字段包含的信息量比较大，基于该背景字段有助于区分不同用途的端口号，可以基于该背景字段的内容确定静态标签。反之，如果端口注册表中一个背景字段的散列度过低，例如，对于端口注册表中传输协议(Transport Protocol)字段，Transport Protocol字段的值通常只有两种类型，一种类型是UDP，另一种类型是TCP，Transport Protocol字段的值的类型数量(2)远小于的端口号的总数量，Transport Protocol字段的值的散列度过低，因此无需将Transport Protocol字段的值确定为静态标签。又如，对于端口注册表中Service Name字段，Service Name字段本身的丰富度比较高，且Service Name字段与端口号的用途有关，因此可以基于Service Name字段的内容确定端口号的静态标签。In some embodiments, the hash value of the background field is further considered when extracting the static label, thereby improving the discrimination of the static label and helping to distinguish different port number uses through different static labels. For example, for each background field related to the port number use in the port registration table, it is determined whether the hash value of the background field is higher than the hash value threshold. If the background field is related to the port number use and the hash value is higher than the hash value threshold, the static label of the port number is extracted from the background field. The hash value is used to characterize the number of different values in the background field. For example, if the hash value of a background field in the port registration table is higher than the hash value threshold, or the richness or diversity of the background field is relatively strong, the amount of information contained in the background field is relatively large. Based on the background field, it is helpful to distinguish port numbers with different uses, and the static label can be determined based on the content of the background field. On the other hand, if the hash value of a background field in the port registry is too low, for example, for the Transport Protocol field in the port registry, the value of the Transport Protocol field usually has only two types, one type is UDP and the other type is TCP. The number of types of values in the Transport Protocol field (2) is much smaller than the total number of port numbers. The hash value of the Transport Protocol field is too low, so there is no need to determine the value of the Transport Protocol field as a static label. For another example, for the Service Name field in the port registry, the Service Name field itself is relatively rich and is related to the purpose of the port number. Therefore, the static label of the port number can be determined based on the content of the Service Name field.

在一些实施方式中，采用静态路径提取端口号的静态标签集合后，基于端口号以及静态标签集合生成一组映射关系。这一组映射关系中每个映射关系例如是一个键值对，键为端口号，值为端口号对应的静态标签集合。示例性地，对于端口79、端口80、端口8080以及端口52297这四个端口，基于静态路径获得的这四个端口对应的静态标签以及构建的映射关系如下表1所示。In some embodiments, after extracting the static label set for the port number using a static path, a set of mapping relationships is generated based on the port number and the static label set. Each mapping relationship in this set of mapping relationships is, for example, a key-value pair, where the key is the port number and the value is the static label set corresponding to the port number. For example, for the four ports of port 79, port 80, port 8080, and port 52297, the static labels corresponding to these four ports obtained based on the static path and the mapping relationships constructed are shown in Table 1 below.

表1
Table 1

上表所示的静态数据的来源包括IANA或/和私有协议其中至少一项。为了便于示例，将私有协议提供的端口号对应的应用名称之前加上“private”的前缀，又如，端口号52297在私有协议中对应两种应用，一种应用的名称记为private_52297a,另一种应用的名称记为private_52297b。The static data shown in the table above comes from at least one of IANA and/or private protocols. For ease of illustration, the application names corresponding to port numbers provided by private protocols are prefixed with "private." For example, port number 52297 corresponds to two applications in the private protocol: one application is named private_52297a, and the other application is named private_52297b.

以将端口号对应的应用名称为端口的静态标签，并将IANA端口注册表作为静态数据为例，在IANA端口注册表中，端口88对应的应用是kerberos，端口749对应的应用是kerberos，端口750对应的应用包括kerberos、rfile以及loadav这三种应用，端口2007对应的应用包括raid以及dectalk这两种应用，端口2015对应的应用包括cypress以及raid这两种应用，端口2017对应的应用包括bootclient以及cypress这两种应用。基于IANA端口注册表提取各个端口对应的应用名称后，得到的各个端口的静态标签如下表2所示。For example, using the application name corresponding to a port number as a static label and the IANA port registry as static data, port 88 corresponds to Kerberos, port 749 also corresponds to Kerberos, port 750 corresponds to Kerberos, rfile, and loadav, port 2007 corresponds to RAID and DecTalk, port 2015 corresponds to Cypress and RAID, and port 2017 corresponds to BootClient and Cypress. After extracting the application name corresponding to each port based on the IANA port registry, the resulting static labels for each port are shown in Table 2 below.

表2
Table 2

通过生成端口号与静态标签集合之间的映射关系，后续如果需要进一步基于端口号的静态标签集合执行推理任务时，以端口号为键，执行一次哈希映射即可获取端口号对应的静态标签集合。由于计算开销小(提取静态标签集合的过程仅涉及对背景字段的扫描，复杂度为O(N)；推理过程仅涉及一次简单哈希，复杂度为O(1)，基本不受具体应用场景和技术领域的限制，可广泛部署在各类应用到机器学习算法的通信产品中。By generating a mapping relationship between the port number and the static tag set, if further inference tasks need to be performed based on the static tag set of the port number, a hash mapping is performed using the port number as the key to obtain the static tag set corresponding to the port number. Due to the low computational overhead (the process of extracting the static tag set only involves scanning the background fields, with a complexity of O(N); the inference process only involves a simple hash, with a complexity of O(1), it is basically not limited by specific application scenarios and technical fields, and can be widely deployed in various communication products that apply machine learning algorithms.

在一些实施方式中，考虑到端口注册表可能存在冗余的内容，先对端口注册表进行去冗余处理，再从去冗余后的端口注册表提取端口的静态标签，通过减少端口注册表中冗余内容从而降低端口注册表的数据量，进而减少从端口注册表中提取特征标签时所需处理的数据量，进而提高提取特征标签的效率。In some implementations, considering that the port registry may contain redundant content, the port registry is first de-redundanted, and then the static tag of the port is extracted from the de-redundant port registry. By reducing the redundant content in the port registry, the data volume of the port registry is reduced, and the amount of data required to be processed when extracting feature tags from the port registry is reduced, thereby improving the efficiency of extracting feature tags.

针对去冗余处理的方式，在一些实施方式中，遍历端口注册表中的每行字段，对端口注册表中的每行字段分别与预定规则匹配，如果一行字段命中预定规则，则删除该行字段。示例性地，对端口注册表中各个背景字段的内容进行语义分析，如果确定端口注册表中一个背景字段的内容包含重复的语义，则删除该背景字段所在的行。又如，利用自然语言处理技术或者正则表达式等方法，检测端口注册表中各个背景字段的内容是否包含有表征重复的关键词，比如“duplicate”，如果检测出端口注册表一个背景字段的内容包含有表征重复的关键词，则删除该背景字段所在的行。Regarding the de-redundancy processing method, in some embodiments, each row of fields in the port registry is traversed and matched against predetermined rules. If a row of fields matches the predetermined rules, the row of fields is deleted. For example, semantic analysis is performed on the contents of each background field in the port registry. If it is determined that the content of a background field in the port registry contains duplicate semantics, the row containing the background field is deleted. In another example, natural language processing technology or regular expressions are used to detect whether the content of each background field in the port registry contains keywords that represent duplication, such as "duplicate". If it is detected that the content of a background field in the port registry contains keywords that represent duplication, the row containing the background field is deleted.

作为示例，遍历IANA端口注册表中每一行数据，提取出每行中的备注字段(字段名为Assignment Notes)，即最后一列字段的内容。如果识别出IANA端口注册表中任意一行字段中的备注字段包含语义为重复的单词(duplicate)，则从IANA端口注册表中删除这行字段。比如说，在IANA端口注册表中，80端口存在两行内容相同的字段，这两行字段的最后一列(Assignment Notes)的内容均包括This is a duplicate of the"http"service and should not be used for discovery purposes。当检测出80端口的一行字段中包含单词duplicate，则删除这行字段。As an example, traverse each row of data in the IANA port registry and extract the notes field (field name Assignment Notes) in each row, that is, the content of the last column of the field. If the notes field in any row of the IANA port registry is identified as containing the word "duplicate" with duplicate semantics, then delete this row of fields from the IANA port registry. For example, in the IANA port registry, there are two rows of fields with the same content for port 80. The content of the last column (Assignment Notes) of these two rows includes "This is a duplicate of the "http" service and should not be used for discovery purposes." When it is detected that a row of fields for port 80 contains the word "duplicate", delete this row of fields.

在一些实施方式中，考虑到端口注册表中含义相同的术语可能存在多种表述，如果识别出端口注册表中多个字段中存在语义相同且表述不一致的术语，则对该多个字段进行表述一致化处理。例如从多种表述中选择其中的一种目标表述，仅保留该目标表述，将目标表述之外其他表述替换为该目标表述。比如说，在识别出端口注册表中存在同一个术语的英文全称和英文缩写的情况下，可以保留英文全称和英文缩写其中的一种表述方式，比如选择仅保留英文全称，将该术语的英文缩写替换为英文全称，使得同一个术语的表述一致。In some embodiments, considering that there may be multiple expressions for the same term in the port registry, if it is identified that there are terms with the same semantics but inconsistent expressions in multiple fields in the port registry, the multiple fields are subjected to expression consistency processing. For example, a target expression is selected from the multiple expressions, only the target expression is retained, and the other expressions other than the target expression are replaced with the target expression. For example, when it is identified that the full English name and the English abbreviation of the same term exist in the port registry, one of the expressions of the full English name and the English abbreviation can be retained, such as choosing to retain only the full English name and replacing the English abbreviation of the term with the full English name, so that the expression of the same term is consistent.

例如，如下表3中示出了IANA端口注册表中的部分内容，下表中给出了术语FTP的两种表述方式，一种是英文全称File Transfer Protocol，另一种是英文缩略词FTP，在对IANA端口注册表进行预处理的过程中，可以将下表中出现的英文缩略词FTP均替换为英文全称File Transfer Protocol，以使整个IANA端口注册表中术语FTP的表述方式一致，降低内容的冗余。For example, Table 3 below shows part of the content in the IANA port registry. The table below gives two ways of expressing the term FTP, one is the full English name File Transfer Protocol, and the other is the English abbreviation FTP. In the process of preprocessing the IANA port registry, the English abbreviation FTP that appears in the table below can be replaced with the full English name File Transfer Protocol to make the expression of the term FTP consistent in the entire IANA port registry and reduce content redundancy.

表3

Table 3

当然，从端口注册表中提取静态标签仅是示例性的，在另一些实施方式中，从端口注册表之外其他能够表征端口号用途的文件中提取静态标签。Of course, extracting the static label from the port registry is only exemplary. In other implementations, the static label is extracted from other files outside the port registry that can characterize the purpose of the port number.

通过将端口号对应的应用标识作为端口号的静态标签，与一个端口号相关的应用数量通常较少，例如大部分情况下一个端口号仅对应一个应用，少部分情况下一个端口号对应多个应用，但即使在一个端口号对应多个应用的情况下，一个端口号对应的应用数量基本上也仅是在个位数左右，因此通过将端口号对应的应用标识作为端口号的静态标签集合，端口号的静态标签集合的尺寸或者说特征向量的维度不超过10，远小于采用独热编码的方式或标签编码的方式时端口号的静态标签集合会达到的上万的尺寸，从而避免维度灾难。此外，由于利用端口注册表即可获得端口号的静态标签，计算的开销比较小，也不需要再对端口注册表进行更新，所以减少了额外的更新端口注册表会产生的开销。By using the application identifier corresponding to a port number as the static label for the port number, the number of applications associated with a port number is typically small. For example, in most cases, a port number corresponds to only one application, while in a few cases, a port number corresponds to multiple applications. However, even in cases where a port number corresponds to multiple applications, the number of applications corresponding to a port number is generally only in the single digit. Therefore, by using the application identifier corresponding to the port number as the static label set for the port number, the size of the static label set for the port number, or the dimension of the feature vector, does not exceed 10, which is much smaller than the tens of thousands that can be achieved when using one-hot encoding or label encoding, thus avoiding the curse of dimensionality. Furthermore, since the static label for the port number can be obtained using the port registry, the computational overhead is relatively low, and there is no need to update the port registry, thus reducing the overhead incurred by additional updates to the port registry.

在一些实施方式中，考虑到端口注册表等静态数据中不同背景字段与端口号之间的相关程度可能不同，先基于相关性的维度判断端口注册表中一个背景字段是否适用于提取端口号的静态标签。如果端口注册表中背景字段与端口号之间的相似度高于相似度阈值，则基于该背景字段提取端口号的静态标签。如果背景字段与端口号之间的相似度低于相似度阈值，则无需基于该背景字段提取端口号的静态标签。In some embodiments, considering that the degree of correlation between different background fields and port numbers in static data such as a port registry may vary, a determination is first made based on the dimension of correlation as to whether a background field in the port registry is suitable for extracting a static label for the port number. If the similarity between the background field in the port registry and the port number exceeds a similarity threshold, the static label for the port number is extracted based on the background field. If the similarity between the background field and the port number is below the similarity threshold, there is no need to extract a static label for the port number based on the background field.

在一些实施方式中，每当发现端口注册表发生更新，则基于更新后的端口注册表重新生成端口号的特征标签集合，从而更新端口号的特征标签集合。In some implementations, whenever it is found that the port registry is updated, the feature tag set of the port number is regenerated based on the updated port registry, thereby updating the feature tag set of the port number.

B、动态路径B. Dynamic Path

动态路径是指基于动态数据对端口号进行编码，以获得动态标签。例如，在安全领域方面，攻击者在执行特定类型的网络攻击时，通常会反复地与某几个固定类型的端口进行交互。例如，请参考下表4，表4中每一行表征一种攻击类型，表4中每一列表征一个端口，表4中每一个元素表征对应端口在对应攻击类型的记录中出现的次数。例如，53端口与攻击类型6对应的值为40，表征53端口在攻击类型6的记录中出现了40次。通过表4可见，端口号21仅与攻击类型4相关，端口号53在攻击类型5与攻击类型9中出现过，端口号80在攻击类型5至攻击类型9中每种攻击类型中均出现过。Dynamic paths refer to encoding port numbers based on dynamic data to obtain dynamic labels. For example, in the security field, when attackers perform specific types of network attacks, they often repeatedly interact with a few fixed types of ports. For example, please refer to Table 4 below. Each row in Table 4 represents an attack type, each column in Table 4 represents a port, and each element in Table 4 represents the number of times the corresponding port appears in the records of the corresponding attack type. For example, the value corresponding to port 53 and attack type 6 is 40, indicating that port 53 appears 40 times in the records of attack type 6. As can be seen from Table 4, port number 21 is only related to attack type 4, port number 53 appears in attack types 5 and attack type 9, and port number 80 appears in each attack type from attack type 5 to attack type 9.

表4
Table 4

有鉴于端口和网络攻击之间的关系，在一些实施方式中，基于防火墙的威胁检测记录确定以端口为攻击目标的网络攻击的类型，基于端口号对应的网络攻击的类型获得端口的动态标签。例如，基于端口号对应的网络攻击的类型进行编码，得到端口号的特征向量。例如，特征向量中的每个维度分别对应一种攻击类型，特征向量中每个维度的取值表征对应攻击类型是否攻击过端口。Given the relationship between ports and network attacks, in some implementations, the type of network attack targeting a port is determined based on the firewall's threat detection records, and a dynamic label for the port is obtained based on the network attack type corresponding to the port number. For example, the type of network attack corresponding to the port number is encoded to obtain a feature vector for the port number. For example, each dimension in the feature vector corresponds to an attack type, and the value of each dimension in the feature vector indicates whether the corresponding attack type has attacked the port.

例如，如果威胁检测记录中存在第i种网络攻击曾攻击过一个端口的记录，则一个端口的特征向量中第i个维度的值为1；如果威胁检测记录中不存在第i种网络攻击曾攻击过一个端口的记录，则一个端口的特征向量中第i个维度的值为0。For example, if there is a record in the threat detection record that the i-th network attack has attacked a port, the value of the i-th dimension in the feature vector of the port is 1; if there is no record in the threat detection record that the i-th network attack has attacked a port, the value of the i-th dimension in the feature vector of the port is 0.

又如，基于端口号对应的网络攻击的类型以及端口号被该网络攻击所攻击的次数进行编码，得到端口号的特征向量，使得特征向量中每个维度的取值能够表征端口号被对应网络攻击所攻击的频率。端口被对应网络攻击所攻击的频率越高，则该端口的端口号的特征向量中对应维度的取值越高。比如说，如果威胁检测记录中存在第i种网络攻击曾攻击过一个端口的记录，且第i种网络攻击总共攻击过该端口k次，则一个端口的特征向量中第i个维度的值为k。For example, a feature vector for a port number is generated by encoding the type of network attack corresponding to the port number and the number of times the port number has been attacked by the network attack. The value of each dimension in the feature vector represents the frequency with which the port number has been attacked by the corresponding network attack. The higher the frequency with which a port is attacked by the corresponding network attack, the higher the value of the corresponding dimension in the feature vector for the port number. For example, if a threat detection record contains a record of a port being attacked by the i-th network attack, and the i-th network attack has attacked the port a total of k times, the value of the i-th dimension in the feature vector for the port is k.

在另一些实施方式中，在对端口号编码时，还考虑端口在攻击类型中的出现频率，例如为特定攻击类型中高频出现的端口以及低频出现的端口分别分配不同的权重，结合频率对应的权重以及端口在攻击类型中的出现频率确定端口的特征向量。In other embodiments, when encoding the port number, the frequency of occurrence of the port in the attack type is also considered. For example, different weights are assigned to ports that appear frequently and ports that appear infrequently in a specific attack type, and the characteristic vector of the port is determined by combining the weight corresponding to the frequency and the frequency of occurrence of the port in the attack type.

其中，基于网络攻击进行编码的方式例如独热编码、TF-IDF编码、嵌入向量编码(Embedding)等，本实施例对具体采用哪种编码的方式不做限定。Among them, the encoding method based on network attacks is, for example, one-hot encoding, TF-IDF encoding, embedding vector encoding (Embedding), etc. This embodiment does not limit the specific encoding method used.

示例性地，可以将端口号21编码为7维的特征向量，该7维的特征向量中的每个维度分别对应一种攻击类型，7维的特征向量依次对应于攻击类型4、攻击类型5、攻击类型6……攻击类型10。基于这种编码方式，端口号21的特征向量是1000000，端口号53的特征向量是0100001，端口号80的特征向量是1111111。For example, port number 21 can be encoded as a 7-dimensional feature vector, where each dimension corresponds to an attack type. The 7-dimensional feature vectors correspond to attack type 4, attack type 5, attack type 6, and so on, attack type 10. Based on this encoding method, the feature vector for port number 21 is 1000000, the feature vector for port number 53 is 0100001, and the feature vector for port number 80 is 1111111.

由于端口号21的特征向量与端口号53的特征向量没有重合的部分，端口号21的特征向量与端口号53的特征向量之间的相似度较小，可知端口号21与端口号53之间的相似度较小。由于端口号53的特征向量与端口号80的特征向量在第2个维度和第7个维度是重合的，端口号53的特征向量与端口号80的特征向量之间的相似度较大，可知端口号53与端口号80之间的相似度较大。通过采用基于攻击类型将端口号编码为特征向量的方式，使得特征向量能保留端口号之间在攻击类型方面的关联，特征向量与对应端口在实际应用中受到的攻击有关。Since the feature vector for port number 21 does not overlap with the feature vector for port number 53, the similarity between the feature vectors for port number 21 and port number 53 is small, indicating that the similarity between port number 21 and port number 53 is small. Since the feature vector for port number 53 and the feature vector for port number 80 overlap in the second and seventh dimensions, the similarity between the feature vector for port number 53 and port number 80 is large, indicating that the similarity between port number 53 and port number 80 is large. By encoding port numbers into feature vectors based on attack types, the feature vectors can retain the association between port numbers in terms of attack types, and the feature vectors are related to the attacks suffered by the corresponding ports in actual applications.

对端口号编码得到的特征向量是二值化向量仅是示例的。在另一些实施方式中，对端口号编码得到的特征向量也可以是0或1之外的其他数值。The fact that the feature vector obtained by encoding the port number is a binary vector is merely an example. In other implementations, the feature vector obtained by encoding the port number may also be a value other than 0 or 1.

在一些实施方式中，考虑到设备在运行应用的过程中，通过应用访问的目的端口通常会满足一定的规律，同一种应用常常会频繁访问服务器的固定几个端口。比如说，应用NetBIOS访问的服务器中的目的端口通常包括端口137、端口139以及端口138这三个固定的端口。由此可见，如果多个端口与同一个对象(如应用或远端设备)交互，该多个端口可能存在关联关系。In some implementations, considering that when a device runs an application, the destination ports accessed by the application typically follow certain patterns, the same application often frequently accesses a few fixed ports on the server. For example, the destination ports for a server accessed using NetBIOS typically include three fixed ports: port 137, port 139, and port 138. Therefore, if multiple ports interact with the same object (such as an application or remote device), these multiple ports may be associated.

基于此，可以从第一通信设备的运行日志记录中，获得与第一通信设备的端口存在交互关系的对象的标识；基于与端口存在交互关系的对象的标识获得端口号的动态标签。其中，与第一通信设备的端口存在交互关系的对象例如为第二通信设备或者第二通信设备中的应用。相应地，端口号的动态标签例如为第二通信设备的IP地址、第二通信设备的IP地址中的部分字段、第二通信设备的媒体访问控制地址(Media Access Control Address，MAC)地址、第二通信设备所在的区域的标识或者第二通信设备中应用的标识。Based on this, the identifier of the object that has an interactive relationship with the port of the first communication device can be obtained from the operation log record of the first communication device; and the dynamic label of the port number can be obtained based on the identifier of the object that has an interactive relationship with the port. The object that has an interactive relationship with the port of the first communication device is, for example, the second communication device or an application in the second communication device. Accordingly, the dynamic label of the port number is, for example, the IP address of the second communication device, a part of the field in the IP address of the second communication device, the Media Access Control Address (MAC) address of the second communication device, the identifier of the area where the second communication device is located, or the identifier of the application in the second communication device.

例如，基于与源端口号存在交互关系的目的IP地址，确定源端口号的动态标签。又如，基于与目的端口号存在交互关系的源IP地址，确定目的端口号的动态标签。For example, the dynamic label of the source port number is determined based on the destination IP address that has an interactive relationship with the source port number. For another example, the dynamic label of the destination port number is determined based on the source IP address that has an interactive relationship with the destination port number.

作为示例，通信设备中记录的日志片段包括如下表5示出的内容。下表5中的源端口一列记录了通信设备中的各个端口的端口号，下表5中的目的地址一列记录了对应端口访问的目的地址。As an example, the log fragment recorded in the communication device includes the content shown in Table 5. The source port column in Table 5 records the port number of each port in the communication device, and the destination address column in Table 5 records the destination address accessed by the corresponding port.

表5日志片段

Table 5 Log snippet

结合上表5示出的内容，可以将与端口交互过的IP地址选择为背景字段，基于与端口交互过的IP地址获取端口号的动态标签，从而为各个端口分别打上动态标签。示例性地，端口号56551的动态标签为{172.31.65.123}；端口号443的动态标签为{172.31.65.124,172.31.65.123}；端口号80的动态标签为{172.31.65.126}；端口号62746的动态标签为{172.31.65.124}；端口号60628的动态标签为{172.31.65.124}。In conjunction with the content shown in Table 5 above, the IP address that has interacted with the port can be selected as the background field, and the dynamic label of the port number can be obtained based on the IP address that has interacted with the port, thereby dynamically labeling each port. For example, the dynamic label of port number 56551 is {172.31.65.123}; the dynamic label of port number 443 is {172.31.65.124,172.31.65.123}; the dynamic label of port number 80 is {172.31.65.126}; the dynamic label of port number 62746 is {172.31.65.124}; and the dynamic label of port number 60628 is {172.31.65.124}.

由于基于与端口号存在交互关系的IP地址获取端口号的动态标签，一方面，使得动态标签能表征端口号的语义，不同端口号的动态标签之间的相似度能表征不同端口号的语义之间的相关性，从而避免了信息丢失。例如，从上表5示出的日志片段来看，由于端口号443、端口号62746以及端口号60628这三个端口号与同一个IP地址172.31.65.124具有交互关系，可以确定端口号443、端口号62746以及端口号60628这三个端口号之间具有关联关系，端口号443、端口号62746以及端口号60628这三个端口号的动态标签存在重叠，端口号443、端口号62746以及端口号60628这三个端口号的动态标签均包括IP地址172.31.65.124。其中，与端口号62746存在交互关系的IP地址与端口号60628存在交互关系的IP地址一致，端口号62746的动态标签与端口号60628的动态标签完全重叠，这也体现出端口号62746与端口号60628关联性大。与端口号443存在交互关系的IP地址以及与端口号80存在交互关系的IP地址不同，端口号443的动态标签以及端口号80动态标签不同，这也体现出端口号443与端口号80关联性小。由此可见，基于与端口号存在交互关系的IP地址对端口号进行编码的方式可以体现端口号与端口号之间的相似度或者相关性。Since the dynamic label of the port number is obtained based on the IP address that has an interactive relationship with the port number, on the one hand, the dynamic label can represent the semantics of the port number, and the similarity between the dynamic labels of different port numbers can represent the correlation between the semantics of different port numbers, thereby avoiding information loss. For example, from the log fragment shown in Table 5 above, since the three port numbers 443, 62746, and 60628 have an interactive relationship with the same IP address 172.31.65.124, it can be determined that the three port numbers 443, 62746, and 60628 have an association relationship, and the dynamic labels of the three port numbers 443, 62746, and 60628 overlap, and the dynamic labels of the three port numbers 443, 62746, and 60628 all include the IP address 172.31.65.124. Among them, the IP address that has an interactive relationship with port number 62746 is consistent with the IP address that has an interactive relationship with port number 60628, and the dynamic label of port number 62746 completely overlaps with the dynamic label of port number 60628, which also reflects that port number 62746 and port number 60628 are highly correlated. The IP address that has an interactive relationship with port number 443 is different from the IP address that has an interactive relationship with port number 80, and the dynamic label of port number 443 is different from the dynamic label of port number 80, which also reflects that port number 443 has a low correlation with port number 80. This shows that the method of encoding port numbers based on the IP addresses that have an interactive relationship with the port numbers can reflect the similarity or correlation between port numbers.

另一方面，与一个端口号存在交互关系的IP地址数量通常是有限的，即使与一个端口号存在交互关系的IP地址数量较多，也可以通过从与一个端口号存在交互关系的IP地址中保留出现频率高于预定频率的IP地址从而减少动态标签集合的维度，使得动态标签集合的尺寸不至于像独热编码或标签编码的方式那样达到上万的尺寸，避免维度灾难。On the other hand, the number of IP addresses that interact with a port number is usually limited. Even if there are a large number of IP addresses that interact with a port number, the dimension of the dynamic label set can be reduced by retaining IP addresses that appear more frequently than a predetermined frequency from the IP addresses that interact with the port number. This prevents the size of the dynamic label set from reaching tens of thousands as in one-hot encoding or label encoding, thus avoiding the dimensionality disaster.

再一方面，能够通过动态更新的业务数据更新端口号的动态标签。从而维持端口号的动态标签的时效性，使得端口号的动态标签能够随着业务数据的更新而变化，使得端口号的特征标签集合与业务数据的实时性以及业务数据的局部性规律相匹配。On the other hand, the dynamic tag of the port number can be updated through dynamically updated business data. This maintains the timeliness of the dynamic tag of the port number, allowing it to change as business data is updated, and making the characteristic tag set of the port number match the real-time nature of business data and the locality of business data.

更新端口号的动态标签集合的方式包括定时更新的方式和滚动更新的方式。例如，每隔预定时间周期，获取通信设备当前时间周期产生的日志记录，基于通信设备当前时间周期产生的日志记录获取端口号的当前动态标签集合；基于端口号的当前动态标签集合对端口号的历史动态标签集合进行更新。又如，考虑到不同的服务提供商对同一个端口的应用可能不同，比如原本游戏服务提供商租用通信设备的端口A提供游戏服务，后来视频服务提供商租用通信设备的端口A提供视频服务，导致通信设备的端口A在不同时间的用途不同。基于此，当确定端口A对应的服务提供商从第一服务提供商更新为第二服务提供商，则获取第二服务提供商应用端口A期间产生的日志记录，基于第二服务提供商应用端口A期间产生的日志记录提取端口号的当前动态标签集合；基于端口号的当前动态标签集合对端口号的历史动态标签集合进行更新。Methods for updating the dynamic tag set of a port number include a timed update method and a rolling update method. For example, at predetermined time intervals, the log records generated by the communication device during the current time period are obtained, and the current dynamic tag set of the port number is obtained based on the log records generated by the communication device during the current time period; the historical dynamic tag set of the port number is updated based on the current dynamic tag set of the port number. For another example, considering that different service providers may have different applications for the same port, for example, the game service provider originally rented port A of the communication device to provide game services, and later the video service provider rented port A of the communication device to provide video services, resulting in different uses of port A of the communication device at different times. Based on this, when it is determined that the service provider corresponding to port A is updated from the first service provider to the second service provider, the log records generated during the period when the second service provider applied port A are obtained, and the current dynamic tag set of the port number is extracted based on the log records generated during the period when the second service provider applied port A; the historical dynamic tag set of the port number is updated based on the current dynamic tag set of the port number.

在一些实施方式中，直接将与端口存在交互关系的IP地址本身作为端口的特征标签；在另一些实施方式中，采用预定规则对与端口存在交互关系的IP地址进行处理，将处理后得到的字段作为端口的特征标签。例如，从与端口存在交互关系的IP地址截取前预定数目的字段，作为端口的特征标签。例如，从与端口存在交互关系的IP地址截取第一个字段，作为端口的特征标签。In some embodiments, the IP address interacting with the port is used directly as the port's characteristic tag. In other embodiments, the IP address interacting with the port is processed using a predetermined rule, and the resulting fields are used as the port's characteristic tag. For example, the first predetermined number of fields from the IP address interacting with the port are intercepted and used as the port's characteristic tag. For example, the first field from the IP address interacting with the port is intercepted and used as the port's characteristic tag.

通过截取与端口存在交互关系的IP地址其中第一个字段作为端口的特征标签，从而减少端口的特征标签的长度，使得端口的特征标签更加简洁，端口的特征标签的数据量更少，从而端口的特征标签所需占用的存储空间以及基于端口的特征标签执行进一步推理任务时所需处理的数据量。此外，IP地址中的第一个字段的含义通常是设备所属的网段的标识，通过将与端口存在交互关系的IP地址的第一个字段作为端口的特征标签，使得端口的特征标签能够表征与端口交互的远端设备所处的网段。By intercepting the first field of the IP address that has an interactive relationship with the port as the port's feature tag, the length of the port's feature tag is reduced, making the port's feature tag more concise and the amount of data in the port's feature tag less, thereby reducing the storage space required for the port's feature tag and the amount of data required to process when performing further reasoning tasks based on the port's feature tag. In addition, the first field in the IP address usually means the identifier of the network segment to which the device belongs. By using the first field of the IP address that has an interactive relationship with the port as the port's feature tag, the port's feature tag can represent the network segment where the remote device interacting with the port is located.

当然，也可以从与端口存在交互关系的IP地址中截取前两个字段或前三个字段作为端口的特征标签；或者，根据与端口存在交互关系的IP地址中首个字节的值，确定从该IP地址中截取的字段数目，从IP地址中截取该字段数目的字段作为端口的特征标签。或者，确定与端口存在交互关系的IP地址的哈希值，将IP地址的哈希值作为端口的特征标签。Of course, the first two or three fields of the IP address that interacts with the port can also be intercepted as the port's characteristic tag; or the number of fields to be intercepted from the IP address that interacts with the port can be determined based on the value of the first byte of the IP address that interacts with the port, and the number of fields intercepted from the IP address can be used as the port's characteristic tag. Alternatively, the hash value of the IP address that interacts with the port can be determined, and the hash value of the IP address can be used as the port's characteristic tag.

在一些实施方式中，端口号的动态标签还用于表征远端设备或应用等对象与端口之间的交互次数。例如，端口号的动态标签是对与端口存在交互关系的对象的标识以及对象与端口之间交互的次数进行拼接得到的。例如，端口号的动态标签包括第一字段以及第二字段，第一字段包括与端口存在交互关系的对象的标识，第二字段包括该对象与端口之间交互的次数。又如，端口号的动态标签是对与端口存在交互关系的对象的标识以及对象与端口之间交互的次数进行融合得到的。In some embodiments, the dynamic label of the port number is also used to characterize the number of interactions between an object such as a remote device or application and the port. For example, the dynamic label of the port number is obtained by concatenating the identifier of the object that has an interactive relationship with the port and the number of interactions between the object and the port. For example, the dynamic label of the port number includes a first field and a second field, the first field includes the identifier of the object that has an interactive relationship with the port, and the second field includes the number of interactions between the object and the port. For another example, the dynamic label of the port number is obtained by concatenating the identifier of the object that has an interactive relationship with the port and the number of interactions between the object and the port.

示例性地，从日志记录等动态数据中确定与端口具有交互关系的IP地址以及IP地址与端口之间的交互次数，对与端口存在交互关系的IP地址以及IP地址与端口之间的交互次数进行拼接，以得到端口的动态标签。端口的动态标签包括与端口存在交互关系的IP地址以及IP地址与端口之间的交互次数。例如，端口A与IP地址8.8.8.8具有交互关系，且端口A与IP地址8.8.8.8交互过4次，则端口A的动态标签为8.8.8.8 4。For example, the IP addresses that have an interaction relationship with the port and the number of interactions between the IP addresses and the port are determined from dynamic data such as log records. The IP addresses that have an interaction relationship with the port and the number of interactions between the IP addresses and the port are concatenated to obtain the dynamic label of the port. The dynamic label of the port includes the IP addresses that have an interaction relationship with the port and the number of interactions between the IP addresses and the port. For example, if port A has an interaction relationship with IP address 8.8.8.8 and port A has interacted with IP address 8.8.8.8 4 times, the dynamic label of port A is 8.8.8.8 4.

由于端口的动态标签中保留了该端口与远端IP地址、远端应用等对象之间的交互次数，使得端口的动态标签能够提供更加丰富的背景信息，端口的动态标签能够更加充分地表征端口的语义，进一步提高基于端口的动态标签执行推理任务的准确性。Because the dynamic label of a port retains the number of interactions between the port and objects such as remote IP addresses and remote applications, the dynamic label of the port can provide richer background information. The dynamic label of the port can more fully represent the semantics of the port, further improving the accuracy of reasoning tasks based on the dynamic label of the port.

在一些实施方式中，在获得端口号对应的动态标签后，基于端口号以及动态标签维护一组映射关系，每个映射关系中的键仍是端口号，每个映射关系中的值仍是动态标签集合。其中，端口号的动态标签是从包含该端口号的多条记录的端口号之外的其它字段(记为“背景字段”)中获得的，例如，端口号的动态标签包括与该端口进行过交互的全部IP地址或涉及过该端口的全部应用名称等。In some embodiments, after obtaining the dynamic tag corresponding to the port number, a set of mapping relationships is maintained based on the port number and the dynamic tag. The key in each mapping relationship is still the port number, and the value in each mapping relationship is still the set of dynamic tags. The dynamic tag of the port number is obtained from other fields (referred to as "background fields") other than the port number in multiple records containing the port number. For example, the dynamic tag of the port number may include all IP addresses that have interacted with the port or all application names that have involved the port.

C、动静态路径融合C. Dynamic and static path fusion

在一些实施方式中，采用动态路径与静态路径融合的方式提取特征标签。例如，采用动态路径提取端口号的特征标签，以得到动态标签集合；采用静态路径提取端口号的特征标签，以得到静态标签集合；对静态标签集合与动态标签集合进行融合，以得到端口号的特征标签集合。In some embodiments, a dynamic path and a static path are fused to extract characteristic tags. For example, a dynamic path is used to extract the characteristic tags of a port number to obtain a dynamic tag set; a static path is used to extract the characteristic tags of a port number to obtain a static tag set; and the static tag set is fused with the dynamic tag set to obtain a characteristic tag set of the port number.

融合两种特征标签集合的方式包括取并集的方式、加权求和的方式以及采用单一路径得到的集合。取并集的方式例如为获取静态标签集合与动态标签集合的并集，以得到端口号的特征标签集合。加权求和的方式例如为对动态路径与静态路径分别分配对应的权重，基于动态路径对应的权重以及静态路径对应的权重对静态标签集合以及动态标签集合进行加权求和，以得到端口号的特征标签集合。采用单一路径的方式例如从静态标签集合与动态标签集合中选择一种作为端口号的特征标签集合。融合两种特征标签集合采用哪一种方式可以根据实际场景决定。Methods for fusing two feature label sets include a union method, a weighted sum method, and a set obtained by a single path. The union method is, for example, to obtain the union of a static label set and a dynamic label set to obtain a feature label set for the port number. The weighted sum method is, for example, to assign corresponding weights to the dynamic path and the static path respectively, and perform weighted summation on the static label set and the dynamic label set based on the weight corresponding to the dynamic path and the weight corresponding to the static path to obtain a feature label set for the port number. The single path method is, for example, to select one from the static label set and the dynamic label set as the feature label set for the port number. Which method to use to fuse the two feature label sets can be determined according to the actual scenario.

由于采用动态路径与静态路径融合的方式提取特征标签，不仅避免了信息丢失以及维度灾难，此外由于特征标签集合包括动态数据和静态数据两种数据源，使得提取特征标签的过程兼顾了静态数据以及动态数据，使得特征标签能够更加充分地表征端口的语义。Since feature labels are extracted by fusing dynamic paths with static paths, not only information loss and dimensionality disaster are avoided, but also, since the feature label set includes two data sources, dynamic data and static data, the feature label extraction process takes into account both static data and dynamic data, allowing the feature labels to more fully represent the semantics of the port.

在如上动态路径、静态路径以及动静态路径融合其中的任意一种方式中，在从静态数据中的背景字段或/和动态数据中的背景字段提取特征标签之前，可选地，基于相关性的维度来判断该背景字段是否适用于提取端口号的特征标签。如果背景字段与端口号之间的相似度高于相似度阈值，则基于该背景字段提取端口号的特征标签。如果背景字段与端口号之间的相似度低于相似度阈值，则无需基于该背景字段提取端口号的特征标签。基于此，相当于对背景字段与端口号之间的相关性进行校验，与端口号之间联系不密切的背景字段无需参与特征标签的提取过程，与端口号之间联系密切的背景字段才会参与特征标签的提取过程，从而一方面减少了与端口号之间联系不密切的背景字段的干扰，也减少了特征标签的维度数量。In any of the above-mentioned dynamic paths, static paths, and dynamic-static path fusion methods, before extracting feature labels from the background fields in the static data or/and the background fields in the dynamic data, it is optionally determined based on the dimension of correlation whether the background field is suitable for extracting the feature label of the port number. If the similarity between the background field and the port number is higher than the similarity threshold, the feature label of the port number is extracted based on the background field. If the similarity between the background field and the port number is lower than the similarity threshold, there is no need to extract the feature label of the port number based on the background field. Based on this, it is equivalent to verifying the correlation between the background field and the port number. The background field that is not closely related to the port number does not need to participate in the feature label extraction process, and the background field that is closely related to the port number will participate in the feature label extraction process. This reduces the interference of the background field that is not closely related to the port number on the one hand, and also reduces the number of dimensions of the feature label.

例如，在从端口注册表中提取端口号的静态标签的过程中，确定端口注册表中每一维度的背景字段与端口号之间的相似度。如果端口注册表中一个维度的背景字段与端口号之间的相似度高于相似度阈值，则基于该背景字段提取端口号的静态标签。如果端口注册表中一个维度的背景字段与端口号之间的相似度低于相似度阈值，则无需基于该背景字段提取端口号的静态标签。基于此，使得端口注册表中人名、分配状态(Assignment Status)等与端口号的语义之间关系不强的字段无需充当端口号的静态标签的数据源。For example, in the process of extracting the static label of the port number from the port registry, the similarity between the background field of each dimension in the port registry and the port number is determined. If the similarity between the background field of a dimension in the port registry and the port number is higher than the similarity threshold, the static label of the port number is extracted based on the background field. If the similarity between the background field of a dimension in the port registry and the port number is lower than the similarity threshold, there is no need to extract the static label of the port number based on the background field. Based on this, fields in the port registry that have little relationship with the semantics of the port number, such as the person's name and assignment status, do not need to serve as the data source of the static label of the port number.

又如，在从流量日志中提取端口号的动态标签的过程中，流量日志中通常会记录流量所来自的源IP地址所在的地区，如果确定源IP地址所在的地区与目的端口号之间的相似度高于相似度阈值，可以将源IP地址所在的地区作为目的端口号的特征标签。For example, in the process of extracting dynamic labels of port numbers from traffic logs, the traffic logs usually record the region where the source IP address from which the traffic comes is located. If it is determined that the similarity between the region where the source IP address is located and the destination port number is higher than the similarity threshold, the region where the source IP address is located can be used as a feature label of the destination port number.

再如，在从威胁日志中提取端口号的动态标签的过程中，可以对目的端口号与攻击类型进行相似度分析，以得到目的端口号与攻击类型之间的相似度。如果攻击类型与目的端口号之间相似度高于相似度阈值，再将该攻击类型作为目的端口号的特征标签。For example, when extracting dynamic labels for port numbers from threat logs, a similarity analysis can be performed between the destination port number and the attack type to determine the similarity between the two. If the similarity between the attack type and the destination port number exceeds a similarity threshold, the attack type is used as the characteristic label for the destination port number.

计算背景字段与端口号之间的相似度的方法例如为计算两种离散变量的相似度的方法，例如协方差、卡方检验方法、相关系数、雅卡尔系数等。The method for calculating the similarity between the background field and the port number is, for example, a method for calculating the similarity between two discrete variables, such as covariance, chi-square test method, correlation coefficient, Jacquard coefficient, etc.

以上列举了端口号的特征标签的多种提取方式，下面再结合一些具体的应用场景，对如何使用端口号的特征标签进行介绍。The above lists various methods for extracting feature tags of port numbers. The following describes how to use feature tags of port numbers in combination with some specific application scenarios.

在一些实施方式中，基于不同端口号的特征标签集合之间的相似度，确定不同端口号之间的相似度。例如，如果不同端口号的特征标签集合之间的相似度高于相似度阈值，则确定不同端口号之间具有相关性。如果不同端口号的特征标签集合之间的相似度低于相似度阈值，则确定不同端口号之间不具有相关性。In some embodiments, the similarity between different port numbers is determined based on the similarity between the feature tag sets of different port numbers. For example, if the similarity between the feature tag sets of different port numbers is higher than a similarity threshold, it is determined that there is correlation between the different port numbers. If the similarity between the feature tag sets of different port numbers is lower than the similarity threshold, it is determined that there is no correlation between the different port numbers.

以采用雅卡尔系数表征不同端口号的特征标签集合之间的相似度为例，例如，确定两个端口号的特征标签集合之间的雅卡尔系数，如果雅卡尔系数超过50％，则确定这两个端口号具有相关性。如果雅卡尔系数不超过50％，则确定这两个端口号不具有相关性。雅卡尔系数的计算公式如下所示。以下公式中，A表征一个端口号的特征标签集合，B表征另一个端口号的特征标签集合，J表征这两个端口号的特征标签集合之间的相似度。
Taking the example of using the Jacquard coefficient to represent the similarity between feature tag sets of different port numbers, for example, the Jacquard coefficient between the feature tag sets of two port numbers is determined. If the Jacquard coefficient exceeds 50%, the two port numbers are determined to be correlated. If the Jacquard coefficient does not exceed 50%, the two port numbers are determined to be uncorrelated. The formula for calculating the Jacquard coefficient is as follows. In the following formula, A represents the feature tag set of one port number, B represents the feature tag set of another port number, and J represents the similarity between the feature tag sets of the two port numbers.

在另一些实施方式中，雅卡尔系数替换为相关系数、余弦相似度或欧氏距离等其他能够表征不同的离散变量组成的集合之间相似程度的算法，本实施例对确定不同端口号的特征标签集合之间的相似度时采用的具体算法不做限定。In other embodiments, the Jacquard coefficient is replaced by other algorithms such as correlation coefficient, cosine similarity or Euclidean distance that can characterize the similarity between sets of different discrete variables. This embodiment does not limit the specific algorithm used to determine the similarity between feature tag sets with different port numbers.

例如，在基于表2示出的端口的特征标签计算不同端口号之间的相似度时，端口88与端口749这两个端口对应的应用完全相同，因此端口88与端口749之间的相似度是1；端口88与端口750这两个端口对应的应用中存在一种相同的应用，端口88与端口750总共对应四种应用，因此端口88与端口750之间的相似度是1/3；端口88对应的应用与端口2007对应的应用互不相同，端口88对应的应用与端口2015对应的应用互不相同，端口88对应的应用与端口2017对应的应用互不相同，因此端口88与端口2007之间的相似度是0，端口88与端口2015之间的相似度是0，端口88与端口2017之间的相似度是0。For example, when calculating the similarity between different port numbers based on the feature tags of the ports shown in Table 2, the applications corresponding to port 88 and port 749 are exactly the same, so the similarity between port 88 and port 749 is 1; there is one identical application among the applications corresponding to port 88 and port 750, and port 88 and port 750 correspond to a total of four applications, so the similarity between port 88 and port 750 is 1/3; the application corresponding to port 88 is different from the application corresponding to port 2007, the application corresponding to port 88 is different from the application corresponding to port 2015, and the application corresponding to port 88 is different from the application corresponding to port 2017, so the similarity between port 88 and port 2007 is 0, the similarity between port 88 and port 2015 is 0, and the similarity between port 88 and port 2017 is 0.

在一些实施方式中，在基于不同端口的特征标签确定不同端口之间的相似度时，可以基于IP地址与端口之间的交互次数对端口的特征标签进行加权。其中，权重用于表征端口的特征标签的重要度，端口的特征标签的权重与交互次数正相关。In some embodiments, when determining the similarity between different ports based on their feature tags, the feature tags of the ports may be weighted based on the number of interactions between the IP address and the port. The weight is used to represent the importance of the feature tag of the port, and the weight of the feature tag of the port is positively correlated with the number of interactions.

例如，端口A与IP地址8.8.8.8以及IP地址4.4.4.4均具有交互关系，且端口A与IP地址8.8.8.8交互的次数多于端口A与IP地址4.4.4.4交互的次数；端口B也与IP地址8.8.8.8以及IP地址4.4.4.4均具有交互关系，且端口B与IP地址8.8.8.8交互的次数少于端口A与IP地址4.4.4.4交互的次数；端口C也与IP地址8.8.8.8以及IP地址4.4.4.4均具有交互关系，且端口B与IP地址8.8.8.8交互的次数多于端口A与IP地址4.4.4.4交互的次数。在这一示例中，在提取端口A、端口B以及端口C这三个端口的特征标签集合，基于端口A、端口B以及端口C这三个端口的特征标签集合确定三个端口中不同端口之间的相似度时，能够确定端口A与端口C之间的相似度大于端口A与端口B之间的相似度。这是由于端口的特征标签集合中保留了IP地址与端口之间的交互次数，在比较相似度大小时能够将IP地址与端口之间的交互次数纳入至判断的依据，为交互次数更多的IP地址赋予更大的权重。For example, port A has interaction relationships with both IP address 8.8.8.8 and IP address 4.4.4.4, and the number of interactions between port A and IP address 8.8.8.8 is greater than the number of interactions between port A and IP address 4.4.4.4; port B also has interaction relationships with both IP address 8.8.8.8 and IP address 4.4.4.4, and the number of interactions between port B and IP address 8.8.8.8 is less than the number of interactions between port A and IP address 4.4.4.4; port C also has interaction relationships with both IP address 8.8.8.8 and IP address 4.4.4.4, and the number of interactions between port B and IP address 8.8.8.8 is greater than the number of interactions between port A and IP address 4.4.4.4. In this example, when extracting feature tag sets for three ports, Port A, Port B, and Port C, and determining the similarity between different ports within the three ports based on the feature tag sets, it can be determined that the similarity between Port A and Port C is greater than the similarity between Port A and Port B. This is because the feature tag sets for the ports include the number of interactions between the IP address and the port. This number of interactions between the IP address and the port can be factored into the similarity comparison, giving a greater weight to IP addresses with a greater number of interactions.

在再一些实施方式中，提取静态数据或/和动态数据中背景字段中包含的文本，对文本进行分词处理，对得到的词按照出现频率从高到低的顺序进行排序，选择出现频率排在前若干位的词作为端口号的特征标签集合。In some further embodiments, text contained in the background field of static data and/or dynamic data is extracted, the text is segmented, the obtained words are sorted in descending order of frequency of occurrence, and the words with the highest frequency of occurrence are selected as the feature tag set of the port number.

在一些实施方式中，在基于多个日志记录获得不同端口号的特征标签集合后，可以基于不同端口号的特征标签集合之间的相似度，确定不同日志记录之间的相似度，基于不同日志记录之间的相似度对多个日志记录进行聚类，从而自动挖掘出多个日志记录潜在的关联关系，将存在联系的日志记录自动地集中在一起，便于日志记录的存储和进一步处理。以日志记录为告警日志为例，下面通过实施例一对告警日志的聚类过程举例说明。In some implementations, after obtaining feature tag sets for different port numbers based on multiple log records, the similarity between the feature tag sets for different port numbers can be used to determine the similarity between the different log records. Based on the similarity between the different log records, the multiple log records can be clustered, thereby automatically discovering potential associations between the multiple log records and automatically grouping related log records together to facilitate storage and further processing of the log records. Taking alarm logs as an example, the clustering process for alarm logs is illustrated below through Example 1.

实施例一、聚类Example 1: Clustering

示例性地，对于资产1.3.5.7，攻击者往往以随机的端口(比如本例中的52333、49999、56666等)为源端口，对资产1.3.5.7进行攻击，从而产生如下表6示出的一组防火墙告警日志。For example, for asset 1.3.5.7, attackers often use random ports (such as 52333, 49999, 56666, etc. in this example) as source ports to attack asset 1.3.5.7, thereby generating a set of firewall alarm logs as shown in Table 6 below.

表6
Table 6

告警日志关联分析模块需要在这些孤立的告警日志中识别出存在联系的告警日志，将存在联系的告警日志自动地集中在一起，方便运维人员进行审视和处置。The alarm log correlation analysis module needs to identify related alarm logs among these isolated alarm logs and automatically group the related alarm logs together to facilitate review and disposal by operation and maintenance personnel.

例如，网络管理员设定关联规则为当两条告警日志中有超过50％的字段(即，四个字段中的超过两个)相关时，则认为这两条告警日志存在关联。那么，如果不采用本申请实施例，将仅能发现1号告警日志和3号告警日志之间的关联关系；其它的告警日志间将被认为不存在关联关系。For example, a network administrator sets the association rule such that when more than 50% of the fields in two alarm logs (i.e., more than two of the four fields) are correlated, the two alarm logs are considered to be associated. If the embodiment of the present application is not used, only the association between alarm logs 1 and 3 will be found; the other alarm logs will be considered to have no association.

采用本申请实施例，则可以基于上表6示出的告警日志，从动态路径生成如下表7所示的端口与特征标签之间的映射关系。例如，对于79号端口，由于从如上表6示出的告警日志中识别出编号为1的告警日志中出现了79号端口，79号端口为目的端口，与79号端口存在交互关系的IP地址的IP地址为源IP地址192.1.1.1，因此79号端口的特征标签为{192.1.1.1}。对于80号端口，由于从如上表6示出的告警日志中识别出编号为2的告警日志以及编号为6的告警日志中均出现了80号端口，在编号为2的告警日志中80号端口为目的端口，与80号端口存在交互关系的IP地址的IP地址为源IP地址215.8.8.8，在编号为6的告警日志中80号端口为目的端口，与80号端口存在交互关系的IP地址的IP地址为源IP地址191.4.4.4，因此80号端口的特征标签为{215.8.8.8,191.4.4.4}。以此类推，确定出端口号与特征标签之间的映射关系如下表7所示。By adopting the embodiment of the present application, a mapping relationship between ports and feature tags as shown in Table 7 below can be generated from a dynamic path based on the alarm log shown in Table 6 above. For example, for port 79, since port 79 appears in the alarm log numbered 1 as identified from the alarm log shown in Table 6 above, port 79 is the destination port, and the IP address of the IP address that has an interactive relationship with port 79 is the source IP address 192.1.1.1, the feature tag of port 79 is {192.1.1.1}. For port 80, as shown in Table 6 above, port 80 appears in both alarm logs 2 and 6. In alarm log 2, port 80 is the destination port, and the IP address interacting with port 80 is the source IP address 215.8.8.8. In alarm log 6, port 80 is the destination port, and the IP address interacting with port 80 is the source IP address 191.4.4.4. Therefore, the characteristic tag for port 80 is {215.8.8.8, 191.4.4.4}. Similarly, the mapping relationship between port numbers and characteristic tags is determined as shown in Table 7 below.

表7
Table 7

在本例中，为了融合动态路径和静态路径，针对告警日志中每个出现过的端口号，合并采用静态路径得到的该端口号的特征标签集合以及采用动态路径得到的该端口号的特征标签集合，得到的结果如下表8所示。In this example, in order to merge the dynamic path and the static path, for each port number that appears in the alarm log, the feature label set of the port number obtained using the static path and the feature label set of the port number obtained using the dynamic path are merged. The results are shown in Table 8 below.

表8
Table 8

在本实施例中，可以通过比较不同端口的特征标签集合，确定不同端口的特征标签集合之间的相似度，基于不同端口的特征标签集合之间的相似度，确定不同端口对应的告警日志之间的相似度。In this embodiment, the similarity between the feature tag sets of different ports can be determined by comparing the feature tag sets of different ports, and based on the similarity between the feature tag sets of different ports, the similarity between the alarm logs corresponding to different ports can be determined.

以采用雅卡尔系数确定不同特征标签集合的相似度为例，例如，如果两个端口的特征标签集合的雅卡尔系数超过50％，则确定两个端口之间存在相关性。例如，计算出端口80的特征标签集合与端口8080的特征标签集合之间的雅卡尔系数为3/4，则确定端口80与端口8080存在相关性。以此类推，可以确定编号为2、4、5、6的告警日志间均存在关联，而编号为2、4、5和6的告警日志与编号为1、3的告警日志不存在关联。For example, using the Jaccard coefficient to determine the similarity between different feature tag sets, if the Jaccard coefficient of the feature tag sets of two ports exceeds 50%, a correlation is determined between the two ports. For example, if the Jaccard coefficient between the feature tag set of port 80 and the feature tag set of port 8080 is calculated to be 3/4, then ports 80 and 8080 are determined to be correlated. Similarly, it can be determined that alarm logs numbered 2, 4, 5, and 6 are all correlated, while alarm logs numbered 2, 4, 5, and 6 are not correlated with alarm logs numbered 1 or 3.

通过以上基于不同端口的特征标签集合之间的相似度，确定不同告警日志的相似度再对告警日志进行聚类的方式，相对于不使用本实施例提供的方法而言，更加符合安全专家的人工研究判断。The method of determining the similarity of different alarm logs and then clustering the alarm logs based on the similarity between the feature tag sets of different ports is more consistent with the manual research and judgment of security experts than not using the method provided by this embodiment.

本实施例提供的方法，由于基于告警日志中记录的与端口存在交互关系的IP地址，确定端口对应的特征标签集合，基于两个端口对应的特征标签集合能够较为准确的确定两个端口之间的距离。对于聚类任务而言，不同样本之间的距离是重要的基础，因此本实施例提供的端口维度的距离一方面能够为对告警日志进行聚类提供依据，另一方面降低信息丢失，此外由于避免了维度灾难，所需占用的内存空间也比较少。The method provided in this embodiment determines the feature tag set corresponding to the port based on the IP address that interacts with the port recorded in the alarm log. This feature tag set can more accurately determine the distance between the two ports. The distance between different samples is an important foundation for clustering tasks. Therefore, the port-dimensional distance provided in this embodiment not only provides a basis for clustering alarm logs, but also reduces information loss. Furthermore, by avoiding the curse of dimensionality, it also requires less memory space.

实施例二、异常连接检测Example 2: Abnormal connection detection

在一些实施方式中，基于第一通信设备的正常历史连接信息，采用动态路径的方式获得第一通信设备中多个端口的特征标签集合。In some implementations, based on normal historical connection information of the first communication device, a dynamic path is used to obtain a set of characteristic tags of multiple ports in the first communication device.

正常历史连接信息是指第一网络设备在历史时间段状态正常(如未检测到网络攻击、行为异常、CPU内存等运行状态正常，未触发安全警报)的情况下记录的日志中包含的网络连接信息。正常历史连接信息包括第一网络设备的本机端口号以及与该本机端口号对应端口建立连接的远端设备的标识)(如IP地址或者IP地址的部分内容)。端口与特征标签集合之间的映射关系可以实现流量基线的作用。Normal historical connection information refers to the network connection information contained in the log recorded when the first network device was in normal status during the historical time period (such as no network attack, abnormal behavior, normal CPU memory and other operating conditions, and no security alarm was triggered). Normal historical connection information includes the local port number of the first network device and the identifier of the remote device that established a connection with the port corresponding to the local port number (such as an IP address or part of the IP address). The mapping relationship between the port and the feature tag set can realize the role of a traffic baseline.

例如，当第一通信设备检测到第二通信设备与第一通信设备的一个端口建立连接，第一通信设备可以判断第二通信设备的特征是否命中该端口对应的特征标签集合；如果第二通信设备的特征属于该端口对应的特征标签集合，表征第二通信设备历史时间段与第一通信设备的端口正常连接，第二通信设备当前向该端口发起的连接没有超过流量基线，第一通信设备可以允许与第二通信设备建立连接；如果第二通信设备的特征属于该端口对应的特征标签集合，表征第二通信设备当前向该端口发起的连接偏离流量基线，则确定第二通信设备与第一通信设备的连接为异常连接，第一通信设备可以拒绝与第二通信设备建立连接。For example, when a first communication device detects that a second communication device has established a connection with a port of the first communication device, the first communication device can determine whether the characteristics of the second communication device hit the characteristic tag set corresponding to the port; if the characteristics of the second communication device belong to the characteristic tag set corresponding to the port, indicating that the second communication device has been normally connected to the port of the first communication device in a historical time period, and the connection currently initiated by the second communication device to the port does not exceed the traffic baseline, the first communication device can allow a connection to be established with the second communication device; if the characteristics of the second communication device belong to the characteristic tag set corresponding to the port, indicating that the connection currently initiated by the second communication device to the port deviates from the traffic baseline, then it is determined that the connection between the second communication device and the first communication device is an abnormal connection, and the first communication device can refuse to establish a connection with the second communication device.

将与端口存在交互关系的IP地址本身作为端口的特征标签仅是示例性的；在另一些实施方式中，对与端口存在交互关系的IP地址采用预定规则进行加工处理，将处理后得到的字段作为端口的特征标签。例如，从与端口存在交互关系的IP地址截取前预定数目的字段，作为端口的特征标签。例如，从与端口存在交互关系的IP地址截取第一个字段，作为端口的特征标签。具体地，IP地址中的第一个字段的含义通常是设备所属的网段的标识，通过将与端口存在交互关系的IP地址的第一个字段作为端口的特征标签，能够将与同一个端口交互关系的IP地址中属于同一个网段的IP地址进行合并，得到端口对应的网段集合。基于端口对应的网段集合进行异常连接的检测时，可以将端口对应的网段集合作为可信网段集合。对于任一条数据流，如果该数据流来自于网段集合中的任意网段且目的端口为网段集合对应的端口，则将该数据流确定为可信流量。如果该数据流来自于网段集合之外的其他网段且目的端口为网段集合对应的端口，则将该数据流确定为异常流量。Using the IP address that interacts with a port as the port's characteristic tag is merely exemplary; in other embodiments, the IP address that interacts with a port is processed using predetermined rules, and the resulting fields are used as the port's characteristic tags. For example, the first predetermined number of fields from the IP address that interacts with a port are intercepted and used as the port's characteristic tag. For example, the first field from the IP address that interacts with a port is intercepted and used as the port's characteristic tag. Specifically, the first field in an IP address typically identifies the network segment to which the device belongs. By using the first field of the IP address that interacts with a port as the port's characteristic tag, IP addresses that interact with the same port and belong to the same network segment can be merged to obtain the network segment set corresponding to the port. When detecting abnormal connections based on the network segment set corresponding to the port, the network segment set corresponding to the port can be used as a trusted network segment set. For any data flow, if the data flow originates from any network segment in the network segment set and the destination port is the port corresponding to the network segment set, the data flow is determined to be trusted traffic. If the data flow comes from a network segment other than the network segment set and the destination port is a port corresponding to the network segment set, the data flow is determined to be abnormal traffic.

示例性地，第一通信设备的正常的历史连接信息如下表9所示。Exemplarily, normal historical connection information of the first communication device is shown in Table 9 below.

表9

Table 9

基于该流量基线，采用动态路径的方式，建立第一通信设备的端口号与特征标签集合之间的映射关系。例如，选取与第一通信设备的端口号相连接的IPv4地址的第一个字段作为端口号的特征标签，得到如下表10所示的特征标签集合。Based on the traffic baseline, a mapping relationship between the port number of the first communication device and a set of characteristic tags is established using a dynamic path. For example, the first field of the IPv4 address connected to the port number of the first communication device is selected as the characteristic tag of the port number, resulting in the characteristic tag set shown in Table 10 below.

表10
Table 10

第一通信设备基于上表10所示的映射关系能够进行异常连接检测。例如，第一通信设备检测出IP地址128.8.8.8与本机的80端口连接，则以80端口为键查找表10所示的映射关系，得到80端口对应的流量基线{128,255}；第一通信设备对该IP地址128.8.8.8与80端口对应的流量基线{128,255}进行匹配，基于IP地址128.8.8.8未超出流量基线(128∈{128,255})，不将该IP地址128.8.8.8与80端口之间的连接作为异常连接；又如，第一通信设备检测出IP地址128.8.8.8与本机的79端口连接，则以79端口为键查找表10所示的映射关系，得到79端口对应的流量基线{8}；第一通信设备对该IP地址128.8.8.8与79端口的流量基线{8}进行匹配，基于IP地址128.8.8.8偏离流量基线确定IP地址128.8.8.8与本机的79端口之间的连接为异常连接。The first communication device can perform abnormal connection detection based on the mapping relationship shown in Table 10 above. For example, the first communication device detects that the IP address 128.8.8.8 is connected to port 80 of the local machine, and uses port 80 as the key to look up the mapping relationship shown in Table 10 to obtain the traffic baseline {128, 255} corresponding to port 80; the first communication device matches the IP address 128.8.8.8 with the traffic baseline {128, 255} corresponding to port 80, and based on the fact that the IP address 128.8.8.8 does not exceed the traffic baseline (128∈{128, 255}), the connection between the IP address 128.8.8.8 and port 80 is not regarded as an abnormal connection; for another example, the first communication device detects that the IP address 128.8.8.8 is connected to port 79 of the local machine, and uses port 79 as the key to look up the mapping relationship shown in Table 10 to obtain the traffic baseline {8} corresponding to port 79; the first communication device matches the IP address 128.8.8.8 with the traffic baseline {8} of port 79, and based on the fact that the IP address 128.8.8.8 deviates from the traffic baseline The connection between IP address 128.8.8.8 and port 79 of the local machine is determined to be an abnormal connection.

实施例三、分类Example 3: Classification

在一些实施方式中，通信设备获取训练样本集，训练样本集中每个样本包含端口号以及背景字段。例如，每个样本包括源地址、目的地址、源端口、目的端口以及协议类型等字段。此外，训练样本集中每个样本中标注有类型标签，类型标签用于表征样本对应的类型。In some embodiments, a communication device obtains a training sample set, where each sample in the training sample set includes a port number and a context field. For example, each sample includes fields such as source address, destination address, source port, destination port, and protocol type. Furthermore, each sample in the training sample set is annotated with a type tag that identifies the type of the sample.

在特征工程步骤中，通信设备将样本中包含的每个字段转换为对应的特征标签。其中，对于IP地址等端口号之外的字段，通信设备可采用相对应的特征工程算法将其转换为特征标签；对源端口以及目的端口，通信设备直接采用经由静态路径生成的映射关系，将端口号转换为特征标签。通信设备对端口号对应的特征标签以及非端口号对应的特征标签进行拼接或融合，得到样本的特征标签。During the feature engineering step, the communications device converts each field contained in the sample into a corresponding feature label. For fields other than port numbers, such as IP addresses, the communications device can use corresponding feature engineering algorithms to convert them into feature labels. For source and destination ports, the communications device directly uses the mapping relationship generated via the static path to convert the port number into a feature label. The communications device concatenates or fuses the feature labels corresponding to the port numbers with those corresponding to non-port numbers to obtain the sample's feature label.

在训练过程中，通信设备可按照上述特征工程过程，将每个标注有类型的样本转换为一系列的特征标签，并利用决策树算法训练一个分类模型。在推理过程中，通信设备直接采用相同的特征方程方案，将无标注的待预测样本转换为一系列的特征标签，并利用训练好的决策树对无标注的待预测样本进行分类。During training, the communication device converts each labeled sample into a series of feature labels using the aforementioned feature engineering process, and then trains a classification model using a decision tree algorithm. During inference, the communication device directly uses the same characteristic equation scheme to convert unlabeled samples to be predicted into a series of feature labels, and then uses the trained decision tree to classify the unlabeled samples.

以应用识别的场景为例，例如，训练样本集是基于会话表得到的，训练样本集中每个样本是基于会话表中一个数据流的记录得到的。每个样本包括一个数据流的源地址、该数据流的目的地址、源端口、目的端口以及协议类型等字段。此外，训练样本集中每个样本中标注有数据流对应的类型标签，类型标签用于表征数据流对应的应用类型，例如视频应用、游戏应用或者语音应用。在训练过程中，通信设备可按照上述特征工程过程，将每个标注有类型的样本转换为一系列的特征标签，并利用决策树算法训练一个应用识别模型。在推理过程中，通信设备直接采用相同的特征方程方案，将无标注的待预测样本转换为一系列的特征标签，并利用训练好的决策树基于待预测样本确定对应的应用类型。Taking the application identification scenario as an example, a training sample set is generated based on a session table, with each sample in the training sample set being derived from a data flow record in the session table. Each sample includes fields such as the source address of a data flow, its destination address, source port, destination port, and protocol type. Furthermore, each sample in the training sample set is annotated with a type label corresponding to the data flow. The type label characterizes the corresponding application type, such as video application, gaming application, or voice application. During training, the communication device can convert each type-annotated sample into a series of feature labels according to the aforementioned feature engineering process, and then train an application identification model using a decision tree algorithm. During inference, the communication device directly uses the same characteristic equation scheme to convert unlabeled prediction samples into a series of feature labels. It then uses a trained decision tree to determine the corresponding application type based on the prediction sample.

总结来看，本实施例提供的方法，由于通过动态路径和/或静态路径，构建出端口号与特征标签之间的映射关系，从而解决信息丢失以及维度爆炸等问题。In summary, the method provided in this embodiment solves problems such as information loss and dimensionality explosion by constructing a mapping relationship between port numbers and feature tags through dynamic paths and/or static paths.

此外，算法计算较为简单，适用于现网通信设备。In addition, the algorithm calculation is relatively simple and is applicable to existing network communication equipment.

此外，动态路径和静态路径这两种方式既可单独使用，也可联合使用。In addition, the dynamic path and static path can be used alone or in combination.

此外，动态路径和静态路径所基于的数据源包括IANA端口注册表、私有协议端口注册表、流量日志记录、告警日志记录等多种数据源，因此方便适配各类应用。例如，当应用类型发生变化，可以基于应用类型选择将端口号转换为特征标签所基于的信息源的以及对信息具体的处理方式。Furthermore, dynamic and static paths are based on a variety of data sources, including the IANA port registry, private protocol port registries, traffic logs, and alarm logs, making them easily adaptable to various applications. For example, when the application type changes, the information source used to convert port numbers into feature tags and the specific processing method for this information can be selected based on the application type.

以下通过附图1所示实施例对上述各个实施例总结概括。The above embodiments are summarized below with reference to the embodiment shown in FIG1 .

附图1是本申请实施例提供的一种映射关系的获取方法的流程图。附图1所示方法由通信设备执行，该通信设备例如是网络设备，比如是交换机、路由器或防火墙等具有转发功能的设备。该通信设备又如为终端、服务器等具有数据处理能力的设备。附图1所示方法包括以下步骤S110至步骤S140。FIG1 is a flowchart of a method for obtaining a mapping relationship provided in an embodiment of the present application. The method shown in FIG1 is performed by a communication device, such as a network device, such as a switch, router, or firewall, which has forwarding capabilities. The communication device can also be a terminal, server, or other device with data processing capabilities. The method shown in FIG1 includes the following steps S110 to S140.

步骤S110，通信设备获取目标数据。Step S110: The communication device obtains target data.

该目标数据包括端口号字段以及背景字段，该端口号字段包括该第一端口的端口号，该背景字段包括该端口号的属性。目标数据可以来自于通信设备上保存的转发表。目标数据也可以来自于通信设备上记录的日志，比如说流量转发日志、安全检测日志等。目标数据也可以是其他设备发送给通信设备的，或者是用户输入的。The target data includes a port number field and a background field. The port number field includes the port number of the first port, and the background field includes the attributes of the port number. The target data can come from a forwarding table stored on the communication device. The target data can also come from logs recorded on the communication device, such as traffic forwarding logs, security detection logs, etc. The target data can also be sent to the communication device by another device or input by the user.

在一些实施方式中，该目标数据包括静态数据，该静态数据是指与端口实际的连接使用情况无关的数据。例如，该静态数据包括端口注册表，该端口注册表包括IANA端口注册表或/和私有协议端口注册表其中至少一项。In some embodiments, the target data includes static data, which refers to data unrelated to the actual connection usage of the port. For example, the static data includes a port registry, which includes at least one of an IANA port registry and/or a private protocol port registry.

在一些实施方式中，通信设备在基于目标数据获取第一端口的端口号的属性之前，对原始端口注册表进行去冗余处理，得到该端口注册表。In some implementations, before acquiring the attribute of the port number of the first port based on the target data, the communication device performs redundancy removal processing on the original port registration table to obtain the port registration table.

在一些实施方式中，去冗余处理的方式包括以下方式一或/和方式二至少一项。In some implementations, the redundancy removal method includes at least one of the following method 1 and/or method 2.

方式一，如果检测出该原始端口注册表中一个背景字段中包括表征重复的关键词，从该原始端口注册表删除该背景字段所在的行。Method 1: If it is detected that a background field in the original port registration table includes a keyword indicating duplication, the row where the background field is located is deleted from the original port registration table.

方式二，如果检测出该原始端口注册表中多个背景字段中存在含义相同且表述不同的术语，确定该术语的目标用词，将该多个背景字段中出现的该术语替换为该目标用词。Method 2: If it is detected that there are terms with the same meaning but different expressions in multiple background fields in the original port registration table, a target word for the term is determined, and the term appearing in the multiple background fields is replaced with the target word.

在一些实施方式中，该目标数据包括动态数据，该动态数据包括第一通信设备的日志记录，该日志记录用于记录该第一通信设备的该第一端口与第二通信设备通信产生的数据，该对象标识包括该第二通信设备的网络地址、该第二通信设备所在的区域的标识、该第二通信设备中与该第一端口通信的第一端口或/和该第二通信设备中与该第一端口通信的应用程序的标识中至少一项。In some embodiments, the target data includes dynamic data, which includes log records of a first communication device, and the log records are used to record data generated by the communication between the first port of the first communication device and the second communication device. The object identifier includes at least one of the network address of the second communication device, the identifier of the area where the second communication device is located, the first port in the second communication device that communicates with the first port, and/or the identifier of an application in the second communication device that communicates with the first port.

步骤S120，通信设备基于目标数据获取第一端口的端口号的属性。Step S120: The communication device obtains the attribute of the port number of the first port based on the target data.

第一端口是指目标数据中记录的一个端口。第一端口例如是通信设备包括的一个端口。通信设备解析目标数据中的背景字段，从背景字段中提取端口号的属性。The first port refers to a port recorded in the target data. The first port is, for example, a port included in the communication device. The communication device parses the background field in the target data and extracts the attribute of the port number from the background field.

在一些实施方式中，该端口号的属性包括该端口号的用途信息，该用途信息包括通过该端口号通信的应用程序的标识、通过该端口号通信的协议的标识或通过该端口号通信的业务的标识中至少一项。In some embodiments, the attributes of the port number include usage information of the port number, which includes at least one of an identifier of an application communicating through the port number, an identifier of a protocol communicating through the port number, or an identifier of a service communicating through the port number.

在一些实施方式中，通信设备在执行步骤S120之前，先确定该目标数据中背景字段与该端口号之间的相似度；对相似度与相似度阈值进行比较，如果该相似度高于相似度阈值，则执行步骤S120，以从该背景字段中提取该端口号的属性。如果该相似度不高于相似度阈值，则无需执行步骤S120。In some embodiments, before executing step S120, the communication device first determines the similarity between the background field in the target data and the port number; compares the similarity with a similarity threshold; and if the similarity exceeds the similarity threshold, executes step S120 to extract the attributes of the port number from the background field. If the similarity is not greater than the similarity threshold, step S120 is not executed.

在一些实施方式中，通信设备在执行步骤S120之前，先确定该目标数据中背景字段的散列度；对散列度与散列度阈值进行比较，如果该散列度高于散列度阈值，则执行步骤S120，以从该背景字段中提取该端口号的属性。如果散列度不高于散列度阈值，则无需执行步骤S120。In some embodiments, before executing step S120, the communication device first determines the hash value of the background field in the target data; compares the hash value with a hash value threshold; and if the hash value exceeds the hash value threshold, executes step S120 to extract the attributes of the port number from the background field. If the hash value does not exceed the hash value threshold, step S120 is not executed.

步骤S130，通信设备对该端口号的属性进行特征提取，以获得该端口号的特征标签集合，该特征标签集合包括该端口号的一个或多个特征标签。Step S130: The communication device extracts features from the attributes of the port number to obtain a feature tag set of the port number, where the feature tag set includes one or more feature tags of the port number.

在一些实施方式中，通信设备对该端口号的用途信息进行特征提取，以获得该端口号的静态标签集合，该端口号的静态标签集合包括该端口号的一个或多个静态标签。In some implementations, the communication device performs feature extraction on usage information of the port number to obtain a static tag set for the port number, where the static tag set for the port number includes one or more static tags for the port number.

在一些实施方式中，该端口号的属性还包括该对象与该第一端口之间的交互次数，，通信设备对该对象标识与该交互次数进行拼接或者融合，以获得该端口号的特征标签集合。In some implementations, the attributes of the port number further include the number of interactions between the object and the first port. The communication device concatenates or fuses the object identifier and the number of interactions to obtain a feature tag set for the port number.

在一些实施方式中，该目标数据包括动态数据，该动态数据是指与该第一端口实际的连接使用情况相关的数据，该端口号的属性包括对象标识，该对象标识用于标识与该第一端口存在交互关系的对象，，通信设备对该对象标识进行特征提取，以获得该端口号的动态标签集合，该端口号的动态标签集合包括该端口号的一个或多个动态标签。In some embodiments, the target data includes dynamic data, which refers to data related to the actual connection usage of the first port. The attributes of the port number include an object identifier, which is used to identify an object that has an interactive relationship with the first port. The communication device performs feature extraction on the object identifier to obtain a dynamic tag set for the port number, and the dynamic tag set for the port number includes one or more dynamic tags of the port number.

在一些实施方式中，该动态数据还包括威胁记录，该威胁记录中的背景字段包括攻击类型字段，该端口号的属性包括以该第一端口为攻击目标的网络攻击的类型以及该第一端口被该类型的网络攻击所攻击的次数，通信设备对该网络攻击或/和该攻击的次数进行特征提取，以获得该端口号的动态标签集合，该端口号的动态标签集合包括该端口号的一个或多个动态标签。In some embodiments, the dynamic data also includes a threat record, the background field in the threat record includes an attack type field, the attributes of the port number include the type of network attack targeting the first port and the number of times the first port is attacked by this type of network attack, and the communication device performs feature extraction on the network attack and/or the number of attacks to obtain a dynamic label set for the port number, and the dynamic label set for the port number includes one or more dynamic labels for the port number.

在一些实施方式中，通信设备对静态数据中的该端口号的属性进行特征提取，以获得该端口号的静态标签集合；通信设备对动态数据中的该端口号的属性进行特征提取，以获得该端口号的动态标签集合；通信设备对该端口号的静态标签集合与该端口号的静态标签集合进行融合，以获得该端口号的特征标签集合。In some embodiments, the communication device performs feature extraction on the attributes of the port number in the static data to obtain a static tag set for the port number; the communication device performs feature extraction on the attributes of the port number in the dynamic data to obtain a dynamic tag set for the port number; the communication device fuses the static tag set of the port number with the static tag set of the port number to obtain a feature tag set for the port number.

步骤S140，通信设备基于该端口号以及该特征标签集合生成映射关系，该映射关系包括键值对，该映射关系中的键包括该端口号，该映射关系中的值包括该特征标签集合。In step S140 , the communication device generates a mapping relationship based on the port number and the feature tag set, where the mapping relationship includes a key-value pair, the key in the mapping relationship includes the port number, and the value in the mapping relationship includes the feature tag set.

本实施例提供的方法，由于利用端口号与背景字段之间的共现关系，不是直接对端口号本身进行编码，而是从与端口号有关系的背景字段中获取端口号的属性，基于端口号的属性获取端口号的特征标签集合(或者说端口号的编码)，使得端口号的特征标签集合能够表征端口号的属性(或者说表征端口号的语义或用途)，从而避免端口号的特征标签集合无法表征端口号属性造成的信息丢失。此外，如果两个端口号具有共有的属性，则两个端口号的特征标签集合中会存在相同的部分，换句话说两个端口号的编码的局部相同或者相近，因此两个端口号的特征标签集合之间的相似度能够表征两个端口号之间的关联关系的紧密程度，从而更充分地表征出不同端口之间的相关性。The method provided in this embodiment utilizes the co-occurrence relationship between the port number and the background field. Instead of directly encoding the port number itself, it obtains the attributes of the port number from the background field related to the port number, and obtains the feature tag set of the port number (or the encoding of the port number) based on the attributes of the port number, so that the feature tag set of the port number can represent the attributes of the port number (or the semantics or purpose of the port number), thereby avoiding the loss of information caused by the inability of the feature tag set of the port number to represent the attributes of the port number. In addition, if two port numbers have common attributes, there will be the same part in the feature tag sets of the two port numbers. In other words, the encodings of the two port numbers are partially the same or similar. Therefore, the similarity between the feature tag sets of the two port numbers can represent the closeness of the association relationship between the two port numbers, thereby more fully representing the correlation between different ports.

在一些实施方式中，该方法还包括：In some embodiments, the method further comprises:

比较该第一端口的特征标签集合与第二端口的特征标签集合，以获得该第一端口的特征标签集合与第二端口的特征标签集合之间的相似度；Comparing the feature tag set of the first port with the feature tag set of the second port to obtain a similarity between the feature tag set of the first port and the feature tag set of the second port;

基于该第一端口的特征标签集合与第二端口的特征标签集合之间的相似度，确定该第一端口与该第二端口的相似度。Based on the similarity between the feature tag set of the first port and the feature tag set of the second port, the similarity between the first port and the second port is determined.

在一些实施方式中，该目标数据包括告警日志集合，该告警日志集合包括第一告警日志以及第二告警日志，该方法还包括：In some embodiments, the target data includes an alarm log set, the alarm log set includes a first alarm log and a second alarm log, and the method further includes:

基于第一端口的特征标签集合与第二端口的特征标签集合之间的相似度，确定该第一告警日志与该第二告警日志之间的相似度，该第一端口的特征标签集合是基于该第一告警日志中的背景字段确定的，该第二端口的特征标签集合是基于该第二告警日志中的背景字段确定的；Determining the similarity between the first alarm log and the second alarm log based on similarity between a feature tag set of a first port and a feature tag set of a second port, wherein the feature tag set of the first port is determined based on a background field in the first alarm log, and the feature tag set of the second port is determined based on the background field in the second alarm log;

基于该第一告警日志与第二告警日志之间的相似度，对该告警日志集合进行聚类。The alarm log set is clustered based on the similarity between the first alarm log and the second alarm log.

在一些实施方式中，该目标数据包括第一通信设备在正常状态下的网络连接信息，该方法还包括：In some embodiments, the target data includes network connection information of the first communication device in a normal state, and the method further includes:

当检测到第二通信设备与该第一通信设备的该第一端口建立连接，基于该第一端口的端口号查找该映射关系，以获得该端口号的特征标签集合；When it is detected that the second communication device establishes a connection with the first port of the first communication device, searching the mapping relationship based on the port number of the first port to obtain a feature tag set of the port number;

如果该第二通信设备的特征命中该端口号的特征标签集合，确定该连接为正常连接；If the feature of the second communication device matches the feature tag set of the port number, determining that the connection is a normal connection;

如果该第二通信设备的特征未命中该端口号的特征标签集合，确定该连接为异常连接。If the feature of the second communication device does not match the feature tag set of the port number, it is determined that the connection is an abnormal connection.

在一些实施方式中，该目标数据包括训练样本集，该训练样本集中每个样本包括源地址、目的地址、源端口号、目的端口号、协议类型以及类型标签，该方法还包括：In some embodiments, the target data includes a training sample set, each sample in the training sample set includes a source address, a destination address, a source port number, a destination port number, a protocol type, and a type label, and the method further includes:

基于该源端口号查找该映射关系，以获得该源端口号的特征标签集合；Searching the mapping relationship based on the source port number to obtain a feature tag set of the source port number;

基于该目的端口号查找该映射关系，以获得该目的端口号的特征标签集合；Searching the mapping relationship based on the destination port number to obtain a feature tag set of the destination port number;

对该源地址的特征标签集合、该目的地址的特征标签集合、该源端口号的特征标签集合、该目的端口号的特征标签集合以及该协议类型的特征标签集合进行拼接或融合，以获得该训练样本集中每个样本的特征标签集合；splicing or fusing the feature label set of the source address, the feature label set of the destination address, the feature label set of the source port number, the feature label set of the destination port number, and the feature label set of the protocol type to obtain a feature label set for each sample in the training sample set;

基于该训练样本集中每个样本的特征标签集合以及该训练样本集中每个样本的类型标签进行模型训练，以得到分类模型。Model training is performed based on the feature label set of each sample in the training sample set and the type label of each sample in the training sample set to obtain a classification model.

图2是本申请实施例提供的一种映射关系的获取装置200的结构示意图。装置200包括：FIG2 is a schematic diagram of a mapping relationship acquisition device 200 provided in an embodiment of the present application. The device 200 includes:

获取单元210，用于基于目标数据获取第一端口的端口号的属性，目标数据包括端口号字段以及背景字段，端口号字段包括第一端口的端口号，背景字段包括端口号的属性；An acquiring unit 210 is configured to acquire the properties of the port number of the first port based on target data, wherein the target data includes a port number field and a background field, the port number field includes the port number of the first port, and the background field includes the properties of the port number;

处理单元220，用于对端口号的属性进行特征提取，以获得端口号的特征标签集合，特征标签集合包括端口号的一个或多个特征标签；The processing unit 220 is configured to perform feature extraction on the attributes of the port number to obtain a feature tag set of the port number, where the feature tag set includes one or more feature tags of the port number;

获取单元210，还用于基于端口号以及特征标签集合获取映射关系，映射关系包括键值对，映射关系中的键包括端口号，映射关系中的值包括特征标签集合。The acquisition unit 210 is further configured to acquire a mapping relationship based on the port number and the feature tag set, where the mapping relationship includes a key-value pair, the key in the mapping relationship includes the port number, and the value in the mapping relationship includes the feature tag set.

在一些实施方式中，目标数据包括静态数据，静态数据是指与端口实际的连接使用情况无关的数据，端口号的属性包括端口号的用途信息，用途信息包括通过端口号通信的应用程序的标识、通过端口号通信的协议的标识或通过端口号通信的业务的标识中至少一项，处理单元220，还用于对端口号的用途信息进行特征提取，以获得端口号的静态标签集合，端口号的静态标签集合包括端口号的一个或多个静态标签。In some embodiments, the target data includes static data, which refers to data that is unrelated to the actual connection usage of the port. The attributes of the port number include usage information of the port number, which includes at least one of an identifier of an application communicating through the port number, an identifier of a protocol communicating through the port number, or an identifier of a service communicating through the port number. The processing unit 220 is further used to perform feature extraction on the usage information of the port number to obtain a static tag set of the port number, and the static tag set of the port number includes one or more static tags of the port number.

在一些实施方式中，静态数据包括端口注册表，端口注册表包括IANA端口注册表或/和私有协议端口注册表其中至少一项。In some implementations, the static data includes a port registry, where the port registry includes at least one of an IANA port registry and/or a private protocol port registry.

在一些实施方式中，处理单元220，还用于对原始端口注册表进行去冗余处理，得到端口注册表。In some implementations, the processing unit 220 is further configured to perform redundancy removal on the original port registration table to obtain a port registration table.

在一些实施方式中，处理单元220还用于执行以下至少一项：如果检测出原始端口注册表中一个背景字段中包括表征重复的关键词，从原始端口注册表删除背景字段所在的行；如果检测出原始端口注册表中多个背景字段中存在含义相同且表述不同的术语，确定术语的目标用词，将多个背景字段中出现的术语替换为目标用词。In some embodiments, the processing unit 220 is further configured to perform at least one of the following: if it is detected that a background field in the original port registry includes keywords representing duplication, deleting the row containing the background field from the original port registry; if it is detected that multiple background fields in the original port registry contain terms with the same meaning but different expressions, determining a target term for the term, and replacing the term appearing in multiple background fields with the target term.

在一些实施方式中，目标数据包括动态数据，动态数据是指与第一端口实际的连接使用情况相关的数据，端口号的属性包括对象标识，对象标识用于标识与第一端口存在交互关系的对象，处理单元220用于对对象标识进行特征提取，以获得端口号的动态标签集合，端口号的动态标签集合包括端口号的一个或多个动态标签。In some embodiments, the target data includes dynamic data, which refers to data related to the actual connection usage of the first port. The attributes of the port number include an object identifier, which is used to identify an object that has an interactive relationship with the first port. The processing unit 220 is used to extract features from the object identifier to obtain a dynamic tag set of the port number. The dynamic tag set of the port number includes one or more dynamic tags of the port number.

在一些实施方式中，动态数据包括第一通信设备的日志记录，日志记录用于记录第一通信设备的第一端口与第二通信设备通信产生的数据，对象标识包括第二通信设备的网络地址、第二通信设备所在的区域的标识、第二通信设备中与第一端口通信的第一端口或/和第二通信设备中与第一端口通信的应用程序的标识中至少一项。In some embodiments, the dynamic data includes log records of the first communication device, where the log records are used to record data generated by the communication between the first port of the first communication device and the second communication device, and the object identifier includes at least one of the network address of the second communication device, the identifier of the area where the second communication device is located, the identifier of the first port in the second communication device that communicates with the first port, and/or the identifier of the application in the second communication device that communicates with the first port.

在一些实施方式中，端口号的属性还包括对象与第一端口之间的交互次数，处理单元220用于对对象标识与交互次数进行拼接或者融合，以获得端口号的特征标签集合。In some implementations, the attribute of the port number further includes the number of interactions between the object and the first port. The processing unit 220 is configured to concatenate or fuse the object identifier and the number of interactions to obtain a feature tag set of the port number.

在一些实施方式中，动态数据还包括威胁记录，威胁记录中的背景字段包括攻击类型字段，端口号的属性包括以第一端口为攻击目标的网络攻击的类型以及第一端口被类型的网络攻击所攻击的次数，处理单元220用于对网络攻击或/和攻击的次数进行特征提取，以获得端口号的动态标签集合，端口号的动态标签集合包括端口号的一个或多个动态标签。In some embodiments, the dynamic data also includes threat records, the background field in the threat record includes an attack type field, the attributes of the port number include the type of network attack targeting the first port and the number of times the first port is attacked by the type of network attack, and the processing unit 220 is used to perform feature extraction on the network attack and/or the number of attacks to obtain a dynamic label set for the port number, and the dynamic label set for the port number includes one or more dynamic labels for the port number.

在一些实施方式中，处理单元220还用于对静态数据中的端口号的属性进行特征提取，以获得端口号的静态标签集合；对动态数据中的端口号的属性进行特征提取，以获得端口号的动态标签集合；对端口号的静态标签集合与端口号的静态标签集合进行融合，以获得端口号的特征标签集合。In some embodiments, the processing unit 220 is also used to perform feature extraction on the attributes of the port number in the static data to obtain a static tag set of the port number; perform feature extraction on the attributes of the port number in the dynamic data to obtain a dynamic tag set of the port number; and fuse the static tag set of the port number with the static tag set of the port number to obtain a feature tag set of the port number.

在一些实施方式中，处理单元220，还用于确定目标数据中背景字段与端口号之间的相似度；如果相似度高于相似度阈值，则从背景字段中提取端口号的属性。In some implementations, the processing unit 220 is further configured to determine the similarity between the background field and the port number in the target data; if the similarity is higher than a similarity threshold, extract the attribute of the port number from the background field.

在一些实施方式中，处理单元220，还用于确定目标数据中背景字段的散列度；如果散列度高于散列度阈值，则从背景字段中提取端口号的属性。In some implementations, the processing unit 220 is further configured to determine a hash degree of a background field in the target data; if the hash degree is higher than a hash degree threshold, extract the attribute of the port number from the background field.

在一些实施方式中，装置还包括：In some embodiments, the apparatus further comprises:

比较单元，比较第一端口的特征标签集合与第二端口的特征标签集合，以获得第一端口的特征标签集合与第二端口的特征标签集合之间的相似度；a comparing unit, for comparing the feature tag set of the first port with the feature tag set of the second port to obtain a similarity between the feature tag set of the first port and the feature tag set of the second port;

处理单元220，还用于基于第一端口的特征标签集合与第二端口的特征标签集合之间的相似度，确定第一端口与第二端口的相似度。The processing unit 220 is further configured to determine the similarity between the first port and the second port based on the similarity between the feature tag set of the first port and the feature tag set of the second port.

在一些实施方式中，目标数据包括告警日志集合，告警日志集合包括第一告警日志以及第二告警日志，处理单元220，还用于基于第一端口的特征标签集合与第二端口的特征标签集合之间的相似度，确定第一告警日志与第二告警日志之间的相似度，第一端口的特征标签集合是基于第一告警日志中的背景字段确定的，第二端口的特征标签集合是基于第二告警日志中的背景字段确定的；基于第一告警日志与第二告警日志之间的相似度，对告警日志集合进行聚类。In some embodiments, the target data includes an alarm log set, the alarm log set includes a first alarm log and a second alarm log, and the processing unit 220 is further used to determine the similarity between the first alarm log and the second alarm log based on the similarity between the feature tag set of the first port and the feature tag set of the second port, the feature tag set of the first port is determined based on the background field in the first alarm log, and the feature tag set of the second port is determined based on the background field in the second alarm log; based on the similarity between the first alarm log and the second alarm log, the alarm log set is clustered.

在一些实施方式中，目标数据包括第一通信设备在正常状态下的网络连接信息，处理单元220，还用于当检测到第二通信设备与第一通信设备的第一端口建立连接，基于第一端口的端口号查找映射关系，以获得端口号的特征标签集合；如果第二通信设备的特征命中端口号的特征标签集合，确定连接为正常连接；如果第二通信设备的特征未命中端口号的特征标签集合，确定连接为异常连接。In some embodiments, the target data includes network connection information of the first communication device in a normal state. The processing unit 220 is further used to, when detecting that the second communication device has established a connection with the first port of the first communication device, search for a mapping relationship based on the port number of the first port to obtain a feature tag set of the port number; if the feature of the second communication device hits the feature tag set of the port number, determine that the connection is a normal connection; if the feature of the second communication device does not hit the feature tag set of the port number, determine that the connection is an abnormal connection.

在一些实施方式中，目标数据包括训练样本集，训练样本集中每个样本包括源地址、目的地址、源端口号、目的端口号、协议类型以及类型标签，装置还包括：In some embodiments, the target data includes a training sample set, each sample in the training sample set includes a source address, a destination address, a source port number, a destination port number, a protocol type, and a type label, and the apparatus further includes:

查找单元，用于基于源端口号查找映射关系，以获得源端口号的特征标签集合；基于目的端口号查找映射关系，以获得目的端口号的特征标签集合；A search unit is configured to search for a mapping relationship based on a source port number to obtain a feature tag set of the source port number; and search for a mapping relationship based on a destination port number to obtain a feature tag set of the destination port number;

处理单元220，还用于对源地址的特征标签集合、目的地址的特征标签集合、源端口号的特征标签集合、目的端口号的特征标签集合以及协议类型的特征标签集合进行拼接或融合，以获得训练样本集中每个样本的特征标签集合；基于训练样本集中每个样本的特征标签集合以及训练样本集中每个样本的类型标签进行模型训练，以得到分类模型。The processing unit 220 is also used to splice or fuse the feature label set of the source address, the feature label set of the destination address, the feature label set of the source port number, the feature label set of the destination port number, and the feature label set of the protocol type to obtain the feature label set of each sample in the training sample set; and perform model training based on the feature label set of each sample in the training sample set and the type label of each sample in the training sample set to obtain a classification model.

附图2所描述的装置实施例仅仅是示意性的，例如，上述单元的划分，仅仅为一种逻辑功能划分，实际实现时可以有另外的划分方式，例如多个单元或组件可以结合或者可以集成到另一个系统，或一些特征可以忽略，或不执行。在本申请各个实施例中的各功能单元可以集成在一个处理单元中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个单元中。The device embodiment described in FIG2 is merely illustrative. For example, the division of the above units is merely a logical functional division. In actual implementation, there may be other division methods, such as multiple units or components can be combined or integrated into another system, or some features can be ignored or not executed. The functional units in the various embodiments of the present application can be integrated into a processing unit, or each unit can exist physically separately, or two or more units can be integrated into a single unit.

装置200中的各个单元全部或部分地通过软件、硬件、固件或者其任意组合来实现。Each unit in the device 200 is implemented entirely or partially by software, hardware, firmware, or any combination thereof.

下面结合后文描述的通信设备300，描述使用硬件或软件来实现装置200中的各个功能单元的一些可能实现方式。In conjunction with the communication device 300 described below, some possible implementations of the various functional units in the apparatus 200 using hardware or software are described below.

在采用软件实现的情况下，例如，上述获取单元210和处理单元220是由附图3中的至少一个处理器301读取存储器302中存储的程序代码后，生成的软件功能单元来实现。In the case of software implementation, for example, the acquisition unit 210 and the processing unit 220 are implemented by software functional units generated by at least one processor 301 in FIG. 3 after reading the program code stored in the memory 302 .

在采用硬件实现的情况下，例如，附图2中上述各个单元由通信设备中的不同硬件分别实现，例如处理单元220由附图3中的至少一个处理器301中的一部分处理资源(例如多核处理器中的一个核或两个核)实现，或者采用现场可编程门阵列(field－programmable gate array，FPGA)、或协处理器等可编程器件来完成。获取单元210由附图3中的网络接口303实现。In the case of hardware implementation, for example, each of the above units in FIG. 2 is implemented by different hardware in the communication device. For example, processing unit 220 is implemented by a portion of processing resources in at least one processor 301 in FIG. 3 (e.g., one or two cores in a multi-core processor), or is implemented by a programmable device such as a field-programmable gate array (FPGA) or a coprocessor. Acquisition unit 210 is implemented by network interface 303 in FIG. 3 .

附图3是本申请实施例提供的一种通信设备300的结构示意图。FIG3 is a schematic structural diagram of a communication device 300 provided in an embodiment of the present application.

通信设备300包括至少一个处理器301、存储器302以及至少一个网络接口303。The communication device 300 includes at least one processor 301 , a memory 302 , and at least one network interface 303 .

处理器301例如是通用中央处理器(central processing unit，CPU)、网络处理器(network processer，NP)、图形处理器(graphics processing unit，GPU)、神经网络处理器(neural-network processing units，NPU)、数据处理单元(data processing unit，DPU)、微处理器或者一个或多个用于实现本申请方案的集成电路。例如，处理器301包括专用集成电路(application-specific integrated circuit，ASIC)，可编程逻辑器件(programmable logic device，PLD)或其组合。PLD例如是复杂可编程逻辑器件(complex programmable logic device，CPLD)、现场可编程逻辑门阵列(field-programmable gate array，FPGA)、通用阵列逻辑(generic array logic，GAL)或其任意组合。The processor 301 is, for example, a general-purpose central processing unit (CPU), a network processor (NP), a graphics processing unit (GPU), a neural-network processing unit (NPU), a data processing unit (DPU), a microprocessor, or one or more integrated circuits for implementing the solution of the present application. For example, the processor 301 includes an application-specific integrated circuit (ASIC), a programmable logic device (PLD), or a combination thereof. The PLD is, for example, a complex programmable logic device (CPLD), a field-programmable gate array (FPGA), a generic array logic (GAL), or any combination thereof.

存储器302例如是只读存储器(read-only memory，ROM)或可存储静态信息和指令的其它类型的静态存储设备，又如是随机存取存储器(random access memory，RAM)或者可存储信息和指令的其它类型的动态存储设备，又如是电可擦可编程只读存储器(electrically erasable programmable read-only Memory，EEPROM)、只读光盘(compact disc read-only memory，CD-ROM)或其它光盘存储、光碟存储(包括压缩光碟、激光碟、光碟、数字通用光碟、蓝光光碟等)、磁盘存储介质或者其它磁存储设备，或者是能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其它介质，但不限于此。可选地，存储器302独立存在，并通过内部连接304与处理器301相连接。或者，可选地存储器302和处理器301集成在一起。Memory 302 may be, for example, a read-only memory (ROM) or other type of static storage device capable of storing static information and instructions, a random access memory (RAM) or other type of dynamic storage device capable of storing information and instructions, an electrically erasable programmable read-only memory (EEPROM), a compact disc read-only memory (CD-ROM) or other optical disk storage, an optical disk storage (including compact discs, laser discs, optical discs, digital versatile discs, Blu-ray discs, etc.), a magnetic disk storage medium or other magnetic storage device, or any other medium capable of carrying or storing desired program code in the form of instructions or data structures and accessible by a computer, but is not limited thereto. Optionally, memory 302 exists independently and is connected to processor 301 via internal connection 304. Alternatively, memory 302 and processor 301 may be integrated together.

网络接口303使用任何收发器一类的装置，用于与其它设备或通信网络通信。网络接口303例如包括有线网络接口或者无线网络接口中的至少一项。其中，有线网络接口例如为以太网接口。以太网接口例如是光接口，电接口或其组合。无线网络接口例如为无线局域网(wireless local area networks，WLAN)接口，蜂窝网络网络接口或其组合等。The network interface 303 uses any transceiver-like device for communicating with other devices or communication networks. The network interface 303 includes, for example, at least one of a wired network interface and a wireless network interface. The wired network interface is, for example, an Ethernet interface. The Ethernet interface is, for example, an optical interface, an electrical interface, or a combination thereof. The wireless network interface is, for example, a wireless local area network (WLAN) interface, a cellular network interface, or a combination thereof.

在一些实施例中，处理器301包括一个或多个CPU，如附图3中所示的CPU0和CPU1。In some embodiments, processor 301 includes one or more CPUs, such as CPU0 and CPU1 shown in FIG. 3 .

在一些实施例中，通信设备300可选地包括多个处理器，如附图3中所示的处理器301和处理器305。这些处理器中的每一个例如是一个单核处理器(single-CPU)，又如是一个多核处理器(multi-CPU)。这里的处理器可选地指一个或多个设备、电路、和/或用于处理数据(如计算机程序指令)的处理核。In some embodiments, the communication device 300 may optionally include multiple processors, such as processor 301 and processor 305 shown in FIG3 . Each of these processors may be, for example, a single-core processor (single-CPU) or a multi-core processor (multi-CPU). A processor herein may optionally refer to one or more devices, circuits, and/or processing cores for processing data (e.g., computer program instructions).

在一些实施例中，通信设备300还包括内部连接304。处理器301、存储器302以及至少一个网络接口303通过内部连接304连接。内部连接304包括通路，在上述组件之间传送信息。可选地，内部连接304是单板或总线。可选地，内部连接304分为地址总线、数据总线、控制总线等。In some embodiments, communication device 300 further includes internal connections 304. Processor 301, memory 302, and at least one network interface 303 are connected via internal connections 304. Internal connections 304 include pathways that transmit information between these components. Optionally, internal connections 304 are boards or buses. Optionally, internal connections 304 are divided into address buses, data buses, control buses, and the like.

在一些实施例中，通信设备300还包括输入输出接口306。输入输出接口306连接到内部连接304上。In some embodiments, the communication device 300 further includes an input/output interface 306 , which is connected to the internal connection 304 .

可选地，处理器301通过读取存储器302中保存的程序代码实现上述实施例中的方法，或者，处理器301通过内部存储的程序代码实现上述实施例中的方法。在处理器301通过读取存储器302中保存的程序代码实现上述实施例中的方法的情况下，存储器302中保存实现本申请实施例提供的方法的程序代码310。Optionally, the processor 301 implements the method in the above embodiment by reading the program code stored in the memory 302, or the processor 301 implements the method in the above embodiment by internally stored program code. In the case where the processor 301 implements the method in the above embodiment by reading the program code stored in the memory 302, the memory 302 stores the program code 310 that implements the method provided in the embodiment of the present application.

本说明书中的各个实施例均采用递进的方式描述，各个实施例之间相同相似的部分可互相参考，每个实施例重点说明的都是与其他实施例的不同之处。The various embodiments in this specification are described in a progressive manner. The same or similar parts between the various embodiments can be referenced to each other, and each embodiment focuses on the differences from other embodiments.

A参考B，指的是A与B相同或者A为B的简单变形。A refers to B, which means that A is the same as B or A is a simple variant of B.

本申请实施例的说明书和权利要求书中的术语“第一”和“第二”等是用于区别不同的对象，而不是用于描述对象的特定顺序，也不能理解为指示或暗示相对重要性。例如，第一端口和第二端口用于区别不同的端口，而不是用于描述端口的特定顺序，也不能理解为第一端口比第二端口更重要。The terms "first" and "second" in the description and claims of the embodiments of this application are used to distinguish different objects, not to describe a specific order of objects, and should not be construed as indicating or implying relative importance. For example, the terms "first port" and "second port" are used to distinguish different ports, not to describe a specific order of ports, and should not be construed as implying that the first port is more important than the second port.

本申请实施例，除非另有说明，“至少一个”的含义是指一个或多个，“多个”的含义是指两个或两个以上。例如，多个端口是指两个或两个以上的端口。In the embodiments of the present application, unless otherwise specified, "at least one" means one or more, and "a plurality" means two or more. For example, a plurality of ports means two or more ports.

上述实施例可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时，可以全部或部分地以计算机程序产品的形式实现。计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行计算机程序指令时，全部或部分地产生按照本申请实施例描述的流程或功能。计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。计算机指令可以存储在计算机可读存储介质中，或者从一个计算机可读存储介质向另一个计算机可读存储介质传输，例如，计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。可用介质可以是磁性介质，(例如，软盘、硬盘、磁带)、光介质(例如，DVD)、或者半导体介质(例如固态硬盘Solid State Disk(SSD))等。The above embodiments can be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented using software, they can be implemented in whole or in part in the form of a computer program product. A computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the process or function described in accordance with the embodiments of the present application is generated in whole or in part. The computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. Computer instructions can be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium. For example, computer instructions can be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via a wired (e.g., coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) method. The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server or data center that includes one or more available media integrated therein. Available media can be magnetic media (e.g., floppy disk, hard disk, tape), optical media (e.g., DVD), or semiconductor media (e.g., solid-state drive (SSD)).

以上实施例仅用以说明本申请的技术方案，而非对其限制；尽管参照前述实施例对本申请进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本申请各实施例技术方案的范围。The above embodiments are only used to illustrate the technical solutions of the present application, rather than to limit them. Although the present application has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that they can still modify the technical solutions described in the aforementioned embodiments, or make equivalent replacements for some of the technical features therein. However, these modifications or replacements do not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of the embodiments of the present application.

Claims

A method for obtaining a mapping relationship, characterized in that the method includes:

Acquire attributes of a port number of a first port based on target data, wherein the target data includes a port number field and a background field, the port number field includes the port number of the first port, and the background field includes attributes of the port number;

Performing feature extraction on the attributes of the port number to obtain a feature tag set of the port number, wherein the feature tag set includes one or more feature tags of the port number;

A mapping relationship is acquired based on the port number and the feature tag set, where the mapping relationship includes a key-value pair, the key in the mapping relationship includes the port number, and the value in the mapping relationship includes the feature tag set.

The method according to claim 1 is characterized in that the target data includes static data, the static data refers to data unrelated to the actual connection usage of the port, the attributes of the port number include usage information of the port number, the usage information includes at least one of an identifier of an application communicated through the port number, an identifier of a protocol communicated through the port number, or an identifier of a service communicated through the port number, and the feature extraction of the attributes of the port number to obtain a feature tag set for the port number includes:

Feature extraction is performed on the usage information of the port number to obtain a static tag set of the port number, where the static tag set of the port number includes one or more static tags of the port number.

The method according to claim 2, characterized in that the static data includes a port registry, and the port registry includes at least one of an Internet Assigned Numbers Authority (IANA) port registry and/or a private protocol port registry.

The method according to claim 3, characterized in that before obtaining the attribute of the port number of the first port based on the target data, the method further comprises:

The original port registration table is de-redundanted to obtain the port registration table.

The method according to claim 4, wherein the de-redundancy processing of the original port registry comprises at least one of the following:

If it is detected that a background field in the original port registration table includes a keyword indicating duplication, deleting the row where the background field is located from the original port registration table;

If it is detected that there are terms with the same meaning but different expressions in multiple background fields in the original port registration table, a target word for the term is determined, and the terms appearing in the multiple background fields are replaced with the target word.

The method according to claim 1, wherein the target data includes dynamic data, the dynamic data refers to data related to the actual connection usage of the first port, the attributes of the port number include an object identifier, the object identifier is used to identify an object that has an interactive relationship with the first port, and the feature extraction of the attributes of the port number to obtain a feature tag set for the port number includes:

Feature extraction is performed on the object identifier to obtain a dynamic tag set of the port number, where the dynamic tag set of the port number includes one or more dynamic tags of the port number.

The method according to claim 1 is characterized in that the dynamic data includes log records of a first communication device, the log records are used to record data generated by the communication between the first port of the first communication device and the second communication device, and the object identifier includes at least one of the network address of the second communication device, the identifier of the area where the second communication device is located, the first port in the second communication device that communicates with the first port, and/or the identifier of the application in the second communication device that communicates with the first port.

The method according to claim 1, wherein the attributes of the port number further include the number of interactions between the object and the first port, and the extracting features of the object identifier to obtain the dynamic tag set of the port number comprises:

The object identifier and the number of interactions are concatenated or fused to obtain a feature tag set of the port number.

The method according to claim 1, wherein the dynamic data further includes a threat record, the background field in the threat record includes an attack type field, the attributes of the port number include a type of network attack targeting the first port and a number of times the first port has been attacked by the type of network attack, and the feature extraction of the attributes of the port number to obtain a feature tag set for the port number includes:

Feature extraction is performed on the network attack and/or the number of attacks to obtain a dynamic label set of the port number, where the dynamic label set of the port number includes one or more dynamic labels of the port number.

The method according to claim 1, wherein extracting features from the attributes of the port number to obtain a feature tag set of the port number comprises:

Performing feature extraction on the attributes of the port number in the static data to obtain a static tag set of the port number;

Performing feature extraction on the attributes of the port number in the dynamic data to obtain a dynamic tag set of the port number;

The static label set of the port number is merged with the static label set of the port number to obtain the feature label set of the port number.

The method according to claim 1, wherein obtaining the attribute of the port number of the first port based on the target data comprises:

Determining the similarity between the background field in the target data and the port number;

If the similarity is higher than a similarity threshold, the attribute of the port number is extracted from the background field.

Determining the hash degree of the background field in the target data;

If the hash degree is higher than a hash degree threshold, the attribute of the port number is extracted from the background field.

The method according to claim 1, further comprising:

comparing the feature tag set of the first port with the feature tag set of the second port to obtain a similarity between the feature tag set of the first port and the feature tag set of the second port;

Based on the similarity between the feature tag set of the first port and the feature tag set of the second port, the similarity between the first port and the second port is determined.

The method according to claim 1, wherein the target data includes an alarm log set, the alarm log set includes a first alarm log and a second alarm log, and the method further comprises:

Determining similarity between the first alarm log and the second alarm log based on similarity between a feature tag set of a first port and a feature tag set of a second port, wherein the feature tag set of the first port is determined based on a background field in the first alarm log, and the feature tag set of the second port is determined based on the background field in the second alarm log;

The alarm log set is clustered based on the similarity between the first alarm log and the second alarm log.

The method according to claim 1, wherein the target data includes network connection information of the first communication device in a normal state, and the method further comprises:

When it is detected that a second communication device establishes a connection with the first port of the first communication device, searching the mapping relationship based on the port number of the first port to obtain a feature tag set of the port number;

If the feature of the second communication device matches the feature tag set of the port number, determining that the connection is a normal connection;

If the feature of the second communication device does not match the feature tag set of the port number, it is determined that the connection is an abnormal connection.

The method according to claim 1, wherein the target data comprises a training sample set, each sample in the training sample set comprises a source address, a destination address, a source port number, a destination port number, a protocol type, and a type label, and the method further comprises:

Searching the mapping relationship based on the source port number to obtain a feature tag set of the source port number;

Searching the mapping relationship based on the destination port number to obtain a feature tag set of the destination port number;

splicing or fusing the feature label set of the source address, the feature label set of the destination address, the feature label set of the source port number, the feature label set of the destination port number, and the feature label set of the protocol type to obtain a feature label set for each sample in the training sample set;

Model training is performed based on the feature label set of each sample in the training sample set and the type label of each sample in the training sample set to obtain a classification model.

A device for obtaining a mapping relationship, characterized in that the device comprises:

an acquiring unit, configured to acquire an attribute of a port number of a first port based on target data, wherein the target data includes a port number field and a background field, the port number field includes the port number of the first port, and the background field includes the attribute of the port number;

a processing unit, configured to perform feature extraction on the attributes of the port number to obtain a feature tag set of the port number, wherein the feature tag set includes one or more feature tags of the port number;

The acquisition unit is further configured to acquire a mapping relationship based on the port number and the feature tag set, wherein the mapping relationship includes a key-value pair, the key in the mapping relationship includes the port number, and the value in the mapping relationship includes the feature tag set.

A communication device, characterized in that the communication device includes: a processor, the processor is coupled to a memory, the memory stores at least one computer program instruction, and the at least one computer program instruction is loaded and executed by the processor to enable the communication device to implement the method according to any one of claims 1 to 16.

A computer-readable storage medium, characterized in that at least one instruction is stored in the storage medium, and when the instruction is executed on a computer, the computer executes the method according to any one of claims 1 to 16.

A computer program product, characterized in that the computer program product comprises one or more computer program instructions, which, when loaded and executed by a computer, enable the computer to execute the method according to any one of claims 1 to 16.