CN116796170A

CN116796170A - Data processing method, electronic device and medium

Info

Publication number: CN116796170A
Application number: CN202310730373.3A
Authority: CN
Inventors: 王淼军; 焦孟; 谢东; 刘晓佳
Original assignee: Hubei Xingji Meizu Technology Co Ltd
Current assignee: Hubei Xingji Meizu Technology Co Ltd
Priority date: 2023-06-19
Filing date: 2023-06-19
Publication date: 2023-09-22

Abstract

The invention provides a data processing method, electronic equipment and media, and relates to technical fields such as big data and deep learning. The specific implementation plan includes: obtaining multiple object data; determining that the number of multiple object data is greater than a preset quantity threshold, and for any candidate object data in the multiple object data, based on the feature vector of the candidate object data and the first distance threshold, Determine neighboring object data associated with the candidate object data from a plurality of object data; in response to the presence of neighboring object data in the plurality of object data, generate first candidate representation data based on the candidate object data and the neighboring object data; based on the candidate object data, adjacent object data and first candidate representation data, process the plurality of object data to obtain a first representation data set; and determine that the number of first representation data contained in the first representation data set is less than or equal to a preset quantity threshold, The first representation data set is determined as the first sample data set.

Description

Data processing methods, electronic equipment and media

技术领域Technical field

本发明涉及数据处理技术领域，具体涉及大数据、人工智能、深度学习等技术领域，尤其涉及一种数据处理方法和装置、一种目标对象检测模型的训练方法和装置、一种目标对象检测方法、装置、电子设备、存储介质及计算机程序产品。The present invention relates to the technical field of data processing, specifically to the technical fields of big data, artificial intelligence, deep learning, etc., and in particular to a data processing method and device, a training method and device for a target object detection model, and a target object detection method. , devices, electronic equipment, storage media and computer program products.

背景技术Background technique

在诸如深度学习、机器学习等领域，通常需要使用大量的样本数据进行模型训练。然而，当样本数据的数量过大时，不仅会影响模型的训练效率，而且过大的样本数据量会导致局部噪点聚集，从而影响模型的精度和准确性。In fields such as deep learning and machine learning, it is usually necessary to use a large amount of sample data for model training. However, when the amount of sample data is too large, it will not only affect the training efficiency of the model, but also lead to the accumulation of local noise points, thus affecting the precision and accuracy of the model.

发明内容Contents of the invention

本发明提供了一种数据处理方法和装置、一种目标对象检测模型的训练方法和装置、一种目标对象检测方法、装置、电子设备、存储介质及计算机程序产品。The invention provides a data processing method and device, a training method and device for a target object detection model, a target object detection method and device, electronic equipment, storage media and computer program products.

根据本发明的一个方面，提供了一种数据处理方法，包括：获取多个对象数据；确定多个对象数据的数量大于预设数量阈值，针对多个对象数据中的任意一个候选对象数据，根据候选对象数据的特征向量和第一距离阈值，从多个对象数据中确定与该候选对象数据相关联的邻近对象数据；响应于多个对象数据中存在邻近对象数据，根据候选对象数据和邻近对象数据，生成第一候选表征数据；其中，第一候选表征数据用于代替候选对象数据和邻近对象数据；基于候选对象数据、邻近对象数据和第一候选表征数据，对多个对象数据进行处理，得到第一表征数据集；以及确定第一表征数据集中包含的第一表征数据的数量小于或等于预设数量阈值，将第一表征数据集确定为第一样本数据集。According to one aspect of the present invention, a data processing method is provided, including: acquiring multiple object data; determining that the number of multiple object data is greater than a preset quantity threshold, and targeting any candidate object data among the multiple object data, according to the feature vector of the candidate object data and the first distance threshold, determining neighboring object data associated with the candidate object data from the plurality of object data; in response to the presence of neighboring object data in the plurality of object data, determining based on the candidate object data and the neighboring object data to generate first candidate representation data; wherein the first candidate representation data is used to replace the candidate object data and neighboring object data; based on the candidate object data, neighboring object data and the first candidate representation data, multiple object data are processed, Obtain a first characterization data set; and determine that the number of first characterization data contained in the first characterization data set is less than or equal to a preset quantity threshold, and determine the first characterization data set as the first sample data set.

根据本发明的实施例，基于候选对象数据、邻近对象数据和第一候选表征数据，对多个对象数据进行处理，得到第一表征数据集包括：删除多个对象数据中的候选对象数据和邻近对象数据，得到候选对象数据集；以及根据候选对象数据集和第一候选表征数据，确定第一表征数据集。According to an embodiment of the present invention, processing the plurality of object data based on the candidate object data, the neighboring object data and the first candidate representation data to obtain the first representation data set includes: deleting the candidate object data and neighboring objects in the plurality of object data. object data to obtain a candidate object data set; and determining a first representation data set according to the candidate object data set and the first candidate representation data.

根据本发明的实施例，第一表征数据集包括多个第一表征数据；响应于检测到候选对象数据集中不包括对象数据，且确定多个第一表征数据的数量大于预设数量阈值，数据处理方法还包括：确定每个第一表征数据的点密度；确定多个第一表征数据中点密度最大的第一表征数据，作为候选第一表征数据；基于候选第一表征数据的点密度，确定与候选第一表征数据相关联的第二候选表征数据的数目；基于第二候选表征数据的数目，对多个第一表征数据进行表征数据提取，得到第二表征数据集；以及确定第二表征数据集中包含的第二表征数据的数量小于或等于预设数量阈值，将第二表征数据集确定为第二样本数据集。According to an embodiment of the present invention, the first characterization data set includes a plurality of first characterization data; in response to detecting that the candidate object data set does not include object data and determining that the number of the plurality of first characterization data is greater than a preset quantity threshold, the data The processing method further includes: determining the point density of each first representation data; determining the first representation data with the largest point density among the plurality of first representation data as the candidate first representation data; based on the point density of the candidate first representation data, Determine the number of second candidate representation data associated with the candidate first representation data; based on the number of second candidate representation data, perform representation data extraction on the plurality of first representation data to obtain a second representation data set; and determine the second candidate representation data If the number of second characterization data contained in the characterization data set is less than or equal to the preset quantity threshold, the second characterization data set is determined as the second sample data set.

根据本发明的实施例，基于第二候选表征数据的数目，对多个第一表征数据进行表征数据提取，得到第二表征数据集包括：根据候选第一表征数据的特征向量和第二距离阈值，从多个第一表征数据中确定与候选第一表征数据相关联的邻近表征数据；基于第二候选表征数据的数目，对候选第一表征数据和邻近表征数据进行聚类处理，得到数目个第二候选表征数据；删除多个第一表征数据中的候选第一表征数据和邻近表征数据，得到候选表征数据集；以及根据候选表征数据集和数目个第二候选表征数据，确定第二表征数据集。According to an embodiment of the present invention, based on the number of second candidate characterization data, characterization data is extracted for a plurality of first characterization data, and the second characterization data set obtained includes: according to the feature vector of the candidate first characterization data and the second distance threshold , determine the neighboring representation data associated with the candidate first representation data from the plurality of first representation data; based on the number of the second candidate representation data, cluster the candidate first representation data and the neighboring representation data to obtain a number of second candidate representation data; deleting candidate first representation data and adjacent representation data among the plurality of first representation data to obtain a candidate representation data set; and determining the second representation according to the candidate representation data set and the number of second candidate representation data data set.

根据本发明的实施例，响应于确定第二表征数据集中包含的第二表征数据的数量大于预设数量阈值，数据处理方法还包括：确定多个第一表征数据中与候选第一表征数据相关联的受影响的第一表征数据；分别确定受影响的第一表征数据的点密度和数目个第二候选表征数据的点密度；根据受影响的第一表征数据的点密度、数目个第二候选表征数据的点密度以及候选表征数据集中除受影响的第一表征数据之外的其他表征数据的点密度，确定第二表征数据集中点密度最大的第二表征数据，作为候选第二表征数据；以及基于候选第二表征数据的点密度来更新候选第一表征数据的点密度，重复执行确定第二表征数据集中包含的第二表征数据的数量是否小于或等于预设数量阈值的操作。According to an embodiment of the present invention, in response to determining that the number of second characterization data contained in the second characterization data set is greater than the preset quantity threshold, the data processing method further includes: determining that the plurality of first characterization data are related to the candidate first characterization data The affected first representation data associated with each other; respectively determine the point density of the affected first representation data and the point density of the number of second candidate representation data; according to the point density of the affected first representation data, the number of second candidate representation data The point density of the candidate representation data and the point density of other representation data in the candidate representation data set except the affected first representation data are determined to determine the second representation data with the largest point density in the second representation data set as the candidate second representation data ; and updating the point density of the candidate first representation data based on the point density of the candidate second representation data, and repeatedly performing the operation of determining whether the number of second representation data contained in the second representation data set is less than or equal to a preset quantity threshold.

根据本发明的实施例，确定多个第一表征数据中点密度最大的第一表征数据，作为候选第一表征数据包括：根据多个第一表征数据各自对应的点密度，对多个第一表征数据进行排序，得到表征数据序列；基于计算资源，将表征数据序列拆分为多个第一表征数据子序列；其中，每个第一表征数据子序列包括预设数量个第一表征数据；针对每个第一表征数据子序列，确定预设数量个第一表征数据中点密度最大的第一表征数据，作为第一子序列表征数据；以及确定多个第一子序列表征数据中点密度最大的第一子序列表征数据，作为候选第一表征数据。According to an embodiment of the present invention, determining the first characterization data with the largest point density among the plurality of first characterization data as the candidate first characterization data includes: based on the corresponding point densities of the plurality of first characterization data, determining the plurality of first characterization data. The characterization data is sorted to obtain a characterization data sequence; based on computing resources, the characterization data sequence is split into multiple first characterization data sub-sequences; wherein each first characterization data sub-sequence includes a preset number of first characterization data; For each first characterization data subsequence, determine the first characterization data with the highest point density in a preset number of first characterization data as the first subsequence characterization data; and determine the midpoint densities of multiple first subsequence characterization data. The largest first subsequence representation data is used as the candidate first representation data.

根据本发明的实施例，响应于确定第二表征数据集中包含的第二表征数据的数量大于预设数量阈值，数据处理方法还包括：确定多个第一表征数据中与候选第一表征数据相关联的受影响的第一表征数据；将受影响的第一表征数据和数目个第二候选表征数据存储至待排序表征数据序列中；针对每个第一表征数据子序列，删除第一表征数据子序列中与候选第一表征数据和邻近表征数据相关的表征数据以及受影响的第一表征数据，得到第二表征数据子序列；分别确定待排序表征数据序列中受影响的第一表征数据的点密度和数目个第二候选表征数据的点密度；根据待排序表征数据序列中受影响的第一表征数据的点密度和数目个第二候选表征数据的点密度，以及多个第二表征数据子序列中表征数据的点密度，分别将待排序表征数据序列中的受影响的第一表征数据和数目个第二候选表征数据分配至多个第二表征数据子序列，得到多个第三表征数据子序列；针对每个第三表征数据子序列，确定第三表征数据子序列中点密度最大的表征数据，作为第二子序列表征数据；确定多个第二子序列表征数据中点密度最大的第二子序列表征数据，作为候选第二表征数据；以及基于候选第二表征数据的点密度来更新候选第一表征数据的点密度，重复执行确定第二表征数据集中包含的第二表征数据的数量是否小于或等于预设数量阈值的操作。According to an embodiment of the present invention, in response to determining that the number of second characterization data contained in the second characterization data set is greater than the preset quantity threshold, the data processing method further includes: determining that the plurality of first characterization data are related to the candidate first characterization data associated affected first representation data; store the affected first representation data and a number of second candidate representation data into a sequence of representation data to be sorted; for each first representation data subsequence, delete the first representation data The characterization data related to the candidate first characterization data and adjacent characterization data in the sub-sequence and the affected first characterization data are obtained to obtain the second characterization data sub-sequence; and the values of the affected first characterization data in the to-be-sorted characterization data sequence are determined respectively. Point density and the point density of the number of second candidate representation data; according to the point density of the affected first representation data and the point density of the number of second candidate representation data in the representation data sequence to be sorted, and a plurality of second representation data According to the point density of the characterization data in the sub-sequence, the affected first characterization data and the number of second candidate characterization data in the characterization data sequence to be sorted are allocated to multiple second characterization data sub-sequences to obtain multiple third characterization data. subsequence; for each third representation data subsequence, determine the representation data with the highest point density in the third representation data subsequence as the second subsequence representation data; determine the representation data with the highest point density among multiple second subsequence representation data the second subsequence characterization data as the candidate second characterization data; and updating the point density of the candidate first characterization data based on the point density of the candidate second characterization data, repeatedly performing the step of determining the second characterization data contained in the second characterization data set. Operation to determine whether the quantity is less than or equal to the preset quantity threshold.

根据本发明的实施例，每个对象数据包括对象数据标识；基于第二候选表征数据的数目，对多个第一表征数据进行表征数据提取，得到第二表征数据集还包括：针对每个第二候选表征数据，确定与第二候选表征数据相关联的候选第一表征数据和邻近表征数据；根据多个对象数据各自对应的对象数据标识，确定候选第一表征数据的对象数据标识和邻近表征数据的对象数据标识；根据候选第一表征数据的对象数据标识和邻近表征数据的对象数据标识，确定第二候选表征数据的对象数据标识；以及将第二候选表征数据的对象数据标识和第二候选表征数据进行关联。According to an embodiment of the present invention, each object data includes an object data identifier; based on the number of second candidate characterization data, performing characterization data extraction on a plurality of first characterization data to obtain the second characterization data set further includes: for each third candidate characterization data two candidate representation data, determining the candidate first representation data and adjacent representation data associated with the second candidate representation data; determining the object data identifier and adjacent representation of the candidate first representation data according to the object data identifiers corresponding to the plurality of object data the object data identifier of the data; determining the object data identifier of the second candidate representation data according to the object data identifier of the candidate first representation data and the object data identifier of the adjacent representation data; and combining the object data identifier of the second candidate representation data with the second Candidate representation data is associated.

根据本发明的实施例，根据候选第一表征数据的对象数据标识和邻近表征数据的对象数据标识，确定第二候选表征数据的对象数据标识包括：确定候选第一表征数据的对象数据标识和邻近表征数据的对象数据标识中属于相同数据标识类别的对象数据标识的数量；以及将数量最大的对象数据标识，作为第二候选表征数据的对象数据标识。According to an embodiment of the present invention, determining the object data identifier of the second candidate representation data based on the object data identifier of the candidate first representation data and the object data identifier of adjacent representation data includes: determining the object data identifier of the candidate first representation data and the adjacent object data identifier. The number of object data identifiers belonging to the same data identifier category among the object data identifiers representing the data; and using the object data identifier with the largest number as the object data identifier of the second candidate representation data.

根据本发明的实施例，第二候选表征数据的数目与候选第一表征数据的点密度呈正相关关系。According to an embodiment of the present invention, the number of second candidate representation data is positively correlated with the point density of candidate first representation data.

根据本发明的实施例，确定每个第一表征数据的点密度包括：针对每个第一表征数据，基于第二距离阈值和第一表征数据的特征向量，从多个第一表征数据中确定与该第一表征数据相关联的邻近表征数据的数量；以及将邻近表征数据的数量确定为点密度。According to an embodiment of the present invention, determining the point density of each first representation data includes: for each first representation data, determining from a plurality of first representation data based on the second distance threshold and the feature vector of the first representation data. the number of neighboring characterization data associated with the first characterization data; and determining the number of neighboring characterization data as a point density.

根据本发明的实施例，数据处理方法还包括：响应于确定多个对象数据中不存在与候选对象数据相关联的邻近对象数据，将候选对象数据确定为第一候选表征数据。According to an embodiment of the present invention, the data processing method further includes: in response to determining that adjacent object data associated with the candidate object data does not exist in the plurality of object data, determining the candidate object data as the first candidate representation data.

根据本发明的实施例，每个对象数据包括对象数据标识；根据候选对象数据和邻近对象数据，生成第一候选表征数据还包括：根据多个对象数据各自对应的对象数据标识，确定候选对象数据的对象数据标识和邻近对象数据的对象数据标识；根据候选对象数据的对象数据标识和邻近对象数据的对象数据标识，确定第一候选表征数据的对象数据标识；以及将第一候选表征数据的对象数据标识和第一候选表征数据进行关联。According to an embodiment of the present invention, each object data includes an object data identifier; generating the first candidate representation data according to the candidate object data and adjacent object data further includes: determining the candidate object data according to the object data identifiers corresponding to the plurality of object data. the object data identifier of the candidate object data and the object data identifier of the adjacent object data; determine the object data identifier of the first candidate representation data according to the object data identifier of the candidate object data and the object data identifier of the adjacent object data; and assign the object of the first candidate representation data The data identifier is associated with the first candidate representation data.

根据本发明的实施例，根据候选对象数据的对象数据标识和邻近对象数据的对象数据标识，确定第一候选表征数据的对象数据标识包括：确定候选对象数据的对象数据标识和邻近对象数据的对象数据标识中属于相同数据标识类别的对象数据标识的数量；以及将数量最大的对象数据标识，作为第一候选表征数据的对象数据标识。According to an embodiment of the present invention, determining the object data identifier of the first candidate representation data according to the object data identifier of the candidate object data and the object data identifier of the adjacent object data includes: determining the object data identifier of the candidate object data and the object of the adjacent object data. The number of object data identifiers belonging to the same data identifier category in the data identifiers; and the object data identifier with the largest number as the first candidate object data identifier.

根据本发明的另一方面，提供了一种目标对象检测模型的训练方法，包括：获取样本数据集；利用样本数据集对深度学习模型进行迭代训练，直至深度学习模型的输出结果满足迭代停止条件或者迭代训练的累计次数达到预设的次数阈值，得到目标对象检测模型；其中，样本数据集是利用以上描述的数据处理方法得到的。According to another aspect of the present invention, a training method for a target object detection model is provided, including: obtaining a sample data set; using the sample data set to iteratively train a deep learning model until the output result of the deep learning model meets the iteration stop condition Or the cumulative number of iterative training reaches a preset threshold, and the target object detection model is obtained; where the sample data set is obtained using the data processing method described above.

根据本发明的另一方面，提供了一种目标对象检测方法，包括：将待处理数据输入目标对象检测模型，得到针对待处理数据的检测结果；其中，目标对象检测模型是利用以上描述的目标对象检测模型的训练方法训练得到的。According to another aspect of the present invention, a target object detection method is provided, including: inputting data to be processed into a target object detection model to obtain detection results for the data to be processed; wherein the target object detection model uses the target object described above The object detection model is trained using the training method.

根据本发明的另一方面，提供了一种数据处理装置，包括：第一获取模块，用于获取多个对象数据；第一确定模块，用于确定多个对象数据的数量大于预设数量阈值，针对多个对象数据中的任意一个候选对象数据，根据候选对象数据的特征向量和第一距离阈值，从多个对象数据中确定与该候选对象数据相关联的邻近对象数据；生成模块，用于响应于多个对象数据中存在邻近对象数据，根据候选对象数据和邻近对象数据，生成第一候选表征数据；其中，第一候选表征数据用于代替候选对象数据和邻近对象数据；处理模块，用于基于候选对象数据、邻近对象数据和第一候选表征数据，对多个对象数据进行处理，得到第一表征数据集；以及第二确定模块，用于确定第一表征数据集中包含的第一表征数据的数量小于或等于预设数量阈值，将第一表征数据集确定为第一样本数据集。According to another aspect of the present invention, a data processing device is provided, including: a first acquisition module for acquiring multiple object data; a first determination module for determining that the number of multiple object data is greater than a preset quantity threshold , for any candidate object data among the plurality of object data, determine the adjacent object data associated with the candidate object data from the plurality of object data according to the feature vector of the candidate object data and the first distance threshold; the generation module uses In response to the presence of adjacent object data in the plurality of object data, generating first candidate representation data according to the candidate object data and the adjacent object data; wherein the first candidate representation data is used to replace the candidate object data and the adjacent object data; the processing module, for processing multiple object data based on the candidate object data, neighboring object data and the first candidate representation data to obtain a first representation data set; and a second determination module for determining the first representation data contained in the first representation data set If the number of characterization data is less than or equal to the preset quantity threshold, the first characterization data set is determined as the first sample data set.

根据本发明的另一方面，提供了一种目标对象检测模型的训练装置，包括：第二获取模块，用于获取样本数据集；训练模块，用于利用样本数据集对深度学习模型进行迭代训练，直至深度学习模型的输出结果满足迭代停止条件或者迭代训练的累计次数达到预设的次数阈值，得到目标对象检测模型；其中，样本数据集是利用以上描述的数据处理装置得到的。According to another aspect of the present invention, a training device for a target object detection model is provided, including: a second acquisition module for acquiring a sample data set; and a training module for iteratively training a deep learning model using the sample data set. , until the output result of the deep learning model meets the iteration stop condition or the cumulative number of iterative training reaches the preset number threshold, the target object detection model is obtained; wherein, the sample data set is obtained by using the data processing device described above.

根据本发明的另一方面，提供了一种目标对象检测装置，包括：输入模块，用于将待处理数据输入目标对象检测模型，得到针对待处理数据的检测结果；其中，目标对象检测模型是利用以上描述的目标对象检测模型的训练装置训练得到的。According to another aspect of the present invention, a target object detection device is provided, including: an input module for inputting data to be processed into a target object detection model to obtain detection results for the data to be processed; wherein the target object detection model is It is trained using the training device of the target object detection model described above.

根据本发明的另一方面，提供了一种电子设备，包括：一个或多个处理器；存储器，用于存储一个或多个程序，其中，当所述一个或多个程序被所述一个或多个处理器执行时，使得一个或多个处理器执行实现如上所述的方法。According to another aspect of the present invention, an electronic device is provided, including: one or more processors; a memory for storing one or more programs, wherein when the one or more programs are processed by the one or more When multiple processors execute, one or more processors are caused to execute the method described above.

根据本发明的另一方面，提供了一种计算机可读存储介质，其上存储有可执行指令，该指令被处理器执行时使处理器执行实现如上所述的方法。According to another aspect of the present invention, a computer-readable storage medium is provided, on which executable instructions are stored. When the instructions are executed by a processor, they cause the processor to perform the method described above.

根据本发明的另一方面，提供了一种计算机程序产品，包括计算机程序，该计算机程序在被处理器执行时实现如上所述的方法。According to another aspect of the present invention, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the method as described above.

附图说明Description of the drawings

为进一步说明本发明的技术内容，以下将结合实例及附图来详细说明，其中：In order to further illustrate the technical content of the present invention, the following will be described in detail with reference to examples and drawings, wherein:

图1是根据本发明实施例的可以应用数据处理方法和装置、目标对象检测模型的训练方法和装置以及目标对象检测方法和装置的示例性系统架构示意图；Figure 1 is a schematic diagram of an exemplary system architecture in which data processing methods and devices, target object detection model training methods and devices, and target object detection methods and devices can be applied according to an embodiment of the present invention;

图2是根据本发明实施例的数据处理方法的流程图；Figure 2 is a flow chart of a data processing method according to an embodiment of the present invention;

图3是根据本发明实施例的生成第一样本数据集过程的示意图；Figure 3 is a schematic diagram of the process of generating a first sample data set according to an embodiment of the present invention;

图4是根据本发明实施例的生成第二表征数据集过程的示意图；Figure 4 is a schematic diagram of a process of generating a second representation data set according to an embodiment of the present invention;

图5是根据本发明实施例的对第二表征数据集中各个第二表征数据的点密度进行重排序的示意图；Figure 5 is a schematic diagram of reordering the point density of each second characterization data in the second characterization data set according to an embodiment of the present invention;

图6A是利用相关技术进行数据分类的效果示意图；Figure 6A is a schematic diagram of the effect of data classification using related technologies;

图6B是基于本发明的技术方案进行数据分类的效果示意图；Figure 6B is a schematic diagram of the effect of data classification based on the technical solution of the present invention;

图7是根据本发明实施例的目标对象检测模型的训练方法的流程图；Figure 7 is a flow chart of a training method for a target object detection model according to an embodiment of the present invention;

图8是根据本发明实施例的目标对象检测方法的流程图；Figure 8 is a flow chart of a target object detection method according to an embodiment of the present invention;

图9是根据本发明实施例的数据处理装置的框图；Figure 9 is a block diagram of a data processing device according to an embodiment of the present invention;

图10是根据本发明实施例的目标对象检测模型的训练装置的框图；Figure 10 is a block diagram of a training device for a target object detection model according to an embodiment of the present invention;

图11是根据本发明实施例的目标对象检测装置的框图；以及Figure 11 is a block diagram of a target object detection device according to an embodiment of the present invention; and

图12是用来实现本发明实施例的数据处理方法、目标对象检测模型的训练方法以及目标对象检测方法的电子设备的框图。Figure 12 is a block diagram of an electronic device used to implement the data processing method, the training method of the target object detection model, and the target object detection method according to the embodiment of the present invention.

具体实施方式Detailed ways

下面将结合实施例和实施例中的附图，对本发明实施例中的技术方案进行清楚、完整的描述。显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the embodiments and the drawings in the embodiments. Obviously, the described embodiments are only some of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of the present invention.

图1是根据本发明的实施例的可以应用数据处理方法和装置、目标对象检测模型的训练方法和装置以及目标对象检测方法和装置的示例性系统架构示意图。需要注意的是，图1所示仅为可以应用本发明实施例的系统架构的示例，以帮助本领域技术人员理解本发明的技术内容，但并不意味着本发明实施例不可以用于其他设备、系统、环境或场景。1 is a schematic diagram of an exemplary system architecture in which data processing methods and devices, target object detection model training methods and devices, and target object detection methods and devices can be applied according to embodiments of the present invention. It should be noted that Figure 1 is only an example of a system architecture to which embodiments of the present invention can be applied, to help those skilled in the art understand the technical content of the present invention, but does not mean that the embodiments of the present invention cannot be used in other applications. Device, system, environment or scenario.

如图1所示，根据该实施例的系统架构100可以包括终端设备101、102、103，网络104和服务器105。网络104用以在终端设备101、102、103和服务器105之间提供通信链路的介质。网络104可以包括各种连接类型，例如有线和/或无线通信链路等等。As shown in Figure 1, the system architecture 100 according to this embodiment may include terminal devices 101, 102, 103, a network 104 and a server 105. The network 104 is a medium used to provide communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired and/or wireless communication links, and the like.

用户可以使用终端设备101、102、103通过网络104与服务器105交互，以接收或发送消息等。终端设备101、102、103上可以安装有各种通讯客户端应用。例如，目标检测类应用、网页浏览器应用、搜索类应用、即时通信工具、邮箱客户端或社交平台软件等(仅为示例)。Users can use terminal devices 101, 102, 103 to interact with the server 105 through the network 104 to receive or send messages, etc. Various communication client applications can be installed on the terminal devices 101, 102, and 103. For example, target detection applications, web browser applications, search applications, instant messaging tools, email clients or social platform software, etc. (only examples).

终端设备101、102、103可以是具有显示屏并且支持网页浏览的各种电子设备，包括但不限于智能手机、平板电脑、膝上型便携计算机和台式计算机等等。The terminal devices 101, 102, and 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop computers, desktop computers, and the like.

服务器105可以是独立的物理服务器，也可以是多个物理服务器构成的服务器集群或分布式系统，还可以是提供云服务、云计算、网络服务、中间件服务等基础云计算服务的云服务器。The server 105 can be an independent physical server, a server cluster or a distributed system composed of multiple physical servers, or a cloud server that provides basic cloud computing services such as cloud services, cloud computing, network services, and middleware services.

服务器105可以是提供各种服务的服务器，例如对用户利用终端设备101、102、103所浏览的网站提供支持的后台管理服务器(仅为示例)。后台管理服务器可以对接收到的用户请求等数据进行分析等处理，并将处理结果(例如根据用户请求获取或生成的网页、信息、或数据等)反馈给终端设备。The server 105 may be a server that provides various services, such as a backend management server that provides support for websites browsed by users using the terminal devices 101, 102, and 103 (example only). The background management server can analyze and process the received user request and other data, and feed back the processing results (such as web pages, information, or data obtained or generated according to the user request) to the terminal device.

例如，服务器105可以通过网络104获取来自终端设备101、102、103的多个对象数据。在确定多个对象数据的数量大于预设数量阈值时，服务器105可以对多个对象数据进行处理，以生成样本数据集。其中样本数据集中包含的样本数据可以代替多个对象数据。For example, the server 105 can obtain multiple object data from the terminal devices 101, 102, and 103 through the network 104. When it is determined that the number of multiple object data is greater than the preset quantity threshold, the server 105 may process the multiple object data to generate a sample data set. The sample data contained in the sample data set can replace multiple object data.

在一些实施例中，服务器105还可以利用上述样本数据集对深度学习模型进行训练。服务器105在完成对深度学习模型的训练之后，还可以利用训练后的深度学习模型(例如目标对象检测模型)对待处理数据中的目标对象进行检测。在一些示例中，服务器105还可以将训练后的深度学习模型(例如目标对象检测模型)发送给终端设备101、102、103。这样，用户可以应用终端设备中的目标对象检测模型进行目标对象检测。In some embodiments, the server 105 can also use the above sample data set to train the deep learning model. After completing the training of the deep learning model, the server 105 may also use the trained deep learning model (eg, target object detection model) to detect target objects in the data to be processed. In some examples, the server 105 can also send the trained deep learning model (eg, target object detection model) to the terminal devices 101, 102, 103. In this way, the user can apply the target object detection model in the terminal device to perform target object detection.

需要说明的是，本发明实施例所提供的数据处理方法一般可以由服务器105执行。相应地，本发明实施例所提供的数据处理装置一般可以设置于服务器105中。本发明实施例所提供的数据处理方法也可以由不同于服务器105且能够与终端设备101、102、103和/或服务器105通信的服务器或服务器集群执行。相应地，本发明实施例所提供的数据处理装置也可以设置于不同于服务器105且能够与终端设备101、102、103和/或服务器105通信的服务器或服务器集群中。It should be noted that the data processing method provided by the embodiment of the present invention can generally be executed by the server 105. Correspondingly, the data processing device provided by the embodiment of the present invention can generally be installed in the server 105. The data processing method provided by the embodiment of the present invention can also be executed by a server or server cluster that is different from the server 105 and can communicate with the terminal devices 101, 102, 103 and/or the server 105. Correspondingly, the data processing apparatus provided by the embodiment of the present invention can also be arranged in a server or server cluster that is different from the server 105 and can communicate with the terminal devices 101, 102, 103 and/or the server 105.

需要说明的是，本发明实施例所提供的目标对象检测模型的训练方法一般可以由服务器105执行。相应地，本发明实施例所提供的目标对象检测模型的训练装置一般可以设置于服务器105中。本发明实施例所提供的目标对象检测模型的训练方法也可以由不同于服务器105且能够与终端设备101、102、103和/或服务器105通信的服务器或服务器集群执行。相应地，本发明实施例所提供的目标对象检测模型的训练装置也可以设置于不同于服务器105且能够与终端设备101、102、103和/或服务器105通信的服务器或服务器集群中。It should be noted that the training method of the target object detection model provided by the embodiment of the present invention can generally be executed by the server 105 . Correspondingly, the training device for the target object detection model provided by the embodiment of the present invention can generally be installed in the server 105 . The training method of the target object detection model provided by the embodiment of the present invention can also be executed by a server or server cluster that is different from the server 105 and can communicate with the terminal devices 101, 102, 103 and/or the server 105. Correspondingly, the training device for the target object detection model provided by the embodiment of the present invention can also be set up in a server or server cluster that is different from the server 105 and can communicate with the terminal devices 101, 102, 103 and/or the server 105.

需要说明的是，本发明实施例所提供的目标对象检测方法一般可以由服务器105执行。相应地，本发明实施例所提供的目标对象检测装置一般可以设置于服务器105中。本发明实施例所提供的目标对象检测方法也可以由不同于服务器105且能够与终端设备101、102、103和/或服务器105通信的服务器或服务器集群执行。相应地，本发明实施例所提供的目标对象检测装置也可以设置于不同于服务器105且能够与终端设备101、102、103和/或服务器105通信的服务器或服务器集群中。It should be noted that the target object detection method provided by the embodiment of the present invention can generally be executed by the server 105. Correspondingly, the target object detection device provided by the embodiment of the present invention can generally be installed in the server 105 . The target object detection method provided by the embodiment of the present invention can also be executed by a server or server cluster that is different from the server 105 and can communicate with the terminal devices 101, 102, 103 and/or the server 105. Correspondingly, the target object detection apparatus provided by the embodiment of the present invention can also be arranged in a server or server cluster that is different from the server 105 and can communicate with the terminal devices 101, 102, 103 and/or the server 105.

备选地，本发明实施例所提供的目标对象检测方法一般也可以由终端设备101、102、或103执行。相应地，本发明实施例所提供的目标对象检测装置也可以设置于终端设备101、102、或103中。Alternatively, the target object detection method provided by the embodiment of the present invention can generally also be executed by the terminal device 101, 102, or 103. Correspondingly, the target object detection device provided by the embodiment of the present invention can also be provided in the terminal device 101, 102, or 103.

应该理解，图1中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要，可以具有任意数目的终端设备、网络和服务器。It should be understood that the number of terminal devices, networks and servers in Figure 1 is only illustrative. Depending on implementation needs, there can be any number of end devices, networks, and servers.

应注意，以下方法中各个操作的序号仅作为该操作的表示以便描述，而不应被看作表示该各个操作的执行顺序。除非明确指出，否则该方法不需要完全按照所示顺序来执行。It should be noted that the sequence number of each operation in the following method is only used as a representation of the operation for the purpose of description, and should not be regarded as indicating the execution order of the respective operations. Unless explicitly stated, the methods need not be performed in exactly the order shown.

图2是根据本发明实施例的数据处理方法的流程图。Figure 2 is a flow chart of a data processing method according to an embodiment of the present invention.

如图2所示，数据处理方法200包括操作S210～S250。As shown in FIG. 2 , the data processing method 200 includes operations S210 to S250.

在操作S210，获取多个对象数据。In operation S210, a plurality of object data are obtained.

在操作S220，确定多个对象数据的数量大于预设数量阈值，针对多个对象数据中的任意一个候选对象数据，根据候选对象数据的特征向量和第一距离阈值，从多个对象数据中确定与该候选对象数据相关联的邻近对象数据。In operation S220, it is determined that the number of the plurality of object data is greater than the preset quantity threshold, and for any candidate object data in the plurality of object data, a determination is made from the plurality of object data according to the feature vector of the candidate object data and the first distance threshold. Neighboring object data associated with the candidate object data.

在操作S230，响应于多个对象数据中存在邻近对象数据，根据候选对象数据和邻近对象数据，生成第一候选表征数据。In operation S230, in response to the presence of neighboring object data in the plurality of object data, first candidate representation data is generated according to the candidate object data and the neighboring object data.

在操作S240，基于候选对象数据、邻近对象数据和第一候选表征数据，对多个对象数据进行处理，得到第一表征数据集。In operation S240, a plurality of object data are processed based on the candidate object data, neighboring object data and the first candidate representation data to obtain a first representation data set.

在操作S250，确定第一表征数据集中包含的第一表征数据的数量小于或等于预设数量阈值，将第一表征数据集确定为第一样本数据集。In operation S250, it is determined that the number of first characterization data contained in the first characterization data set is less than or equal to the preset quantity threshold, and the first characterization data set is determined as the first sample data set.

根据本发明的实施例，对象数据例如包括但不限于文本、图像、视频、音频或者携带有标识数据的位置信息(其中标识数据与位置信息相关联)等类型的数据，具体可以根据实际需要选择，这里不做限定。在一个示例中，携带有标识数据的位置信息例如可以包括携带有行政区号信息的经纬度信息。当然，本发明并不局限于此。According to the embodiment of the present invention, the object data includes, but is not limited to, text, image, video, audio, or location information carrying identification data (where the identification data is associated with the location information) and other types of data, which can be selected according to actual needs. , no limitation is made here. In one example, the location information carrying identification data may include, for example, latitude and longitude information carrying administrative area code information. Of course, the present invention is not limited to this.

多个对象数据例如可以用于构建样本数据集，以用于模型训练。当然，多个对象数据的用途并不局限于此。为了简便说明，下面以多个对象数据用于构建样本数据集为例进行示例说明。Multiple object data can be used, for example, to construct a sample data set for model training. Of course, the uses of multiple object data are not limited to this. For simplicity of explanation, the following uses multiple object data to construct a sample data set as an example.

根据本发明的实施例，预设数量阈值用于指示拟生成的样本数据集中样本数据的期望数据量。预设数量阈值可以根据实际情况设定，本发明对此不做限定。According to an embodiment of the present invention, the preset quantity threshold is used to indicate an expected data quantity of sample data in the sample data set to be generated. The preset quantity threshold can be set according to actual conditions, and the present invention does not limit this.

在本发明实施例中，可以将多个对象数据的数量与预设数量阈值进行比较，以判断是否需要对多个对象数据进行进一步处理，以得到具有期望数据量的样本数据集。In embodiments of the present invention, the number of multiple object data can be compared with a preset quantity threshold to determine whether further processing of the multiple object data is required to obtain a sample data set with a desired amount of data.

例如，如果确定多个对象数据的数量大于预设数量阈值，即多个对象数据的数量大于期望数据量，则说明多个对象数据中存在冗余的对象数据。此时需要对多个对象数据进行处理，以得到具有期望数据量的样本数据集。反之，如果确定多个对象数据的数量小于或等于预设数量阈值，则说明多个对象数据的数量是合适的，此时可以不对多个对象数据进行处理，直接根据多个对象数据获取样本数据集。For example, if it is determined that the number of multiple object data is greater than the preset quantity threshold, that is, the number of multiple object data is greater than the expected data amount, it means that there is redundant object data in the multiple object data. At this time, multiple object data need to be processed to obtain a sample data set with the expected data volume. On the contrary, if it is determined that the number of multiple object data is less than or equal to the preset quantity threshold, it means that the number of multiple object data is appropriate. At this time, the multiple object data can not be processed, and the sample data can be obtained directly based on the multiple object data. set.

在本发明的实施例中，在确定多个对象数据的数量大于预设数量阈值时，针对多个对象数据中的任意一个候选对象数据，根据候选对象数据的特征向量和第一距离阈值，判断多个对象数据中是否存在与候选对象数据相关联的邻近对象数据。其中，邻近对象数据是指多个对象数据中与候选对象数据具有相似特征的对象数据。In the embodiment of the present invention, when it is determined that the number of multiple object data is greater than the preset quantity threshold, for any candidate object data among the multiple object data, it is determined according to the feature vector of the candidate object data and the first distance threshold. Whether there is adjacent object data associated with the candidate object data among the plurality of object data. The adjacent object data refers to the object data among multiple object data that have similar characteristics to the candidate object data.

可以理解，对于任意两个对象数据来说，如果两个对象数据具有的特征越相似，那么它们之间的距离也会越近；反之亦然。因此，可以利用对象数据之间的距离来确定与候选对象数据相关联的邻近对象数据。It can be understood that for any two object data, if the characteristics of the two object data are more similar, the distance between them will be closer; and vice versa. Therefore, distances between object data may be utilized to determine neighboring object data associated with candidate object data.

示例性地，针对任意一个候选对象数据，确定该候选对象数据的特征向量与另一个候选对象数据的特征向量之间的距离。其中另一个候选对象数据是指多个对象数据中区别于该候选对象数据的任意一个对象数据。如果确定该候选对象数据的特征向量与另一个候选对象数据的特征向量之间的距离小于或等于第一距离阈值，则将另一个候选对象数据确定为与该候选对象数据相关联的邻近对象数据。反之，如果确定该候选对象数据的特征向量与另一个候选对象数据的特征向量之间的距离大于第一距离阈值，则认为另一个候选对象数据不是与该候选对象数据相关联的邻近对象数据。基于上述方式，可以判断多个对象数据中是否存在与候选对象数据相关联的邻近对象数据，以及与候选对象数据相关联的邻近对象数据的数量。For example, for any candidate object data, the distance between the feature vector of the candidate object data and the feature vector of another candidate object data is determined. The other candidate object data refers to any object data among the plurality of object data that is different from the candidate object data. If it is determined that the distance between the feature vector of the candidate object data and the feature vector of another candidate object data is less than or equal to the first distance threshold, the other candidate object data is determined as the neighboring object data associated with the candidate object data . On the contrary, if it is determined that the distance between the feature vector of the candidate object data and the feature vector of another candidate object data is greater than the first distance threshold, it is considered that the other candidate object data is not the neighboring object data associated with the candidate object data. Based on the above method, it can be determined whether there is adjacent object data associated with the candidate object data in the plurality of object data, and the number of adjacent object data associated with the candidate object data.

需要说明的是，以上描述的第一距离阈值可以是一个或多个距离预设值，也可以是一个或多个距离预设范围，具体可以根据实际情况设定，这里不做限定。It should be noted that the first distance threshold described above may be one or more preset distance values, or one or more preset distance ranges. The specific distance threshold may be set according to the actual situation, and is not limited here.

另外，在确定候选对象数据的特征向量与另一个候选对象数据的特征向量之间的距离时，可以采用相关技术中任意一项或多项距离计算方法来实现，例如包括但不限于欧式距离、马氏距离、曼哈顿距离等等，本发明对此不做限定。In addition, when determining the distance between the feature vector of the candidate object data and the feature vector of another candidate object data, any one or more distance calculation methods in the related technology can be used, such as but not limited to Euclidean distance, Mahalanobis distance, Manhattan distance, etc. are not limited by the present invention.

在本发明实施例中，如果确定多个对象数据中存在与上述候选对象数据相关联的至少一个邻近对象数据，可以根据该候选对象数据和至少一个邻近对象数据，生成第一候选表征数据。In this embodiment of the present invention, if it is determined that at least one neighboring object data associated with the above-mentioned candidate object data exists in the plurality of object data, the first candidate representation data can be generated based on the candidate object data and the at least one neighboring object data.

由于每个邻近对象数据与候选对象数据具有相似的特征，因此，可以将至少一个邻近对象数据和候选对象数据视为同一类的对象数据。这样，就可以在包含候选对象数据和至少一个邻近对象数据的范围内，生成第一候选表征数据。第一候选表征数据用于代替候选对象数据和至少一个邻近对象数据。换言之，可以利用第一候选表征数据来代表候选对象数据和至少一个邻近对象数据，这样有利于后续对多个对象数据进行处理。Since each neighboring object data has similar characteristics to the candidate object data, at least one neighboring object data and the candidate object data may be regarded as object data of the same type. In this way, the first candidate representation data can be generated within a range including the candidate object data and at least one neighboring object data. The first candidate representation data is used to replace the candidate object data and the at least one neighboring object data. In other words, the first candidate representation data can be used to represent the candidate object data and at least one neighboring object data, which facilitates subsequent processing of multiple object data.

在一个示例中，例如，可以基于聚类算法，根据候选对象数据和至少一个邻近对象数据，生成一个聚类中心，并将该聚类中心确定为第一候选表征数据。在本发明实施例中，可以根据实际需要选择合适的聚类算法来生成聚类中心，这里不做限定。In one example, for example, a clustering center may be generated based on the candidate object data and at least one neighboring object data based on a clustering algorithm, and the clustering center may be determined as the first candidate representation data. In the embodiment of the present invention, an appropriate clustering algorithm can be selected according to actual needs to generate a clustering center, which is not limited here.

接下来，基于候选对象数据、与候选对象数据相关联的至少一个邻近对象数据以及第一候选表征数据，对多个对象数据进行处理，以得到第一表征数据集。Next, the plurality of object data are processed based on the candidate object data, at least one neighboring object data associated with the candidate object data, and the first candidate representation data to obtain a first representation data set.

在本发明实施例中，由于第一候选表征数据可以代表候选对象数据和至少一个邻近对象数据。因此，在对多个对象数据进行处理时，可以利用第一候选表征数据来代替多个对象数据中的候选对象数据和至少一个邻近对象数据，以得到第一表征数据集。第一表征数据集包括多个第一表征数据，多个第一表征数据可以代表对多个对象数据。基于上述方式对多个对象数据进行处理，能够实现对多个对象数据中冗余数据的简化处理，从而有利于获取期望数据量的样本数据。In this embodiment of the present invention, the first candidate representation data may represent candidate object data and at least one neighboring object data. Therefore, when processing multiple object data, the first candidate representation data can be used to replace the candidate object data and at least one neighboring object data in the multiple object data to obtain the first representation data set. The first characterization data set includes a plurality of first characterization data, and the plurality of first characterization data may represent data for a plurality of objects. Processing multiple object data based on the above method can achieve simplified processing of redundant data in multiple object data, thereby facilitating the acquisition of sample data with a desired amount of data.

之后，判断第一表征数据集中包含的第一表征数据的数量是否小于或等于预设数量阈值，即判断第一表征数据的数量是否小于或等于期望数据量。如果确定第一表征数据集中包含的第一表征数据的数量小于或等于预设数量阈值，则将第一表征数据集确定为第一样本数据集。Afterwards, it is determined whether the quantity of the first characterization data contained in the first characterization data set is less than or equal to the preset quantity threshold, that is, it is determined whether the quantity of the first characterization data is less than or equal to the expected data amount. If it is determined that the number of first characterization data contained in the first characterization data set is less than or equal to the preset quantity threshold, the first characterization data set is determined to be the first sample data set.

如果确定第一表征数据集中包含的第一表征数据的数量大于预设数量阈值，则可以从多个对象数据中确定另一个候选对象数据，并利用另一个候选对象数据来更新上述候选对象数据，重复执行上述确定第一表征数据集中包含的第一表征数据的数量小于或等于预设数量阈值的操作，直至确定第一表征数据集中包含的第一表征数据的数量达到预设数量阈值；或者在对所有对象数据处理之后，确定所获取的每个第一表征数据集中包含的第一表征数据的数量均未达到预设数量阈值时，则停止执行上述操作S220～S250。If it is determined that the number of first characterization data contained in the first characterization data set is greater than the preset quantity threshold, another candidate object data may be determined from the plurality of object data, and the above candidate object data may be updated using another candidate object data, Repeat the above operation of determining that the number of first characterization data contained in the first characterization data set is less than or equal to the preset quantity threshold until it is determined that the number of first characterization data contained in the first characterization data set reaches the preset quantity threshold; or in After processing all the object data, when it is determined that the number of first characterization data included in each acquired first characterization data set does not reach the preset quantity threshold, the execution of the above operations S220 to S250 is stopped.

根据本发明实施例，在确定多个对象数据的数量超过期望数据量时，通过利用候选对象数据以及与候选对象数据相关联的邻近对象数据，生成第一候选表征数据。然后，利用第一候选表征数据来代替多个对象数据中的候选对象数据和邻近对象数据，以得到第一样本数据集。由此，能够实现对多个对象数据中冗余数据的简化处理，从而获取具有期望数据量的第一样本数据集。According to an embodiment of the present invention, when it is determined that the number of multiple object data exceeds the expected data amount, first candidate representation data is generated by utilizing the candidate object data and adjacent object data associated with the candidate object data. Then, the first candidate representation data is used to replace the candidate object data and adjacent object data in the plurality of object data to obtain a first sample data set. As a result, it is possible to implement simplified processing of redundant data in multiple object data, thereby obtaining a first sample data set with a desired data amount.

图3是根据本发明实施例的生成第一样本数据集过程的示意图。下面参考图3对生成第一样本数据集的过程进行示例说明。在本发明实施例中，可以将各个数据以数据点的形式进行示出，例如对象数据是视频或者图片或者音频的，可以将对象数据输入一个训练完成的神经网络(例如Resnet、VGG、Transformer等可以用来处理图像的网络，相应的可以是BLSTM-RNN、WaveNet、GPT等可以用来处理音频的网络)进行特征提取得到对应的特征向量，可以利用数据点或者其它的形式(例如线，面等)来表示上述特征向量。应当理解，上述表示只是为了理解本发明实施例描述的技术方案，但本发明的方案不局限于此。Figure 3 is a schematic diagram of a process of generating a first sample data set according to an embodiment of the present invention. The process of generating the first sample data set is illustrated below with reference to FIG. 3 . In the embodiment of the present invention, each data can be shown in the form of data points. For example, the object data is video, picture, or audio, and the object data can be input into a trained neural network (such as Resnet, VGG, Transformer, etc.) Networks that can be used to process images (corresponding to BLSTM-RNN, WaveNet, GPT, etc. that can be used to process audio) are used to extract features to obtain the corresponding feature vectors. Data points or other forms (such as lines, surfaces, etc.) can be used to extract features. etc.) to represent the above feature vectors. It should be understood that the above representations are only for understanding the technical solutions described in the embodiments of the present invention, but the solutions of the present invention are not limited thereto.

如图3中的310所示，多个对象数据例如包括对象数据P_a、对象数据P_b、对象数据P_c、对象数据P_d、对象数据P_e、对象数据P_f、对象数据P_g、对象数据P_h、对象数据P_j、对象数据P_k、对象数据P_m。As shown in 310 in Figure 3, the plurality of object data include, for example, object data _Pa , object data _Pb , object data _Pc , object data _Pd , object data _Pe , object data _Pf , object data _Pg , Object data _Ph , object data _Pj , object data _Pk , and object data _Pm .

在确定多个对象数据的数量大于预设数量阈值时，针对多个对象数据中的任意一个候选对象数据，例如，可以将对象数据P_d确定为一个候选对象数据，根据该对象数据P_d的特征向量和第一距离阈值，可以从多个对象数据中确定与该候选对象数据相关联的邻近对象数据。例如，可以将以对象数据P_d为中心，以第一距离阈值为半径范围内的所有对象数据确定为与该候选对象数据(即对象数据P_d)相关联的邻近对象数据，即将对象数据P_a、对象数据P_b、对象数据P_c确定为与对象数据P_d关联的邻近对象数据。When it is determined that the number of multiple object data is greater than the preset quantity threshold, for any candidate object data among the multiple object data, for example, the object data P _d can be determined as a candidate object data. According to the object data P _d With the feature vector and the first distance threshold, neighboring object data associated with the candidate object data may be determined from the plurality of object data. For example, all object data within the radius range with the object data P _d as the center and the first distance threshold as the radius can be determined as adjacent object data associated with the candidate object data (ie, the object data P _d ), that is, the object data P _a , object data _Pb , and object data _Pc are determined as adjacent object data associated with object data _Pd .

之后，根据候选对象数据和所有的邻近对象数据，生成第一候选表征数据。如图3中的320所示，例如，可以对对象数据P_a、对象数据P_b、对象数据P_c、对象数据P_d进行聚类处理，生成一个聚类中心，并将该聚类中心作为第一候选表征数据(即320中示出的ct₁)。第一候选表征数据ct₁可以用于代替对象数据P_a、对象数据P_b、对象数据P_c和对象数据P_d。Afterwards, first candidate representation data is generated based on the candidate object data and all neighboring object data. As shown at 320 in Figure 3, for example, the object data _Pa , object data _Pb , object data _Pc , and object data _Pd can be clustered to generate a clustering center, and the clustering center can be The first candidate characterization data (i.e., ct ₁ shown in 320 ). The first candidate representation data ct ₁ may be used in place of the object data _Pa , object data _Pb , object data _Pc and object data _Pd .

接下来，删除多个对象数据中的候选对象数据和邻近对象数据，得到候选对象数据集。然后，根据候选对象数据集和第一候选表征数据，确定第一表征数据集。Next, candidate object data and adjacent object data in multiple object data are deleted to obtain a candidate object data set. Then, a first representation data set is determined based on the candidate object data set and the first candidate representation data.

如图3中的320和330所示，例如，从多个对象数据中删除对象数据P_b以及与对象数据P_d关联的邻近对象数据(即对象数据P_a、对象数据P_b和对象数据P_c)，得到候选对象数据集。之后，根据候选对象数据集合和第一候选表征数据ct₁，得到第一表征数据集。第一表征数据集包括第一候选表征数据ct₁、对象数据P_e、对象数据P_f、对象数据P_g、对象数据P_h、对象数据P_j、对象数据P_k、对象数据P_m。第一表征数据集中包含的各个数据即为第一表征数据。As shown in 320 and 330 in FIG. 3, for example, the object data _Pb and the adjacent object data associated with the object data _Pd (ie, the object data _Pa , the object data _Pb and the object data Pd) are deleted from the plurality of object data. _c ), obtain the candidate object data set. Afterwards, a first representation data set is obtained based on the candidate object data set and the first candidate representation data ct ₁ . The first representation data set includes first candidate representation data ct ₁ , object data Pe, object data P _f , object data P _g , object data _{Ph h} _, object data P _j , object data P _k , and object data P _m . Each data included in the first characterization data set is the first characterization data.

接下来，可以判断第一表征数据集中包含的第一表征数据的数量是否小于或等于预设数量阈值。如果是，则将第一表征数据集确定为第一样本数据集。如果否，则从多个对象数据中确定另一个候选对象数据，并采用以上描述的过程，继续对另一个候选对象数据进行处理，以得到另一个第一表征数据集。Next, it may be determined whether the number of first characterization data included in the first characterization data set is less than or equal to a preset quantity threshold. If so, the first representation data set is determined as the first sample data set. If not, another candidate object data is determined from the plurality of object data, and the process described above is used to continue processing the other candidate object data to obtain another first representation data set.

假设上述第一表征数据集中包含的第一表征数据的数量大于预设数量阈值，如图3中的330所示，还可以将对象数据P_g(仅为示例)确定为另一个候选对象数据，并根据该对象数据P_g的特征向量和第一距离阈值，从多个对象数据中确定与另一个候选对象数据(即对象数据P_g)相关联的邻近对象数据。例如，与对象数据P_g关联的邻近对象数据包括对象数据P_f和对象数据P_h。Assuming that the number of first characterization data contained in the above-mentioned first characterization data set is greater than the preset quantity threshold, as shown at 330 in Figure 3, the object data _Pg (example only) may also be determined as another candidate object data, And based on the feature vector of the object data _Pg and the first distance threshold, neighboring object data associated with another candidate object data (ie, object data _Pg ) is determined from the plurality of object data. For example, the adjacent object data associated with the object data _Pg includes the object data _Pf and the object data _Ph .

接下来，根据另一个候选对象数据和所有的邻近对象数据，生成另一个第一候选表征数据。如图3中的330所示，例如，可以对对象数据P_g、对象数据P_f和对象数据P_h进行聚类处理，生成另一个聚类中心，并将另一个聚类中心作为另一个第一候选表征数据(即330中示出的ct₂)。第一候选表征数据ct₂可以用于代替对象数据P_g、对象数据P_f和对象数据P_h。Next, another first candidate representation data is generated based on another candidate object data and all neighboring object data. As shown at 330 in Figure 3, for example, the object data _Pg , the object data _Pf and the object data _Ph can be clustered to generate another clustering center, and the other clustering center can be used as another third clustering center. A candidate characterization data (i.e., ct ₂ shown in 330). The first candidate representation data ct ₂ may be used in place of the object data P _g , the object data P _f and the object data Ph _h .

接下来，删除多个对象数据中的另一个候选对象数据以及对应的邻近对象数据，得到另一个候选对象数据集。然后，根据另一个候选对象数据集和另一个第一候选表征数据，确定另一个第一表征数据集。Next, another candidate object data and corresponding neighboring object data among the plurality of object data are deleted to obtain another candidate object data set. Then, another first representation data set is determined based on another candidate object data set and another first candidate representation data.

如图3中的330和340所示，例如，从多个对象数据中删除对象数据P_g以及与对象数据P_g关联的邻近对象数据(即对象数据P_f和对象数据P_h)，得到另一个候选对象数据集。之后，根据另一个候选对象数据集合和第一候选表征数据ct₂，得到另一个第一表征数据集。另一个第一表征数据集包括第一候选表征数据ct₁、第一候选表征数据ct₂、对象数据P_e、对象数据P_j、对象数据P_k、对象数据P_m。另一个第一表征数据集中包含的各个数据即为更新的第一表征数据。As shown at 330 and 340 in Figure 3, for example, the object data _Pg and the adjacent object data associated with the object data _Pg (ie, the object data _Pf and the object data _Ph ) are deleted from the plurality of object data to obtain another A candidate object data set. Afterwards, another first representation data set is obtained based on another candidate object data set and the first candidate representation data ct ₂ . Another first characterization data set includes first candidate characterization data ct ₁ , first candidate characterization data ct ₂ , object data Pe , object data P _j , object data P _k , and object data _P _m . Each data included in another first characterization data set is the updated first characterization data.

接下来，可以判断更新的第一表征数据的数量是否小于或等于预设数量阈值。如果是，则将另一个第一表征数据集确定为第一样本数据集。如果否，则从多个对象数据中确定另一个候选对象数据，并采用以上描述的过程，继续对另一个候选对象数据进行处理，以得到另一个第一表征数据集。Next, it may be determined whether the number of updated first representation data is less than or equal to a preset quantity threshold. If so, another first representation data set is determined as the first sample data set. If not, another candidate object data is determined from the plurality of object data, and the process described above is used to continue processing the other candidate object data to obtain another first representation data set.

假设上述更新的第一表征数据的数量仍大于预设数量阈值，还可以从剩余的对象数据(包括对象数据P_e、对象数据P_j、对象数据P_k、对象数据P_m)中确定另一个候选对象数据。例如，可以将对象数据P_e确定为另一个候选对象数据。之后，可以根据对象数据P_e的特征向量和第一距离阈值，从多个对象数据中确定与另一个候选对象数据(即对象数据P_e)相关联的邻近对象数据。如图3中的340所示，以对象数据P_e为中心，以第一距离阈值为半径范围内的不存在与对象数据P_e相关联的邻近对象数据。此时，可以将另一个候选对象数据(即对象数据P_e)确定为另一个第一候选表征数据，记为第一候选表征数据ct₃。Assuming that the quantity of the updated first representation data is still greater than the preset quantity threshold, another object data (including object data P _e , object data P _j , object data P _k , and object data P _m ) can also be determined from the remaining object data. Candidate data. For example, the object data _Pe may be determined as another candidate object data. Afterwards, neighboring object data associated with another candidate object data (ie, object data _Pe ) may be determined from the plurality of object data according to the feature vector of the object data _Pe and the first distance threshold. As shown at 340 in Figure 3, with the object data _Pe as the center and the first distance threshold as the radius, there is no adjacent object data associated with the object data _Pe . At this time, another candidate object data (ie, object data _Pe ) can be determined as another first candidate representation data, which is recorded as first candidate representation data ct ₃ .

接下来，删除多个对象数据中的另一个候选对象数据(即对象数据P_e)，得到另一个候选对象数据集。之后，根据另一个候选对象数据集和第一候选表征数据ct₃，确定另一个第一表征数据集。另一个第一表征数据集中包括第一候选表征数据ct₁、第一候选表征数据ct₂、第一候选表征数据ct₃、以及对象数据P_j、对象数据P_k、对象数据P_m。另一个第一表征数据集中包含的各个数据即为更新的第一表征数据。Next, another candidate object data (ie, the object data P _e ) among the plurality of object data is deleted to obtain another candidate object data set. Afterwards, another first representation data set is determined based on another candidate object data set and the first candidate representation data ct ₃ . Another first representation data set includes first candidate representation data ct ₁ , first candidate representation data ct ₂ , first candidate representation data ct ₃ , and object data P _j , object data P _k , and object data P _m . Each data included in another first characterization data set is the updated first characterization data.

接下来，判断另一个第一表征数据集中包含的更新的第一表征数据的数量是否小于或等于预设数量阈值。如果是，则将另一个第一表征数据集确定为第一样本数据集。如果否，则从剩余的对象数据(包括对象数据P_j、对象数据P_k、对象数据P_m)中确定另一个候选对象数据，并采用以上描述的过程，继续对另一个候选对象数据进行处理；如此重复操作，直至确定第一表征数据集中包含的第一表征数据的数量达到预设数量阈值，或者检测到另一个候选对象数据集中不包括对象数据时，停止继续执行上述操作。Next, it is determined whether the number of updated first characterization data contained in another first characterization data set is less than or equal to a preset quantity threshold. If so, another first representation data set is determined as the first sample data set. If not, determine another candidate object data from the remaining object data (including object data P _j , object data P _k , and object data P _m ), and use the process described above to continue processing the other candidate object data. ; Repeat the operation in this way until it is determined that the number of first characterization data contained in the first characterization data set reaches the preset quantity threshold, or when it is detected that another candidate object data set does not include object data, stop continuing to perform the above operations.

在本发明实施例中，在确定多个对象数据的数量超过期望数据量的情况下，通过利用每个候选对象数据及对应的邻近对象数据，生成第一候选表征数据。然后，利用第一候选表征数据来代替多个对象数据中的候选对象数据和邻近对象数据，以得到第一样本数据集。由此，能够实现对多个对象数据中冗余数据的简化处理，从而获取具有期望数据量的第一样本数据集。In the embodiment of the present invention, when it is determined that the number of multiple object data exceeds the expected data amount, first candidate representation data is generated by utilizing each candidate object data and corresponding neighboring object data. Then, the first candidate representation data is used to replace the candidate object data and adjacent object data in the plurality of object data to obtain a first sample data set. As a result, it is possible to implement simplified processing of redundant data in multiple object data, thereby obtaining a first sample data set with a desired data amount.

根据本发明的实施例，上述每个对象数据可以包括对象数据标识。对象数据标识用于指示与对象数据关联的描述信息。描述信息例如包括但不限于对象数据所属的类别、对象数据中包含的目标对象和目标对象所属的类别，或者与对象数据相关的其他信息，等等。According to an embodiment of the present invention, each object data mentioned above may include an object data identifier. The object data identifier is used to indicate descriptive information associated with the object data. The description information includes, for example, but is not limited to the category to which the object data belongs, the target object contained in the object data and the category to which the target object belongs, or other information related to the object data, and so on.

在上述生成第一候选表征数据过程中，还可以根据各个对象数据的对象数据标识来确定与第一候选表征数据对应的对象数据标识。In the above process of generating the first candidate representation data, the object data identifier corresponding to the first candidate representation data may also be determined according to the object data identifier of each object data.

在一个示例中，例如，在确定多个对象数据中存在与候选对象数据相关联的邻近对象数据时，可以根据多个对象数据各自对应的对象数据标识，确定候选对象数据的对象数据标识和邻近对象数据的对象数据标识。In one example, for example, when it is determined that adjacent object data associated with the candidate object data exists in multiple object data, the object data identifier and adjacent object data of the candidate object data may be determined based on the object data identifiers corresponding to the multiple object data. The object data identifier of the object data.

接下来，根据候选对象数据的对象数据标识和邻近对象数据的对象数据标识，确定第一候选表征数据的对象数据标识。例如，可以确定候选对象数据的对象数据标识和邻近对象数据的对象数据标识中属于相同数据标识类别的对象数据标识的数量。然后，将数量最大的对象数据标识，作为第一候选表征数据的对象数据标识。Next, the object data identifier of the first candidate representation data is determined based on the object data identifier of the candidate object data and the object data identifier of the adjacent object data. For example, the number of object data identifiers belonging to the same data identifier category among the object data identifiers of the candidate object data and the object data identifiers of the adjacent object data may be determined. Then, the object data identifier with the largest number is used as the object data identifier of the first candidate representation data.

示例性地，候选对象数据P_d的对象数据标识为对象数据标识1。与候选对象数据P_d关联的邻近对象数据包括对象数据P_a、对象数据P_b和对象数据P_c。对象数据P_a、对象数据P_b和对象数据P_c各自对应的对象数据标识分别为对象数据标识1、对象数据标识2和对象数据标识1。将候选对象数据P_d的对象数据标识和邻近对象数据的对象数据标识进行汇总，得到1个对象数据标识2和3个对象数据标识1。将数量最大的对象数据标识即对象数据标识1，作为第一候选表征数据的对象数据标识。For example, the object data identifier of the candidate object data _Pd is object data identifier 1. The neighboring object data associated with the candidate object data _Pd includes object data _Pa , object data _Pb and object data _Pc . The object data identifiers corresponding to the object data P _a , object data P _b and object data P _c are respectively object data identifier 1 , object data identifier 2 and object data identifier 1 . The object data identifiers of the candidate object data P _d and the object data identifiers of adjacent object data are summarized to obtain one object data identifier 2 and three object data identifiers 1 . The object data identifier with the largest number, namely object data identifier 1, is used as the object data identifier of the first candidate representation data.

之后，将第一候选表征数据的对象数据标识和第一候选表征数据进行关联。例如，可以将第一候选表征数据与对象数据标识1进行关联存储，后续可以利用对象数据标识1来描述第一候选表征数据中包含的相关信息。Afterwards, the object data identifier of the first candidate representation data is associated with the first candidate representation data. For example, the first candidate representation data can be stored in association with the object data identifier 1, and subsequently the object data identifier 1 can be used to describe the relevant information contained in the first candidate representation data.

在本发明实施例中，通过利用候选对象数据的对象数据标识和邻近对象数据的对象数据标识来确定第一候选表征数据的对象数据标识，这样在利用第一候选表征数据代替候选对象数据和邻近对象数据时，还可以利用第一候选表征数据的对象数据标识来表示与候选对象数据和邻近对象数据对应的对象数据标识情况。In the embodiment of the present invention, the object data identifier of the first candidate representation data is determined by using the object data identifier of the candidate object data and the object data identifier of the adjacent object data. In this way, the first candidate representation data is used to replace the candidate object data and the adjacent object data. When selecting object data, the object data identifier of the first candidate representation data may also be used to represent the object data identifier corresponding to the candidate object data and adjacent object data.

在另一个示例中，例如，在确定多个对象数据中不存在与候选对象数据相关联的邻近对象数据时，可以将候选对象数据的对象数据标识确定为第一候选表征数据的对象数据标识。In another example, for example, when it is determined that adjacent object data associated with the candidate object data does not exist in the plurality of object data, the object data identification of the candidate object data may be determined as the object data identification of the first candidate representation data.

例如，候选对象数据P_e的对象数据标识为对象数据标识3。如果确定多个对象数据中不存在与候选对象数据P_e相关联的邻近对象数据，可以将候选对象数据P_e确定为第一候选表征数据，并将对象数据标识3确定为第一候选表征数据的对象数据标识。For example, the object data identifier of the candidate object data P _e is object data identifier 3. If it is determined that there is no adjacent object data associated with the candidate object data _Pe in the plurality of object data, the candidate object data _Pe may be determined as the first candidate representation data, and the object data identification 3 may be determined as the first candidate representation data. The object data identifier.

在一些实施例中，如果检测到候选对象数据集中不包括对象数据，且确定第一表征数据集中包含的第一表征数据的数量大于预设数量阈值时，还可以基于聚类方法对第一表征数据集中的多个第一表征数据进行处理，以获得具有期望数据量的第二样本数据集。下面参考具体实施例对此进行示例说明。In some embodiments, if it is detected that the candidate object data set does not include object data, and it is determined that the number of first characterization data contained in the first characterization data set is greater than the preset quantity threshold, the first characterization data can also be classified based on a clustering method. A plurality of first characterization data in the data set are processed to obtain a second sample data set with a desired data amount. This is illustrated below with reference to specific embodiments.

首先，在检测到候选对象数据集中不包括对象数据，且确定第一表征数据集中包含的第一表征数据的数量大于预设数量阈值时，可以确定每个第一表征数据的点密度。First, when it is detected that the candidate object data set does not include object data, and it is determined that the number of first characterization data included in the first characterization data set is greater than a preset quantity threshold, the point density of each first characterization data may be determined.

示例性地，针对每个第一表征数据，基于第二距离阈值和第一表征数据的特征向量，从多个第一表征数据中确定与该第一表征数据相关联的邻近表征数据的数量。然后，将邻近表征数据的数量确定为该第一表征数据的点密度。Exemplarily, for each first representation data, the number of adjacent representation data associated with the first representation data is determined from the plurality of first representation data based on the second distance threshold and the feature vector of the first representation data. Then, the number of adjacent characterization data is determined as the point density of the first characterization data.

在本发明实施例中，邻近表征数据用于指示第一表征数据集中与该第一表征数据具有相似特征的第一表征数据。基于第二距离阈值和第一表征数据的特征向量，确定与该第一表征数据相关联的邻近表征数据的过程与以上确定邻近对象数据的过程类似，这里不再赘述。In this embodiment of the present invention, adjacent characterization data is used to indicate first characterization data in the first characterization data set that has similar characteristics to the first characterization data. Based on the second distance threshold and the feature vector of the first representation data, the process of determining the neighboring representation data associated with the first representation data is similar to the above process of determining the neighboring object data, and will not be described again here.

需要说明的是，第二距离阈值与第一距离阈值不同。第二距离阈值可以是一个或多个距离预设值，也可以是一个或多个距离预设范围，具体可以根据实际情况设定，这里不做限定。It should be noted that the second distance threshold is different from the first distance threshold. The second distance threshold may be one or more preset distance values, or may be one or more preset distance ranges. The specific distance threshold may be set according to the actual situation, and is not limited here.

接下来，将多个第一表征数据中点密度最大的第一表征数据，确定为候选第一表征数据。Next, the first characterization data with the highest point density among the plurality of first characterization data is determined as the candidate first characterization data.

例如，在确定多个第一表征数据各自对应的点密度之后，对多个第一表征数据各自对应的点密度进行排序。然后，基于点密度排序结果，将多个第一表征数据中点密度最大的第一表征数据，确定为候选第一表征数据。For example, after determining the point densities corresponding to the plurality of first characterization data, the point densities corresponding to the plurality of first characterization data are sorted. Then, based on the point density sorting result, the first representation data with the largest point density among the plurality of first representation data is determined as the candidate first representation data.

接下来，基于候选第一表征数据的点密度，确定与候选第一表征数据相关联的第二候选表征数据的数目。Next, based on the point density of the candidate first representation data, the number of second candidate representation data associated with the candidate first representation data is determined.

在本发明实施例中，第二候选表征数据可以用于代替候选第一表征数据以及与候选第一表征数据相关联的邻近表征数据。后续可以利用第二候选表征数据来代替候选第一表征数据及对应的邻近表征数据，以实现对多个第一表征数据中冗余数据的简化处理，从而获得具有期望数据量的第二样本数据集。In embodiments of the present invention, the second candidate representation data may be used to replace the candidate first representation data and adjacent representation data associated with the candidate first representation data. Subsequently, the second candidate representation data can be used to replace the candidate first representation data and the corresponding adjacent representation data to simplify the processing of redundant data in multiple first representation data, thereby obtaining second sample data with the desired data amount. set.

为了使第二候选表征数据能够更好地代替候选第一表征数据及对应的邻近表征数据，在本发明实施例中，可以基于候选第一表征数据的点密度来确定与候选第一表征数据相关联的第二候选表征数据的数目。In order to enable the second candidate representation data to better replace the candidate first representation data and the corresponding adjacent representation data, in the embodiment of the present invention, the correlation with the candidate first representation data can be determined based on the point density of the candidate first representation data. The number of associated second candidate representation data.

可以理解，如果某个第一表征数据的点密度越大，则意味着与它关联的邻近表征数据的数量越多。相应地，就需要更多的第二候选表征数据来表征候选第一表征数据及对应的邻近表征数据。反之，如果第一表征数据的点密度越小，则意味着与它关联的邻近表征数据的数量越少。相应地，可以利用较少的第二候选表征数据来表征候选第一表征数据及对应的邻近表征数据。换言之，第二候选表征数据的数目与候选第一表征数据的点密度呈正相关关系。It can be understood that if the point density of a certain first representation data is larger, it means that the number of adjacent representation data associated with it is greater. Correspondingly, more second candidate representation data are needed to characterize the candidate first representation data and corresponding adjacent representation data. On the contrary, if the point density of the first representation data is smaller, it means that the number of adjacent representation data associated with it is smaller. Correspondingly, less second candidate representation data may be used to characterize the candidate first representation data and the corresponding neighboring representation data. In other words, the number of second candidate representation data is positively correlated with the point density of candidate first representation data.

在一个示例中，例如，可以采用以下公式(1)来确定第二候选表征数据的数目。In one example, for example, the following formula (1) may be used to determine the number of second candidate characterization data.

在公式(1)中，N_cluster表示第二候选表征数据的数目，N_density表示候选第一表征数据的点密度，N_e表示样本数据的期望数据量，N_retain表示多个第一表征数据的数量。In formula (1), N _cluster represents the number of second candidate representation data, N _density represents the point density of candidate first representation data, N _e represents the expected data volume of sample data, and N _retain represents the number of multiple first representation data. quantity.

接下来，基于第二候选表征数据的数目，对多个第一表征数据进行表征数据提取，得到第二表征数据集。Next, based on the number of second candidate characterization data, characterization data is extracted from the plurality of first characterization data to obtain a second characterization data set.

例如，可以根据候选第一表征数据的特征向量和第二距离阈值，从多个第一表征数据中确定与候选第一表征数据相关联的邻近表征数据。该过程与以上确定邻近对象数据的过程类似，这里不再赘述。For example, adjacent characterization data associated with the candidate first characterization data may be determined from the plurality of first characterization data according to the feature vector of the candidate first characterization data and the second distance threshold. This process is similar to the above process of determining neighboring object data, and will not be described again here.

之后，基于第二候选表征数据的数目，对候选第一表征数据和邻近表征数据进行聚类处理，得到数目个第二候选表征数据。示例性地，可以采用相关技术中的聚类算法，例如包括但不限于k-means聚类算法，对候选第一表征数据和邻近表征数据进行聚类处理，以得到数目个第二候选表征数据。Afterwards, based on the number of second candidate representation data, clustering processing is performed on the candidate first representation data and adjacent representation data to obtain a number of second candidate representation data. For example, clustering algorithms in related technologies, such as but not limited to k-means clustering algorithm, can be used to perform clustering processing on the candidate first representation data and adjacent representation data to obtain a number of second candidate representation data. .

之后，利用数目个第二候选表征数据来代替多个第一表征数据中的候选第一表征数据及对应的邻近表征数据，以实现对多个第一表征数据中冗余数据的简化处理。例如，可以删除多个第一表征数据中的候选第一表征数据和邻近表征数据，得到候选表征数据集。然后，根据候选表征数据集和数目个第二候选表征数据，确定第二表征数据集。第二表征数据集中包括除候选第一表征数据和邻近表征数据之外的第一表征数据以及数目个第二候选表征数据。第二表征数据集中的各个数据即为第二表征数据。Afterwards, a number of second candidate representation data are used to replace the candidate first representation data and the corresponding adjacent representation data in the plurality of first representation data, so as to achieve simplified processing of redundant data in the plurality of first representation data. For example, candidate first representation data and adjacent representation data among the plurality of first representation data can be deleted to obtain a candidate representation data set. Then, a second representation data set is determined based on the candidate representation data set and the number of second candidate representation data. The second characterization data set includes first characterization data and a number of second candidate characterization data in addition to the candidate first characterization data and adjacent characterization data. Each data in the second characterization data set is the second characterization data.

接下来，判断第二表征数据集中包含的第二表征数据的数量是否小于或等于预设数量阈值，即判断第二表征数据的数量是否小于或等于期望数据量。如果确定第二表征数据集中包含的第二表征数据的数量小于或等于预设数量阈值，则将第二表征数据集确定为第二样本数据集。Next, it is determined whether the number of second characterization data contained in the second characterization data set is less than or equal to a preset quantity threshold, that is, it is determined whether the number of second characterization data is less than or equal to the expected data amount. If it is determined that the amount of second characterization data contained in the second characterization data set is less than or equal to the preset quantity threshold, the second characterization data set is determined as the second sample data set.

如果确定第二表征数据集中包含的第二表征数据的数量仍大于预设数量阈值，还可以从第二表征数据集中确定另一个候选第一表征数据作为候选第二表征数据，并基于候选第二表征数据的点密度来更新候选第一表征数据的点密度，重复执行上述确定第二表征数据集中包含的第二表征数据的数量是否小于或等于预设数量阈值的操作，直至确定第二表征数据集中包含的第二表征数据的数量小于或等于预设数量阈值，或者重复操作的次数达到预定次数为止。If it is determined that the number of second characterization data contained in the second characterization data set is still greater than the preset quantity threshold, another candidate first characterization data may also be determined from the second characterization data set as the candidate second characterization data, and based on the candidate second Use the point density of the characterization data to update the point density of the candidate first characterization data, and repeat the above-mentioned operation of determining whether the number of second characterization data contained in the second characterization data set is less than or equal to the preset quantity threshold until the second characterization data is determined. The number of second representation data contained in the set is less than or equal to the preset quantity threshold, or the number of times of repeated operations reaches the predetermined number.

下面结合具体实施例来说明确定候选第二表征数据的示例过程。An example process of determining candidate second characterization data will be described below with reference to specific embodiments.

可以理解，由于第二表征数据集是通过使用数目个第二候选表征数据代替多个第一表征数据中的候选第一表征数据和邻近表征数据而得到的。因此，第二表征数据集中各个第二表征数据的点密度会相应变化，需要重新对各个第二表征数据的点密度进行计算，以便确定第二表征数据集中点密度最大的第二表征数据，即候选第二表征数据。It can be understood that the second characterization data set is obtained by using a number of second candidate characterization data to replace candidate first characterization data and adjacent characterization data in a plurality of first characterization data. Therefore, the point density of each second representation data in the second representation data set will change accordingly, and the point density of each second representation data needs to be recalculated in order to determine the second representation data with the highest point density in the second representation data set, that is, Candidate second characterization data.

图4是根据本发明实施例的生成第二表征数据集过程的示意图。下面参考图4对第二表征数据集中各个第二表征数据的点密度的变化情况进行示例说明。Figure 4 is a schematic diagram of a process of generating a second representation data set according to an embodiment of the present invention. The following is an example of the change in point density of each second characterization data in the second characterization data set with reference to FIG. 4 .

如图4中的440所示，多个第一表征数据例如包括第一表征数据ct₁、第一表征数据ct₂、第一表征数据ct₃、第一表征数据ct₄、第一表征数据ct₅、第一表征数据ct₆、......。As shown in 440 in Figure 4 , the plurality of first characterization data include, for example, first characterization data ct ₁ , first characterization data ct ₂ , first characterization data ct ₃ , first characterization data ct ₄ , and first characterization data ct _5. First characterization data ct ₆ .......

例如，以第二距离阈值为lc。针对每个第一表征数据，基于第二距离阈值lc和第一表征数据的特征向量，从多个第一表征数据中确定与该第一表征数据相关联的邻近表征数据的数量。然后，将邻近表征数据的数量确定为该第一表征数据的点密度。以第一表征数据ct₁为例。例如，基于第二距离阈值lc和第一表征数据的特征向量，确定与第一表征数据ct₁相关联的邻近表征数据的数量为9个。这9个邻近表征数据位于以第一表征数据ct₁为中心，以第二距离阈值lc为半径的密度区域内，例如包括第一表征数据ct₂、第一表征数据ct₆等。For example, let the second distance threshold be lc. For each first characterization data, a number of adjacent characterization data associated with the first characterization data is determined from the plurality of first characterization data based on the second distance threshold lc and the feature vector of the first characterization data. Then, the number of adjacent characterization data is determined as the point density of the first characterization data. Take the first representation data ct ₁ as an example. For example, based on the second distance threshold lc and the feature vector of the first characterization data, it is determined that the number of adjacent characterization data associated with the first characterization data ct ₁ is nine. These nine neighboring characterization data are located in a density area with the first characterization data ct ₁ as the center and the second distance threshold lc as the radius, and include, for example, the first characterization data ct ₂ , the first characterization data ct ₆ , etc.

之后，对多个第一表征数据各自对应的点密度进行排序，并基于点密度排序结果，将多个第一表征数据中点密度最大的第一表征数据，确定为候选第一表征数据。例如，多个第一表征数据中点密度最大的第一表征数据为第一表征数据ct₁，即候选第一表征数据为第一表征数据ct₁。Afterwards, the corresponding point densities of the plurality of first representation data are sorted, and based on the point density sorting result, the first representation data with the largest point density among the plurality of first representation data is determined as the candidate first representation data. For example, the first representation data with the highest point density among the plurality of first representation data is the first representation data ct ₁ , that is, the candidate first representation data is the first representation data ct ₁ .

接下来，基于候选第一表征数据的点密度，确定与候选第一表征数据相关联的第二候选表征数据的数目。例如，基于以上公式(1)，可以确定与候选第一表征数据相关联的第二候选表征数据的数目例如为4个。Next, based on the point density of the candidate first representation data, the number of second candidate representation data associated with the candidate first representation data is determined. For example, based on the above formula (1), it may be determined that the number of second candidate characterization data associated with the candidate first characterization data is, for example, 4.

接下来，基于第二候选表征数据的数目，对候选第一表征数据和邻近表征数据进行聚类处理，得到数目个第二候选表征数据。Next, based on the number of second candidate representation data, clustering processing is performed on the candidate first representation data and adjacent representation data to obtain a number of second candidate representation data.

如图4中的450所示，例如，基于第二候选表征数据的数目，对第一表征数据ct₁以及第一表征数据ct₁的邻近表征数据进行聚类处理，生成4个第二候选表征数据，例如第二候选表征数据cp₁、第二候选表征数据cp₂、第二候选表征数据cp₃和第二候选表征数据cp₄。As shown in 450 in Figure 4, for example, based on the number of second candidate representation data, clustering processing is performed on the first representation data ct ₁ and the adjacent representation data of the first representation data ct ₁ to generate 4 second candidate representations. Data, such as second candidate representation data cp ₁ , second candidate representation data cp ₂ , second candidate representation data cp ₃ and second candidate representation data cp ₄ .

接下来，利用数目个第二候选表征数据来代替多个第一表征数据中的候选第一表征数据及对应的邻近表征数据，以得到第二表征数据集。Next, a number of second candidate representation data are used to replace the candidate first representation data and corresponding adjacent representation data in the plurality of first representation data to obtain a second representation data set.

如图4中的460所示，例如，可以利用第二候选表征数据cp₁～第二候选表征数据cp₄来代替第一表征数据ct₁以及第一表征数据ct₁的邻近表征数据，得到第二表征数据集。第二表征数据集包括4个第二候选表征数据以及除第一表征数据ct₁和邻近表征数据之外的其他第一表征数据。第二表征数据集中包含的各个表征数据即为第二表征数据。As shown at 460 in Figure 4, for example, the second candidate representation data cp ₁ to the second candidate representation data cp ₄ can be used to replace the first representation data ct 1 and the adjacent representation data of the first representation data ct ₁ to obtain the first representation data ct 1 and the adjacent representation data of the first representation data ct ₁ . 2. Characterization data set. The second characterization data set includes 4 second candidate characterization data and other first characterization data except the first characterization data ct ₁ and adjacent characterization data. Each representation data included in the second representation data set is the second representation data.

在上述生成第二表征数据集的过程中，针对与候选第一表征数据(即第一表征数据ct₁)之间的距离小于或等于第二距离阈值lc的第一表征数据，即第一表征数据ct₁的邻近表征数据(例如440中示出的第一表征数据ct₂)，这些邻近表征数据和候选第一表征数据会被替换掉。也就是说，这些邻近表征数据和候选第一表征数据不会存在于第二表征数据集中。In the above process of generating the second representation data set, for the first representation data whose distance from the candidate first representation data (ie, the first representation data ct ₁ ) is less than or equal to the second distance threshold lc, that is, the first representation The neighboring representation data of data ct ₁ (for example, the first representation data ct ₂ shown in 440 ), these neighboring representation data and the candidate first representation data will be replaced. That is to say, these neighboring representation data and candidate first representation data will not exist in the second representation data set.

另外，针对与候选第一表征数据之间的距离大于一倍第二距离阈值lc且小于或等于两倍第二距离阈值lc的第一表征数据，例如440中示出的第一表征数据ct₃，由于候选第一表征数据的密度区域与该第一表征数据ct₃的密度区域存在重合的情况，换言之，候选第一表征数据与该第一表征数据ct₃具有相同的邻近表征数据(例如第一表征数据ct₆)。在使用数目个第二候选表征数据代替多个第一表征数据中的候选第一表征数据和邻近表征数据之后，第二表征数据集中诸如第一表征数据ct₃的点密度会受到候选第一表征数据(即第一表征数据ct₁)的影响而发生改变。因此，需要重新对这些受影响的第一表征数据的点密度进行计算。In addition, for the first characterization data whose distance from the candidate first characterization data is greater than one time of the second distance threshold lc and less than or equal to two times of the second distance threshold lc, for example, the first characterization data ct ₃ shown in 440 , since the density area of the candidate first characterization data overlaps with the density area of the first characterization data ct ₃ , in other words, the candidate first characterization data and the first characterization data ct ₃ have the same adjacent characterization data (for example, the first characterization data ct 3 ). 1. Characterization data ct ₆ ). After using a number of second candidate representation data to replace the candidate first representation data and adjacent representation data in the plurality of first representation data, the point density in the second representation data set such as the first representation data ct ₃ will be affected by the candidate first representation data The data (that is, the first representation data ct ₁ ) is changed. Therefore, the point density of these affected first representation data needs to be recalculated.

另外，针对与候选第一表征数据之间的距离大于两倍第二距离阈值lc的第一表征数据，例如440中示出的第一表征数据ct4的密度区域内的第一表征数据，由于候选第一表征数据和邻近表征数据与它们彼此独立，因此，这些数据不会受到候选第一表征数据和邻近表征数据的影响，后续不需要再对这些第一表征数据的点密度进行计算。In addition, for the first characterization data whose distance from the candidate first characterization data is greater than twice the second distance threshold lc, such as the first characterization data within the density area of the first characterization data ct4 shown in 440, due to the candidate The first representation data and the adjacent representation data are independent of each other. Therefore, these data will not be affected by the candidate first representation data and the adjacent representation data, and there is no need to calculate the point density of these first representation data later.

此外，第二表征数据集中还包括新生成的数目个第二候选表征数据，而这些第二候选表征数据彼此之间的距离、以及这些第二候选表征数据与第二表征数据集中除候选第一表征数据和邻近表征数据之外的其他第一表征数据之间的距离相应地也会发生变化(如图4中的470所示)。因此，还需要计算新生成的数目个第二候选表征数据的点密度。In addition, the second representation data set also includes a newly generated number of second candidate representation data, and the distances between these second candidate representation data, and the distance between these second candidate representation data and the first candidate in the second representation data set are The distance between the characterization data and other first characterization data other than the adjacent characterization data will also change accordingly (as shown at 470 in Figure 4). Therefore, it is also necessary to calculate the point density of the newly generated number of second candidate representation data.

基于上述介绍可知，在确定候选第二表征数据时，可以仅针对受影响的第一表征数据和数目个第二候选表征数据进行点密度计算，而不必对第二表征数据集中所有第二表征数据的点密度进行计算。由此，可以提高数据处理效率，节省计算资源。Based on the above introduction, it can be seen that when determining candidate second representation data, point density calculation can be performed only for the affected first representation data and a number of second candidate representation data, without having to calculate all the second representation data in the second representation data set. The point density is calculated. As a result, data processing efficiency can be improved and computing resources can be saved.

根据本发明的实施例，可以采用以下方式确定候选第二表征数据。According to embodiments of the present invention, the candidate second characterization data may be determined in the following manner.

首先，确定多个第一表征数据中与候选第一表征数据相关联的受影响的第一表征数据。例如，可以将多个第一表征数据中与候选第一表征数据之间的距离大于一倍第二距离阈值且小于或等于两倍第二距离阈值的第一表征数据，确定为受影响的第一表征数据。First, the affected first characterization data associated with the candidate first characterization data among the plurality of first characterization data is determined. For example, the first characterization data among the plurality of first characterization data whose distance from the candidate first characterization data is greater than one time of the second distance threshold and less than or equal to two times of the second distance threshold can be determined as the affected first characterization data. 1. Characterization data.

接下来，分别确定受影响的第一表征数据的点密度和数目个第二候选表征数据的点密度。Next, the point density of the affected first representation data and the point density of the number of second candidate representation data are determined respectively.

需要说明的是，在本发明实施例中，在确定受影响的第一表征数据的点密度和数目个第二候选表征数据的点密度时，针对每个表征数据，基于第二距离阈值和该表征数据的特征向量，从第二表征数据集中确定与该表征数据相关联的邻近表征数据的数量。然后，将该邻近表征数据的数量确定为该表征数据的点密度。It should be noted that in the embodiment of the present invention, when determining the point density of the affected first representation data and the point density of the number of second candidate representation data, for each representation data, based on the second distance threshold and the A feature vector of the representation data, and a number of neighboring representation data associated with the representation data is determined from the second set of representation data. Then, the number of adjacent representation data is determined as the point density of the representation data.

接下来，根据受影响的第一表征数据的点密度、数目个第二候选表征数据的点密度以及候选表征数据集中除受影响的第一表征数据之外的其他表征数据的点密度，确定第二表征数据集中点密度最大的第二表征数据，作为候选第二表征数据。Next, according to the point density of the affected first representation data, the point density of the number of second candidate representation data, and the point density of other representation data in the candidate representation data set except the affected first representation data, the third point density is determined. The second representation data with the largest point density in the second representation data set is used as the candidate second representation data.

在一些实施例中，可以基于有序队列来实现对第二表征数据集中各个第二表征数据的点密度的重排序，以获取候选第二表征数据。下面参考图5对点密度重排序过程进行示例说明。In some embodiments, reordering of point densities of respective second representation data in the second representation data set may be implemented based on an ordered queue to obtain candidate second representation data. The point density reordering process is illustrated below with reference to Figure 5.

图5是根据本发明实施例的对第二表征数据集中各个第二表征数据的点密度进行重排序的示意图。FIG. 5 is a schematic diagram of reordering the point density of each second characterization data in the second characterization data set according to an embodiment of the present invention.

如图5中的501所示，对于多个第一表征数据，例如第一表征数据ct₁～第一表征数据ct_n，n为大于1的整数，在确定第一表征数据ct₁～第一表征数据ct_n的点密度之后，可以按照点密度的大小，对这些第一表征数据进行排序，得到表征数据序列51。之后，根据表征数据序列51，可以获取多个第一表征数据中点密度最大的第一表征数据，即第一表征数ct₁，并将该第一表征数ct₁作为候选第一表征数据。As shown in 501 in Figure 5, for a plurality of first characterization data, for example, the first characterization data ct ₁ to the first characterization data ct _n , n is an integer greater than 1, when determining the first characterization data ct ₁ to the first After the point density of the characterization data ct _n is determined, the first characterization data can be sorted according to the size of the point density to obtain the characterization data sequence 51 . Afterwards, according to the characterization data sequence 51, the first characterization data with the highest point density among the plurality of first characterization data, that is, the first characterization number ct ₁ can be obtained, and the first characterization number ct ₁ is used as the candidate first characterization data.

接下来，基于候选第一表征数据的点密度，确定与候选第一表征数据相关联的第二候选表征数据的数目。例如，可以确定与第一表征数ct₁相关联的第二候选表征数据的数目为m个，m为正整数。Next, based on the point density of the candidate first representation data, the number of second candidate representation data associated with the candidate first representation data is determined. For example, it may be determined that the number of second candidate representation data associated with the first representation number ct ₁ is m, and m is a positive integer.

接下来，基于第二候选表征数据的数目，对候选第一表征数据和对应的邻近表征数据进行聚类处理，得到数目个第二候选表征数据。例如，通过对第一表征数据ct₁和对应的邻近表征数据(例如包括第一表征数据ct_g、第一表征数据ct_j等，其中g和j是小于n的正整数)进行聚类处理，得到m个第二候选表征数据，例如第二候选表征数据cp₁、第二候选表征数据cp₁、…、第二候选表征数据cp_m。Next, based on the number of second candidate representation data, clustering processing is performed on the candidate first representation data and the corresponding adjacent representation data to obtain a number of second candidate representation data. For example, by performing clustering processing on the first representation data ct ₁ and the corresponding adjacent representation data (for example, including the first representation data ct _g , the first representation data ct _j , etc., where g and j are positive integers less than n), m second candidate representation data are obtained, such as second candidate representation data cp ₁ , second candidate representation data cp ₁ , ..., second candidate representation data cp _m .

接下来，确定多个第一表征数据中与候选第一表征数据相关联的受影响的第一表征数据。例如，确定与第一表征数据ct₁相关联的受影响的第一表征数据包括第一表征数据ct₂。Next, affected first characterization data associated with the candidate first characterization data among the plurality of first characterization data is determined. For example, it is determined that the affected first characterization data associated with the first characterization data ct ₁ includes the first characterization data ct ₂ .

接下来，如图5中的502所示，可以从表征数据序列51中移除第一表征数据ct₁以及与第一表征数据ct₁对应的邻近表征数据(例如包括第一表征数据ct_g、第一表征数据ct_j等)。Next, as shown in 502 in FIG. 5 , the first characterization data ct ₁ and adjacent characterization data corresponding to the first characterization data ct ₁ (for example, including the first characterization data ct _g , The first characterization data ct _j , etc.).

另外，如图5中的503所示，可以将表征数据序列51中的受影响的第一表征数据例如第一表征数据ct₂，移入待排序表征数据序列52中。并且将m个第二候选表征数据(例如第二候选表征数据cp₁～第二候选表征数据cp_m)移入待排序表征数据序列52中。这样便于在确定待排序表征数据序列52中各个表征数据的点密度之后，将待排序表征数据序列52中各个表征数据重新插回表征数据序列51中。In addition, as shown at 503 in FIG. 5 , the affected first characterization data, such as the first characterization data ct ₂ , in the characterization data sequence 51 can be moved into the characterization data sequence 52 to be sorted. And move m second candidate representation data (for example, second candidate representation data cp ₁ to second candidate representation data cp _m ) into the sequence of representation data to be sorted 52 . This facilitates reinserting each representation data in the representation data sequence 52 to be sorted into the representation data sequence 51 after determining the point density of each representation data in the representation data sequence 52 to be sorted.

请继续参阅图5中的504～506所示，在确定待排序表征数据序列52中各个表征数据的点密度之后，可以依次将各个表征数据插回表征数据序列51中，以得到重排序后的表征数据序列。Please continue to refer to 504 to 506 in Figure 5. After determining the point density of each representation data in the representation data sequence 52 to be sorted, each representation data can be inserted back into the representation data sequence 51 in sequence to obtain the reordered Characterize data sequences.

以将第一表征数据ct₂插回表征数据序列51为例。例如，可以通过比较第一表征数据ct₂与表征数据序列51中剩余表征数据的点密度大小来确定第一表征数据ct₂的插入位置。Take inserting the first characterization data ct ₂ back into the characterization data sequence 51 as an example. For example, the insertion position of the first characterization data ct ₂ can be determined by comparing the point density of the first characterization data ct ₂ with the remaining characterization data in the characterization data sequence 51 .

如果第一表征数据ct₂的点密度大于或者等于表征数据序列51中剩余表征数据的最大点密度(对应表征数据序列51中第一表征数据ct₃的点密度)时，则将第一表征数据ct₂插入第一表征数据ct₃的左侧。If the point density of the first characterization data ct ₂ is greater than or equal to the maximum point density of the remaining characterization data in the characterization data sequence 51 (corresponding to the point density of the first characterization data ct ₃ in the characterization data sequence 51 ), then the first characterization data ct ₂ is inserted to the left of the first representation data ct ₃ .

如果第一表征数据ct₂的点密度小于或者等于表征数据序列51中剩余表征数据的最小点密度(对应表征数据序列51中第一表征数据ct_n的点密度)时，则将第一表征数据ct₂插入第一表征数据ct_n的右侧。If the point density of the first characterization data ct ₂ is less than or equal to the minimum point density of the remaining characterization data in the characterization data sequence 51 (corresponding to the point density of the first characterization data ct _n in the characterization data sequence 51 ), then the first characterization data ct ₂ is inserted to the right of the first representation data ct _n .

如果第一表征数据ct₂的点密度大于最小点密度且小于最大点密度，则在表征数据序列51中从点密度最大点位置处开始，例如可以通过滑动比较的方式，将第一表征数据ct₂插回表征数据序列51。If the point density of the first characterization data ct ₂ is greater than the minimum point density and less than the maximum point density, starting from the position of the maximum point density in the characterization data sequence 51 , for example, the first representation data ct can be compared by sliding comparison. ₂ Insert back the characterizing data sequence 51.

通过滑动比较的方式，将各个表征数据插回表征数据序列51时，待插入表征数据的点密度可以满足如下条件。When inserting each representation data back into the representation data sequence 51 through sliding comparison, the point density of the representation data to be inserted can satisfy the following conditions.

N_{density_i-1}≥N_{densitv_current}≥N_{density_i} (2)N _{density_i-1} ≥N _{density_current} ≥N _{density_i} (2)

在公式(2)中，N_{density_i-1}和N_{density_i}表示表征数据序列51中两个相邻表征数据的点密度，N_{density_current}表示待插入表征数据的点密度。其中i＝2，3，…，n。In formula (2), N _{density_i-1} and N _{density_i} represent the point density of two adjacent representation data in the representation data sequence 51, and N _{density_current} represents the point density of the representation data to be inserted. Where i=2, 3,...,n.

例如，当确定第一表征数据ct₂的点密度介于两个相邻表征数据(例如第一表征数据ct₄和第一表征数据ct_k)的点密度之间时，可以将第一表征数据ct₂插入这两个相邻表征数据之间，从而完成将第一表征数据ct₂插回表征数据序列51。For example, when it is determined that the point density of the first characterization data ct ₂ is between the point densities of two adjacent characterization data (for example, the first characterization data ct ₄ and the first characterization data ct _k ), the first characterization data may be ct ₂ is inserted between the two adjacent characterization data, thereby completing the insertion of the first characterization data ct ₂ back into the characterization data sequence 51 .

需要说明的是，将各个表征数据插回表征数据序列51过程中，除了滑动比较的方式之外，还可以采用其他合适的方式来实现，例如二分查找法。具体可以根据实际需要选择，这里不做限定。It should be noted that in the process of inserting each representation data back into the representation data sequence 51, in addition to the sliding comparison method, other suitable methods can also be used, such as the binary search method. You can choose according to actual needs, and there are no restrictions here.

基于上述方式，可以将实现对第二表征数据集中各个第二表征数据的点密度的重排序，从而根据重排序的有序队列，获取候选第二表征数据。Based on the above method, the point density of each second representation data in the second representation data set can be reordered, so that candidate second representation data can be obtained according to the reordered ordered queue.

根据本发明的实施例，在上述确定候选第一表征数据的过程中，在得到表征数据序列之后，还可以根据计算资源，将表征数据序列拆分为多个第一表征数据子序列进行并行处理，以加快获取候选第一表征数据的效率。这种方式尤其适用于第一表征数据集中数据量较大的情形。According to embodiments of the present invention, in the above process of determining candidate first representation data, after obtaining the representation data sequence, the representation data sequence can also be split into multiple first representation data sub-sequences for parallel processing based on computing resources. , to speed up the efficiency of obtaining candidate first characterization data. This method is especially suitable for situations where the amount of data in the first characterization data set is large.

例如，在根据多个第一表征数据各自对应的点密度，对多个第一表征数据进行排序，得到表征数据序列之后。可以基于计算资源，将表征数据序列拆分为多个第一表征数据子序列。其中，每个第一表征数据子序列包括预设数量个第一表征数据。在一个示例中，每个第一表征数据子序列包括的第一表征数据的数量可以相同，这样可以确保各个第一表征数据子序列在并行计算时的处理速度相对一致。For example, after the plurality of first representation data are sorted according to their respective point densities corresponding to the plurality of first representation data, a sequence of representation data is obtained. The characterization data sequence may be split into a plurality of first characterization data sub-sequences based on computing resources. Wherein, each first characterization data subsequence includes a preset number of first characterization data. In one example, each first characterization data sub-sequence may include the same amount of first characterization data, which can ensure that the processing speed of each first characterization data sub-sequence is relatively consistent during parallel computation.

接下来，针对每个第一表征数据子序列，确定预设数量个第一表征数据中点密度最大的第一表征数据，作为第一子序列表征数据。Next, for each first characterization data subsequence, the first characterization data with the highest point density among the preset number of first characterization data is determined as the first subsequence characterization data.

接下来，确定多个第一子序列表征数据中点密度最大的第一子序列表征数据，作为候选第一表征数据。Next, the first sub-sequence representation data with the highest point density among the plurality of first sub-sequence representation data is determined as the candidate first representation data.

根据本发明的实施例，在上述利用有序队列实现对第二表征数据集中各个第二表征数据的点密度的重排序，以获取候选第二表征数据的过程中，还可以根据计算资源，将表征数据序列拆分为多个第一表征数据子序列，以进行重排序并行处理，从而加快获取候选第二表征数据的效率。其中，每个第一表征数据子序列包括预设数量个第一表征数据。According to embodiments of the present invention, in the above process of using an ordered queue to reorder the point density of each second representation data in the second representation data set to obtain candidate second representation data, it is also possible to use computing resources to The characterization data sequence is split into multiple first characterization data sub-sequences for reordering and parallel processing, thereby speeding up the efficiency of obtaining candidate second characterization data. Wherein, each first characterization data subsequence includes a preset number of first characterization data.

例如，在确定多个第一表征数据中与候选第一表征数据相关联的受影响的第一表征数据之后，将受影响的第一表征数据和数目个第二候选表征数据存储至待排序表征数据序列中。For example, after determining the affected first characterization data associated with the candidate first characterization data among the plurality of first characterization data, the affected first characterization data and the number of second candidate characterization data are stored in the to-be-sorted representation in the data sequence.

接下来，针对每个第一表征数据子序列，删除第一表征数据子序列中与候选第一表征数据和邻近表征数据相关的表征数据以及受影响的第一表征数据，得到第二表征数据子序列。Next, for each first characterization data subsequence, delete the characterization data related to the candidate first characterization data and adjacent characterization data in the first characterization data subsequence as well as the affected first characterization data, to obtain the second characterization data subsequence. sequence.

接下来，分别确定待排序表征数据序列中受影响的第一表征数据的点密度和数目个第二候选表征数据的点密度。需要说明的是，在确定待排序表征数据序列中受影响的第一表征数据的点密度和数目个第二候选表征数据的点密度时，针对每个表征数据，基于第二距离阈值和该表征数据的特征向量，从第二表征数据集中确定与该表征数据相关联的邻近表征数据的数量。然后，将该邻近表征数据的数量确定为该表征数据的点密度。Next, the point density of the affected first representation data and the point density of the number of second candidate representation data in the sequence of representation data to be sorted are determined respectively. It should be noted that when determining the point density of the affected first representation data and the point density of the number of second candidate representation data in the sequence of representation data to be sorted, for each representation data, based on the second distance threshold and the representation A feature vector of the data, and a number of neighboring representation data associated with the representation data is determined from the second representation data set. Then, the number of adjacent representation data is determined as the point density of the representation data.

接下来，根据待排序表征数据序列中受影响的第一表征数据的点密度和数目个第二候选表征数据的点密度，以及多个第二表征数据子序列中表征数据的点密度，分别将待排序表征数据序列中的受影响的第一表征数据和数目个第二候选表征数据分配至多个第二表征数据子序列，得到多个第三表征数据子序列。Next, according to the point density of the affected first representation data and the point density of the number of second candidate representation data in the representation data sequence to be sorted, and the point density of the representation data in the plurality of second representation data sub-sequences, respectively The affected first characterization data and the number of second candidate characterization data in the characterization data sequence to be sorted are allocated to a plurality of second characterization data sub-sequences to obtain a plurality of third characterization data sub-sequences.

在本发明实施例中，将待排序表征数据序列中的受影响的第一表征数据和数目个第二候选表征数据分配至多个第二表征数据子序列的过程与以上描述的将各个表征数据插回表征数据序列的过程类似，这里不再赘述。In the embodiment of the present invention, the process of allocating the affected first characterization data and the number of second candidate characterization data in the characterization data sequence to be sorted to multiple second characterization data subsequences is the same as the above-described interpolation of each characterization data. The process of characterizing data sequences is similar and will not be described again here.

接下来，针对每个第三表征数据子序列，确定第三表征数据子序列中点密度最大的表征数据，作为第二子序列表征数据。之后，确定多个第二子序列表征数据中点密度最大的第二子序列表征数据，作为候选第二表征数据。Next, for each third characterization data subsequence, the characterization data with the highest point density in the third characterization data subsequence is determined as the second subsequence characterization data. Afterwards, the second sub-sequence representation data with the highest point density among the plurality of second sub-sequence representation data is determined as the candidate second representation data.

根据本发明实施例，在上述生成数目个第二候选表征数据过程中，还可以根据各个对象数据的对象数据标识来确定与每个第二候选表征数据对应的对象数据标识。According to an embodiment of the present invention, in the above process of generating a number of second candidate representation data, the object data identifier corresponding to each second candidate representation data may also be determined based on the object data identifier of each object data.

例如，针对每个第二候选表征数据，确定与第二候选表征数据相关联的候选第一表征数据和邻近表征数据。然后，根据多个对象数据各自对应的对象数据标识，确定候选第一表征数据的对象数据标识和邻近表征数据的对象数据标识。For example, for each second candidate representation data, candidate first representation data and adjacent representation data associated with the second candidate representation data are determined. Then, based on the object data identifiers corresponding to the plurality of object data, the object data identifier of the candidate first representation data and the object data identifier of the adjacent representation data are determined.

之后，根据候选第一表征数据的对象数据标识和邻近表征数据的对象数据标识，确定第二候选表征数据的对象数据标识。并将第二候选表征数据的对象数据标识和第二候选表征数据进行关联。Afterwards, the object data identification of the second candidate characterization data is determined based on the object data identification of the first characterization data candidate and the object data identification of adjacent characterization data. and associate the object data identifier of the second candidate representation data with the second candidate representation data.

例如，可以确定候选第一表征数据的对象数据标识和邻近表征数据的对象数据标识中属于相同数据标识类别的对象数据标识的数量。然后，将数量最大的对象数据标识，作为第二候选表征数据的对象数据标识。这里，确定第二候选表征数据的对象数据标识的过程与上述确定第一候选表征数据的对象数据标识的过程类似，在此不再赘述。For example, the number of object data identifiers belonging to the same data identifier category among the object data identifiers of the candidate first representation data and the object data identifiers of adjacent representation data may be determined. Then, the object data identifier with the largest number is used as the object data identifier of the second candidate representation data. Here, the process of determining the object data identifier of the second candidate representation data is similar to the above-mentioned process of determining the object data identifier of the first candidate representation data, and will not be described again.

图6A是利用相关技术进行数据分类的效果示意图，图6B是基于本发明的技术方案进行数据分类的效果示意图。下面参考图6A和图6B来说明本发明的技术方案的优势。FIG. 6A is a schematic diagram of the effect of data classification using related technologies, and FIG. 6B is a schematic diagram of the effect of data classification based on the technical solution of the present invention. The advantages of the technical solution of the present invention will be described below with reference to Figures 6A and 6B.

多个对象数据例如包括对象数据P_a1、对象数据P_a2、对象数据P_a3、......。其中，对象数据P_a1的对象数据标识为对象数据标识1，对象数据P_a2的对象数据标识为对象数据标识2，对象数据P_a3的对象数据标识为对象数据标识1。The plurality of object data include, for example, object data _Pa1 , object data _Pa2 , object data _Pa3 , . . . . Among them, the object data identifier of the object data _Pa1 is object data identifier 1, the object data identifier of the object data _Pa2 is object data identifier 2, and the object data identifier of the object data _Pa3 is object data identifier 1.

各个对象数据的对象数据标识例如用于指示该对象数据所属的类别。例如，对象数据P_a1和对象数据P_a3等对象数据的类别例如为类别1，对象数据P_a2的类别例如为类别2。需要说明的是，为了简便、清楚地说明数据分类效果，本实施例中将图6A和图6B中具有相同对象数据标识的对象数据以相同的符号表示。The object data identifier of each object data is, for example, used to indicate the category to which the object data belongs. For example, the category of the object data _Pa1 and the object data _Pa3 is category 1, and the category of the object data _Pa2 is category 2, for example. It should be noted that, in order to simply and clearly illustrate the data classification effect, in this embodiment, the object data with the same object data identifier in FIG. 6A and FIG. 6B are represented by the same symbols.

以对象数据P_a3为待分类的对象数据为例。分别采用相关技术例如k最近邻分类算法(k-Nearest Neighbor，KNN)、以及本发明的技术方案与k最近邻分类算法结合对对象数据P_a3进行分类。Take the object data P _a3 as the object data to be classified as an example. Relevant technologies such as k-Nearest Neighbor classification algorithm (k-Nearest Neighbor, KNN) and the combination of the technical solution of the present invention and the k-Nearest Neighbor classification algorithm are respectively used to classify the object data P _a3 .

图6A是基于k最近邻分类算法进行数据分类的效果示意图。如图6A中的610和611所示，在采用k最近邻分类算法对对象数据P_a3进行分类时，以近邻值k＝5(仅为示例)，可以确定与对象数据P_a3最近的5个对象数据，包括3个具有对象数据标识2的对象数据和2个具有对象数据标识1的对象数据。由此，可以确定对象数据P_a3的分类结果为类别2，即对象数据P_a3的对象数据标识为对象数据标识2。显然，这与对象数据P_a3的实际对象数据标识(即对象数据标识1)是不符的。可见，如果直接基于k最近邻分类算法对对象数据P_a3进行分类，由于对象数据过于稠密，导致分类结果受局部噪点的影响，使得分类结果出现偏差。Figure 6A is a schematic diagram of the effect of data classification based on the k-nearest neighbor classification algorithm. As shown in 610 and 611 in Figure 6A, when the k-nearest neighbor classification algorithm is used to classify the object data P _a3 , with the nearest neighbor value k=5 (only an example), the five closest neighbors to the object data P _a3 can be determined. The object data includes 3 object data with object data identifier 2 and 2 object data with object data identifier 1. Therefore, it can be determined that the classification result of the object data _Pa3 is category 2, that is, the object data identifier of the object data _Pa3 is object data identifier 2. Obviously, this is inconsistent with the actual object data identification (ie, object data identification 1) of the object data _Pa3 . It can be seen that if the object data P _a3 is classified directly based on the k nearest neighbor classification algorithm, because the object data is too dense, the classification results will be affected by local noise, causing the classification results to be biased.

图6B是基于本发明的技术方案进行数据分类的效果示意图。如图6B中的610～630所示，基于本发明的方案对多个对象数据进行处理之后，可以得到第一样本数据集。如630所示，第一样本数据集中包括对象数据P_a3、多个第一候选表征数据ct₁₁、第一候选表征数据ct₂₁。其中，第一候选表征数据ct₁₁是对象数据P_a1和对应的邻近对象数据生成的。第一候选表征数据ct₂₁是对象数据P_a2和对应的邻近对象数据生成的。Figure 6B is a schematic diagram of the effect of data classification based on the technical solution of the present invention. As shown at 610 to 630 in Figure 6B, after processing multiple object data based on the solution of the present invention, the first sample data set can be obtained. As shown in 630, the first sample data set includes object data _Pa3 , a plurality of first candidate representation data _ct11 , and first candidate representation data _ct21 . Among them, the first candidate representation data ct ₁₁ is generated from the object data _Pa1 and the corresponding neighboring object data. The first candidate representation data ct ₂₁ is generated from the object data _Pa2 and corresponding neighboring object data.

在第一样本数据集基础上，采用k最近邻分类算法对对象数据P_a3进行分类，以近邻值k＝5(仅为示例)，可以确定与对象数据P_a3最近的5个对象数据，包括4个具有对象数据标识1的对象数据和1个具有对象数据标识2的对象数据。由此，可以确定对象数据P_a3的分类结果为类别1，即对象数据P_a3的对象数据标识为对象数据标识1。此时，对象数据P_a3的分类结果与对象数据P_a3的实际对象数据标识(即对象数据标识1)是相符的。Based on the first sample data set, the k-nearest neighbor classification algorithm is used to classify the object data P _a3 . With the nearest neighbor value k=5 (only an example), the five object data closest to the object data P _a3 can be determined. It includes 4 object data with object data identification 1 and 1 object data with object data identification 2. Therefore, it can be determined that the classification result of the object data P _a3 is category 1, that is, the object data identifier of the object data P _a3 is the object data identifier 1. At this time, the classification result of the object data _Pa3 is consistent with the actual object data identification of the object data _Pa3 (ie, the object data identification 1).

可以理解，在利用本发明的方案对多个对象数据进行处理过程中，通过利用第一候选表征数据来代替对应的对象数据和邻近对象数据，以得到第一样本数据集。由此，能够实现对多个对象数据中冗余数据的简化处理，从而在一定程度上消除了局部噪点的影响，这有利于提高分类结果的准确性。而图6B中示出的结果也说明了这一点。It can be understood that in the process of processing multiple object data using the solution of the present invention, the first sample data set is obtained by using the first candidate representation data to replace the corresponding object data and adjacent object data. As a result, the simplified processing of redundant data in multiple object data can be achieved, thereby eliminating the influence of local noise to a certain extent, which is beneficial to improving the accuracy of classification results. The results shown in Figure 6B also illustrate this point.

图7是根据本发明实施例的目标对象检测模型的训练方法的流程图。Figure 7 is a flow chart of a training method for a target object detection model according to an embodiment of the present invention.

如图7所示，目标对象检测模型的训练方法700包括操作S710～S720。As shown in FIG. 7 , the training method 700 of the target object detection model includes operations S710 to S720.

在操作S710，获取样本数据集。在本发明实施例中，样本数据集是利用以上实施例中描述的数据处理方法得到的。样本数据集中包括期望数据量的样本数据。In operation S710, a sample data set is obtained. In the embodiment of the present invention, the sample data set is obtained using the data processing method described in the above embodiment. The sample data set contains sample data of the desired amount of data.

在操作S720，利用样本数据集对深度学习模型进行迭代训练，直至深度学习模型的输出结果满足迭代停止条件或者迭代训练的累计次数达到预设的次数阈值，得到目标对象检测模型。In operation S720, the deep learning model is iteratively trained using the sample data set until the output result of the deep learning model meets the iteration stop condition or the cumulative number of iterative trainings reaches a preset number threshold, and the target object detection model is obtained.

在本发明实施例中，可以利用上述样本数据集对深度学习模型进行多次迭代训练，直至深度学习模型的输出结果满足迭代停止条件或者迭代训练的累计次数达到预设的次数阈值，得到目标对象检测模型。其中，迭代停止条件例如可以包括深度学习模型的输出结果与样本数据的标签信息之间的差异值满足预设的收敛条件。In the embodiment of the present invention, the above sample data set can be used to perform multiple iterative trainings on the deep learning model until the output result of the deep learning model satisfies the iteration stop condition or the cumulative number of iterative training reaches a preset threshold, and the target object is obtained. Detection model. The iteration stop condition may include, for example, that the difference value between the output result of the deep learning model and the label information of the sample data satisfies a preset convergence condition.

在本发明实施例中，样本数据例如包括待检测的文本、图像、视频中至少一项。目标对象检测模型例如用于处理样本数据，以得到针对样本数据的检测结果。In this embodiment of the present invention, the sample data includes, for example, at least one of text, image, and video to be detected. The target object detection model is, for example, used to process sample data to obtain detection results for the sample data.

在另一个示例中，样本数据例如包括待分类的文本、图像、视频、音频或者携带有标识数据的位置信息等。还可以利用该样本数据来训练深度学习模型，使得经训练的深度学习模型可以用于对样本数据进行分类。In another example, the sample data includes, for example, text, images, videos, audios, or location information carrying identification data to be classified. The sample data can also be used to train a deep learning model, so that the trained deep learning model can be used to classify the sample data.

在另一个示例中，样本数据例如包括待转换的文本。还可以利用该样本数据来训练深度学习模型，使得经训练的深度学习模型可以用于处理待转换文本，得到针对待转换文本的转换数据。待转换文本例如包括待翻译文本，转换数据例如包括翻译后的文本。换言之，经训练的深度学习模型用于对文本进行翻译。In another example, the sample data includes text to be converted, for example. The sample data can also be used to train a deep learning model, so that the trained deep learning model can be used to process the text to be converted and obtain conversion data for the text to be converted. The text to be converted includes, for example, the text to be translated, and the conversion data includes, for example, translated text. In other words, a trained deep learning model is used to translate text.

需要说明的是，本发明实施例中样本数据的类型以及经训练的深度学习模型的用途并不局限于上述示例，还可以根据实际应用场景来确定，这里不再赘述。It should be noted that the type of sample data and the use of the trained deep learning model in the embodiment of the present invention are not limited to the above examples, and can also be determined according to the actual application scenario, which will not be described again here.

根据本发明的实施例，通过利用具有期望数据量的样本数据集来训练深度学习模型，不仅能够提高模型的训练效率，节省计算资源。而且可以避免因样本数据量过大而导致局部噪点聚集，从而提高了模型输出的准确性。According to embodiments of the present invention, by using a sample data set with a desired amount of data to train a deep learning model, not only can the training efficiency of the model be improved, but computing resources can be saved. It can also avoid the accumulation of local noise caused by excessive sample data, thus improving the accuracy of model output.

图8是根据本发明实施例的目标对象检测方法的流程图。Figure 8 is a flow chart of a target object detection method according to an embodiment of the present invention.

如图8所示，目标对象检测方法800包括操作S810。As shown in FIG. 8 , the target object detection method 800 includes operation S810.

在操作S810，将待处理数据输入目标对象检测模型，得到针对待处理数据的检测结果。其中，目标对象检测模型是利用以上描述的目标对象检测模型的训练方法训练得到的。In operation S810, the data to be processed is input into the target object detection model to obtain a detection result for the data to be processed. Among them, the target object detection model is trained using the training method of the target object detection model described above.

根据本发明实施例，待处理数据例如包括待检测的文本、图像、视频中至少一项。目标对象检测模型例如用于处理待处理数据，以得到针对待处理数据中目标对象的检测结果。当然，本发明实施例中待处理数据的类型以及目标对象检测模型的用途并不局限于上述示例，还可以根据实际应用场景来确定，这里不再赘述。According to the embodiment of the present invention, the data to be processed includes, for example, at least one of text, image, and video to be detected. The target object detection model is, for example, used to process the data to be processed to obtain detection results for the target objects in the data to be processed. Of course, the type of data to be processed and the use of the target object detection model in the embodiment of the present invention are not limited to the above examples, and can also be determined according to the actual application scenario, which will not be described again here.

在本发明实施例的方案中，通过利用以上方式训练得到的目标对象检测模型来检测针对待处理数据的目标对象，可以提高目标对象检测的准确性。In the solution of the embodiment of the present invention, by using the target object detection model trained in the above manner to detect the target object for the data to be processed, the accuracy of target object detection can be improved.

图9是根据本发明实施例的数据处理装置的框图。Figure 9 is a block diagram of a data processing device according to an embodiment of the present invention.

如图9所示，数据处理装置900包括：第一获取模块910、第一确定模块920、生成模块930、处理模块940和第二确定模块950。As shown in FIG. 9 , the data processing device 900 includes: a first acquisition module 910 , a first determination module 920 , a generation module 930 , a processing module 940 and a second determination module 950 .

第一获取模块910用于获取多个对象数据。The first acquisition module 910 is used to acquire multiple object data.

第一确定模块920用于确定多个对象数据的数量大于预设数量阈值，针对多个对象数据中的任意一个候选对象数据，根据候选对象数据的特征向量和第一距离阈值，从多个对象数据中确定与该候选对象数据相关联的邻近对象数据。The first determination module 920 is used to determine that the number of multiple object data is greater than the preset quantity threshold, and for any candidate object data in the multiple object data, based on the feature vector of the candidate object data and the first distance threshold, from the multiple objects Neighboring object data associated with the candidate object data is determined in the data.

生成模块930用于响应于多个对象数据中存在邻近对象数据，根据候选对象数据和邻近对象数据，生成第一候选表征数据；其中，第一候选表征数据用于代替候选对象数据和邻近对象数据。The generation module 930 is configured to generate first candidate representation data according to the candidate object data and the adjacent object data in response to the presence of adjacent object data in the plurality of object data; wherein the first candidate representation data is used to replace the candidate object data and the adjacent object data. .

处理模块940用于基于候选对象数据、邻近对象数据和第一候选表征数据，对多个对象数据进行处理，得到第一表征数据集。The processing module 940 is configured to process multiple object data based on the candidate object data, neighboring object data and the first candidate representation data to obtain a first representation data set.

第二确定模块950用于确定第一表征数据集中包含的第一表征数据的数量小于或等于预设数量阈值，将第一表征数据集确定为第一样本数据集。The second determination module 950 is configured to determine that the number of first characterization data contained in the first characterization data set is less than or equal to the preset quantity threshold, and determine the first characterization data set as the first sample data set.

图10是根据本发明实施例的目标对象检测模型的训练装置的框图。Figure 10 is a block diagram of a training device for a target object detection model according to an embodiment of the present invention.

如图10所示，目标对象检测模型的训练装置1000包括：第二获取模块1010和训练模块1020。As shown in Figure 10, the training device 1000 of the target object detection model includes: a second acquisition module 1010 and a training module 1020.

第二获取模块1010用于获取样本数据集。其中，样本数据集是利用以上实施例中描述的数据处理装置900得到的。The second acquisition module 1010 is used to acquire a sample data set. Among them, the sample data set is obtained by using the data processing device 900 described in the above embodiment.

训练模块1020用于利用样本数据集对深度学习模型进行迭代训练，直至深度学习模型的输出结果满足迭代停止条件或者迭代训练的累计次数达到预设的次数阈值，得到目标对象检测模型。The training module 1020 is used to iteratively train the deep learning model using the sample data set until the output result of the deep learning model meets the iteration stop condition or the cumulative number of iterative trainings reaches a preset threshold to obtain the target object detection model.

图11是根据本发明实施例的目标对象检测装置的框图。Figure 11 is a block diagram of a target object detection device according to an embodiment of the present invention.

如图11所示，目标对象检测装置1100包括：输入模块1110。As shown in FIG. 11 , the target object detection device 1100 includes: an input module 1110 .

输入模块1110用于将待处理数据输入目标对象检测模型，得到针对待处理数据的以上实施例中描述的目标对象检测模型的训练装置1000训练得到的。The input module 1110 is used to input the data to be processed into the target object detection model to obtain the target object detection model described in the above embodiments for the data to be processed, which is trained by the training device 1000 .

需要说明的是，装置部分实施例中各模块/单元/子单元等的实施方式、解决的技术问题、实现的功能、以及达到的技术效果分别与方法部分实施例中各对应的步骤的实施方式、解决的技术问题、实现的功能、以及达到的技术效果相同或类似，在此不再赘述。It should be noted that the implementation of each module/unit/subunit, etc., the technical problems solved, the functions implemented, and the technical effects achieved in the device embodiment are respectively the same as the implementation of the corresponding steps in the method embodiment. , the technical problems solved, the functions implemented, and the technical effects achieved are the same or similar, and will not be described again here.

本发明的技术方案中，所涉及的数据(例如包括但不限于用户个人信息)的收集、存储、使用、加工、传输、提供、公开和应用等处理，均符合相关法律法规的规定，且不违背公序良俗。In the technical solution of the present invention, the collection, storage, use, processing, transmission, provision, disclosure and application of the data involved (for example, including but not limited to user personal information) comply with the provisions of relevant laws and regulations, and are not Violate public order and good customs.

在本发明的技术方案中，在获取或采集相关数据之前，均获取了数据归属者的授权或同意。In the technical solution of the present invention, before obtaining or collecting relevant data, the authorization or consent of the data owner is obtained.

根据本发明的实施例，本发明还提供了一种电子设备、一种可读存储介质和一种计算机程序产品。According to embodiments of the present invention, the present invention also provides an electronic device, a readable storage medium and a computer program product.

根据本发明的实施例，一种电子设备，包括：至少一个处理器；以及与至少一个处理器通信连接的存储器；其中，存储器存储有可被至少一个处理器执行的指令，指令被至少一个处理器执行，以使至少一个处理器能够执行如上所述的方法。According to an embodiment of the present invention, an electronic device includes: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions that can be executed by at least one processor, and the instructions are processed by at least one processor. processor execution, so that at least one processor can execute the method as described above.

根据本发明的实施例，一种存储有计算机指令的非瞬时计算机可读存储介质，其中，计算机指令用于使计算机执行如上所述的方法。According to an embodiment of the present invention, a non-transitory computer-readable storage medium stores computer instructions, wherein the computer instructions are used to cause a computer to perform the method as described above.

根据本发明的实施例，一种计算机程序产品，包括计算机程序，计算机程序在被处理器执行时实现如上所述的方法。According to an embodiment of the present invention, a computer program product includes a computer program, and the computer program implements the above method when executed by a processor.

图12示意性示出了根据本发明实施例的适于实现数据处理方法、目标对象检测模型的训练方法以及目标对象检测方法的电子设备的方框图。FIG. 12 schematically shows a block diagram of an electronic device suitable for implementing a data processing method, a training method of a target object detection model, and a target object detection method according to an embodiment of the present invention.

如图12所示，根据本发明实施例的电子设备1200包括处理器1201，其可以根据存储在只读存储器(ROM)1202中的程序或者从存储部分1208加载到随机访问存储器(RAM)1203中的程序而执行各种适当的动作和处理。处理器1201例如可以包括通用微处理器(例如CPU)、指令集处理器和/或相关芯片组和/或专用微处理器(例如，专用集成电路(ASIC))等等。处理器1201还可以包括用于缓存用途的板载存储器。处理器1201可以包括用于执行根据本发明实施例的方法流程的不同动作的单一处理单元或者是多个处理单元。As shown in Figure 12, an electronic device 1200 according to an embodiment of the present invention includes a processor 1201, which can be loaded into a random access memory (RAM) 1203 according to a program stored in a read-only memory (ROM) 1202 or from a storage part 1208. program to perform various appropriate actions and processes. Processor 1201 may include, for example, a general-purpose microprocessor (eg, a CPU), an instruction set processor and/or an associated chipset, and/or a special-purpose microprocessor (eg, an application specific integrated circuit (ASIC)), or the like. Processor 1201 may also include onboard memory for caching purposes. The processor 1201 may include a single processing unit or multiple processing units for performing different actions of the method flow according to the embodiment of the present invention.

在RAM 1203中，存储有电子设备1200操作所需的各种程序和数据。处理器1201、ROM 1202以及RAM 1203通过总线1204彼此相连。处理器1201通过执行ROM 1202和/或RAM1203中的程序来执行根据本发明实施例的方法流程的各种操作。需要注意，所述程序也可以存储在除ROM 1202和RAM 1203以外的一个或多个存储器中。处理器1201也可以通过执行存储在所述一个或多个存储器中的程序来执行根据本发明实施例的方法流程的各种操作。In the RAM 1203, various programs and data required for the operation of the electronic device 1200 are stored. The processor 1201, ROM 1202, and RAM 1203 are connected to each other through a bus 1204. The processor 1201 performs various operations according to the method flow of the embodiment of the present invention by executing programs in the ROM 1202 and/or RAM 1203. It should be noted that the program may also be stored in one or more memories other than ROM 1202 and RAM 1203. The processor 1201 may also perform various operations according to the method flow of embodiments of the present invention by executing programs stored in the one or more memories.

根据本发明的实施例，电子设备1200还可以包括输入/输出(I/O)接口1205，输入/输出(I/O)接口1205也连接至总线1204。电子设备1200还可以包括连接至I/O接口1205的以下部件中的一项或多项：包括键盘、鼠标等的输入部分1206；包括诸如阴极射线管(CRT)、液晶显示器(LCD)等以及扬声器等的输出部分1207；包括硬盘等的存储部分1208；以及包括诸如LAN卡、调制解调器等的网络接口卡的通信部分1209。通信部分1209经由诸如因特网的网络执行通信处理。驱动器1210也根据需要连接至I/O接口1205。可拆卸介质1211，诸如磁盘、光盘、磁光盘、半导体存储器等等，根据需要安装在驱动器1210上，以便于从其上读出的计算机程序根据需要被安装入存储部分1208。According to an embodiment of the present invention, the electronic device 1200 may further include an input/output (I/O) interface 1205 that is also connected to the bus 1204. Electronic device 1200 may also include one or more of the following components connected to I/O interface 1205: an input portion 1206 including a keyboard, mouse, etc.; including a cathode ray tube (CRT), liquid crystal display (LCD), etc., and an output section 1207 of a speaker and the like; a storage section 1208 including a hard disk and the like; and a communication section 1209 including a network interface card such as a LAN card, a modem and the like. The communication section 1209 performs communication processing via a network such as the Internet. Driver 1210 is also connected to I/O interface 1205 as needed. Removable media 1211, such as magnetic disks, optical disks, magneto-optical disks, semiconductor memories, etc., are installed on the drive 1210 as needed, so that a computer program read therefrom is installed into the storage portion 1208 as needed.

本发明还提供了一种计算机可读存储介质，该计算机可读存储介质可以是上述实施例中描述的设备/装置/系统中所包含的；也可以是单独存在，而未装配入该设备/装置/系统中。上述计算机可读存储介质承载有一个或者多个程序，当上述一个或者多个程序被执行时，实现根据本发明实施例的方法。The present invention also provides a computer-readable storage medium. The computer-readable storage medium can be included in the equipment/device/system described in the above embodiments; it can also exist independently without being assembled into the equipment/system. in the device/system. The above computer-readable storage medium carries one or more programs. When the above one or more programs are executed, the method according to the embodiment of the present invention is implemented.

根据本发明的实施例，计算机可读存储介质可以是非易失性的计算机可读存储介质，例如可以包括但不限于：便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本发明中，计算机可读存储介质可以是任何包含或存储程序的有形介质，该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。例如，根据本发明的实施例，计算机可读存储介质可以包括上文描述的ROM 1202和/或RAM 1203和/或ROM 1202和RAM 1203以外的一个或多个存储器。According to embodiments of the present invention, the computer-readable storage medium may be a non-volatile computer-readable storage medium, which may include, but is not limited to, portable computer disks, hard disks, random access memory (RAM), and read-only memory (ROM). , erasable programmable read-only memory (EPROM or flash memory), portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above. In the present invention, a computer-readable storage medium may be any tangible medium that contains or stores a program for use by or in conjunction with an instruction execution system, apparatus, or device. For example, according to embodiments of the present invention, the computer-readable storage medium may include one or more memories other than ROM 1202 and/or RAM 1203 and/or ROM 1202 and RAM 1203 described above.

本发明的实施例还包括一种计算机程序产品，其包括计算机程序，该计算机程序包含用于执行流程图所示的方法的程序代码。当计算机程序产品在计算机系统中运行时，该程序代码用于使计算机系统实现本发明实施例所提供的数据处理方法、目标对象检测模型的训练方法以及目标对象检测方法。Embodiments of the invention also include a computer program product comprising a computer program containing program code for performing the method illustrated in the flowchart. When the computer program product is run in the computer system, the program code is used to enable the computer system to implement the data processing method, the training method of the target object detection model and the target object detection method provided by the embodiments of the present invention.

在该计算机程序被处理器1201执行时执行本发明实施例的系统/装置中限定的上述功能。根据本发明的实施例，上文描述的系统、装置、模块、单元等可以通过计算机程序模块来实现。When the computer program is executed by the processor 1201, the above functions defined in the system/device of the embodiment of the present invention are performed. According to embodiments of the present invention, the systems, devices, modules, units, etc. described above may be implemented by computer program modules.

在一种实施例中，该计算机程序可以依托于光存储器件、磁存储器件等有形存储介质。在另一种实施例中，该计算机程序也可以在网络介质上以信号的形式进行传输、分发，并通过通信部分1209被下载和安装，和/或从可拆卸介质1211被安装。该计算机程序包含的程序代码可以用任何适当的网络介质传输，包括但不限于：无线、有线等等，或者上述的任意合适的组合。In one embodiment, the computer program may rely on tangible storage media such as optical storage devices and magnetic storage devices. In another embodiment, the computer program can also be transmitted and distributed in the form of a signal on a network medium, and downloaded and installed through the communication part 1209, and/or installed from the removable medium 1211. The program code contained in the computer program can be transmitted using any appropriate network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the above.

在这样的实施例中，该计算机程序可以通过通信部分1209从网络上被下载和安装，和/或从可拆卸介质1211被安装。在该计算机程序被处理器1201执行时，执行本发明实施例的系统中限定的上述功能。根据本发明的实施例，上文描述的系统、设备、装置、模块、单元等可以通过计算机程序模块来实现。In such embodiments, the computer program may be downloaded and installed from the network via communication portion 1209, and/or installed from removable media 1211. When the computer program is executed by the processor 1201, the above-mentioned functions defined in the system of the embodiment of the present invention are performed. According to embodiments of the present invention, the systems, devices, devices, modules, units, etc. described above may be implemented by computer program modules.

根据本发明的实施例，可以以一种或多种程序设计语言的任意组合来编写用于执行本发明实施例提供的计算机程序的程序代码，具体地，可以利用高级过程和/或面向对象的编程语言、和/或汇编/机器语言来实施这些计算程序。程序设计语言包括但不限于诸如Java，C++，python，“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算设备上执行、部分地在用户设备上执行、部分在远程计算设备上执行、或者完全在远程计算设备或服务器上执行。在涉及远程计算设备的情形中，远程计算设备可以通过任意种类的网络，包括局域网(LAN)或广域网(WAN)，连接到用户计算设备，或者，可以连接到外部计算设备(例如利用因特网服务提供商来通过因特网连接)。According to the embodiments of the present invention, the program code for executing the computer program provided by the embodiments of the present invention may be written in any combination of one or more programming languages. Specifically, high-level procedures and/or object-oriented programming languages may be used. programming language, and/or assembly/machine language to implement these computational procedures. Programming languages include, but are not limited to, programming languages such as Java, C++, python, "C" language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, partly on a remote computing device, or entirely on the remote computing device or server. In situations involving remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computing device, such as provided by an Internet service. (business comes via Internet connection).

附图中的流程图和框图，图示了按照本发明各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上，流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分，上述模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意，在有些作为替换的实现中，方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如，两个接连地表示的方框实际上可以基本并行地执行，它们有时也可以按相反的顺序执行，这依所涉及的功能而定。也要注意的是，框图或流程图中的每个方框、以及框图或流程图中的方框的组合，可以用执行规定的功能或操作的专用的基于硬件的系统来实现，或者可以用专用硬件与计算机指令的组合来实现。The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more logic functions that implement the specified executable instructions. It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown one after another may actually execute substantially in parallel, or they may sometimes execute in the reverse order, depending on the functionality involved. It will also be noted that each block in the block diagram or flowchart illustration, and combinations of blocks in the block diagram or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or operations, or may be implemented by special purpose hardware-based systems that perform the specified functions or operations. Achieved by a combination of specialized hardware and computer instructions.

本领域技术人员可以理解，本发明的各个实施例中记载的特征可以进行多种组合和/或结合，即使这样的组合或结合没有明确记载于本发明中。特别地，在不脱离本发明精神和教导的情况下，本发明的各个实施例中记载的特征可以进行多种组合和/或结合。所有这些组合和/或结合均落入本发明的范围。Those skilled in the art will understand that features described in various embodiments of the present invention may be combined and/or combined in various ways, even if such combinations or combinations are not explicitly described in the present invention. In particular, the features described in the various embodiments of the invention may be combined and/or combined in various ways without departing from the spirit and teachings of the invention. All such combinations and/or combinations fall within the scope of the invention.

以上对本发明的实施例进行了描述。但是，这些实施例仅仅是为了说明的目的，而并非为了限制本发明的范围。尽管在以上分别描述了各实施例，但是这并不意味着各个实施例中的措施不能有利地结合使用。不脱离本发明的范围，本领域技术人员可以做出多种替代和修改，这些替代和修改都应落在本发明的范围之内。The embodiments of the present invention have been described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the invention. Although each embodiment is described separately above, this does not mean that the measures in the various embodiments cannot be used in combination to advantage. Without departing from the scope of the present invention, those skilled in the art can make various substitutions and modifications, and these substitutions and modifications should all fall within the scope of the present invention.

Claims

1. A data processing method, including:

Get multiple object data;

It is determined that the number of the plurality of object data is greater than the preset quantity threshold, and for any candidate object data in the plurality of object data, according to the feature vector of the candidate object data and the first distance threshold, from the plurality of object data Determine adjacent object data associated with the candidate object data in the object data;

In response to the presence of the adjacent object data in the plurality of object data, generating first candidate representation data according to the candidate object data and the adjacent object data; wherein the first candidate representation data is used to replace the candidate object data and the adjacent object data;

Based on the candidate object data, the adjacent object data and the first candidate representation data, the plurality of object data are processed to obtain a first representation data set; and

It is determined that the number of first characterization data contained in the first characterization data set is less than or equal to the preset quantity threshold, and the first characterization data set is determined as a first sample data set.

2. The method according to claim 1, wherein the plurality of object data are processed based on the candidate object data, the adjacent object data and the first candidate representation data to obtain a first representation. Data sets include:

Delete the candidate object data and the adjacent object data in the plurality of object data to obtain a candidate object data set; and

The first characterization data set is determined based on the candidate object data set and the first candidate characterization data.

3. The method of claim 2, the first representation data set comprising a plurality of first representation data; in response to detecting that the candidate object data set does not include object data, and determining the plurality of first representations The amount of data is greater than the preset amount threshold, and the method further includes:

determining a point density for each of said first characterization data;

Determine the first characterization data with the largest point density among the plurality of first characterization data as the candidate first characterization data;

determining a number of second candidate characterization data associated with the candidate first characterization data based on the point density of the candidate first characterization data;

Based on the number of the second candidate characterization data, perform characterization data extraction on the plurality of first characterization data to obtain a second characterization data set; and

It is determined that the number of second characterization data contained in the second characterization data set is less than or equal to the preset quantity threshold, and the second characterization data set is determined as a second sample data set.

4. The method according to claim 3, wherein, based on the number of the second candidate characterization data, performing characterization data extraction on the plurality of first characterization data to obtain the second characterization data set includes:

determining adjacent characterization data associated with the candidate first characterization data from the plurality of first characterization data based on a feature vector of the candidate first characterization data and a second distance threshold;

Based on the number of the second candidate representation data, perform clustering processing on the candidate first representation data and the adjacent representation data to obtain the number of second candidate representation data;

Delete the candidate first characterization data and the adjacent characterization data in the plurality of first characterization data to obtain a candidate characterization data set; and

The second characterization data set is determined based on the candidate characterization data set and the number of second candidate characterization data.

5. The method of claim 4, wherein in response to determining that a quantity of second characterization data contained in the second characterization data set is greater than the preset quantity threshold, the method further comprises:

determining affected first characterization data associated with the candidate first characterization data among the plurality of first characterization data;

Determine the point density of the affected first representation data and the point density of the number of second candidate representation data respectively;

According to the point density of the affected first representation data, the point density of the number of second candidate representation data, and other representation data in the candidate representation data set except the affected first representation data. Point density, determining the second representation data with the largest point density in the second representation data set as the candidate second representation data; and

Update the point density of the candidate first characterization data based on the point density of the candidate second characterization data, and repeatedly determine whether the number of second characterization data contained in the second characterization data set is less than or equal to the preset Quantity threshold operations.

6. The method according to claim 4, wherein determining the first characterization data with the largest point density among the plurality of first characterization data as the candidate first characterization data includes:

Sort the plurality of first representation data according to the corresponding point density of each of the plurality of first representation data to obtain a sequence of representation data;

Based on computing resources, split the characterization data sequence into a plurality of first characterization data sub-sequences; wherein each of the first characterization data sub-sequences includes a preset number of first characterization data;

For each first characterization data subsequence, determine the first characterization data with the highest point density among the preset number of first characterization data as the first subsequence characterization data; and

The first sub-sequence representation data with the highest point density among the plurality of first sub-sequence representation data is determined as the candidate first representation data.

7. The method of claim 6, wherein in response to determining that a quantity of second characterization data contained in the second characterization data set is greater than the preset quantity threshold, the method further comprises:

Store the affected first characterization data and the number of second candidate characterization data into a sequence of characterization data to be sorted;

For each first characterization data subsequence, delete the characterization data related to the candidate first characterization data and the adjacent characterization data and the affected first characterization data in the first characterization data subsequence, and obtain a second representation data subsequence;

Determine respectively the point density of the affected first representation data and the point density of the number of second candidate representation data in the sequence of representation data to be sorted;

According to the point density of the affected first representation data in the to-be-sorted representation data sequence and the point density of the number of second candidate representation data, and the point density of the representation data in a plurality of the second representation data sub-sequences. Point density, respectively assign the affected first characterization data and the number of second candidate characterization data in the to-be-sorted characterization data sequence to multiple second characterization data sub-sequences to obtain multiple third characterization data subsequence;

For each third characterization data subsequence, determine the characterization data with the highest point density in the third characterization data subsequence as the second subsequence characterization data;

Determine the second subsequence representation data with the highest point density among the plurality of second subsequence representation data as the candidate second representation data; and

8. The method of claim 4, wherein each of the object data includes an object data identifier; and based on the number of the second candidate characterization data, characterization data extraction is performed on the plurality of first characterization data. , obtaining the second representation data set also includes:

for each of the second candidate representation data, determining candidate first representation data and the adjacent representation data associated with the second candidate representation data;

Determine the object data identifier of the candidate first representation data and the object data identifier of the adjacent representation data according to the object data identifiers corresponding to the plurality of object data;

Determine the object data identifier of the second candidate representation data based on the object data identifier of the candidate first representation data and the object data identifier of the adjacent representation data; and

The object data identifier of the second candidate representation data is associated with the second candidate representation data.

9. The method of claim 8, wherein the object data identifier of the second candidate representation data is determined based on the object data identifier of the candidate first representation data and the object data identifier of the adjacent representation data. include:

Determine the number of object data identifiers belonging to the same data identifier category among the object data identifiers of the candidate first representation data and the object data identifiers of the adjacent representation data; and

The object data identifier with the largest number is used as the object data identifier of the second candidate representation data.

10. The method according to any one of claims 3 to 9, wherein the number of the second candidate characterization data is positively correlated with the point density of the candidate first characterization data.

11. The method of any one of claims 3 to 9, wherein determining the point density of each of the first characterization data includes:

For each of the first representation data, based on a second distance threshold and a feature vector of the first representation data, determine from the plurality of first representation data the number of adjacent representation data associated with the first representation data. quantity; and

The number of adjacent characterization data is determined as the point density.

12. The method of any one of claims 1 to 9, further comprising:

In response to determining that no adjacent object data associated with the candidate object data exists in the plurality of object data, the candidate object data is determined as the first candidate representation data.

13. The method according to any one of claims 1 to 9, wherein each of the object data includes an object data identifier; and generating a first candidate representation based on the candidate object data and the adjacent object data. The data also includes:

Determine the object data identifier of the candidate object data and the object data identifier of the adjacent object data according to the object data identifiers corresponding to the plurality of object data;

determining the object data identifier of the first candidate representation data according to the object data identifier of the candidate object data and the object data identifier of the adjacent object data; and

The object data identifier of the first candidate representation data is associated with the first candidate representation data.

14. The method of claim 13, wherein determining the object data identifier of the first candidate representation data based on the object data identifier of the candidate object data and the object data identifier of the adjacent object data includes:

Determining the number of object data identifiers belonging to the same data identifier category among the object data identifiers of the candidate object data and the object data identifiers of the adjacent object data; and

The object data identifier with the largest number is used as the object data identifier of the first candidate representation data.

15. A training method for a target object detection model, including:

Get sample data set;

Use the sample data set to iteratively train the deep learning model until the output result of the deep learning model satisfies the iteration stop condition or the cumulative number of iterative trainings reaches a preset threshold to obtain the target object detection model;

Wherein, the sample data set is obtained using the method described in any one of claims 1 to 14.

16. A target object detection method, including:

Input the data to be processed into the target object detection model, and obtain the detection results for the data to be processed;

Wherein, the target object detection model is trained using the method described in claim 15.

17. An electronic device, including:

one or more processors;

memory for storing one or more programs,

Wherein, when the one or more programs are executed by the one or more processors, the one or more processors are caused to execute the method according to any one of claims 1 to 16.

18. A computer-readable storage medium having executable instructions stored thereon, which when executed by a processor causes the processor to perform the method according to any one of claims 1 to 16.