CN116701909A

CN116701909A - Multimodal data fusion method, device, computer equipment and storage medium

Info

Publication number: CN116701909A
Application number: CN202310655794.4A
Authority: CN
Inventors: 张振林; 陈冰研; 袁金伟
Original assignee: China Automotive Innovation Corp
Current assignee: China Automotive Innovation Corp
Priority date: 2023-06-05
Filing date: 2023-06-05
Publication date: 2023-09-05

Abstract

The present application relates to a multimodal data fusion method, device, computer equipment and storage medium. The method includes: obtaining multi-modal data to be processed; the multi-modal data to be processed includes point cloud data to be processed and image data to be processed; calling a pre-built feature extraction model; the feature extraction model is obtained by analyzing samples In-depth mining of multi-modal data is obtained by training according to the results of in-depth mining; feature extraction is performed on the point cloud data to be processed and the image data to be processed through the feature extraction model to obtain the point cloud data to be processed The corresponding first deep feature data and the second deep feature data corresponding to the image data to be processed; the first deep feature data and the second deep feature data are fused through the feature extraction model to obtain the The deep fusion features corresponding to the multimodal data to be processed. The method can improve the fusion accuracy of multimodal data.

Description

Multimodal data fusion method, device, computer equipment and storage medium

技术领域technical field

本申请涉及自动驾驶技术领域，特别是涉及一种多模态数据融合方法、装置、计算机设备、存储介质和计算机程序产品。The present application relates to the technical field of automatic driving, in particular to a multimodal data fusion method, device, computer equipment, storage medium and computer program product.

背景技术Background technique

相机、激光雷达、毫米波雷达等传感器技术的快速发展推动了自动驾驶感知能力的进步。不同传感器采集的数据对应不同模态，也体现了自动驾驶系统从不同角度对真实世界的感知。相机采集的图像包含颜色等更多的纹理信息，而雷达采集的点云则包含了更全面的空间位置信息。基于点云模态和图像模态的融合方法可以进一步推动自动驾驶感知技术的发展。The rapid development of sensor technologies such as cameras, lidars, and millimeter-wave radars has promoted the advancement of autonomous driving perception capabilities. The data collected by different sensors corresponds to different modes, which also reflects the perception of the real world from different angles by the automatic driving system. The image collected by the camera contains more texture information such as color, while the point cloud collected by the radar contains more comprehensive spatial position information. The fusion method based on point cloud modality and image modality can further promote the development of autonomous driving perception technology.

传统方式中，是将点云数据与图像数据在可视范围内进行特征叠加，以实现多模态数据融合。In the traditional way, point cloud data and image data are superimposed on features within the visible range to achieve multi-modal data fusion.

然而，传统方式无法得到更加接近真实世界的目标和场景信息，导致多模态数据的融合准确性较低。However, traditional methods cannot obtain target and scene information closer to the real world, resulting in low accuracy of multimodal data fusion.

发明内容Contents of the invention

基于此，有必要针对上述技术问题，提供一种能够提高多模态数据的融合准确性的多模态数据融合方法、装置、计算机设备、计算机可读存储介质和计算机程序产品。Based on this, it is necessary to provide a multi-modal data fusion method, device, computer equipment, computer-readable storage medium and computer program product capable of improving the fusion accuracy of multi-modal data for the above technical problems.

第一方面，本申请提供了一种多模态数据融合方法。该方法包括：In a first aspect, the present application provides a multimodal data fusion method. The method includes:

获取待处理多模态数据；待处理多模态数据包括待处理点云数据和待处理图像数据；Obtain multimodal data to be processed; multimodal data to be processed includes point cloud data to be processed and image data to be processed;

调用预先构建的特征提取模型；Call a pre-built feature extraction model;

通过特征提取模型分别对待处理点云数据和待处理图像数据进行特征提取，得到待处理点云数据对应的第一深层特征数据以及待处理图像数据对应的第二深层特征数据；Feature extraction is performed on the point cloud data to be processed and the image data to be processed respectively through the feature extraction model, and the first deep feature data corresponding to the point cloud data to be processed and the second deep feature data corresponding to the image data to be processed are obtained;

通过特征提取模型将第一深层特征数据和第二深层特征数据进行融合，得到待处理多模态数据对应的深层融合特征。The first deep feature data and the second deep feature data are fused through a feature extraction model to obtain deep fusion features corresponding to the multimodal data to be processed.

在其中一个实施例中，在获取待处理多模态数据之前，方法还包括：In one of the embodiments, before acquiring the multimodal data to be processed, the method further includes:

获取样本多模态数据；样本多模态数据包括样本点云数据以及样本图像数据；Obtain sample multimodal data; sample multimodal data includes sample point cloud data and sample image data;

将样本多模态数据输入至待训练的深度学习模型中；深度学习模型包括特征提取层、深度挖掘层和感知层；Input the sample multimodal data into the deep learning model to be trained; the deep learning model includes a feature extraction layer, a deep mining layer and a perception layer;

通过特征提取层提取样本点云数据对应的样本点云特征以及样本图像数据对应的样本图像特征，将样本点云特征和样本图像特征进行融合，得到样本融合特征；Extract the sample point cloud features corresponding to the sample point cloud data and the sample image features corresponding to the sample image data through the feature extraction layer, and fuse the sample point cloud features and sample image features to obtain sample fusion features;

通过深度挖掘层对样本点云特征以及样本图像特征进行深度挖掘，得到深度挖掘结果；Through the deep mining layer, the sample point cloud features and sample image features are deeply mined to obtain the deep mining results;

通过感知层根据样本融合特征以及深度挖掘结果，对深度学习模型进行训练，得到预先构建的特征提取模型。Through the perception layer, the deep learning model is trained according to the sample fusion features and the deep mining results, and a pre-built feature extraction model is obtained.

在其中一个实施例中，深度挖掘结果为关系损失值；通过深度挖掘层对样本点云特征以及样本图像特征进行深度挖掘，得到深度挖掘结果包括：In one of the embodiments, the deep mining result is a relationship loss value; through the deep mining layer, the sample point cloud features and sample image features are deeply mined, and the deep mining results obtained include:

通过深度挖掘层对样本点云特征以及样本图像特征进行深度挖掘，得到挖掘特征数据；Through the deep mining layer, the sample point cloud features and sample image features are deeply mined to obtain the mined feature data;

根据挖掘特征数据确定关系损失值。Determine the relationship loss value based on the mined feature data.

在其中一个实施例中，感知层包括感知任务层以及训练优化层；通过感知层根据样本融合特征以及深度挖掘结果，对深度学习模型进行训练包括：In one of the embodiments, the perception layer includes a perception task layer and a training optimization layer; through the perception layer, according to the sample fusion features and depth mining results, training the deep learning model includes:

通过感知任务层根据样本融合特征执行预设感知任务，得到任务执行结果；Through the perception task layer, the preset perception tasks are executed according to the sample fusion features, and the task execution results are obtained;

通过训练优化层根据任务执行结果以及关系损失值确定深度学习模型的总体损失值，根据总体损失值对深度学习模型进行训练。Through the training optimization layer, the overall loss value of the deep learning model is determined according to the task execution result and the relationship loss value, and the deep learning model is trained according to the overall loss value.

在其中一个实施例中，预设感知任务为目标检测任务；通过训练优化层根据任务执行结果以及关系损失值确定深度学习模型的总体损失值包括：In one of the embodiments, the preset perception task is a target detection task; determining the overall loss value of the deep learning model according to the task execution result and the relationship loss value by training the optimization layer includes:

通过训练优化层根据任务执行结果确定深度学习模型的位置损失值、方向损失值以及类别损失值；Determine the position loss value, direction loss value and category loss value of the deep learning model according to the task execution results by training the optimization layer;

通过训练优化层根据位置损失值、方向损失值、类别损失值以及关系损失值确定深度学习模型的总体损失值。The overall loss value of the deep learning model is determined by training the optimization layer according to the position loss value, direction loss value, category loss value and relationship loss value.

在其中一个实施例中，通过特征提取模型分别对待处理点云数据和待处理图像数据进行特征提取，得到待处理点云数据对应的第一深层特征数据以及待处理图像数据对应的第二深层特征数据包括：In one of the embodiments, feature extraction is performed on the point cloud data to be processed and the image data to be processed through the feature extraction model to obtain the first deep feature data corresponding to the point cloud data to be processed and the second deep feature data corresponding to the image data to be processed Data includes:

通过特征提取模型对待处理点云数据进行特征提取，得到待处理点云数据对应的第一深层特征数据；第一深层特征数据包括深层点云特征以及与待处理图像数据之间的关联特征；Feature extraction is performed on the point cloud data to be processed through the feature extraction model to obtain first deep feature data corresponding to the point cloud data to be processed; the first deep feature data includes deep point cloud features and associated features with the image data to be processed;

通过特征提取模型对待处理图像数据进行语义分割，得到语义分割结果，对语义分割结果进行聚类，得到聚类结果；Perform semantic segmentation on the image data to be processed through the feature extraction model to obtain the semantic segmentation results, cluster the semantic segmentation results to obtain the clustering results;

通过特征提取模型对聚类结果进行上采样，得到待处理图像数据对应的第二深层特征数据；第二深层特征数据包括深层图像特征以及与待处理点云数据之间的关联特征。The clustering result is up-sampled through the feature extraction model to obtain second deep feature data corresponding to the image data to be processed; the second deep feature data includes deep image features and associated features with the point cloud data to be processed.

第二方面，本申请还提供了一种多模态数据融合装置。该装置包括：In a second aspect, the present application also provides a multimodal data fusion device. The unit includes:

数据获取模块，用于获取待处理多模态数据；待处理多模态数据包括待处理点云数据和待处理图像数据；The data acquisition module is used to obtain multimodal data to be processed; the multimodal data to be processed includes point cloud data to be processed and image data to be processed;

模型调用模块，用于调用预先构建的特征提取模型；Model calling module, used to call the pre-built feature extraction model;

特征提取模块，用于通过特征提取模型分别对待处理点云数据和待处理图像数据进行特征提取，得到待处理点云数据对应的第一深层特征数据以及待处理图像数据对应的第二深层特征数据；The feature extraction module is used to perform feature extraction on the point cloud data to be processed and the image data to be processed respectively through the feature extraction model to obtain the first deep feature data corresponding to the point cloud data to be processed and the second deep feature data corresponding to the image data to be processed ;

特征融合模块，用于通过特征提取模型将第一深层特征数据和第二深层特征数据进行融合，得到待处理多模态数据对应的深层融合特征。The feature fusion module is used to fuse the first deep feature data and the second deep feature data through a feature extraction model to obtain deep fusion features corresponding to the multimodal data to be processed.

第三方面，本申请还提供了一种计算机设备。该计算机设备包括存储器和处理器，该存储器存储有计算机程序，该处理器执行该计算机程序时实现以下步骤：In a third aspect, the present application also provides a computer device. The computer device includes a memory and a processor, the memory stores a computer program, and the processor implements the following steps when executing the computer program:

第四方面，本申请还提供了一种计算机可读存储介质。该计算机可读存储介质，其上存储有计算机程序，该计算机程序被处理器执行时实现以下步骤：In a fourth aspect, the present application also provides a computer-readable storage medium. The computer-readable storage medium has a computer program stored thereon, and when the computer program is executed by a processor, the following steps are implemented:

第五方面，本申请还提供了一种计算机程序产品。该计算机程序产品，包括计算机程序，该计算机程序被处理器执行时实现以下步骤：In a fifth aspect, the present application also provides a computer program product. The computer program product includes a computer program that, when executed by a processor, implements the following steps:

上述多模态数据融合方法、装置、计算机设备、存储介质和计算机程序产品，由于特征提取模型是通过对样本多模态数据进行深度挖掘，根据深度挖掘结果训练得到的，能够充分挖掘并提取多模态数据的非直观可见的深层特征数据，并进行融合，得到的深层融合特征更加接近真实世界的目标和场景信息，大大提高了多模态数据的融合准确性，从而提升了自动驾驶感知技术对周围环境的感知能力。The above multimodal data fusion method, device, computer equipment, storage medium, and computer program product, because the feature extraction model is obtained through deep mining of sample multimodal data and trained according to the deep mining results, can fully mine and extract multiple The non-intuitive and visible deep feature data of modal data is fused, and the obtained deep fusion feature is closer to the real-world target and scene information, which greatly improves the fusion accuracy of multi-modal data, thereby improving the automatic driving perception technology Awareness of the surrounding environment.

附图说明Description of drawings

图1为一个实施例中多模态数据融合方法的应用环境图；Fig. 1 is an application environment diagram of a multimodal data fusion method in an embodiment;

图2为一个实施例中多模态数据融合方法的流程示意图；FIG. 2 is a schematic flow diagram of a multimodal data fusion method in an embodiment;

图3为一个实施例中特征提取模型的训练步骤的流程示意图；Fig. 3 is a schematic flow chart of the training steps of the feature extraction model in one embodiment;

图4为一个实施例中深度学习模型的网络结构示意图；Fig. 4 is a schematic diagram of the network structure of the deep learning model in an embodiment;

图5为一个实施例中多模态数据融合装置的结构框图；Fig. 5 is a structural block diagram of a multimodal data fusion device in an embodiment;

图6为一个实施例中计算机设备的内部结构图。Figure 6 is an internal block diagram of a computer device in one embodiment.

具体实施方式Detailed ways

为了使本申请的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本申请进行进一步详细说明。应当理解，此处描述的具体实施例仅仅用以解释本申请，并不用于限定本申请。In order to make the purpose, technical solution and advantages of the present application clearer, the present application will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application, and are not intended to limit the present application.

本申请实施例提供的多模态数据融合方法，可以应用于如图1所示的应用环境中。其中，该多模态数据融合方法主要在计算机设备102上执行，具体地，在自动驾驶环境中，车辆中预先安装有车载传感器102和车载计算机设备104。车载计算机设备可以简称为计算机设备。车载传感器102采集待处理多模态数据，将采集的待处理多模态数据传输至计算机设备104，从而计算机设备104调用预先构建的特征提取模型，特征提取模型是通过对样本多模态数据进行深度挖掘，根据深度挖掘结果训练得到的，从而通过特征提取模型分别对待处理点云数据和待处理图像数据进行特征提取，得到待处理点云数据对应的第一深层特征数据以及待处理图像数据对应的第二深层特征数据，进而通过特征提取模型将第一深层特征数据和第二深层特征数据进行融合，得到待处理多模态数据对应的深层融合特征。其中，车载传感器102可以包括用于采集待处理图像数据的各种图像采集设备和视频采集设备，以及用于采集待处理点云数据的各种雷达传感器。计算机设备104可以但不限于是各种个人计算机、笔记本电脑、智能手机、平板电脑和智能车载设备。The multimodal data fusion method provided in the embodiment of the present application can be applied to the application environment shown in FIG. 1 . Wherein, the multi-modal data fusion method is mainly executed on the computer device 102, specifically, in the automatic driving environment, the vehicle is pre-installed with the vehicle sensor 102 and the vehicle computer device 104. The on-board computer equipment may be simply referred to as computer equipment. The on-board sensor 102 collects multimodal data to be processed, and transmits the collected multimodal data to the computer device 104, so that the computer device 104 invokes a pre-built feature extraction model, and the feature extraction model is obtained by performing multimodal data on the sample In-depth mining is obtained by training according to the results of in-depth mining, so that feature extraction is performed on the point cloud data to be processed and the image data to be processed through the feature extraction model, and the first deep feature data corresponding to the point cloud data to be processed and the corresponding image data to be processed are obtained. The second deep feature data, and then through the feature extraction model, the first deep feature data and the second deep feature data are fused to obtain the deep fusion features corresponding to the multi-modal data to be processed. Wherein, the on-vehicle sensor 102 may include various image acquisition devices and video acquisition devices for collecting image data to be processed, and various radar sensors for collecting point cloud data to be processed. The computer device 104 may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and smart vehicle devices.

在一个实施例中，如图2所示，提供了一种多模态数据融合方法，以该方法应用于图1中的计算机设备为例进行说明，包括以下步骤：In one embodiment, as shown in FIG. 2 , a multimodal data fusion method is provided. The method is applied to the computer device in FIG. 1 as an example for illustration, including the following steps:

步骤202，获取待处理多模态数据；待处理多模态数据包括待处理点云数据和待处理图像数据。Step 202, acquiring multimodal data to be processed; the multimodal data to be processed includes point cloud data to be processed and image data to be processed.

其中，待处理多模态数据是指需要进行融合的多模态感知数据。Wherein, the multimodal data to be processed refers to multimodal perception data that needs to be fused.

具体地，在自动驾驶环境中，通过安装在车辆上的多种车载传感器对车辆周围环境进行扫描，得到待处理点云数据以及待处理图像数据。由于不同种类的传感器采集的数据对应不同模态，待处理点云数据以及待处理图像数据可以称为待处理多模态数据。可选地，多种车载传感器可以包括第一传感器和第二传感器。第一传感器可以包括用于采集待处理图像数据的各种图像采集设备和视频采集设备，如相机、摄像头等。第二传感器可以包括用于采集待处理点云数据的各种雷达传感器，如激光雷达、毫米波雷达等。从而车载传感器将采集到的待处理多模态数据传输至计算机设备。Specifically, in the autonomous driving environment, various on-board sensors installed on the vehicle scan the surrounding environment of the vehicle to obtain point cloud data to be processed and image data to be processed. Since the data collected by different types of sensors correspond to different modalities, the point cloud data to be processed and the image data to be processed can be called multi-modal data to be processed. Optionally, the various on-board sensors may include a first sensor and a second sensor. The first sensor may include various image acquisition devices and video acquisition devices for acquiring image data to be processed, such as cameras, cameras, and the like. The second sensor may include various radar sensors for collecting point cloud data to be processed, such as lidar, millimeter-wave radar, and the like. Thus, the on-board sensor transmits the collected multi-modal data to the computer device.

步骤204，调用预先构建的特征提取模型。Step 204, call the pre-built feature extraction model.

其中，深度挖掘是指挖掘样本多模态数据中非直观可见的深层特征数据。Among them, deep mining refers to the mining of non-intuitively visible deep feature data in sample multimodal data.

计算机设备中存储有预先构建的特征提取模型，特征提取模型用于提取多模态数据的深层特征数据。特征提取模型是通过对大量的样本多模态数据进行深度挖掘，得到深度挖掘结果，从而根据深度挖掘结果训练得到的。深度挖掘结果中可以包括样本多模态数据的深层语义特征和内在关联关系。计算机设备调用预先构建的特征提取模型，对待处理多模态数据进行深层特征提取。A pre-built feature extraction model is stored in the computer device, and the feature extraction model is used to extract deep feature data of the multimodal data. The feature extraction model is obtained through deep mining of a large number of sample multimodal data to obtain deep mining results, and then trained according to the deep mining results. The results of in-depth mining can include the deep semantic features and intrinsic relationships of sample multimodal data. The computer equipment calls the pre-built feature extraction model to perform deep feature extraction on the multi-modal data to be processed.

步骤206，通过特征提取模型分别对待处理点云数据和待处理图像数据进行特征提取，得到待处理点云数据对应的第一深层特征数据以及待处理图像数据对应的第二深层特征数据。In step 206, feature extraction is performed on the point cloud data to be processed and the image data to be processed through the feature extraction model to obtain the first deep feature data corresponding to the point cloud data to be processed and the second deep feature data corresponding to the image data to be processed.

其中，第一深层特征数据是指待处理点云数据中非直观可见的深层特征数据。第二深层特征数据是指待处理图像数据中非直观可见的深层特征数据。第一深层特征数据和第二深层特征数据均为高维特征。Wherein, the first deep feature data refers to non-intuitively visible deep feature data in the point cloud data to be processed. The second deep feature data refers to non-intuitive deep feature data in the image data to be processed. Both the first deep feature data and the second deep feature data are high-dimensional features.

具体地，特征提取模型可以包括特征提取层。特征提取层用于对待处理点云数据和待处理图像数据进行特征提取，以及将提取的特征进行融合。进一步地，特征提取层可以包括点云分支和图像分支两个网络分支。通过点云分支对待处理点云数据进行特征提取，得到待处理点云数据对应的第一深层特征数据。其中，第一深层特征数据可以包括待处理点云数对应的深层语义特征，以及待处理点云数据与待处理图像数据之间的内在关联关系。通过图像分支对待处理图像数据进行特征提取，得到待处理图像数据对应的第二深层特征数据。其中，第二深层特征数据可以包括待处理图像数据对应的深层语义特征，以及待处理图像数据与待处理点云数据之间的内在关联关系。Specifically, the feature extraction model may include a feature extraction layer. The feature extraction layer is used for feature extraction of point cloud data to be processed and image data to be processed, and to fuse the extracted features. Further, the feature extraction layer may include two network branches, the point cloud branch and the image branch. Feature extraction is performed on the point cloud data to be processed through the point cloud branch, and the first deep feature data corresponding to the point cloud data to be processed is obtained. Wherein, the first deep feature data may include deep semantic features corresponding to the number of point clouds to be processed, as well as an intrinsic relationship between the point cloud data to be processed and the image data to be processed. Feature extraction is performed on the image data to be processed through the image branch to obtain second deep feature data corresponding to the image data to be processed. Wherein, the second deep feature data may include the deep semantic features corresponding to the image data to be processed, and the intrinsic relationship between the image data to be processed and the point cloud data to be processed.

在其中一个实施例中，通过特征提取模型分别对待处理点云数据和待处理图像数据进行特征提取，得到待处理点云数据对应的第一深层特征数据以及待处理图像数据对应的第二深层特征数据包括：通过特征提取模型对待处理点云数据进行特征提取，得到待处理点云数据对应的第一深层特征数据；第一深层特征数据包括深层点云特征以及与待处理图像数据之间的关联特征；通过特征提取模型对待处理图像数据进行语义分割，得到语义分割结果，对语义分割结果进行聚类，得到聚类结果；通过特征提取模型对聚类结果进行上采样，得到待处理图像数据对应的第二深层特征数据；第二深层特征数据包括深层图像特征以及与待处理点云数据之间的关联特征。In one of the embodiments, feature extraction is performed on the point cloud data to be processed and the image data to be processed through the feature extraction model to obtain the first deep feature data corresponding to the point cloud data to be processed and the second deep feature data corresponding to the image data to be processed The data includes: feature extraction of the point cloud data to be processed through the feature extraction model to obtain the first deep feature data corresponding to the point cloud data to be processed; the first deep feature data includes the deep point cloud features and the association with the image data to be processed Features: use the feature extraction model to perform semantic segmentation on the image data to be processed to obtain the semantic segmentation results, cluster the semantic segmentation results to obtain the clustering results; use the feature extraction model to up-sample the clustering results to obtain the corresponding image data to be processed The second deep feature data; the second deep feature data includes deep image features and associated features with point cloud data to be processed.

其中，深层点云特征是指待处理点云数据中的深层语义特征。与待处理图像数据之间的关联特征是指待处理点云数据与待处理图像数据之间的内在关联关系。深层图像特征是指待处理图像数据中的深层语义特征。与待处理点云数据之间的关联特征是指待处理图像数据与待处理点云数据之间的内在关联关系。Among them, the deep point cloud features refer to the deep semantic features in the point cloud data to be processed. The association feature with the image data to be processed refers to the intrinsic association between the point cloud data to be processed and the image data to be processed. Deep image features refer to the deep semantic features in the image data to be processed. The association feature with the point cloud data to be processed refers to the internal association relationship between the image data to be processed and the point cloud data to be processed.

具体地，特征提取模型可以包括特征提取层，通过特征提取层中的点云分支对待处理点云数据进行特征提取，得到包括深层点云特征以及与待处理图像数据之间的关联特征的，第一深层特征数据。对于待处理图像数据的处理，是通过特征提取层中的图像分支对待处理图像进行语义分割，得到语义分割结果。其中，语义分割结果中包括待处理图像数据中每个像素点的类别。从而对语义分割结果进行聚类处理，实现根据每个像素点的类别将每个像素点进行聚类。进而对聚类结果进行上采样，使得待处理图像数据中的每个像素点与第一深层特征数据中的点云具有相同的特征维度，进而得到包括深层图像特征以及与待处理点云数据之间的关联特征，的第二深层特征数据。例如，待处理点云数对应的深层点云特征和待处理图像数据对应的深层图像特征可以是更加抽象的高维浮点型特征。关键特征可以是相似特征或者相同特征。Specifically, the feature extraction model may include a feature extraction layer, through the point cloud branch in the feature extraction layer to perform feature extraction on the point cloud data to be processed, to obtain features including deep point cloud features and associated features with the image data to be processed, the second A deep feature data. For the processing of the image data to be processed, the semantic segmentation of the image to be processed is carried out through the image branch in the feature extraction layer, and the semantic segmentation result is obtained. Wherein, the semantic segmentation result includes the category of each pixel in the image data to be processed. In this way, the semantic segmentation results are clustered, and each pixel is clustered according to the category of each pixel. Then, the clustering result is up-sampled, so that each pixel in the image data to be processed has the same feature dimension as the point cloud in the first deep feature data, and then obtained including the deep image features and the relationship between the point cloud data and the point cloud data to be processed. The correlation features between, the second deep feature data. For example, the deep point cloud features corresponding to the number of point clouds to be processed and the deep image features corresponding to the image data to be processed may be more abstract high-dimensional floating-point features. Key features can be similar features or identical features.

进一步地，点云分支和图像分支主要利用编码器(Encoder)，即卷积网络，从待处理点云数据和待处理图像数据中分别提取深层特征数据。点云分支可以包括一个编码器，通过编码器将待处理点云数据的特征维度提高到预设维度，从而提取到待处理点云数据中高维且深度的特征数据，即第一深层特征数据。例如，待处理点云数据可以表示为(P,[x,y,z,i])，通过点云分支中的编码器将(P,[x,y,z,i])的特征维度由4提升到C，输出第一深层特征数据F_pt的维度是(P,C)，其中，P表示待处理点云数据中点的数量，C表示预设维度，为中间特征数，如可以是10、64等。Furthermore, the point cloud branch and the image branch mainly use the encoder (Encoder), that is, the convolutional network, to extract deep feature data from the point cloud data to be processed and the image data to be processed respectively. The point cloud branch may include an encoder, through which the feature dimension of the point cloud data to be processed is increased to a preset dimension, thereby extracting high-dimensional and deep feature data in the point cloud data to be processed, that is, the first deep feature data. For example, the point cloud data to be processed can be represented as (P, [x, y, z, i]), and the feature dimension of (P, [x, y, z, i]) is changed by the encoder in the point cloud branch 4. Upgrade to C, and output the dimension of the first deep feature data F_pt is (P, C), wherein, P represents the number of points in the point cloud data to be processed, C represents the preset dimension, which is the number of intermediate features, such as 10 , 64, etc.

与点云分支不同的是，图像分支可以包括两个编码器，通过第一个编码器对待处理图像进行语义分割，对语义分割结果进行聚类处理，从而通过第二个编码器对聚类结果进行上采样，得到第二深层特征数据。例如，当第一深层特征数据F_pt的维度是(P,C)时，第二深层特征数据F_img中的每个像素点与第一深层特征数据中的点云具有相同的特征维度C，第二深层特征数据F_img的维度是(H,W,C)，其中，H表示第二深层特征数据的高，W表示第二深层特征数据的宽，C表示第二深层特征数据的特征维度。Different from the point cloud branch, the image branch can include two encoders. The image to be processed is semantically segmented through the first encoder, and the semantic segmentation results are clustered, so that the clustering results are clustered through the second encoder. Upsampling is performed to obtain the second deep feature data. For example, when the dimension of the first deep feature data F_pt is (P, C), each pixel in the second deep feature data F_img has the same feature dimension C as the point cloud in the first deep feature data, and the second The dimensions of the deep feature data F_img are (H, W, C), where H represents the height of the second deep feature data, W represents the width of the second deep feature data, and C represents the feature dimension of the second deep feature data.

步骤208，通过特征提取模型将第一深层特征数据和第二深层特征数据进行融合，得到待处理多模态数据对应的深层融合特征。In step 208, the first deep feature data and the second deep feature data are fused through the feature extraction model to obtain deep fusion features corresponding to the multimodal data to be processed.

通过特征提取模型中的特征提取层将第一深层特征数据和第二深层特征数据进行融合。The feature data of the first deep layer and the feature data of the second deep layer are fused through the feature extraction layer in the feature extraction model.

进一步地，特征提取层中还包括融合层(Fusion层)，如1×1卷积核，融合层分别与，点云分支以及图像分支中编码器的输出端相连。将点云分支中编码器输出的第一深层特征数据，以及图像分支中第二个编码器输出的第二深层特征数据作为融合层的输入，通过融合层根据预设外参矩阵将第二深层特征数据中每个像素点与第一深层特征数据中的点进行对齐，得到对齐后的第二深层特征数据，从而将对齐后的第二深层特征数据与第一深层特征数据进行融合，得到待处理多模态数据对应的深层融合特征F_fuse，F_fuse的形状(shape)为(P,C1)，其中，P表示点云点的数量，C1表示特征维度。Further, the feature extraction layer also includes a fusion layer (Fusion layer), such as a 1×1 convolution kernel, and the fusion layer is respectively connected to the output end of the encoder in the point cloud branch and the image branch. The first deep feature data output by the encoder in the point cloud branch and the second deep feature data output by the second encoder in the image branch are used as the input of the fusion layer, and the second deep layer is integrated according to the preset external parameter matrix through the fusion layer Each pixel in the feature data is aligned with the point in the first deep feature data to obtain the aligned second deep feature data, so that the aligned second deep feature data is fused with the first deep feature data to obtain Process the deep fusion feature F_fuse corresponding to the multimodal data. The shape of F_fuse is (P, C1), where P represents the number of point cloud points, and C1 represents the feature dimension.

传统方式中，是通过PointPainting(图像激光融合模型)、RoarNet(RegiOnApproximation Refinement Network，目标检测网络)等模型将点云数据与图像数据在可视范围内进行特征叠加，以实现多模态数据融合，聚焦于叠加浅层特征，即肉眼可见且很容易直观理解的形状、颜色、空间位置等，而忽略了多种模态数据的内在深层语义联系，即无法通过二维或三维特征清晰表示的高维特征，且仅是多个模态的单向叠加，如图像叠加到点云中，没有多个模态之间的相互监督和相互学习过程。因此，传统方式并不能充分挖掘多模态数据的深层语义特征和内在关联关系。In the traditional way, point cloud data and image data are superimposed in the visible range through models such as PointPainting (image laser fusion model) and RoarNet (RegiOnApproximation Refinement Network, target detection network) to achieve multi-modal data fusion. Focusing on superimposing shallow features, that is, shapes, colors, and spatial positions that are visible to the naked eye and easy to understand intuitively, while ignoring the inherent deep semantic connections of various modal data, that is, high-level features that cannot be clearly represented by two-dimensional or three-dimensional features. Dimensional features, and only one-way superposition of multiple modalities, such as superimposing images into point clouds, without mutual supervision and mutual learning process between multiple modalities. Therefore, the traditional methods cannot fully exploit the deep semantic features and intrinsic relationships of multimodal data.

而上述多模态数据融合方法中，通过调用预先构建的特征提取模型，分别对待处理多模态数据中的待处理点云数据和待处理图像数据进行特征提取，得到待处理点云数据对应的第一深层特征数据以及待处理图像数据对应的第二深层特征数据，从而将第一深层特征数据和第二深层特征数据进行融合，得到待处理多模态数据对应的深层融合特征。由于特征提取模型是通过对样本多模态数据进行深度挖掘，根据深度挖掘结果训练得到的，能够充分挖掘并提取多模态数据的非直观可见的深层特征数据，并进行融合，得到的深层融合特征更加接近真实世界的目标和场景信息，大大提高了多模态数据的融合准确性，从而提升了自动驾驶感知技术对周围环境的感知能力。In the above multimodal data fusion method, by calling the pre-built feature extraction model, feature extraction is performed on the point cloud data to be processed and the image data to be processed in the multimodal data to be processed, and the corresponding point cloud data to be processed is obtained. The first deep feature data and the second deep feature data corresponding to the image data to be processed, so as to fuse the first deep feature data and the second deep feature data to obtain deep fusion features corresponding to the multimodal data to be processed. Since the feature extraction model is obtained through in-depth mining of sample multi-modal data and trained according to the results of deep mining, it can fully mine and extract non-intuitive and visible deep feature data of multi-modal data, and perform fusion to obtain deep fusion The features are closer to the target and scene information in the real world, which greatly improves the fusion accuracy of multi-modal data, thereby improving the ability of autonomous driving perception technology to perceive the surrounding environment.

在其中一个实施例中，上述方法还包括：对深层融合特征进行特征提取，得到目标特征；根据目标特征执行自动驾驶任务。In one embodiment, the above method further includes: performing feature extraction on deep fusion features to obtain target features; performing an automatic driving task according to the target features.

特征提取模型还可以包括感知任务层，感知任务层用于执行自动驾驶任务。The feature extraction model can also include a perception task layer, which is used to perform autonomous driving tasks.

特征提取模型中的特征提取层输出深层融合特征后，将深层融合特征输入至感知任务层，感知任务层可以包括提取层(Extract层)和感知任务头(3DTaskHead)。例如，提取层可以是Voxel、Pillar等。通过提取层对深层融合特征进行特征提取，得到目标特征。目标特征是指对深层融合特征进行点云特征提取后的特征。将目标特征作为感知任务头的输入，通过感知任务头根据目标特征执行自动驾驶任务。例如，自动驾驶任务可以是3D目标检测任务、目标分割等任务。After the feature extraction layer in the feature extraction model outputs deep fusion features, the deep fusion features are input to the perception task layer. The perception task layer may include an extraction layer (Extract layer) and a perception task head (3DTaskHead). For example, the abstraction layer can be Voxel, Pillar, etc. The feature extraction is performed on the deep fusion feature through the extraction layer to obtain the target feature. The target feature refers to the feature extracted from the point cloud feature of the deep fusion feature. The target feature is used as the input of the perception task head, and the automatic driving task is performed according to the target feature through the perception task head. For example, autonomous driving tasks can be 3D object detection tasks, object segmentation and other tasks.

在本实施例中，由于深层融合特征能够更加接近真实世界的场景，对深层融合特征进行特征提取，得到目标特征，能够为感知任务头提供符合要求的数据，有利于后续感知任务头执行自动驾驶任务。In this embodiment, since the deep fusion features can be closer to real-world scenes, feature extraction is performed on the deep fusion features to obtain target features, which can provide the required data for the perception task head, which is beneficial for the subsequent perception task head to perform automatic driving Task.

在一个实施例中，如图3所示，在获取待处理多模态数据之前，上述方法还包括：特征提取模型的训练步骤，该步骤包括：In one embodiment, as shown in Figure 3, before obtaining the multimodal data to be processed, the above method further includes: a step of training the feature extraction model, which step includes:

步骤302，获取样本多模态数据；样本多模态数据包括样本点云数据以及样本图像数据。Step 302, acquiring sample multimodal data; the sample multimodal data includes sample point cloud data and sample image data.

步骤304，将样本多模态数据输入至待训练的深度学习模型中；深度学习模型包括特征提取层、深度挖掘层和感知层。Step 304, input the sample multimodal data into the deep learning model to be trained; the deep learning model includes a feature extraction layer, a depth mining layer and a perception layer.

其中，样本多模态数据是指用于训练深度学习模型的多模态数据。Among them, the sample multimodal data refers to the multimodal data used to train the deep learning model.

计算机设备在模型训练过程中，可以先获取样本多模态数据。样本多模态数据可以包括样本点云数据和样本图像数据。调用待训练的深度学习模型，将样本多模态数据输入至深度学习模型中，深度学习模型可以包括特征提取层、深度挖掘层和感知层。特征提取层和深度挖掘层中均包括点云分支以及图像分支。特征提取层中的点云分支和深度挖掘层中的点云分支相连，特征提取层中的图像分支和深度挖掘层中的图像分支相连。待处理点云数据和待处理图像数据通过两个不同的分支来处理，即待处理点云数据是通过深度学习模型中的点云分支进行处理，待处理图像数据是通过图像数据进行处理。During the model training process, the computer equipment can first obtain sample multimodal data. The sample multimodal data may include sample point cloud data and sample image data. Invoke the deep learning model to be trained, and input sample multimodal data into the deep learning model. The deep learning model may include a feature extraction layer, a deep mining layer, and a perception layer. Both feature extraction layer and deep mining layer include point cloud branch and image branch. The point cloud branch in the feature extraction layer is connected to the point cloud branch in the deep mining layer, and the image branch in the feature extraction layer is connected to the image branch in the deep mining layer. The point cloud data to be processed and the image data to be processed are processed through two different branches, that is, the point cloud data to be processed is processed through the point cloud branch in the deep learning model, and the image data to be processed is processed through the image data.

步骤306，通过特征提取层提取样本点云数据对应的样本点云特征以及样本图像数据对应的样本图像特征，将样本点云特征和样本图像特征进行融合，得到样本融合特征。Step 306, extracting sample point cloud features corresponding to the sample point cloud data and sample image features corresponding to the sample image data through the feature extraction layer, and fusing the sample point cloud features and sample image features to obtain sample fusion features.

通过特征提取层中的点云分支提取样本点云数据对应的样本点云特征，通过图像分支提取样本图像数据对应的样本图像特征。样本点云特征和样本图像特征均为高维特征。通过特征提取层将样本点云特征和所述样本图像特征进行融合，得到样本融合特征。进一步地，深度学习模型的特征提取层的结构与特征提取模型的特征提取层是相同的，此处不再赘述。The sample point cloud features corresponding to the sample point cloud data are extracted through the point cloud branch in the feature extraction layer, and the sample image features corresponding to the sample image data are extracted through the image branch. Both sample point cloud features and sample image features are high-dimensional features. The sample point cloud feature and the sample image feature are fused through the feature extraction layer to obtain the sample fusion feature. Furthermore, the structure of the feature extraction layer of the deep learning model is the same as that of the feature extraction model, and will not be repeated here.

步骤308，通过深度挖掘层对样本点云特征以及样本图像特征进行深度挖掘，得到深度挖掘结果。Step 308 , perform deep mining on the feature of the sample point cloud and the feature of the sample image through the deep mining layer to obtain a deep mining result.

通过深度挖掘层中的点云分支对样本点云特征进行深度挖掘，以及通过深度挖掘层中的图像分支对样本图像特征进行深度挖掘，得到挖掘特征数据。挖掘特征数据是指样本点云数据以及样本图像数据的深层特征，和样本点云数据与样本图像数据等多模态数据之间的内在关联关系。从而通过深度挖掘层根据挖掘特征数据计算深度挖掘结果。The point cloud feature of the sample is deeply mined through the point cloud branch in the deep mining layer, and the sample image feature is deeply mined through the image branch in the deep mining layer to obtain the mined feature data. Mining feature data refers to the deep features of sample point cloud data and sample image data, and the intrinsic correlation between multimodal data such as sample point cloud data and sample image data. Therefore, the deep mining result is calculated according to the mining feature data through the deep mining layer.

在其中一个实施例中，深度挖掘结果为关系损失值；通过深度挖掘层对样本点云特征以及样本图像特征进行深度挖掘，得到深度挖掘结果包括：通过深度挖掘层对样本点云特征以及样本图像特征进行深度挖掘，得到挖掘特征数据；根据挖掘特征数据确定关系损失值。In one of the embodiments, the result of deep mining is a relationship loss value; through the deep mining layer, the sample point cloud features and sample image features are deeply mined, and the deep mining results obtained include: using the deep mining layer to analyze the sample point cloud features and sample images The features are deeply mined to obtain the mining feature data; the relationship loss value is determined according to the mining feature data.

深度挖掘层包括点云分支和图像分支。通过深度挖掘层中的点云分支对样本点云特征进行深度挖掘，得到第一挖掘特征。其中，第一挖掘特征是指样本点云数据对应的深层特征，以及与样本图像数据之间的内在关联关系。通过深度挖掘层中的图像分支对样本图像特征进行深度挖掘，得到第二挖掘特征。第二挖掘特征是指样本图像数据对应的深层特征，以及与样本点云数据之间的内在关联关系。根据第一挖掘特征以及第二挖掘特征得到挖掘特征数据。进而根据挖掘特征数据计算深度学习模型的关系损失值，将关系损失值确定为深度挖掘结果。The deep mining layer includes a point cloud branch and an image branch. Through the point cloud branch in the deep mining layer, the feature of the sample point cloud is deeply mined to obtain the first mined feature. Wherein, the first mining feature refers to the deep feature corresponding to the sample point cloud data, and the intrinsic relationship with the sample image data. The sample image features are deeply mined through the image branch in the deep mining layer to obtain the second mined features. The second mining feature refers to the deep feature corresponding to the sample image data, and the intrinsic relationship with the sample point cloud data. Mining feature data is obtained according to the first mining feature and the second mining feature. Furthermore, the relationship loss value of the deep learning model is calculated according to the mining feature data, and the relationship loss value is determined as the result of deep mining.

进一步地，深度挖掘层的点云分支和图像分支分别包括一个编码器和多层感知机(Multilayer Perceptron，MLP)，多层感知机也可以称为Projection。编码器用于对样本点云特征以及样本图像特征进行深度挖掘。多层感知机用于将挖掘特征数据映射到样本标记空间，即将挖掘特征数据整合为一个值，能够减少特征位置对于挖掘特征数据的影响，提高了整个模型的鲁棒性。通过深度挖掘层根据第一挖掘特征以及第二挖掘特征计算深度学习模型的关系损失值，根据关系损失值驱动特征提取层中点云分支和图像分支中编码器的参数向着能挖掘深度特征和多模态数据间内在关联关系的方向更新。Further, the point cloud branch and the image branch of the depth mining layer respectively include an encoder and a multilayer perceptron (Multilayer Perceptron, MLP), and the multilayer perceptron can also be called Projection. The encoder is used for deep mining of sample point cloud features and sample image features. The multi-layer perceptron is used to map the mining feature data to the sample label space, that is, to integrate the mining feature data into a value, which can reduce the influence of the feature position on the mining feature data and improve the robustness of the entire model. Through the deep mining layer, the relationship loss value of the deep learning model is calculated according to the first mining feature and the second mining feature, and the parameters of the encoder in the point cloud branch and the image branch in the feature extraction layer are driven according to the relationship loss value to mine deep features and multiple Update the direction of the intrinsic relationship between modal data.

进一步地，样本多模态数据可以是多个目标的多模态数据，因此，挖掘特征数据是指多个目标的挖掘特征。例如，关系损失值的关系损失函数可以采用InfoNCE loss函数来计算，如下所示：Further, the sample multimodal data may be multimodal data of multiple targets, therefore, the mined feature data refers to the mined features of multiple targets. For example, the relationship loss function of the relationship loss value can be calculated using the InfoNCE loss function as follows:

其中，L_relation表示关系损失值，N表示一个批次的图像或点云帧数，z_i表示第一挖掘特征或第二挖掘特征中某个目标的深层特征，表示另一模态挖掘特征中与z_i相同的该目标的特征，z_j表示一个批次内的任意目标，Sim(z_i,z_j)表示z_i和z_j的余弦相似度，t表示温度超参数。Among them, L _relation represents the relationship loss value, N represents the number of images or point cloud frames in a batch, z _i represents the deep feature of a target in the first mining feature or the second mining feature, Indicates the feature of the target that is the same as _zi in another mode mining feature, z _j indicates any target in a batch, Sim( _zi , z _j ) indicates the cosine similarity between z _i and z _j , t indicates Temperature hyperparameter.

从关系损失函数来看，最小化关系损失意味着z_i与的相似度应该尽量大，而z_i与其他不同数据的相似度尽量小。关系损失函数使得不同模态数据间同一目标的区域特征相似度更高，不同目标的区域特征相似度更低。从而使模型能提挖掘数据更深层的语义特征和不同模态数据间的特征联系，更加有利于自动驾驶感知能力的提升。From the relational loss function, minimizing the relational loss means that zi _and The similarity of zi should be as large as possible, while the similarity between _zi and other different data should be as small as possible. The relational loss function makes the similarity of regional features of the same target between different modal data higher, and the similarity of regional features of different targets is lower. In this way, the model can mine the deeper semantic features of the data and the feature relationship between different modal data, which is more conducive to the improvement of the perception ability of automatic driving.

步骤310，通过感知层根据样本融合特征以及深度挖掘结果，对深度学习模型进行训练，得到预先构建的特征提取模型。In step 310, the deep learning model is trained through the perception layer according to the fusion features of the samples and the results of deep mining to obtain a pre-built feature extraction model.

通过感知层根据样本融合特征执行预设感知任务，得到任务执行结果。其中，预设感知任务是指自动驾驶任务，可以是3D目标检测任务、目标分割任务等。进而根据任务执行结果以及深度挖掘结果对深度学习模型进行训练，直至满足预设条件，得到预先构建的特征提取模型。其中，预设条件可以是深度学习模型的损失值不再下降或者达到预设迭代次数。Through the perception layer, the preset perception tasks are performed according to the sample fusion features, and the task execution results are obtained. Wherein, the preset perception task refers to an automatic driving task, which may be a 3D target detection task, a target segmentation task, and the like. Then, the deep learning model is trained according to the task execution results and deep mining results until the preset conditions are met, and a pre-built feature extraction model is obtained. Wherein, the preset condition may be that the loss value of the deep learning model no longer decreases or reaches a preset number of iterations.

在其中一个实施例中，感知层包括感知任务层以及训练优化层；通过感知层根据样本融合特征以及深度挖掘结果，对深度学习模型进行训练包括：通过感知任务层根据样本融合特征执行预设感知任务，得到任务执行结果；通过训练优化层根据任务执行结果以及关系损失值确定深度学习模型的总体损失值，根据总体损失值对深度学习模型进行训练。In one of the embodiments, the perception layer includes a perception task layer and a training optimization layer; through the perception layer, according to the sample fusion features and depth mining results, training the deep learning model includes: performing preset perception according to the sample fusion features through the perception task layer The task is to obtain the task execution result; the overall loss value of the deep learning model is determined according to the task execution result and the relationship loss value through the training optimization layer, and the deep learning model is trained according to the overall loss value.

通过感知层中的感知任务层根据样本融合特征执行预设感知任务。训练过程中的感知层结构与实际应用过程中的感知层结构是相同的。其中，感知任务层可以包括提取层(Extract层)和感知任务头(3DTaskHead)。例如，提取层可以是Voxel、Pillar等。通过提取层对深层融合特征进行特征提取，再通过感知任务头根据提取的特征执行预设感知任务，即自动驾驶任务，得到任务执行结果。例如，预设感知任务可以是3D目标检测任务、目标分割等任务。The preset perception task is performed according to the sample fusion feature through the perception task layer in the perception layer. The perception layer structure in the training process is the same as the perception layer structure in the actual application process. Wherein, the perception task layer may include an extraction layer (Extract layer) and a perception task head (3DTaskHead). For example, the abstraction layer can be Voxel, Pillar, etc. Feature extraction is performed on the deep fusion features through the extraction layer, and then the preset perception tasks, that is, automatic driving tasks, are performed through the perception task head according to the extracted features, and the task execution results are obtained. For example, the preset perception tasks may be 3D target detection tasks, target segmentation and other tasks.

通过感知层中的训练优化层根据任务执行结果以及对应的感知任务损失函数，计算深度学习模型的感知任务损失值。当预设感知任务不同时，对应的感知任务损失函数也可以是不同的。例如，当预设感知任务为3D目标检测任务时，其感知任务损失函数可以包括位置损失、方向损失和类别损失。当预设感知任务为目标分割任务时，其感知任务损失函数可以包括类别损失、边界框损失、掩码损失和区域焦点损失。进而通过训练优化层根据感知任务损失值以及关系损失值，计算深度学习模型的总体损失值。通过调整总体损失值来训练深度学习模型的参数。其中，总体损失值可以是最小化总体损失值。Through the training optimization layer in the perception layer, the perception task loss value of the deep learning model is calculated according to the task execution result and the corresponding perception task loss function. When the preset perceptual tasks are different, the corresponding perceptual task loss functions may also be different. For example, when the preset perception task is a 3D object detection task, its perception task loss function may include position loss, direction loss and category loss. When the preset perceptual task is an object segmentation task, its perceptual task loss function can include category loss, bounding box loss, mask loss and region focus loss. Then, the overall loss value of the deep learning model is calculated by training the optimization layer according to the perception task loss value and the relationship loss value. Train the parameters of a deep learning model by tuning the overall loss value. Wherein, the total loss value may be the minimum total loss value.

进一步地，预设感知任务为目标检测任务；通过训练优化层根据任务执行结果以及关系损失值确定深度学习模型的总体损失值包括：通过训练优化层根据任务执行结果确定深度学习模型的位置损失值、方向损失值以及类别损失值；通过训练优化层根据位置损失值、方向损失值、类别损失值以及关系损失值确定深度学习模型的总体损失值。Further, the preset perception task is a target detection task; determining the overall loss value of the deep learning model through the training optimization layer according to the task execution result and the relationship loss value includes: determining the position loss value of the deep learning model according to the task execution result through the training optimization layer , direction loss value and category loss value; determine the overall loss value of the deep learning model according to the position loss value, direction loss value, category loss value and relationship loss value by training the optimization layer.

具体地，当预设感知任务为目标检测任务，即3D目标检测任务时，对应的感知任务损失函数可以包括位置损失、方向损失和类别损失。通过训练优化层根据任务执行结果以及对应的感知任务损失函数，计算深度学习模型的位置损失值、方向损失值以及类别损失值。感知任务损失函数可以包括位置损失函数、方向损失函数以及类别损失函数。其中，位置损失函数可以采用SmoothL1函数，方向损失函数可以采用Softmax损失函数，类别损失可以采用FocalLoss函数。例如，位置损失函数可以如下所示：Specifically, when the preset perception task is an object detection task, that is, a 3D object detection task, the corresponding perception task loss function may include position loss, direction loss, and category loss. Through the training optimization layer, the position loss value, direction loss value and category loss value of the deep learning model are calculated according to the task execution result and the corresponding perception task loss function. Perceptual task loss functions may include position loss functions, orientation loss functions, and category loss functions. Among them, the position loss function can use the SmoothL1 function, the direction loss function can use the Softmax loss function, and the category loss can use the FocalLoss function. For example, the positional loss function can look like this:

L_loc＝∑_{b∈(x,y,z,w,h,l,θ)}SmoothL1(Δb) (3)L _loc =∑ _{b∈(x,y,z,w,h,l,θ)} SmoothL1(Δb) (3)

其中，L_loc表示位置损失值，b表示目标的位置，用7自由度(目标中心空间坐标(x，y，z)，目标宽长高(w，l，h)，航向角θ)表示，Δb表示用深度学习网络预测得到的目标b的检测框与真实框的位置偏差。Among them, L _loc represents the position loss value, b represents the position of the target, expressed by 7 degrees of freedom (the space coordinates of the target center (x, y, z), the target width, length and height (w, l, h), and the heading angle θ), Δb represents the position deviation between the detection frame of target b predicted by the deep learning network and the real frame.

从而通过训练优化层根据位置损失值、方向损失值、类别损失值、关系损失值以及预设计算关系，计算深度学习模型的总体损失值。总体损失值可以是最小化总体损失值，以最小化总体损失值为优化目标。例如，预设计算关系可以如下所示：Therefore, the overall loss value of the deep learning model is calculated by training the optimization layer according to the position loss value, direction loss value, category loss value, relationship loss value and preset calculation relationship. The overall loss value may be to minimize the overall loss value, and the optimization goal is to minimize the overall loss value. For example, a preset calculation relationship can look like this:

其中，L_total表示总体损失值，N_positive表示正样本总数，正样本的确定与阈值优化，即样本多模态数据的预测分值大于正样本阈值，则此样本为正样本，L_loc表示位置损失值，β_loc表示位置损失值对应的位置权重，L_dir表示方向损失值，β_dir表示方向损失值对应的方向权重，L_cls表示类别损失值，β_cls表示类别损失值对应的类别权重，L_relation表示关系损失值，β_relation表示关系损失值对应的关系权重。Among them, L _total represents the overall loss value, N _positive represents the total number of positive samples, the determination of positive samples and threshold optimization, that is, the prediction score of sample multimodal data is greater than the positive sample threshold, then this sample is a positive sample, L _loc represents the location Loss value, β _loc represents the position weight corresponding to the position loss value, L _dir represents the direction loss value, β _dir represents the direction weight corresponding to the direction loss value, L _cls represents the category loss value, β _cls represents the category weight corresponding to the category loss value, L _relation represents the relation loss value, and β _relation represents the relation weight corresponding to the relation loss value.

通过设计一种新的综合的优化目标，可以综合表征自动驾驶感知任务的性能和多模态融合程度。通过这种新的优化目标可以驱动深度学习网络学习多模态数据深层的特征和关联，提升自动驾驶感知性能。By designing a new comprehensive optimization objective, the performance of autonomous driving perception tasks and the degree of multimodal fusion can be comprehensively characterized. Through this new optimization goal, the deep learning network can be driven to learn the deep features and associations of multi-modal data, and improve the perception performance of autonomous driving.

示例性地，当预设感知任务为目标检测任务时，深度学习模型的网络结构示意图可以如图4所示。其中，深度学习模型包括特征提取层、深度挖掘层、感知任务层和训练优化层。在训练深度学习模型时，特征提取层、深度挖掘层、感知任务层和训练优化层全程参与计算各类损失值，进行模型参数的更新和优化，以获得优质的特征提取模型。在实际应用过程中，不需要进行优化和损失值计算，可以去掉所有的损失结构，仅适用特征提取层和感知任务层。特征提取层和深度挖掘层中均包括点云分支以及图像分支。特征提取层中的点云分支和深度挖掘层中的点云分支相连，特征提取层中的图像分支和深度挖掘层中的图像分支相连。Exemplarily, when the preset perception task is the target detection task, a schematic diagram of the network structure of the deep learning model may be shown in FIG. 4 . Among them, the deep learning model includes a feature extraction layer, a deep mining layer, a perception task layer and a training optimization layer. When training a deep learning model, the feature extraction layer, deep mining layer, perception task layer, and training optimization layer participate in the calculation of various loss values throughout the process, and update and optimize model parameters to obtain a high-quality feature extraction model. In the actual application process, there is no need for optimization and loss value calculation, and all loss structures can be removed, and only the feature extraction layer and the perception task layer are applicable. Both feature extraction layer and deep mining layer include point cloud branch and image branch. The point cloud branch in the feature extraction layer is connected to the point cloud branch in the deep mining layer, and the image branch in the feature extraction layer is connected to the image branch in the deep mining layer.

具体地，特征提取层中的图像分支包括两个Encoder(编码器)，通过第一个Encoder对待处理图像进行语义分割，对语义分割结果进行聚类处理，通过第二个Encoder对聚类结果进行上采样，得到F_img(第二深层特征数据)。特征提取层中的点云分支包括一个Encoder，通过Encoder将待处理点云数据的特征维度提高到预设维度，从而提取到待处理点云数据中高维且深度的特征数据，即F_pt(第一深层特征数据)。特征提取层中还包括Fusion层，通过融合层根据预设外参矩阵将F_img中每个像素点与F_pt中的点进行对齐，得到对齐后的F_img，从而将对齐后的F_img与F_pt进行融合，得到待处理多模态数据对应的F_fuse(深层融合特征)。Specifically, the image branch in the feature extraction layer includes two Encoders (encoders). The image to be processed is semantically segmented through the first Encoder, and the semantic segmentation results are clustered, and the clustering results are clustered through the second Encoder. Upsampling to get F_img (second deep feature data). The point cloud branch in the feature extraction layer includes an Encoder, which increases the feature dimension of the point cloud data to be processed to the preset dimension through the Encoder, thereby extracting high-dimensional and deep feature data in the point cloud data to be processed, namely F_pt (first deep feature data). The feature extraction layer also includes the Fusion layer. Through the fusion layer, each pixel in F_img is aligned with the point in F_pt according to the preset external parameter matrix to obtain the aligned F_img, so that the aligned F_img and F_pt are fused. Get the F_fuse (deep fusion feature) corresponding to the multimodal data to be processed.

深度挖掘层的点云分支和图像分支分别包括一个Encoder和Projection(多层感知机)，通过该图像分支中的Encoder对样本图像特征进行深度挖掘，将挖掘的图像特征输入至Projection，得到第二挖掘特征。通过该点云分支中的Encoder对样本点云特征进行深度挖掘，将挖掘的点云特征输入至Projection，得到第一挖掘特征。通过深度挖掘层根据第一挖掘特征以及第二挖掘特征计算深度学习模型的Relation(关系损失值)。The point cloud branch and the image branch of the deep mining layer include an Encoder and a Projection (multi-layer perceptron) respectively. The Encoder in the image branch is used to deeply mine the sample image features, and the mined image features are input to the Projection to obtain the second Mining features. Through the Encoder in the point cloud branch, the sample point cloud features are deeply mined, and the mined point cloud features are input to Projection to obtain the first mined feature. The Relation (relation loss value) of the deep learning model is calculated according to the first mining feature and the second mining feature through the deep mining layer.

感知任务层包括Extract层(提取层)和3DTaskHead(3D感知任务头)，将F_fuse输入至Extract层中进行特征提取，再通过3DTaskHead根据提取的特征执行目标检测任务，得到任务执行结果，根据任务执行结果计算Loc(位置损失)、Dir(方向损失)和Cls(类别损失)。The perception task layer includes the Extract layer (extraction layer) and 3DTaskHead (3D perception task head). F_fuse is input into the Extract layer for feature extraction, and then the target detection task is performed according to the extracted features through 3DTaskHead, and the task execution result is obtained. According to the task execution The result computes Loc (Location Loss), Dir (Direction Loss) and Cls (Category Loss).

通过训练优化层根据Loc、Dir、Cls和Relation，计算深度学习模型的TotalLoss(总体损失值)。通过TotalLoss来对深度学习模型的参数进行优化，直至深度学习模型的损失值不再下降或者达到预设迭代次数，得到预先构建的特征提取模型。Calculate the TotalLoss (overall loss value) of the deep learning model by training the optimization layer according to Loc, Dir, Cls and Relation. Use TotalLoss to optimize the parameters of the deep learning model until the loss value of the deep learning model no longer decreases or reaches the preset number of iterations to obtain a pre-built feature extraction model.

本实施例中，待训练的深度学习模型包括特征提取层、深度挖掘层和感知层，通过特征提取层提取样本点云数据对应的样本点云特征以及样本图像数据对应的样本图像特征，将样本点云特征和样本图像特征进行融合，得到样本融合特征，从而通过待训练的深度学习模型中的深度挖掘层对样本点云特征以及样本图像特征进行深度挖掘，得到深度挖掘结果，通过感知层根据样本融合特征以及深度挖掘结果，对深度学习模型进行训练，得到预先构建的特征提取模型。使得训练得到的特征提取模型能够充分提取多模态数据间的内在联系，融合同类目标的特征，提取到更多有效的场景信息，能为自动驾驶的规划和决策提供更加准确的信息。In this embodiment, the deep learning model to be trained includes a feature extraction layer, a depth mining layer, and a perception layer. The sample point cloud features corresponding to the sample point cloud data and the sample image features corresponding to the sample image data are extracted through the feature extraction layer, and the sample The point cloud features and the sample image features are fused to obtain the sample fusion features, so that the sample point cloud features and sample image features are deeply mined through the deep mining layer in the deep learning model to be trained, and the deep mining results are obtained. Sample fusion features and deep mining results are used to train the deep learning model to obtain a pre-built feature extraction model. The trained feature extraction model can fully extract the internal relationship between multi-modal data, integrate the features of similar targets, extract more effective scene information, and provide more accurate information for autonomous driving planning and decision-making.

在另一个实施例中，提供了一种多模态数据融合方法，该方法包括以下步骤：In another embodiment, a method for multimodal data fusion is provided, the method comprising the following steps:

获取样本多模态数据；样本多模态数据包括样本点云数据以及样本图像数据。Obtain sample multimodal data; the sample multimodal data includes sample point cloud data and sample image data.

将样本多模态数据输入至待训练的深度学习模型中；深度学习模型包括特征提取层、深度挖掘层和感知层；感知层包括感知任务层以及训练优化层。Input sample multimodal data into the deep learning model to be trained; the deep learning model includes a feature extraction layer, a deep mining layer and a perception layer; the perception layer includes a perception task layer and a training optimization layer.

通过特征提取层提取样本点云数据对应的样本点云特征以及样本图像数据对应的样本图像特征，将样本点云特征和样本图像特征进行融合，得到样本融合特征。The sample point cloud features corresponding to the sample point cloud data and the sample image features corresponding to the sample image data are extracted through the feature extraction layer, and the sample point cloud features and the sample image features are fused to obtain the sample fusion features.

通过深度挖掘层对样本点云特征以及样本图像特征进行深度挖掘，得到挖掘特征数据。Through the deep mining layer, the sample point cloud features and sample image features are deeply mined to obtain the mined feature data.

根据挖掘特征数据确定深度学习模型的关系损失值。Determine the relationship loss value of the deep learning model based on the mined feature data.

通过感知任务层根据样本融合特征执行预设感知任务，得到任务执行结果。Through the perception task layer, the preset perception tasks are executed according to the sample fusion features, and the task execution results are obtained.

当预设感知任务为目标检测任务时，通过训练优化层根据任务执行结果确定深度学习模型的位置损失值、方向损失值以及类别损失值。When the preset perception task is the target detection task, the position loss value, direction loss value and category loss value of the deep learning model are determined according to the task execution results by training the optimization layer.

通过训练优化层根据位置损失值、方向损失值、类别损失值以及关系损失值确定深度学习模型的总体损失值，根据总体损失值对深度学习模型进行训练，得到预先构建的特征提取模型。Through the training optimization layer, the overall loss value of the deep learning model is determined according to the position loss value, direction loss value, category loss value and relationship loss value, and the deep learning model is trained according to the overall loss value to obtain a pre-built feature extraction model.

获取待处理多模态数据；待处理多模态数据包括待处理点云数据和待处理图像数据。Obtain multimodal data to be processed; the multimodal data to be processed includes point cloud data to be processed and image data to be processed.

调用预先构建的特征提取模型。Call a pre-built feature extraction model.

通过特征提取模型对待处理点云数据进行特征提取，得到待处理点云数据对应的第一深层特征数据；第一深层特征数据包括深层点云特征以及与待处理图像数据之间的关联特征。Feature extraction is performed on the point cloud data to be processed through a feature extraction model to obtain first deep feature data corresponding to the point cloud data to be processed; the first deep feature data includes deep point cloud features and associated features with image data to be processed.

通过特征提取模型对待处理图像数据进行语义分割，得到语义分割结果，对语义分割结果进行聚类，得到聚类结果。The image data to be processed is semantically segmented through the feature extraction model to obtain the semantic segmentation results, and the semantic segmentation results are clustered to obtain the clustering results.

在本实施例中，通过预先构建的特征提取模型对待处理多模态数据进行融合，由于特征提取模型是通过对样本多模态数据进行深度挖掘，根据深度挖掘结果训练得到的，能深度挖掘多模态数据的深层语义特征以及多模态数据之间的内在关联联系，使得深层融合特征能够更加接近真实世界，提高了多模态数据的融合准确性，同时能为自动驾驶的规划和决策提供更为准确的信息。In this embodiment, the pre-built feature extraction model is used to fuse the multi-modal data to be processed. Since the feature extraction model is obtained through deep mining of sample multi-modal data and trained according to the results of deep mining, it can deeply mine multiple The deep semantic features of modal data and the internal correlation between multi-modal data make the deep fusion feature closer to the real world, improve the fusion accuracy of multi-modal data, and provide information for the planning and decision-making of autonomous driving. more accurate information.

在一个实施例中，将本方法与当前的主流方法PointPainting和RoarNet，在KITTI数据集上对车辆类别做了对比试验，如下表1。其中，正负样本阈值分别为0.6，0.45。试验以KITTI数据集的评价标准AP(Average Precision，检测准确率)为指标。针对不同难度的车辆目标，本方法的检测准确率均高于当前主流的两种多模态融合方法。尤其是困难情况下的车辆目标，本身存在一定的检测难度。但是本方法在此类目标上得到的检测准确率拉开PointPainting的检测准确率为0.24％。对比试验从数据上说明本方法相对于当前主流的两种方法在提高感知能力方面具备一定的优越性。而感知能力的提高也依赖模型对数据的学习，所以，实验也验证了本方法对多模态数据特征有更深入的挖掘能力。In one embodiment, this method is compared with the current mainstream methods PointPainting and RoarNet on the KITTI dataset for vehicle categories, as shown in Table 1 below. Among them, the positive and negative sample thresholds are 0.6 and 0.45, respectively. The test uses the evaluation standard AP (Average Precision, detection accuracy) of the KITTI dataset as the index. For vehicle targets with different difficulties, the detection accuracy of this method is higher than the two current mainstream multi-modal fusion methods. Especially the vehicle target under difficult circumstances has a certain degree of difficulty in detection. However, the detection accuracy obtained by this method on such objects is 0.24% higher than that of PointPainting. The comparison test shows from the data that this method has a certain advantage in improving the perception ability compared with the two current mainstream methods. The improvement of perception ability also depends on the learning of data by the model. Therefore, the experiment also verifies that this method has a deeper mining ability for multi-modal data features.

表1对比实验Table 1 Comparative experiment

应该理解的是，虽然如上所述的各实施例所涉及的流程图中的各个步骤按照箭头的指示依次显示，但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明，这些步骤的执行并没有严格的顺序限制，这些步骤可以以其它的顺序执行。而且，如上所述的各实施例所涉及的流程图中的至少一部分步骤可以包括多个步骤或者多个阶段，这些步骤或者阶段并不必然是在同一时刻执行完成，而是可以在不同的时刻执行，这些步骤或者阶段的执行顺序也不必然是依次进行，而是可以与其它步骤或者其它步骤中的步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that although the steps in the flow charts involved in the above embodiments are shown sequentially according to the arrows, these steps are not necessarily executed sequentially in the order indicated by the arrows. Unless otherwise specified herein, there is no strict order restriction on the execution of these steps, and these steps can be executed in other orders. Moreover, at least some of the steps in the flow charts involved in the above-mentioned embodiments may include multiple steps or stages, and these steps or stages are not necessarily executed at the same time, but may be performed at different times For execution, the execution order of these steps or stages is not necessarily performed sequentially, but may be executed in turn or alternately with other steps or at least a part of steps or stages in other steps.

基于同样的发明构思，本申请实施例还提供了一种用于实现上述所涉及的多模态数据融合方法的多模态数据融合装置。该装置所提供的解决问题的实现方案与上述方法中所记载的实现方案相似，故下面所提供的一个或多个多模态数据融合装置实施例中的具体限定可以参见上文中对于多模态数据融合方法的限定，在此不再赘述。Based on the same inventive concept, an embodiment of the present application further provides a multi-modal data fusion device for implementing the above-mentioned multi-modal data fusion method. The solution to the problem provided by the device is similar to the implementation described in the above method, so the specific limitations in one or more embodiments of the multi-modal data fusion device provided below can be referred to above for multi-modal The limitation of the data fusion method will not be repeated here.

在一个实施例中，如图5所示，提供了一种多模态数据融合装置，包括：数据获取模块502、模型调用模块504、特征提取模块506和特征融合模块508，其中：In one embodiment, as shown in FIG. 5 , a multimodal data fusion device is provided, including: a data acquisition module 502, a model calling module 504, a feature extraction module 506, and a feature fusion module 508, wherein:

待处理数据获取模块502，用于获取待处理多模态数据；待处理多模态数据包括待处理点云数据和待处理图像数据。The data to be processed acquisition module 502 is configured to acquire multimodal data to be processed; the multimodal data to be processed includes point cloud data to be processed and image data to be processed.

预构模型调用模块504，用于调用预先构建的特征提取模型。The pre-built model calling module 504 is used to call the pre-built feature extraction model.

特征提取模块506，用于通过特征提取模型分别对待处理点云数据和待处理图像数据进行特征提取，得到待处理点云数据对应的第一深层特征数据以及待处理图像数据对应的第二深层特征数据。The feature extraction module 506 is used to perform feature extraction on the point cloud data to be processed and the image data to be processed respectively through the feature extraction model to obtain the first deep feature data corresponding to the point cloud data to be processed and the second deep feature data corresponding to the image data to be processed data.

特征融合模块508，用于通过特征提取模型将第一深层特征数据和第二深层特征数据进行融合，得到待处理多模态数据对应的深层融合特征。The feature fusion module 508 is configured to fuse the first deep feature data and the second deep feature data through a feature extraction model to obtain deep fusion features corresponding to the multimodal data to be processed.

在一个实施例中，上述装置还包括：In one embodiment, the above-mentioned device also includes:

样本数据获取模块，用于获取样本多模态数据；样本多模态数据包括样本点云数据以及样本图像数据。The sample data acquisition module is configured to acquire sample multimodal data; the sample multimodal data includes sample point cloud data and sample image data.

待训练模型调用模块，用于将样本多模态数据输入至待训练的深度学习模型中；深度学习模型包括特征提取层、深度挖掘层和感知层。The model to be trained calls a module, which is used to input sample multimodal data into the deep learning model to be trained; the deep learning model includes a feature extraction layer, a deep mining layer and a perception layer.

特征处理模块，用于通过特征提取层提取样本点云数据对应的样本点云特征以及样本图像数据对应的样本图像特征，将样本点云特征和样本图像特征进行融合，得到样本融合特征。The feature processing module is used to extract the sample point cloud features corresponding to the sample point cloud data and the sample image features corresponding to the sample image data through the feature extraction layer, and fuse the sample point cloud features and sample image features to obtain sample fusion features.

特征挖掘模块，用于通过深度挖掘层对样本点云特征以及样本图像特征进行深度挖掘，得到深度挖掘结果。The feature mining module is used to perform deep mining on sample point cloud features and sample image features through the deep mining layer to obtain deep mining results.

模型优化模块，用于通过感知层根据样本融合特征以及深度挖掘结果，对深度学习模型进行训练，得到预先构建的特征提取模型。The model optimization module is used to train the deep learning model through the perception layer according to the sample fusion features and the deep mining results to obtain a pre-built feature extraction model.

在一个实施例中，深度挖掘结果为关系损失值；特征挖掘模块，还用于通过深度挖掘层对样本点云特征以及样本图像特征进行深度挖掘，得到挖掘特征数据；根据挖掘特征数据确定关系损失值。In one embodiment, the deep mining result is a relationship loss value; the feature mining module is also used to deeply mine the sample point cloud features and sample image features through the deep mining layer to obtain mining feature data; determine the relationship loss according to the mining feature data value.

在一个实施例中，感知层包括感知任务层以及训练优化层；模型优化模块，还用于通过感知任务层根据样本融合特征执行预设感知任务，得到任务执行结果；通过训练优化层根据任务执行结果以及关系损失值确定深度学习模型的总体损失值，根据总体损失值对深度学习模型进行训练。In one embodiment, the perception layer includes a perception task layer and a training optimization layer; the model optimization module is also used to perform a preset perception task through the perception task layer according to the sample fusion feature to obtain a task execution result; through the training optimization layer according to the task execution The result and the relationship loss value determine the overall loss value of the deep learning model, and the deep learning model is trained according to the overall loss value.

在一个实施例中，预设感知任务为目标检测任务；模型优化模块，还用于通过训练优化层根据任务执行结果确定深度学习模型的位置损失值、方向损失值以及类别损失值；通过训练优化层根据位置损失值、方向损失值、类别损失值以及关系损失值确定深度学习模型的总体损失值。In one embodiment, the preset perception task is a target detection task; the model optimization module is also used to determine the position loss value, direction loss value and category loss value of the deep learning model according to the task execution results through the training optimization layer; through training optimization The layer determines the overall loss value of the deep learning model based on the position loss value, direction loss value, category loss value, and relationship loss value.

在一个实施例中，特征提取模块506，还用于通过特征提取模型对待处理点云数据进行特征提取，得到待处理点云数据对应的第一深层特征数据；第一深层特征数据包括深层点云特征以及与待处理图像数据之间的关联特征；通过特征提取模型对待处理图像数据进行语义分割，得到语义分割结果，对语义分割结果进行聚类，得到聚类结果；通过特征提取模型对聚类结果进行上采样，得到待处理图像数据对应的第二深层特征数据；第二深层特征数据包括深层图像特征以及与待处理点云数据之间的关联特征。In one embodiment, the feature extraction module 506 is also used to perform feature extraction on the point cloud data to be processed through the feature extraction model to obtain the first deep feature data corresponding to the point cloud data to be processed; the first deep feature data includes deep point cloud Features and the associated features with the image data to be processed; the image data to be processed is semantically segmented through the feature extraction model to obtain the semantic segmentation results, and the semantic segmentation results are clustered to obtain the clustering results; the clustering results are obtained through the feature extraction model The results are up-sampled to obtain second deep feature data corresponding to the image data to be processed; the second deep feature data includes deep image features and associated features with the point cloud data to be processed.

上述多模态数据融合装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中，也可以以软件形式存储于计算机设备中的存储器中，以便于处理器调用执行以上各个模块对应的操作。Each module in the above-mentioned multi-modal data fusion device can be fully or partially realized by software, hardware and a combination thereof. The above-mentioned modules can be embedded in or independent of the processor in the computer device in the form of hardware, and can also be stored in the memory of the computer device in the form of software, so that the processor can invoke and execute the corresponding operations of the above-mentioned modules.

在一个实施例中，提供了一种计算机设备，该计算机设备可以是服务器，其内部结构图可以如图6所示。该计算机设备包括处理器、存储器、输入/输出接口(Input/Output，简称I/O)和通信接口。其中，处理器、存储器和输入/输出接口通过系统总线连接，通信接口通过输入/输出接口连接到系统总线。其中，该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质和内存储器。该非易失性存储介质存储有操作系统、计算机程序和数据库。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的数据库用于存储待处理多模态数据等。该计算机设备的输入/输出接口用于处理器与外部设备之间交换信息。该计算机设备的通信接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种多模态数据融合方法。In one embodiment, a computer device is provided. The computer device may be a server, and its internal structure may be as shown in FIG. 6 . The computer device includes a processor, a memory, an input/output interface (Input/Output, I/O for short), and a communication interface. Wherein, the processor, the memory and the input/output interface are connected through the system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein, the processor of the computer device is used to provide calculation and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs and databases. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage medium. The database of the computer device is used to store multimodal data to be processed and the like. The input/output interface of the computer device is used for exchanging information between the processor and external devices. The communication interface of the computer device is used to communicate with an external terminal through a network connection. When the computer program is executed by the processor, a multi-modal data fusion method is realized.

本领域技术人员可以理解，图6中示出的结构，仅仅是与本申请方案相关的部分结构的框图，并不构成对本申请方案所应用于其上的计算机设备的限定，具体的计算机设备可以包括比图中所示更多或更少的部件，或者组合某些部件，或者具有不同的部件布置。Those skilled in the art can understand that the structure shown in FIG. 6 is only a block diagram of a part of the structure related to the solution of this application, and does not constitute a limitation on the computer equipment to which the solution of this application is applied. The specific computer equipment can be More or fewer components than shown in the figures may be included, or some components may be combined, or have a different arrangement of components.

在一个实施例中，提供了一种计算机设备，包括存储器和处理器，存储器中存储有计算机程序，该处理器执行计算机程序时实现上述各方法实施例中的步骤。In one embodiment, a computer device is provided, including a memory and a processor, where a computer program is stored in the memory, and the processor implements the steps in the foregoing method embodiments when executing the computer program.

在一个实施例中，提供了一种计算机可读存储介质，其上存储有计算机程序，该计算机程序被处理器执行时实现上述各方法实施例中的步骤。In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored, and when the computer program is executed by a processor, the steps in the foregoing method embodiments are implemented.

在一个实施例中，提供了一种计算机程序产品，包括计算机程序，该计算机程序被处理器执行时实现上述各方法实施例中的步骤。In one embodiment, a computer program product is provided, including a computer program, and when the computer program is executed by a processor, the steps in the foregoing method embodiments are implemented.

需要说明的是，本申请所涉及的用户信息(包括但不限于用户设备信息、用户个人信息等)和数据(包括但不限于用于分析的数据、存储的数据、展示的数据等)，均为经用户授权或者经过各方充分授权的信息和数据，且相关数据的收集、使用和处理需要遵守相关国家和地区的相关法律法规和标准。It should be noted that the user information (including but not limited to user equipment information, user personal information, etc.) and data (including but not limited to data used for analysis, stored data, displayed data, etc.) involved in this application are all It is information and data authorized by the user or fully authorized by all parties, and the collection, use and processing of relevant data need to comply with relevant laws, regulations and standards of relevant countries and regions.

本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程，是可以通过计算机程序来指令相关的硬件来完成，所述的计算机程序可存储于一非易失性计算机可读取存储介质中，该计算机程序在执行时，可包括如上述各方法的实施例的流程。其中，本申请所提供的各实施例中所使用的对存储器、数据库或其它介质的任何引用，均可包括非易失性和易失性存储器中的至少一种。非易失性存储器可包括只读存储器(Read-OnlyMemory，ROM)、磁带、软盘、闪存、光存储器、高密度嵌入式非易失性存储器、阻变存储器(ReRAM)、磁变存储器(Magnetoresistive Random Access Memory，MRAM)、铁电存储器(Ferroelectric Random Access Memory，FRAM)、相变存储器(Phase Change Memory，PCM)、石墨烯存储器等。易失性存储器可包括随机存取存储器(Random Access Memory，RAM)或外部高速缓冲存储器等。作为说明而非局限，RAM可以是多种形式，比如静态随机存取存储器(Static Random Access Memory，SRAM)或动态随机存取存储器(Dynamic RandomAccess Memory，DRAM)等。本申请所提供的各实施例中所涉及的数据库可包括关系型数据库和非关系型数据库中至少一种。非关系型数据库可包括基于区块链的分布式数据库等，不限于此。本申请所提供的各实施例中所涉及的处理器可为通用处理器、中央处理器、图形处理器、数字信号处理器、可编程逻辑器、基于量子计算的数据处理逻辑器等，不限于此。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above-mentioned embodiments can be completed by instructing related hardware through computer programs, and the computer programs can be stored in a non-volatile computer-readable memory In the medium, when the computer program is executed, it may include the processes of the embodiments of the above-mentioned methods. Wherein, any reference to storage, database or other media used in the various embodiments provided in the present application may include at least one of non-volatile and volatile storage. Non-volatile memory can include read-only memory (Read-Only Memory, ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive variable memory (ReRAM), magnetic variable memory (Magnetoresistive Random Access Memory, MRAM), Ferroelectric Random Access Memory (FRAM), Phase Change Memory (Phase Change Memory, PCM), graphene memory, etc. The volatile memory may include random access memory (Random Access Memory, RAM) or external cache memory. As an illustration and not a limitation, RAM can be in various forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (Dynamic Random Access Memory, DRAM). The databases involved in the various embodiments provided in this application may include at least one of a relational database and a non-relational database. The non-relational database may include a blockchain-based distributed database, etc., but is not limited thereto. The processors involved in the various embodiments provided by this application can be general-purpose processors, central processing units, graphics processors, digital signal processors, programmable logic devices, data processing logic devices based on quantum computing, etc., and are not limited to this.

以上实施例的各技术特征可以进行任意的组合，为使描述简洁，未对上述实施例中的各个技术特征所有可能的组合都进行描述，然而，只要这些技术特征的组合不存在矛盾，都应当认为是本说明书记载的范围。The technical features of the above embodiments can be combined arbitrarily. To make the description concise, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction in the combination of these technical features, they should be It is considered to be within the range described in this specification.

以上所述实施例仅表达了本申请的几种实施方式，其描述较为具体和详细，但并不能因此而理解为对本申请专利范围的限制。应当指出的是，对于本领域的普通技术人员来说，在不脱离本申请构思的前提下，还可以做出若干变形和改进，这些都属于本申请的保护范围。因此，本申请的保护范围应以所附权利要求为准。The above-mentioned embodiments only express several implementation modes of the present application, and the description thereof is relatively specific and detailed, but should not be construed as limiting the patent scope of the present application. It should be noted that those skilled in the art can make several modifications and improvements without departing from the concept of the present application, and these all belong to the protection scope of the present application. Therefore, the protection scope of the present application should be determined by the appended claims.

Claims

1. A multimodal data fusion method, characterized in that the method comprises:

Obtain multimodal data to be processed; the multimodal data to be processed includes point cloud data to be processed and image data to be processed;

Call a pre-built feature extraction model;

Feature extraction is performed on the point cloud data to be processed and the image data to be processed through the feature extraction model, and the first deep feature data corresponding to the point cloud data to be processed and the first deep feature data corresponding to the image data to be processed are obtained. The second deep feature data;

The first deep feature data and the second deep feature data are fused through the feature extraction model to obtain deep fusion features corresponding to the multimodal data to be processed.

2. The method according to claim 1, wherein, before the acquisition of the multimodal data to be processed, the method further comprises:

Obtain sample multimodal data; the sample multimodal data includes sample point cloud data and sample image data;

The sample multimodal data is input into the deep learning model to be trained; the deep learning model includes a feature extraction layer, a depth mining layer and a perception layer;

The sample point cloud features corresponding to the sample point cloud data and the sample image features corresponding to the sample image data are extracted through the feature extraction layer, and the sample point cloud features and the sample image features are fused to obtain a sample fusion. feature;

performing deep mining on the sample point cloud features and the sample image features through the deep mining layer to obtain a deep mining result;

The perception layer trains the deep learning model according to the sample fusion feature and the deep mining result to obtain the pre-built feature extraction model.

3. The method according to claim 2, wherein the deep mining result is a relationship loss value; the deep mining of the sample point cloud features and the sample image features by the deep mining layer, Get deep mining results including:

performing deep mining on the sample point cloud features and the sample image features through the deep mining layer to obtain mining feature data;

The relationship loss value is determined according to the mining feature data.

4. The method according to claim 3, wherein the perceptual layer comprises a perceptual task layer and a training optimization layer; the perceptual layer is used according to the sample fusion feature and the depth mining result to perform The training of the above-mentioned deep learning model includes:

Executing a preset sensing task through the sensing task layer according to the sample fusion feature to obtain a task execution result;

The overall loss value of the deep learning model is determined by the training optimization layer according to the task execution result and the relationship loss value, and the deep learning model is trained according to the overall loss value.

5. The method according to claim 4, wherein the preset perception task is a target detection task; the depth is determined by the training optimization layer according to the task execution result and the relationship loss value The overall loss values for the learned model include:

Determine the position loss value, direction loss value and category loss value of the deep learning model according to the task execution result through the training optimization layer;

The overall loss value of the deep learning model is determined by the training optimization layer according to the position loss value, the direction loss value, the category loss value and the relationship loss value.

6. The method according to claim 1, wherein the feature extraction is carried out to the point cloud data to be processed and the image data to be processed respectively by the feature extraction model to obtain the point cloud to be processed The first deep feature data corresponding to the data and the second deep feature data corresponding to the image data to be processed include:

Feature extraction is performed on the point cloud data to be processed through the feature extraction model to obtain first deep feature data corresponding to the point cloud data to be processed; the first deep feature data includes deep point cloud features and the Correlation features between the image data to be processed;

performing semantic segmentation on the image data to be processed by the feature extraction model to obtain a semantic segmentation result, and clustering the semantic segmentation result to obtain a clustering result;

The clustering result is up-sampled by the feature extraction model to obtain second deep feature data corresponding to the image data to be processed; the second deep feature data includes deep image features and the point cloud to be processed Association features between data.

7. A multimodal data fusion device, characterized in that the device comprises:

A data acquisition module, configured to acquire multimodal data to be processed; the multimodal data to be processed includes point cloud data to be processed and image data to be processed;

The model calling module is used to call a pre-built feature extraction model; the feature extraction model is obtained by performing deep mining on sample multimodal data and training according to deep mining results;

A feature extraction module, configured to perform feature extraction on the point cloud data to be processed and the image data to be processed through the feature extraction model, to obtain the first deep feature data corresponding to the point cloud data to be processed and the The second deep feature data corresponding to the image data to be processed;

A feature fusion module, configured to fuse the first deep feature data and the second deep feature data through the feature extraction model to obtain deep fusion features corresponding to the multimodal data to be processed.

8. The device according to claim 7, further comprising:

A sample data acquisition module, configured to acquire sample multimodal data; the sample multimodal data includes sample point cloud data and sample image data;

The model calling module to be trained is used to input the sample multimodal data into the deep learning model to be trained; the deep learning model includes a feature extraction layer, a depth mining layer and a perception layer;

A feature processing module, configured to extract the sample point cloud features corresponding to the sample point cloud data and the sample image features corresponding to the sample image data through the feature extraction layer, and extract the sample point cloud features and the sample image features Perform fusion to obtain sample fusion features;

A feature mining module, configured to perform deep mining on the sample point cloud features and the sample image features through the deep mining layer to obtain a deep mining result;

The model optimization module is used to train the deep learning model through the perception layer according to the sample fusion features and the deep mining results, so as to obtain the pre-built feature extraction model.

9. A computer device, comprising a memory and a processor, the memory stores a computer program, wherein the processor implements the method according to any one of claims 1 to 6 when executing the computer program step.

10. A computer-readable storage medium, on which a computer program is stored, wherein when the computer program is executed by a processor, the steps of the method according to any one of claims 1 to 6 are realized.