CN119399595A

CN119399595A - A monocular ranging method and device for automobile assisted driving based on lightweight occupancy prediction network

Info

Publication number: CN119399595A
Application number: CN202411307206.9A
Authority: CN
Inventors: 王宪保; 左顺文; 钟恩烨; 刘豪; 郑雅馀; 周珉; 代家文
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2024-09-19
Filing date: 2024-09-19
Publication date: 2025-02-07

Abstract

A method and device for monocular ranging for automobile assisted driving based on a lightweight occupancy prediction network. The method first constructs a 3D occupancy prediction data set, then constructs a monocular occupancy prediction teacher model based on multi-frame temporal fusion and temporal stereo depth estimation, and a lightweight monocular occupancy prediction student model based on single-frame input. The occupancy prediction teacher model is trained to obtain an occupancy prediction teacher model with good performance and the performance of the teacher model is evaluated. The student model is distilled and trained using the intermediate layer features and outputs of the teacher network to obtain a lightweight occupancy prediction student model with improved performance, and the performance of the student model is evaluated. Finally, the student model is forward inferred, and the connected domain analysis of the instances of interest (such as vehicles and pedestrians) is performed from the inference results to complete the monocular ranging. The method of the present invention overcomes the problem of inaccurate ranging due to the instability of the detection frame in the traditional monocular ranging method, and uses distillation technology to improve performance while lightweighting the model.

Description

Automobile auxiliary driving monocular distance measurement method and device based on lightweight occupancy prediction network

Technical Field

The invention relates to the field of computer vision, in particular to the field of automatic driving, and particularly relates to an automobile auxiliary driving monocular distance measuring method and device based on a lightweight occupation prediction network (Occupancy Network).

Background

Advances in autopilot technology have meant a profound revolution in the field of transportation by humans. In recent years, with the rapid development of artificial intelligence, particularly deep learning and sensor technology in the computer field, an automatic driving system brings more efficient, safer and more environment-friendly prospects for transportation through real-time perception and intelligent decision. The advanced auxiliary driving system (ADVANCED DRIVING ASSISTANCE SYSTEM) carried by the automobile in the current market has the main functions of front automobile/rear automobile collision early warning, automobile distance detection and warning blind area detection, lane changing auxiliary, self-adaptive cruising and the like, and the functions can not leave the perception of the automobile to the surrounding 3D space. The 3D space is often perceived by adopting a Bird's Eye View (BEV) perspective, i.e. information acquired by sensors installed at different positions of the vehicle is uniformly converted into the bird's eye view from above the vehicle after feature extraction, and the bird's eye view is used as a uniform view of a subsequent module (3D target detection, path planning, etc.). A disadvantage of BEV is that it compresses the height dimension in the entire three-dimensional space, which, although it has a small information content in autopilot, still contains some information. In BEV mode, therefore, no height or depth information is perceived. The occupancy prediction network technology proposed in 2022 solves the problem of high-altitude space deficiency existing in BEV, and the basic idea is to divide the 3D space around the automobile into voxels, and model the 3D environment around the automobile by predicting the semantic information of each voxel.

Distance information plays an important role in intelligent driving applications. The purely visual ranging scheme using cameras is widely used due to its relatively low price compared to using expensive radar equipment. The purely visual ranging scheme includes binocular ranging and monocular ranging. The binocular range uses parallax of left and right views to perform depth estimation, and has the advantages of higher accuracy of depth estimation, great calculated amount brought by binocular correction and matching, strict calibration and registration of two cameras, and direct influence on accuracy of ranging due to the effects of calibration and registration. The traditional monocular ranging mainly combines a target detection rectangular frame to perform a ranging task, and the distance is estimated through the width of the rectangular frame of the target in the image and a camera matrix. The presence of the occupancy prediction network provides a new idea for ranging, which divides the 3D space into voxels of preset size, making the distance calculation exceptionally simple. Meanwhile, the occupation information of the predicted voxels avoids explicit prediction of object categories, namely, when encountering an object which never appears in the data set, the occupation prediction method avoids collision accidents caused by the fact that abnormal objects cannot be recognized by predicting whether the current voxels are occupied. However, the occupation prediction network in the current automatic driving field often contains complex depth estimation, context prediction or conversion module from 2D to 3D based on a transducer, has high calculation force requirement, and limits the deployment of the network on a vehicle-mounted chip end.

Knowledge distillation (Knowledge Distillation) is a deep learning model lightweight compression method, which aims to migrate knowledge of a large model (teacher model) into a small model (student model) so as to improve the performance and generalization capability of the small model. The core idea of knowledge distillation is to convert the knowledge of a complex model into a more compact and efficient representation, which reduces the computational complexity and resource requirements while maintaining high performance. In order to solve the problem that the occupation prediction network has high computational power requirements and is difficult to deploy, a knowledge distillation technology can be used, namely, a teacher model which is trained in advance and has good performance but high computational power requirements is used for guiding students model learning with poor performance but low computational power requirements, and the performance of the students model is improved without increasing additional computational power requirements.

Disclosure of Invention

Aiming at the defects of the traditional monocular distance measurement technology, the invention provides an automobile auxiliary driving monocular distance measurement method and device based on a lightweight occupation prediction network, and the specific technical scheme is as follows:

A monocular distance measuring method for automobile auxiliary driving based on a lightweight occupation prediction network specifically comprises the following steps:

S1, constructing a 3D occupancy prediction data set.

S2, constructing a monocular occupation prediction teacher model based on multi-frame time sequence fusion and time sequence three-dimensional depth estimation.

S3, constructing a light-weight monocular occupation prediction student model based on single-frame input.

And S4, training the occupancy prediction teacher model to obtain the occupancy prediction teacher model with good performance, and evaluating the performance of the teacher model.

And S5, performing distillation training on the student model by using characteristics and output of the middle layer of the teacher network to obtain a light-weight occupation prediction student model with improved performance, and evaluating the performance of the student model.

S6, forward reasoning is carried out on the student model, and monocular distance measurement is completed according to the reasoning result.

Further, the step S1 is specifically implemented by the following steps:

S1.1, collecting sensor data, namely collecting original data in a road environment by using a collecting vehicle provided with a plurality of sensors (laser radar and cameras). Such data should include a point cloud of multiple time stamps, image data, etc.

And S1.2, marking point cloud data, namely manually marking the semantics of the 3D point cloud of the key frame in the collected data. The values in the tag are encoded with integers starting from 0, one class for each integer.

S1.3, generating an occupancy prediction truth value label, namely voxelizing the marked point cloud data, and generating an accurate 3D truth value.

Further, the step S1.3 is specifically implemented by the following steps:

S1.3.1 aggregating multiple frames of point clouds to generate dense point cloud data.

S1.3.2 semantic tags are assigned to unlabeled frames using the K-nearest neighbor algorithm (KNN). For each point in the unlabeled frames, finding the nearest point in the K labeled frames, and distributing semantic tags for the point according to the majority voting principle.

S1.3.3 grid reconstruction (Mesh reconstruction) of the aggregated dense point cloud using VDBFusion method.

S1.3.4, carrying out dense point sampling on the reconstructed grid, and continuously carrying out semantic marking on sampling points by adopting KNN to obtain an accurate 3D true value.

S1.4, dividing a training set, a verification set and a test set.

And S1.5, the data information is stored into a form which is convenient to read, such as a.pkl form, so that the data information is convenient to read during training and verification.

The pkl file contains:

(1) The data identification comprises a token of the current frame and the previous and subsequent frames thereof, whether the current frame is a key frame identification, a scene token where the frame data is located, and the like.

(2) The conversion matrix comprises an external parameter matrix of the laser radar, an internal and external parameter matrix of the camera, a conversion matrix from a camera coordinate system to a vehicle coordinate system and a conversion matrix from the vehicle coordinate system to a global coordinate system.

(3) The storage path comprises a point cloud data storage path, a camera picture storage path, a 2D semantic pseudo tag storage path and an occupancy (voxel) true value storage path.

Further, the vehicle coordinate system, the global coordinate system, the 2D semantic pseudo tag, the occupation true value and the like in the step S1.5 are specifically as follows:

S1.5.1 the vehicle coordinate system is a local coordinate system with the vehicle (vehicle) as a reference point, and is used for describing the sensor data and the position of the object relative to the vehicle. The origin is located in the center of the vehicle, i.e., the geometric center of the vehicle chassis. The X-axis is directed forward of the vehicle. The Y-axis is directed to the left of the vehicle. The Z axis is directed toward the top of the vehicle.

S1.5.2 the global coordinate system is a fixed world coordinate system that describes the absolute position of an object throughout the environment. Is not changed with the movement of the vehicle. The origin is located at the point in the GPS where the longitude and latitude are both 0 °. The X-axis is directed eastward. The Y-axis is directed north. The Z-axis points to the sky.

S1.5.3 the 2D semantic pseudo tag is obtained by projecting the 3D point cloud subjected to semantic annotation in the S1.2 to a 2D plane, and then performing morphological processing and smoothing operation.

S1.5.4 the occupancy (voxel) truth value contains all scene key frames, the shape is (D _x,D_y,D_z), and the predicted semantic class index of each voxel in the three-dimensional space with the size of D _x×D_y×D_z is represented. Referring to step S1.3, it adds one air (free) class compared to the semantic index of point cloud in S1.2, representing voxels in three-dimensional space that do not contain any object. D _x、D_y、D_z is the number of voxels in the front-back, left-right, up-down, and down-up directions, respectively.

Further, the step S2 is specifically implemented by the following steps:

S2.1, setting an input camera, an input picture size, a picture enhancement matrix, a BEV space grid size, a depth prediction range, a multi-frame input index, a BEV characteristic enhancement matrix, an optimizer and a learning rate.

S2.2, constructing a 2D feature extraction module, namely adopting ResNet network as backbone network to extract multi-scale features. Features of different levels are fed into a FPN (feature pyramid network) to integrate features of different scales.

S2.3, constructing a 2D-to-3D feature conversion module, namely estimating depth and simultaneously predicting depth and context features. When predicting depth, monocular depth prediction and MVS (temporal multi-view stereo) based multi-view depth prediction are combined. For multi-view depth prediction, the depth center (μ) and depth range (σ) are first predicted, then the μ and σ are used to generate a depth profile, and finally the monocular depth prediction and the weighted multi-view depth are combined together to obtain the final depth. BEV features were obtained using Bev Pooling operations. And respectively extracting respective BEV features of continuous n frames of pictures input in time sequence, uniformly projecting BEV features of non-current key frames to the BEV coordinate system of the current key frames, and fusing multi-frame BEV features through splicing operation along the channel dimension.

Further, the Bev Pooling operations in step S2.3 project the 2D features into the BEV space according to the camera' S internal and external parameters and the pre-computed viewing cone pseudo point cloud, specifically by the following sub-steps:

s2.3.1 constructing a view cone pseudo point cloud according to preset depth distribution and picture size, wherein the view cone pseudo point cloud is in the meaning of pixel point coordinates and possible depth distribution of each pixel on the picture.

S2.3.2 subtracting the picture enhancement translation matrix from the pseudo point cloud pixel coordinates and multiplying the pseudo point cloud pixel coordinates by the inverse of the picture enhancement rotation matrix to compensate the influence of the picture enhancement operation. The compensated enhanced pixel coordinates are multiplied by the inverse of the camera-to-vehicle and camera-internal parameters to convert to vehicle coordinates. The vehicle coordinates are multiplied by the BEV feature enhancement matrix to synchronize the data enhancement of the BEV space. Subtracting three minimum values of the preset BEV space three-dimensional direction from the vehicle coordinates and dividing the three minimum values by the length of each voxel to obtain voxel (voxel) coordinates of the view cone pseudo-point cloud in the BEV space, excluding view cone points falling outside the preset range, and generating indexes of the view cone pseudo-point cloud in the BEV space.

S2.3.3, screening out the context characteristics and depth value indexes to be converted according to the index values, and sequencing the pseudo point clouds of the cone according to the voxel indexes to enable the pseudo point clouds belonging to the same voxel to be adjacent. And (3) adopting multithreading parallel processing, and weighting and summing the characteristics according to the predicted depth for a plurality of pseudo point clouds falling into the same voxel to finally obtain the BEV characteristics.

S2.4, constructing a BEV feature processing module, namely ResNet network as a backbone network, sending features of different layers into an FPN (feature pyramid network), and integrating features of different scales.

S2.5, constructing an occupancy prediction head. The input feature map is first input to a convolution layer for processing, followed by channel to height (channel-to-height) operations to decode the height-direction voxels from the channel dimension of the feature map for all classes of predictions.

Further, channel to height operations in step S2.5 are specifically implemented by the following substeps:

S2.5.1 the characteristic diagram with the input channel number of C _f is output after passing through the full connection layer, the activation function and the full connection layer, and the channel number is D _z xC. Where D _z and C are the number of voxels in the height direction and the total number of categories of voxels, respectively, in S1.5.4.

S2.5.2 splitting the output channel to obtain a feature map with the size of (D _x,D_y,D_z, C), representing the predictive scores of the model on all voxels in the three-dimensional voxel space in the C categories.

Further, the step S3 is specifically implemented by the following steps:

and S3.1, setting an input camera, an input picture size, a picture enhancement matrix, a BEV space grid size, a depth prediction range, a BEV characteristic enhancement matrix, an optimizer and a learning rate.

And S3.2, constructing an improved 2D feature extraction module, namely adopting ResNet network as backbone network to extract multi-scale features. Features of different levels are fed into a FPN (feature pyramid network) to integrate features of different scales. And continuously sending the FPN layer output characteristics into two residual blocks, and then carrying out up-sampling, wherein the two residual blocks are supervised by a 2D semantic pseudo tag. The FPN layer features and the features after two residual blocks are spliced along the channel dimension and then pass through a convolution layer to strengthen the 2D features.

And S3.3, constructing a 2D-3D feature conversion module, namely continuously feeding the features obtained in the step S3.2 into a convolution layer, and outputting coupling features of the context features and the depth estimation. The context feature and depth estimate are separated from the coupling feature. BEV features were obtained using the same Bev Pooling operations as S2.3.

S3.4, constructing an improved BEV characteristic processing module, wherein a ResNet network is used as a backbone network. A Spatial Attention (SA) module is added after each residual block of ResNet network to spatially enhance feature extraction of BEV feature maps. The calculation formula of the spatial attention is as follows:

M_spatial＝σ(Conv_7×7([AvgPool(F);MaxPool(F)])) (3.1)

F_out＝F×M_spatial (3.2)

Where F is the input feature map, avgPool (-) and MaxPool (-) are the average pooling and maximum pooling operations, respectively, along the channel dimension, conv _7×7 is the 7X 7 convolution with input channel number 2 and output channel number 1, and σ is the Sigmoid function. M _spatial is a generated spatial attention map. The output signature F _out is multiplied by the input signature F and the spatial attention map M _spatial.

S3.5, constructing an occupancy prediction head. The input feature map is first processed through a convolution layer, then channel to height (channel-to-height) operations, and the height direction voxels are decoded from the channel dimension of the feature map for all classes of predictions.

Further, the step S4 is specifically implemented by the following steps:

S4.1, reading a plurality of time sequence input front camera pictures, a conversion matrix from a camera coordinate system to a vehicle coordinate system, a conversion matrix from the vehicle coordinate system to a global coordinate system, camera internal parameters, a rotation and translation matrix for picture enhancement and an enhancement matrix of BEV space from a training set at one time, and sending data of a teacher model during training.

S4.2, a forward process when training is performed returns a predicted value of the occupied network and a depth predicted value predicted by the 2D-3D feature conversion module.

And S4.3, reading the real occupation of the voxels according to the storage path of the voxel true value contained in the S1.5 pkl file.

And S4.4, reading the point cloud data according to the storage path of the point cloud data contained in the S1.5 pkl file, and generating a depth true value through the point cloud data to monitor the depth value predicted by the 2D-3D feature conversion module.

Further, said step S4.4 is realized by the sub-steps of:

s4.4.1 three-dimensional coordinates (x, y, z) of the point cloud data are read from the pkl file.

S4.4.2 converting the read point cloud data into a camera coordinate system by using a transformation matrix from the vehicle coordinate system to the camera coordinate system.

S4.4.3 the point cloud data converted into the camera coordinate system is further projected onto the image plane by means of the camera reference matrix, the points in three-dimensional space are projected onto the pixel coordinates (u, v) on the two-dimensional image plane while maintaining the depth information d (i.e. distance).

S4.4.4 generating a depth map, recording, for each point projected onto the image plane, its corresponding depth value. And filling corresponding depth values d on pixel coordinates (u, v) to generate a depth map. And selecting the nearest point to the camera as the depth value of the pixel in the depth map.

S4.5, creating a mask for eliminating the occupation true value outside the field of view range of the front camera.

S4.6, calculating binary cross entropy loss between the predicted depth and the real depth:

Wherein d and The true depth label and the predicted depth, respectively.

S4.7, calculating cross entropy loss between the predicted occupancy and the real occupancy:

c is the total number of categories for voxels in S1.5.4. o _i is a one-hot real label of the i-th class obtained by the voxel occupation true value in S4.3. Is the probability of the ith class of model predictions. Loss outside the front camera field of view is excluded by mask when loss is calculated.

S4.8, final loss L _total is:

L_total＝λ₁L_depth+λ₂L_occ (4.3)

Where lambda ₁ and lambda ₂ are weights for depth loss and occupancy loss. Returning to the final penalty, a back propagation process is performed to update the gradient.

And S4.9, putting the trained teacher model in an evaluation (eval) mode, performing forward reasoning, calculating mIoU (average cross-over ratio) values between the reasoning result and the occupation prediction truth value, and evaluating the performance of the teacher model through mIoU values. The mIoU values were calculated as follows:

for each class c, its IoU (cross-over ratio) value is calculated. IoU is defined as the intersection of the predicted region and the real region divided by the union of the predicted region and the real region. The formula is as follows:

TP (c) is a true instance (True Positives) of category c. FP (c) is a false positive (False Positives) of category c. FN (c) is a false negative example of category c (FALSE NEGATIVES).

MIoU is the average of IoU for all classes. The formula is as follows:

where C is the number of categories.

Further, the step S5 is specifically implemented by the following steps:

s5.1, reading and fixing the weight of the trained teacher model, and putting the teacher model in an evaluation (eval) mode and putting the student model in a training (train) mode.

And S5.2, reading front camera pictures of three adjacent frames from the training set at one time in a training period, and sending the front camera pictures into a teacher model to perform forward reasoning so as to obtain a characteristic diagram T _feat occupying the last convolution layer in the pre-measurement head and an output T _out occupying the pre-measurement head.

And S5.3, in a training period, sending the picture of the current frame into a student model to execute a forward process of training, so as to obtain a characteristic diagram S _feat of the last convolution layer in the occupied pre-measurement head and an output S _out of the occupied pre-measurement head.

S5.4 for S _feat after passing through an MLP layer and T _feat, calculate L2 loss:

wherein the MLP layer is composed of a1×1 convolution+ReLU+1×1 convolution, N is the batch size at training. The mask is used to calculate the L2 loss only for valid voxels within the front camera field of view.

And S5.5, respectively decoding the occupation prediction in the height direction from the T _out and the S _out to obtain the prediction scores y ^T and y ^S of the teacher model and the student model on all voxels and all kinds of objects in the three-dimensional voxel space.

S5.6, for prediction of a teacher model and a student model in a training batch, using a mask to find effective voxels within the field of view of the front camera, and calculating kl divergence as distillation loss for the prediction scores of voxel spaces corresponding to all categories of the teacher model and the student model, wherein the distillation loss is calculated by the following formula:

for converting the eigenvalues into probability distributions as follows:

y ^T and y ^S are the outputs of the teacher model and the student model, respectively, representing the predicted scores of valid voxels for the C categories in the three-dimensional voxel space of size w·h·z. T is the distillation temperature used to smooth the predicted distribution of the student model. The larger T represents a wider area in the three-dimensional space of greater interest.

The distillation temperature varies according to the following formula in the whole training process, and complies with a distillation strategy of high temperature and low temperature, and the temperature T in each round of training is calculated according to the following formula:

Where T ₀ is the starting temperature, T _end is the ending temperature, E _total is the total number of training rounds, and E is the current number of training rounds.

S5.7, for the prediction category of the voxels in S1.5.4, classifying the voxels into three categories of foreground objects, background objects and air according to actual requirements. The foreground objects may include vehicles, pedestrians, etc. on roads with a focus of attention, and the background objects may include trees, roads, buildings, etc. The distillation loss was calculated to give different weights to the three types of objects, respectively.

S5.8, total loss in the distillation process is as follows:

L_totat＝λ_segL_seg+λ_occL_occ+λ_featL_feat+λ_cw3dL_cw3d (5.5)

where L _seg is the 2D semantic loss of the student model itself, i.e., the cross entropy loss function between the 2D semantic graph and the 2D semantic pseudo tags:

Wherein the method comprises the steps of Is the predicted probability of the model for the true class y _i,j at pixel (i, j). H and W are the height and width of the feature map, respectively.

L _occ is the occupancy prediction loss of the student model itself, which is calculated identically to equation (4.2) in S4.7.

L _feat is the L2 distillation loss of the profile.

L _cw3d is the distillation loss of the final output.

Lambda _seg、λ_occ、λ_feat and lambda _cw3d are weights of four, respectively. For distillation loss, a cosine annealing learning rate strategy is set. Returning to the final penalty, a back propagation process is performed to update the gradient.

And S5.9, putting the student model subjected to distillation training in an evaluation (eval) mode, performing forward reasoning, calculating mIoU (average cross-over ratio) values between the reasoning result and the occupation prediction truth value, and evaluating the performance of the student model through mIoU values. The mIoU value is calculated as in equation (4.5) in S4.9.

Further, the step S6 is specifically implemented by the following steps:

S6.1, inputting the vehicle-mounted front-view camera picture, a conversion matrix from a camera coordinate system to a vehicle coordinate system, a conversion matrix from a vehicle coordinate system to a global coordinate system and camera internal parameters into a student model, wherein the student model is in an evaluation (eval) mode, and performing forward reasoning to obtain a prediction with the shape of (Dx, dy and Dz).

S6.2, initializing a data structure, wherein the data structure comprises an array which has the same shape as the predicted data structure and is used for storing the connected domain identification labeled array. An array of boolean types visited for recording accessed voxels. A list region props for storing attributes of each connected region, such as the number of voxels, class, closest voxel to the own vehicle, distance, etc.

And S6.3, using breadth-first search strategy to find the connected domain of each object of interest and the voxel coordinates nearest to the own vehicle, and recording the connected domain meeting the requirements. For each connected domain, all its voxels are traversed and the Euclidean distance of each voxel to the own vehicle is calculated.

Further, the step S6.3 is implemented by the following sub-steps:

S6.3.1 traversing all voxels, wherein the voxels meeting the requirements meet the following conditions that 1, the voxels are positioned in the field of view of the front camera. 2. Voxels belong to a class of interest. 3. Voxels are not accessed.

S6.3.2 for satisfactory voxels, initializing a breadth first search queue and adding the current voxel coordinates to the queue while marking it as accessed. Initializing a voxel counter of the current connected domain to be 1, and recording the current voxel as the nearest voxel to the own vehicle, wherein the distance from the current voxel to the own vehicle is the minimum distance.

S6.3.3 performing breadth first search until the queue is empty. Every time 1 voxel coordinate is taken out of the queue, it is checked whether its neighboring 6 voxels meet S6.3.1 requirements. And marking the voxels meeting the requirements as accessed, adding the accessed voxels into a queue, simultaneously updating a voxel counter of the connected domain, calculating the distance from the connected domain to the vehicle, and updating the minimum distance if the distance from the connected domain to the vehicle is smaller than the result of the previous calculation.

S6.3.4 after the search is finished, if the voxel counter of the current connected domain is larger than 1, recording the connected domain in labeled array, recording the attribute of the connected domain, including the number of voxels, the category, the closest voxels to the own vehicle, the distance and other information in region props, and continuing traversing the next voxel.

And S6.4, for each connected domain, finding the minimum value in all the distances, and taking the minimum value as the minimum distance of the current connected domain as the distance from the vehicle of the current interesting example.

And S6.5, if the distance is smaller than a preset safety threshold value, generating alarm information.

A second aspect of the present invention relates to a monocular distance measuring device for driving assistance of a vehicle based on a lightweight occupancy prediction network, comprising a memory and one or more processors, wherein executable codes are stored in the memory, and the one or more processors are used for implementing the monocular distance measuring method for driving assistance of a vehicle based on a lightweight occupancy prediction network according to the present invention when executing the executable codes.

A third aspect of the present invention relates to a computer-readable storage medium having stored thereon a program which, when executed by a processor, implements a method for monocular ranging for driving assistance for a vehicle based on a lightweight occupancy prediction network according to the present invention.

The innovation point of the invention is that:

(1) The method introduces the hot occupancy prediction in the current automatic driving field into the monocular ranging field. Modeling 3D space through occupancy prediction overcomes the drawbacks inherent to conventional monocular ranging.

(2) The model used is lightweight and easy to deploy, with the improved 2D feature handling part adding 2D semantic supervision and the improved BEV feature handling part adding spatial attention mechanisms. The added module only needs small calculation force, so that the extraction capacity of the model on the 2D features and BEV features is improved.

(3) Knowledge distillation techniques combining feature distillation and output distillation are used to enhance model performance. The feature distillation uses an MLP layer to relieve feature differences between a teacher model and a student model, so that the student model can learn features better. And the channel distillation in 2D semantic segmentation is promoted to a 3D space by output distillation, so that the performance of the intensive prediction task of occupation prediction can be remarkably improved.

Compared with the prior art, the technical scheme provided by the implementation of the invention has the following beneficial effects:

The method for measuring the distance by monocular distance measurement is mainly based on the width of the detected object boundary frame and known camera internal parameters, and estimates the object distance by geometric calculation. The method has strong dependence on accurate detection of the object, and if the detection is inaccurate or the object is blocked, the distance measurement accuracy can be greatly reduced. And it is often necessary to know in advance the actual size or appearance characteristics of the target object. Therefore, it is necessary to maintain a database containing various object information. This is a challenge for application scenarios where object types are numerous or frequently changing, and conventional methods may not provide accurate ranging information if objects not in the database are present. Meanwhile, the conventional method is susceptible to the influence of camera parameter errors and object shapes, and particularly when the object is irregular, the range errors are increased remarkably. In the face of unstable image quality (such as light variation, motion blur) or the like of the input, the size and position of the detection frame of the conventional method may vary (shake) between frames even if the position of the target object is relatively fixed. This jitter directly affects the distance calculated based on the detection frame width, resulting in unstable ranging. In contrast, the monocular ranging method based on the lightweight occupancy prediction network does not depend on prior information of objects, but directly obtains distance information and semantic information through output of the occupancy prediction network capable of performing panoramic sensing on a 3D space. The method does not need to maintain an object database, and can still provide reasonable distance estimation for newly-appearing objects or irregularly-shaped objects, thereby improving the universality and adaptability of the system and reducing the maintenance cost. Meanwhile, the invention generates stable three-dimensional space representation through occupation prediction, and overcomes the ranging error caused by jitter of a detection frame in the traditional method.

The real-time performance and the calculation efficiency are that a 2D semantic supervision and spatial attention light weight student model is added, so that the perception capability of monocular input images and BEV features is enhanced. Meanwhile, a complicated teacher model is learned through a knowledge distillation technology, so that high-precision characteristics are reserved, and the complexity and the calculated amount of the model are remarkably reduced. The method can be suitable for vehicle-mounted systems with high real-time requirements after model quantization.

The technical scheme of the invention is not only suitable for monocular ranging, but also can be expanded to other application scenes related to occupation prediction, such as path planning, obstacle detection and the like in automatic driving.

In summary, the invention ensures the model performance, simultaneously considers the real-time performance and the high-efficiency utilization of the computing resources, and provides an efficient and practical solution for monocular ranging and related applications.

Drawings

Fig. 1 is a flow chart of the method of the present invention.

Fig. 2 is a diagram of a monocular occupancy prediction teacher model based on multi-frame time sequence fusion and time sequence stereoscopic depth estimation of the present invention.

FIG. 3 is a diagram of a lightweight monocular occupancy prediction student model based on single frame input in accordance with the present invention.

FIG. 4 is a range of the front camera field of view of the present invention in BEV space.

Fig. 5 is a schematic diagram of the present invention using a teacher model to distill knowledge of a student model.

Fig. 6 is a flow chart of a monocular ranging method of a lightweight monocular occupancy prediction student model based on single frame input of the present invention.

FIG. 7 is a graph showing the effect of the student model occupancy prediction reasoning after distillation training in the invention.

Fig. 8 is an effect diagram of monocular ranging using the occupation prediction reasoning result of the student model according to the present invention.

Detailed Description

The objects and effects of the present invention will become more apparent from the following detailed description of the present invention with reference to the accompanying drawings and preferred examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Example 1

The principle of the invention is that a 3D occupation prediction data set is firstly constructed, and a monocular occupation prediction teacher model based on multi-frame time sequence fusion and time sequence three-dimensional depth estimation and a light monocular occupation prediction student model based on single-frame input are respectively constructed. The light-weight monocular occupation prediction student model is added with 2D semantic pseudo tag supervision after the FPN layer of 2D feature processing, and a spatial attention module is added after each residual block of the BEV feature backbone network, so that the extraction capability of the 2D features and the BEV features is respectively enhanced. Training the teacher model by using the time sequence input pictures to obtain the occupied prediction teacher model with good performance, and evaluating the performance of the teacher model. The teacher model is in an evaluation (eval) mode during distillation training, time sequence input is adopted, the student model is in a training mode, only key frames are accepted as input, and characteristics and final output of an intermediate layer of the teacher model are used for guiding the study of the student model. And obtaining a light-weight occupancy prediction student model with improved performance after distillation training, and evaluating the performance of the student model. And finally, reasoning by using a student model, finding out the connected domain of all interested examples (such as vehicles, pedestrians and the like) and the Euclidean distance from each voxel to the own vehicle by carrying out connected domain analysis based on breadth-first search on the output result, and selecting the minimum Euclidean distance to finish monocular distance measurement.

As one implementation method, the method for monocular ranging of automobile assisted driving based on lightweight occupancy prediction network of the present invention, as shown in fig. 1, specifically comprises the following steps:

the step S1 of constructing the 3D occupancy prediction data set specifically comprises the following steps:

And S1.2, marking point cloud data, namely manually marking the semantics of the 3D point cloud of the key frame in the collected data. As shown in table 1:

TABLE 1 Point cloud semantic indexing

Further, the step S1.3 is specifically implemented by the following steps:

S1.3.4, carrying out dense point sampling on the reconstructed grid, and continuously carrying out semantic marking on sampling points by adopting KNN to obtain an accurate 3D true value S1.4, namely dividing a training set, a verification set and a test set. Wherein the training set and the validation set comprise 850 scenes in total 34149 frames of data. The test set contains 6000 frames of data for a total of 150 scenes.

The pkl file contains:

Further, the true values of the vehicle coordinate system, the global coordinate system and the occupancy (voxel) in the step S1.5 are specifically as follows:

S1.5.3 the 2D semantic pseudo tag is obtained by projecting the 3D point cloud subjected to semantic annotation in the S1.2 to a 2D plane, and then performing morphological processing and smoothing operation. The shape of which is consistent with the size of the input picture.

S1.5.4 the occupancy (voxel) truth value contains all scene key frames and is the voxel label of the 3D grid with the shape (200,200,16). Wherein each value is a predicted semantic class index for a corresponding voxel in 3D space, class 17 is added to air (free) as compared to table 1.

The construction of the monocular occupancy prediction teacher model based on multi-frame time sequence fusion and time sequence three-dimensional depth estimation in step S2, as shown in fig. 2, specifically includes the following steps:

s2.1 setting the input camera as a front camera, the input picture size being 256X 704 pixels, the picture enhancement scale being (-0.06,0.11), the BEV space X, Y and the Z-direction range being (-8 m,40 m), (-32 m,32 m), (-1 m,5.4 m), respectively. The depth prediction range is set to predict a set of distributions of length (45-1)/0.5=88 for ranges of 0.5m from 1m to 45m of the host vehicle, using 2 frames +1 additional reference frames as timing inputs, the BEV space enhancement flipped the BEV features along the X-axis with 50% probability. The optimizer adopts Adam, and the learning rate is set to be 1e-4. The batch size at training time takes 8.

And S2.2, constructing a 2D feature extraction module, namely adopting ResNet network as backbone network to extract the features of the 0 th layer, the 2 nd layer and the 3 rd layer. Features of different levels are fed into a FPN (feature pyramid network) to integrate features of different scales. The feature size of the resulting FPN layer output is (8,256,16,44).

S2.3, constructing a 2D-to-3D feature conversion module, namely estimating depth and simultaneously predicting depth and context features. When predicting depth, monocular depth prediction and MVS (temporal multi-view stereo) based multi-view depth prediction are combined. For multi-view depth prediction, the depth center (μ) and depth range (σ) are first predicted, then the μ and σ are used to generate a depth profile, and finally the monocular depth prediction and the weighted multi-view depth are combined together to obtain the final depth. BEV features were obtained using Bev Pooling operations. And respectively extracting BEV features of a current key frame and a previous frame for continuous 3 frames of pictures input in a time sequence, projecting the BEV features of the previous frame to the BEV coordinate system of the current key frame, and fusing multi-frame BEV features through splicing operation along a channel dimension.

S2.3.3, screening out the context characteristics and depth value indexes to be converted according to the index values, and sequencing the pseudo point clouds of the cone according to the voxel indexes to enable the pseudo point clouds belonging to the same voxel to be adjacent. With multi-threaded parallel processing, feature weights are summed according to predicted depth for multiple pseudo-point clouds that fall within the same voxel, ultimately obtaining BEV features of size (8,64,160,120).

S2.4, constructing a BEV feature processing module, namely ResNet network as a backbone network, sending features of different layers into an FPN (feature pyramid network), and integrating features of different scales. The final FPN layer outputs a feature map of size (8,256,160,120).

S2.5, constructing an occupancy prediction head. The input feature map with the size (8,256,160,120) is firstly input into a convolution layer to be processed to obtain a feature map with the size (8,120,160,256), then channel to height (channel to height) operation is carried out to output the feature map with the size (8,120,160,16,18), and the prediction scores of all voxels in the 120×160×16 three-dimensional voxel space in 18 categories are represented by the model.

s2.5.1 for the feature map with the size of (8,120,160,256), the feature map with the size of (8,120,160,288) is obtained through a full connection layer with the input dimension of 256 and the output dimension of 512, through a full connection layer with the input dimension of 512 and the output dimension of 288 after being activated by Softplus function.

S2.5.2 the last dimension 288 of the feature map is split into 16 x 18 to obtain a feature map of size (8,120,160,16,18), representing the model's predictive scores over 18 classes for all voxels in 120 x 160 x 16 three-dimensional voxel space.

The step S3 of constructing a light-weight monocular occupation prediction student model based on single-frame input, as shown in FIG. 3, specifically comprises the following steps:

s3.1 setting the input camera as a front camera, the input picture size 256X 704, the picture enhancement scale (-0.06,0.11), the BEV space X, Y and the Z direction range (-8 m,40 m), (-32 m,32 m), (-1 m,5.4 m) respectively. The depth prediction range is set to predict a set of distributions of length 88 for a range of 0.5m from 1m to 45m from the host vehicle, with the enhancement of BEV space flipping BEV features along the X-axis with 50% probability. The optimizer adopts Adam, and the learning rate is set to be 1e-4. The batch size at training time takes 8.

And S3.2, constructing an improved 2D feature extraction module, namely adopting ResNet network as backbone network to extract the features of the 0 th layer, the 2 nd layer and the 3 rd layer. Features of different levels are fed into a FPN (feature pyramid network) to integrate features of different scales. The FPN layer feature of size (8,256,16,44) is fed into two residual blocks, then the number of channels is reduced to 18 (corresponding to 18 classes) through one convolution layer, and after up-sampling by 16 times, the channel number is supervised by 2D semantic pseudo labels. The FPN layer features and the features after two residual blocks are spliced along the channel dimension and then pass through a convolution layer to strengthen the 2D features.

And S3.3, constructing a 2D-to-3D feature conversion module, namely continuously sending the enhanced 2D feature obtained in the step S3.2 into a convolution layer, and obtaining the coupling feature of the context feature and the depth estimation, wherein the size is 8,152,16,44. The first 88 channels and the last 64 channels of the coupling feature are separated as context features and depth estimates, respectively, and a Softmax function of the depth estimates is calculated. BEV features were obtained using the same Bev Pooling operations as in S2.3.

S3.4, constructing an improved BEV characteristic processing module, wherein a ResNet network is used as a backbone network. A Spatial Attention (SA) module is added after each residual block of ResNet network to spatially enhance feature extraction of BEV feature maps. The spatial attention module calculates average value and maximum value of the input characteristics in the channel dimension respectively, connects the obtained results along the channel dimension, passes the connected tensor through a convolution layer, and then generates a spatial attention diagram through a Sigmoid function. Note that in an effort to multiply the input features, certain regions on the original features are spatially emphasized or attenuated. The specific calculation formula of the spatial attention is as follows:

M_spatial＝σ(Conv_7×7([AvgPool(F);MaxPool(F)])) (3.1)

F_out＝F×M_spatial (3.2)

S3.5, constructing an occupancy prediction head. For an input feature map with the size of (8,256,160,120), a convolution layer is firstly used for processing to obtain a feature map with the size of (8,120,160,256), then channel to height operation is performed to output a prediction score with the size of (8,120,160,16,18), which represents a model for all voxels in 120×160×16 three-dimensional voxel spaces in 18 categories.

The training of the occupancy prediction teacher model in the step S4 obtains an occupancy prediction teacher model with good performance, and evaluates the performance of the teacher model, which specifically includes the following steps:

s4.1, reading front camera pictures of three adjacent frames, a conversion matrix from a camera coordinate system to a vehicle coordinate system, a conversion matrix from the vehicle coordinate system to a global coordinate system, camera internal parameters, a rotation and translation matrix for picture enhancement and an enhancement matrix of BEV space from a training set at one time, and sending data of a teacher model during training.

And S4.2, a forward process when training is performed, and returning predicted values of occupied networks and depth predicted values predicted by a 2D-to-3D feature conversion module, wherein the sizes are (8,120,160,16) and (8,88,16,44) respectively.

And S4.3, reading the real occupation of the voxels according to the storage path of the voxel true value contained in the S1.5 pkl file. The size is (8,200,200,16). (200,200,16) means a three-dimensional space with 200,16 voxels in the front-back, left-right, up-down directions, respectively, centered on the own vehicle. The actual length of each voxel is 0.4m. The truth values contain labels of 0-17, representing 18 categories that classify the voxel space.

Further, said step S4.4 is realized by the sub-steps of:

S4.5, creating a mask with a size (8,160,120,16) for eliminating the occupation true value outside the field of view of the front camera. The range of the front camera field of view in the BEV space is shown in fig. 4, where the BEV space is 120 voxels in length in the front-back direction and 160 voxels in width in the left-right direction. The vehicle is positioned on the BEV space centerline 20 voxels behind, and the two vertices at the bottom of the triangle of the field of view range are 10 voxels away from the two sides.

Wherein d and The true depth label and the predicted depth, respectively.

C is the total number of categories 18.o _i is a one-hot real label of the i-th class obtained by the voxel occupation true value in S4.3. Is the probability of the ith class of model predictions. Loss outside the front camera field of view is excluded by mask when loss is calculated.

S4.8, final loss L _total is:

L_total＝λ_depthL_depth+λ_occL_occ (4.3)

Where λ _depth and λ _occ are weights for depth loss and occupancy loss, taking 0.05 and 1. Returning to the final penalty, a back propagation process is performed to update the gradient.

MIoU is the average of IoU for all classes. The formula is as follows:

where C is the number of categories.

The distillation training of the student model is performed by using the characteristics and the output of the middle layer of the teacher network in the step S5 to obtain a light-weight occupation prediction student model with improved performance, and the performance of the student model is evaluated, as shown in fig. 5, specifically including the following steps:

And S5.2, reading front camera pictures of three adjacent frames from the training set at one time in a training period, and sending the front camera pictures into a teacher model to perform forward reasoning so as to obtain a characteristic diagram T _feat occupying the last convolution layer in the pre-measurement head and an output T _out occupying the pre-measurement head. Sizes (8,256,160,120) and (8,120,160,288).

And S5.3, in a training period, sending the picture of the current frame into a student model to execute a forward process of training to obtain the output of the occupancy pre-measurement head, and obtaining a characteristic diagram S _feat of the last convolution layer in the occupancy pre-measurement head and the output S _out of the occupancy pre-measurement head. Sizes (8,256,160,120) and (8,120,160,288).

Wherein the MLP layer is composed of 1×1 convolution+ReLU+1×1 convolution, N is the batch size during training, taking 8. The mask is used to calculate the L2 loss only for valid voxels within the front camera field of view.

And S5.5, respectively decoding the occupation prediction in the height direction from the T _out and the S _out to obtain the prediction scores y ^T and y ^S of the teacher model and the student model on all voxels and all kinds of objects in the three-dimensional voxel space. The size is (8,120,160,16,18).

S5.6, for prediction of a teacher model and a student model in a training batch, using masks to find effective voxels within the field of view of the front camera, and respectively calculating kl divergence as distillation loss for the prediction scores of 18 voxel spaces corresponding to 18 categories of the teacher model and the student model. The distillation loss was calculated from the following formula:

for converting the eigenvalues into probability distributions as follows:

y ^T and y ^S are the outputs of the teacher model and the student model, respectively, representing the predicted scores of the valid voxel pairs c=18 classes in the three-dimensional voxel space of size w·h·z. T is the distillation temperature used to smooth the predicted distribution of the student model. The larger T represents a wider area in the three-dimensional space of greater interest.

Where T ₀ is the starting temperature, 4 is taken. T _end is the end temperature, taken as 2.E _total is the total training round number, taken 24.e is the current training round number.

S5.7, for 18 prediction categories, classifying the 18 prediction categories into three categories of foreground objects, background objects and air, and calculating distillation loss to give different weights to the three categories of objects respectively. The specific divisions and weights of the foreground, background, and air categories are shown in table 2:

Table 2 foreground/background/air partitioning and weighting

S5.8, total loss in the distillation process is as follows:

L_total＝λ_segL_seg+λ_occL_occ+λ_featL_feat+λ_cw3dL_cw3d (5.5)

L _feat is the L2 distillation loss of the profile.

L _cw3d is the distillation loss of the final output.

Lambda _seg、λ_occ、λ_feat and lambda _cw3d are weights of four, respectively. Set to 0.5, 1, 5e-7 and 1. For distillation loss, a cosine annealing learning rate strategy was set, which was a decrease from an initial learning rate of 0.0004 to 1% of the initial learning rate as a cosine function over the entire 24-round training period. Returning to the final penalty, a back propagation process is performed to update the gradient.

Forward reasoning is performed on the student model in the step S6, and monocular ranging is completed according to the reasoning result, as shown in fig. 6, and specifically includes the following steps:

S6.1, inputting the vehicle-mounted forward-looking camera picture, a conversion matrix from a camera coordinate system to a vehicle coordinate system, a conversion matrix from the vehicle coordinate system to a global coordinate system and camera internal parameters into a student model, wherein the student model is in an evaluation (eval) mode, and performing forward reasoning to obtain the prediction with the shape of (120,160,16).

And S6.3, for the area within the field of view of the front camera, using breadth-first search strategy to find the connected domain of each object of interest. The connected domain adopts 6 connection, namely, the 6 voxels which are up, down, left, right, front and back are considered to be connected with one voxel. For each category of interest, all voxels are traversed. When the voxels meeting the requirements are found, searching for all connected voxels by breadth-first search, namely storing the voxel coordinates of the current connected domain by using a queue, traversing all the voxels of each connected domain, and calculating the Euclidean distance from each voxel to the own vehicle.

Further, the step S6.3 is implemented by the following sub-steps:

S6.3.2 for satisfactory voxels, initializing a breadth first search queue and adding the current voxel coordinates to the queue while marking it as accessed. Initializing a voxel counter (voxel count) of the current connected domain to be 1, and recording the current voxel as the nearest voxel to the own vehicle, wherein the distance from the current voxel to the own vehicle is the minimum distance.

And S6.4, regarding each connected domain, taking the minimum distance recorded in region props as the current interesting example distance from the vehicle.

Through experiments, the calculated amount of the teacher model is 132.5GFLOPS, the parameter amount is 58.67M, and the mIoU value is 33.37. The calculated amount of the original student model is 90.14GFLOPS, the parameter amount is 44.47M, and the mIoU value is 27.88. After training using the modified 2D feature processing section and the modified BEV feature processing section and distillation, the student model calculated 92.64GFLOPS, the parameters 48.33m, and the miou value 29.82.

Fig. 7 shows the result of prediction and reasoning of occupancy of student models after distillation training by adopting the invention, and structures such as houses, roads, trees, vehicles and the like can be clearly distinguished. Fig. 8 is an effect diagram of monocular ranging using the occupation prediction reasoning result of the student model, ranging vehicles and pedestrians within the view field of the front camera.

Example 2

The embodiment relates to an automobile auxiliary driving monocular distance measuring device based on a lightweight occupancy prediction network, which comprises a memory and one or more processors, wherein executable codes are stored in the memory, and the one or more processors are used for realizing the automobile auxiliary driving monocular distance measuring method based on the lightweight occupancy prediction network in the embodiment 1 when executing the executable codes.

Example 3

The present embodiment relates to a computer-readable storage medium having stored thereon a program which, when executed by a processor, implements a method for monocular ranging for driving assistance of an automobile based on a lightweight occupancy prediction network of the present embodiment 1.

The invention is expected to solve the technical bottleneck faced by the existing monocular ranging system and provide powerful support for intelligent development in the field of automobile auxiliary driving. The intelligent driving method is not only beneficial to improving the safety and accuracy of the vehicle in a complex traffic environment, but also can reduce the hardware requirement and cost of the system and enlarge the application range of the intelligent driving technology. In addition, through more accurate environment sensing and real-time response capability, the intelligent driving system and method can provide higher user experience and safety guarantee for intelligent driving.

It will be appreciated by persons skilled in the art that the foregoing description is a preferred embodiment of the invention, and is not intended to limit the invention, but rather to limit the invention to the specific embodiments described, and that modifications may be made to the technical solutions described in the foregoing embodiments, or equivalents may be substituted for elements thereof, for the purposes of those skilled in the art. Modifications, equivalents, and alternatives falling within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. A monocular distance measurement method for automobile auxiliary driving based on a lightweight occupation prediction network comprises the following steps:

S1, constructing a 3D occupation prediction data set;

S2, constructing a monocular occupation prediction teacher model based on multi-frame time sequence fusion and time sequence three-dimensional depth estimation;

s3, constructing a light-weight monocular occupation prediction student model based on single-frame input;

S4, training the occupied prediction teacher model to obtain an occupied prediction teacher model with good performance, and evaluating the performance of the teacher model;

S5, performing distillation training on the student model by using characteristics and output of the middle layer of the teacher network to obtain a light-weight occupation prediction student model with improved performance, and evaluating the performance of the student model;

2. The method for monocular ranging of automobile assisted driving based on lightweight occupancy prediction network as claimed in claim 1, wherein step S1 is specifically implemented by the steps of:

S1.1, collecting sensor data, namely collecting raw data in a road environment by using a collecting vehicle provided with a plurality of sensors;

S1.2, marking point cloud data, namely manually marking the semantics of the 3D point cloud of a key frame in the collected data, wherein the value in the label adopts integer codes starting from 0, and each integer corresponds to one category;

s1.3, generating an occupancy prediction truth value tag, namely voxelizing the marked point cloud data to generate an accurate 3D truth value, wherein the method specifically comprises the following steps of:

S1.3.1, aggregating multi-frame point clouds to generate dense point cloud data;

S1.3.2, distributing semantic labels for unlabeled frames by using a K-nearest neighbor algorithm (KNN), finding out the nearest point in K labeled frames for each point in the unlabeled frames, and distributing the semantic labels for the point according to a majority voting principle;

s1.3.3 grid reconstruction (Mesh reconstruction) of the aggregated dense point cloud by VDBFusion method;

S1.3.4, performing dense point sampling on the reconstructed grid, and continuously performing semantic marking on sampling points by adopting KNN to obtain an accurate 3D true value;

s1.4, dividing a training set, a verification set and a test set;

s1.5, storing the data information into a form convenient to read, wherein the form convenient to read during training and verification comprises a pkl file, and the pkl file comprises:

(1) The data identification comprises a token of a current frame, a token of a previous frame and a next frame, a key frame identification, a scene token where the frame data are located, and the like;

(2) The conversion matrix comprises an external parameter matrix of the laser radar, an internal parameter matrix and an external parameter matrix of a camera, a conversion matrix from a camera coordinate system to a vehicle coordinate system and a conversion matrix from the vehicle coordinate system to a global coordinate system;

(3) The storage path comprises a point cloud data storage path, a camera picture storage path, a 2D semantic pseudo tag storage path and an occupancy (voxel) true value storage path;

The vehicle coordinate system, the global coordinate system, the 2D semantic pseudo tag and the occupation true value described in the step S1.5 are specifically as follows:

s1.5.1 the vehicle coordinate system is a local coordinate system taking a vehicle (vehicle) as a reference point and is used for describing the sensor data and the position of an object relative to the vehicle, wherein the origin is positioned in the center of the vehicle, namely the geometric center of a chassis of the vehicle;

s1.5.2 the global coordinate system is a fixed world coordinate system used for describing the absolute position of an object in the whole environment, does not change along with the movement of a vehicle, and is positioned at a point with the longitude and latitude of 0 DEG in a GPS (global position system), wherein the X axis points to the east, the Y axis points to the north, and the Z axis points to the sky;

S1.5.3 the 2D semantic pseudo tag is obtained by projecting 3D point cloud subjected to semantic annotation in the S1.2 to a 2D plane, and then performing morphological processing and smoothing operation;

S1.5.4. The true value of occupied (voxel) contains all scene key frames, the shape is (D _x,D_y,D_z), the predicted semantic class index of each voxel in the three-dimensional space with the size of D _x×D_y×D_z is represented, the reference step S1.3 is that one air (free) class is added compared with the semantic index of point cloud in S1.2, the three-dimensional space does not contain any object voxels, and D _x、D_y、D_z is the number of voxels in the front-back direction, the left-right direction and the up-down direction respectively.

3. The method for monocular ranging of automobile assisted driving based on lightweight occupancy prediction network according to claim 1, wherein step S2 is specifically implemented by the following steps:

s2.1, setting an input camera, an input picture size, a picture enhancement matrix, a BEV space grid size, a depth prediction range, a multi-frame input index, a BEV characteristic enhancement matrix, an optimizer and a learning rate;

S2.2, constructing a 2D feature extraction module, namely adopting ResNet network as backbone network to extract multi-scale features, sending features of different layers into FPN (feature pyramid network), and integrating features of different scales;

S2.3, constructing a 2D-3D feature conversion module, namely, simultaneously predicting depth and contextual features by depth estimation, combining monocular depth prediction and MVS (time sequence multi-view stereo) -based multi-view depth prediction when predicting depth, firstly predicting a depth center (mu) and a depth range (sigma) for multi-view depth prediction, then generating depth distribution by using mu and sigma, and finally combining the monocular depth prediction and weighted multi-view depth together to obtain final depth, obtaining BEV features by using Bev Pooling operation, respectively extracting respective BEV features for sequential n-frame pictures, uniformly projecting BEV features of non-current key frames under a BEV coordinate system of current key frames, and fusing multi-frame BEV features by splicing operation along a channel dimension;

the Bev Pooling operation in step S2.3 projects the 2D features into the BEV space according to the camera internal and external parameters and the pre-computed viewing cone pseudo point cloud, specifically by the following sub-steps:

s2.3.1, constructing a view cone pseudo point cloud according to preset depth distribution and picture size, wherein the view cone pseudo point cloud is defined as pixel point coordinates and possible depth distribution of each pixel on a picture;

S2.3.2 subtracting the picture enhancement translation matrix from the pixel coordinates of the pseudo-point cloud, multiplying the pixel coordinates by the inverse of the picture enhancement rotation matrix, compensating the influence of picture enhancement operation, multiplying the pixel coordinates after compensation by the inverse matrix of the camera to the vehicle and the camera internal reference, thereby converting the pixel coordinates to the vehicle coordinates, multiplying the vehicle coordinates by the BEV characteristic enhancement matrix, synchronizing the data enhancement of the BEV space, subtracting three minimum values of the preset BEV space three-dimensional direction from the vehicle coordinates and dividing the three minimum values by the length of each voxel, thereby obtaining the voxel (voxel) coordinates of the view cone pseudo-point cloud under the BEV space, excluding the view cone points falling outside the preset range, and generating the index of the view cone pseudo-point cloud in the BEV space;

S2.3.3, screening out the context characteristics and depth value indexes to be converted according to the index values, sequencing the pseudo point clouds of the cone according to the voxel indexes to enable the pseudo point clouds belonging to the same voxel to be adjacent;

S2.4, constructing a BEV feature processing module, namely ResNet a network is used as a backbone network, features of different layers are sent into an FPN (feature pyramid network), and features of different scales are integrated;

s2.5, constructing an occupied prediction head, firstly inputting an input feature map into a convolution layer for processing, then carrying out channel to height (channel to height) operation, and decoding the prediction of the voxels in the height direction for all categories from the channel dimension of the feature map.

4. A method for monocular distance measurement for driving assistance in a vehicle based on a lightweight occupancy prediction network as claimed in claim 3, wherein the channel to height operations in step S2.5 are implemented by the following sub-steps:

S2.5.1 outputting a characteristic diagram with the input channel number of C _f after passing through a full connection layer, an activation function and a full connection layer, wherein the channel number is D _z xC, and D _z and C are respectively the number of voxels in the height direction and the total number of categories of the voxels in S1.5.4;

5. The method for monocular ranging of automobile assisted driving based on lightweight occupancy prediction network according to claim 1, wherein step S3 is specifically implemented by the following steps:

S3.1, setting an input camera, an input picture size, a picture enhancement matrix, a BEV space grid size, a depth prediction range, a BEV characteristic enhancement matrix, an optimizer and a learning rate;

S3.2, constructing an improved 2D feature extraction module, namely adopting ResNet network as backbone network to extract multi-scale features, sending features of different layers into FPN (feature pyramid network) to integrate features of different scales, sending the output features of the FPN layer into two residual blocks continuously, and then carrying out up-sampling to be supervised by a 2D semantic pseudo tag;

S3.3, constructing a 2D-to-3D feature conversion module, namely continuously sending the features obtained in the S3.2 into a convolution layer, outputting coupling features of the context features and the depth estimation, separating the context features and the depth estimation from the coupling features, and obtaining BEV features by using Bev Pooling operation which is the same as that of the S2.3;

s3.4, constructing an improved BEV feature processing module, wherein a ResNet network is used as a backbone network, adding a Spatial Attention (SA) module after each residual block of the ResNet network, and spatially enhancing feature extraction of the BEV feature map, wherein the calculation formula of the spatial attention is as follows:

M_spatial＝σ(Conv_7×7([AvgPool(F);MaxPool(F)])) (3.1)

F_out＝F×M_spatial (3.2)

Wherein F is an input feature map, avgPool (-) and MaxPool (-) are average pooling and maximum pooling operations along the channel dimension respectively, conv _7×7 is a 7×7 convolution with input channel number of 2 and output channel number of 1, σ is a Sigmoid function, M _spatial is a generated spatial attention map, output feature map F _out is obtained by multiplying the input feature map F and spatial attention map M _spatial;

s3.5, constructing an occupied prediction head, firstly processing an input feature map through a convolution layer, then carrying out channel to height (channel to height) operation, and decoding the prediction of the voxels in the height direction for all categories from the channel dimension of the feature map.

6. The method for monocular ranging of automobile assisted driving based on lightweight occupancy prediction network according to claim 1, wherein step S4 is specifically implemented by the following steps:

S4.1, reading a plurality of time sequence input front camera pictures, a conversion matrix from a camera coordinate system to a vehicle coordinate system, a conversion matrix from the vehicle coordinate system to a global coordinate system, camera internal parameters, a rotation and translation matrix for picture enhancement and an enhancement matrix of BEV space from a training set at one time, and sending data of a teacher model during training;

s4.2, a forward process when training is executed returns a predicted value of an occupied network and a depth predicted value predicted by a 2D-3D feature conversion module;

s4.3, reading the real occupation of the voxels according to the storage path of the voxel true value contained in the S1.5pkl file;

S4.4, reading point cloud data according to a storage path of the point cloud data contained in the S1.5pkl file, generating a depth true value through the point cloud data, and monitoring the depth value predicted by the 2D-3D feature conversion module;

step S4.4 is realized by the following sub-steps:

S4.4.1 reading three-dimensional coordinates (x, y, z) of the point cloud data from the pkl file;

S4.4.2 converting the read point cloud data into a camera coordinate system by using a transformation matrix from the vehicle coordinate system to the camera coordinate system;

S4.4.3 projecting the point cloud data converted into the camera coordinate system further onto the image plane by means of the camera internal reference matrix, projecting the points in the three-dimensional space onto pixel coordinates (u, v) on the two-dimensional image plane while maintaining depth information d (i.e. distance);

S4.4.4, generating a depth map, namely recording a corresponding depth value of each point projected to the image plane, filling a corresponding depth value d on a pixel coordinate (u, v) to generate the depth map, and selecting the point closest to the camera as the depth value of the pixel in the depth map;

S4.5, creating a mask for eliminating the occupation true value outside the field of view of the front camera;

Wherein d and The real depth label and the predicted depth are respectively;

o _i is one-hot real label of the ith class obtained by true value occupied by the voxels in S4.3; the model prediction method comprises the steps of calculating the probability of the ith class predicted by the model, and eliminating the loss outside the view field of the front camera through a mask when the loss is calculated;

s4.8, final loss L _total is:

L_total＝λ₁L_depth+λ₂L_occ (4.3)

Where λ ₁ and λ ₂ are weights of depth loss and occupancy loss;

S4.9, putting the trained teacher model in an evaluation (eval) mode, performing forward reasoning, calculating mIoU (average cross-over ratio) values between reasoning results and occupation prediction truth values, and evaluating the performance of the teacher model through mIoU values, wherein the calculation of mIoU values is as follows:

for each class c, its IoU (cross-over) value is calculated, ioU is defined as the intersection of the predicted region with the real region divided by the union of the predicted region with the real region, as follows:

Wherein TP (c) is a true instance (True Positives) of class c, FP (c) is a false positive instance (False Positives) of class c, FN (c) is a false negative instance (FALSE NEGATIVES) of class c;

mIoU is the average of IoU for all classes, and the formula is as follows:

where C is the number of categories.

7. The monocular distance measurement method for driving assistance of an automobile based on a lightweight occupancy prediction network according to claim 1, wherein the step S5 is specifically implemented by:

S5.1, reading and fixing the weight of the trained teacher model, and enabling the teacher model to be in an evaluation (eval) mode and enabling the student model to be in a training (train) mode;

S5.2, reading front camera pictures of three adjacent frames from a training set at a time in a training period, and sending the front camera pictures into a teacher model to execute forward reasoning so as to obtain a feature map T _feat occupying the last convolution layer in a pre-measurement head and an output T _out occupying the pre-measurement head;

S5.3, in a training period, sending the picture of the current frame into a student model to execute a forward process of training to obtain a feature map S _feat occupying the last convolution layer in the pre-measurement head and an output S _out occupying the pre-measurement head;

The MLP layer consists of 1×1 convolution+ReLU+1×1 convolution, wherein N is the batch size during training, and when the L2 loss is calculated, the mask is used for calculating only effective voxels within the field of view of the front camera;

S5.5, respectively decoding occupancy prediction in the height direction from T _out and S _out to obtain prediction scores y ^T and y ^s of a teacher model and a student model on all voxels and all kinds of objects in a three-dimensional voxel space;

S5.6, for prediction of a teacher model and a student model in a training batch, using a mask to find effective voxels within the field of view of the front camera, and calculating k1 divergence as distillation loss for the prediction scores of voxel spaces corresponding to all categories of the teacher model and the student model, wherein the distillation loss is calculated by the following formula:

for converting the eigenvalues into probability distributions as follows:

y ^T and y ^s are respectively the outputs of the teacher model and the student model and represent the prediction scores of the effective voxels to C categories in the three-dimensional voxel space with the size W.H.Z, T is the distillation temperature used for smoothing the prediction distribution of the student model, and the bigger T is, the more concerned the wider area in the three-dimensional space is;

Wherein T ₀ is the starting temperature, T _end is the ending temperature, E _total is the total training wheel number, and E is the current training wheel number;

S5.7, dividing the prediction types of voxels in S1.5.4 into three types of foreground objects, background objects and air according to actual requirements, wherein the foreground objects can comprise important attention objects such as vehicles, pedestrians and the like on roads, the background objects can comprise trees, roads, buildings and the like, and calculating distillation loss to give different weights to the three types of objects respectively;

s5.8, total loss in the distillation process is as follows:

L_total＝λ_segL_seg+λ_occL_occ+λ_featL_feat+λ_cw3dL_cw3d (5.5)

Wherein the method comprises the steps of Is the model's predicted probability for the true class y _i,j at pixel (i, j), H and W are the feature map's height and width, respectively;

l _occ is the occupancy prediction loss of the student model itself, and its calculation is the same as that of equation (4.2) in S4.7;

l _feat is the L2 distillation loss of the profile;

L _cw3d is the distillation loss of the final output;

Lambda _seg、λ_occ、λ_feat and lambda _cw3d are weights of four respectively, setting a cosine annealing learning rate strategy for distillation loss, returning to final loss, and executing a back propagation process update gradient;

and S5.9, putting the student model subjected to distillation training in an evaluation (eval) mode, performing forward reasoning, calculating mIoU (average cross ratio) values between a reasoning result and an occupation prediction true value, and evaluating the performance of the student model through mIoU values, wherein the calculation of mIoU values is the same as that of the formula (4.5) in S4.9.

8. The monocular distance measurement method for driving assistance of an automobile based on a lightweight occupancy prediction network according to claim 1, wherein the step S6 is specifically implemented by:

S6.1, inputting a vehicle-mounted forward-looking camera picture, a conversion matrix from a camera coordinate system to a vehicle coordinate system, a conversion matrix from a vehicle coordinate system to a global coordinate system and a camera internal reference into a student model, wherein the student model is in an evaluation (eval) mode, and performing forward reasoning to obtain a prediction with the shape of (Dx, dy, dz);

S6.2, initializing a data structure, wherein the data structure comprises an array labeled array with the same shape as the predicted data, an array visited with a Boolean type for recording accessed voxels, a list region props for storing the attribute of each connected region, such as the number of voxels, the category, the closest voxels to the own vehicle, the distance and the like;

S6.3, using breadth-first search strategy to find the connected domain of each interested object and the voxel coordinate nearest to the vehicle, recording the connected domain meeting the requirements, traversing all voxels of each connected domain, and calculating the Euclidean distance from each voxel to the vehicle;

Further, the step S6.3 is implemented by the following sub-steps:

S6.3.1, traversing all voxels, wherein the voxels meeting the requirements meet the following conditions that 1, the voxels are positioned in the field of view of the front camera, 2, the voxels belong to the interested category, 3, the voxels are not accessed;

S6.3.2, initializing a voxel counter of the current connected domain to be 1, marking the current voxel as the nearest voxel from the vehicle, and enabling the distance from the current voxel to the vehicle to be the minimum distance;

s6.3.3, performing breadth-first search until the queue is empty, taking out 1 voxel coordinate from the queue each time, checking whether 6 adjacent voxels meet S6.3.1 requirements, marking the voxels meeting the requirements as accessed, adding the accessed voxels into the queue, simultaneously updating a voxel counter of a connected domain, calculating the distance from the connected domain to the vehicle, and updating the minimum distance if the distance from the connected domain to the vehicle is smaller than the calculated result;

S6.3.4 after the search is finished, if the voxel counter of the current connected domain is larger than 1, recording the connected domain in labeled array, recording the attribute of the connected domain, including the information of the number of voxels, the category, the closest voxels to the own vehicle, the distance and the like in region props, and continuing traversing the next voxel;

S6.4, for each connected domain, finding the minimum value in all distances, and taking the minimum value as the minimum distance of the current connected domain and taking the minimum value as the distance of the current interesting example distance from the vehicle;

9. A light weight occupancy prediction network based monocular distance measuring device for vehicle assisted driving, comprising a memory and one or more processors, wherein the memory stores executable code, and the one or more processors are configured to implement a light weight occupancy prediction network based monocular distance measuring method as claimed in any one of claims 1 to 8 when executing the executable code.

10. A computer-readable storage medium, having stored thereon a program which, when executed by a processor, implements a method for monocular distance measurement for driving assistance in a vehicle based on a lightweight occupancy prediction network as claimed in any one of claims 1 to 8.