CN116052047A

CN116052047A - Moving object detection method and related equipment

Info

Publication number: CN116052047A
Application number: CN202310043284.1A
Authority: CN
Inventors: 田贤浩; 王瑞星
Original assignee: Honor Device Co Ltd
Current assignee: Honor Device Co Ltd
Priority date: 2023-01-29
Filing date: 2023-01-29
Publication date: 2023-05-02
Anticipated expiration: 2043-01-29
Also published as: CN116052047B

Abstract

The present application provides a moving object detection method and related equipment, which relate to the field of image processing. The method includes: acquiring video code stream data, and extracting compressed domain syntax elements, which are used to indicate variables in the video code stream data Information; according to the compressed domain syntax elements, use the motion detection network to detect and determine the target moving object. The present application uses the network model to detect moving objects by combining the compressed domain syntax elements, so as to achieve the purpose of ensuring real-time performance and adapting to complex scenes.

Description

Moving object detection method and related equipment

技术领域Technical Field

本申请涉及图像处理领域，具体涉及一种运动物体检测方法及其相关设备。The present application relates to the field of image processing, and in particular to a moving object detection method and related equipment.

背景技术Background Art

运动物体检测是计算机视觉的一个研究热点，它可以为视频分析、视频检索等提供支持，在人机交互、医疗诊断等领域都有着越来越重要的应用前景。Moving object detection is a research hotspot in computer vision. It can provide support for video analysis, video retrieval, etc., and has increasingly important application prospects in human-computer interaction, medical diagnosis and other fields.

现有提供的检测方法大部分是在像素域通过算法对视频像素数据进行计算，估计出视频中的运动对象，然而，随着视频分辨率越来越高，需要处理的视频像素数据越来越庞大，如此操作将要耗费大量的计算资源，计算速度也相应变慢。对此，亟需一种新的运动物体检测方法。Most of the existing detection methods use algorithms in the pixel domain to calculate the video pixel data and estimate the moving objects in the video. However, as the video resolution becomes higher and higher, the video pixel data that needs to be processed becomes larger and larger. Such operations will consume a lot of computing resources and the computing speed will also become slower. Therefore, a new moving object detection method is urgently needed.

发明内容Summary of the invention

本申请提供一种运动物体检测方法及其相关设备，通过结合压缩域语法元素，利用网络模型进行运动物体检测，从而可以实现保证实时性、适应复杂场景的目的。The present application provides a moving object detection method and related equipment, which combines compressed domain syntax elements and uses a network model to perform moving object detection, thereby achieving the purpose of ensuring real-time performance and adapting to complex scenarios.

第一方面，提供了一种运动物体检测方法，该方法包括：In a first aspect, a moving object detection method is provided, the method comprising:

获取视频码流数据，并提取压缩域语法元素，所述压缩域语法元素用于指示所述视频码流数据中的变量信息；Acquire video code stream data, and extract compression domain syntax elements, where the compression domain syntax elements are used to indicate variable information in the video code stream data;

根据所述压缩域语法元素，利用运动检测网络进行检测，确定目标运动物体。According to the compression domain syntax elements, a motion detection network is used to perform detection to determine the target moving object.

本申请实施例可以直接利用视频编码过程中产生的运动信息，节省了运动信息的计算步骤；另外，又结合了运动检测网络进行检测，从而可以实现高效、快速的视频运动目标检测任务，在保证实时性的基础下，解决相关方法中在复杂场景下的鲁棒性问题。The embodiment of the present application can directly utilize the motion information generated in the video encoding process, saving the steps of calculating the motion information; in addition, it is combined with a motion detection network for detection, thereby realizing efficient and fast video motion target detection tasks, and solving the robustness problem of related methods in complex scenarios while ensuring real-time performance.

结合第一方面，在第一方面的某些实现方式中，根据所述压缩域语法元素，利用运动检测网络进行检测，确定目标运动物体，包括：In combination with the first aspect, in some implementations of the first aspect, detecting using a motion detection network according to the compressed domain syntax element to determine the target moving object includes:

根据所述压缩域语法元素，确定运动特征；Determining motion features according to the compressed domain syntax elements;

根据所述运动特征，生成二维矩阵；generating a two-dimensional matrix according to the motion characteristics;

将所述二维矩阵输入所述运动检测网络进行检测，确定所述目标运动物体。The two-dimensional matrix is input into the motion detection network for detection to determine the target moving object.

在本申请实施例中，本申请无需解码帧图像数据，可直接从压缩域中提取可靠的压缩域语法元素进行运动分析，因此处理速度容易达到实时性。又结合了运动检测网络进行检测，从而可以实现高效、快速的视频运动目标检测任务，在保证实时性的基础下，解决相关方法中在复杂场景下的鲁棒性弱、性能差的问题。In the embodiment of the present application, the present application does not need to decode the frame image data, and can directly extract reliable compressed domain syntax elements from the compressed domain for motion analysis, so the processing speed can easily reach real-time performance. It is also combined with a motion detection network for detection, so that efficient and fast video motion target detection tasks can be achieved, and the problems of weak robustness and poor performance in complex scenes in related methods can be solved while ensuring real-time performance.

结合第一方面，在第一方面的某些实现方式中，根据所述压缩域语法元素，确定运动特征，包括：In combination with the first aspect, in some implementations of the first aspect, determining the motion feature according to the compressed domain syntax element includes:

根据P帧的压缩域语法元素，确定P帧对应的运动特征，所述视频码流数据包括I帧、P帧和B帧；Determining motion features corresponding to the P frame according to compression domain syntax elements of the P frame, wherein the video code stream data includes an I frame, a P frame, and a B frame;

根据B帧的压缩域语法元素，确定B帧对应的运动特征；Determine motion features corresponding to the B frame according to the compression domain syntax elements of the B frame;

根据所述I帧前后相邻的P帧对应的运动特征和/或B帧对应的运动特征，利用插值方法，确定所述I帧对应的运动特征。The motion features corresponding to the I frame are determined by using an interpolation method according to the motion features corresponding to the P frames adjacent to and before the I frame and/or the motion features corresponding to the B frame.

在本申请实施例中，结合P帧和B帧的运动信息，基于运动物体的时空连贯性，对运动特征进行插值处理，从而可以得到I帧的运动信息，这样就可以确定出每帧对应的运动信息。In an embodiment of the present application, the motion information of the P frame and the B frame is combined, and the motion features are interpolated based on the spatiotemporal coherence of the moving object, so that the motion information of the I frame can be obtained, so that the motion information corresponding to each frame can be determined.

结合第一方面，在第一方面的某些实现方式中，所述方法还包括：In combination with the first aspect, in some implementations of the first aspect, the method further includes:

对所述I帧、所述P帧和所述B帧对应的运动特征，进行平滑处理。The motion features corresponding to the I frame, the P frame and the B frame are smoothed.

在本申请实施例中，进行平滑处理后，相邻视频帧之间的运动信息过渡更加自然，可以去除误检区域的噪声，还可以避免出现个别数据异常、差异较大的情况。In the embodiment of the present application, after smoothing, the transition of motion information between adjacent video frames is more natural, the noise in the false detection area can be removed, and the occurrence of individual data anomalies and large differences can be avoided.

结合第一方面，在第一方面的某些实现方式中，所述压缩域语法元素包括：编码比特量、运动矢量和残差系数。In combination with the first aspect, in certain implementations of the first aspect, the compression domain syntax elements include: coding bit amount, motion vector and residual coefficient.

在本申请实施例中，本申请无需解码帧图像数据，可直接从压缩域中提取可靠的编码比特量、运动矢量和残差系数这三种运动信息进行运动分析，因此处理速度容易达到实时性。In the embodiment of the present application, the present application does not need to decode the frame image data, and can directly extract three kinds of motion information, namely, reliable coding bit amount, motion vector and residual coefficient, from the compressed domain for motion analysis, so the processing speed can easily reach real-time performance.

结合第一方面，在第一方面的某些实现方式中，所述运动特征包括：运动信息量、运动矢量强度、残差系数密度；In combination with the first aspect, in some implementations of the first aspect, the motion features include: motion information amount, motion vector intensity, and residual coefficient density;

所述运动信息量与所述编码比特量对应，所述运动矢量强度与所述运动矢量对应，所述残差系数密度与所述残差系数对应。The amount of motion information corresponds to the amount of coded bits, the motion vector strength corresponds to the motion vector, and the residual coefficient density corresponds to the residual coefficient.

在本申请实施例中，由于压缩域语法元素具有三种，因此，基于三种压缩域语法元素分别设计了运动信息量、运动矢量强度、残差系数密度三种运动特征来表征视频画面的运动情况。In the embodiment of the present application, since there are three types of compression domain syntax elements, three motion features, namely, motion information amount, motion vector intensity, and residual coefficient density, are designed based on the three compression domain syntax elements to characterize the motion of the video picture.

结合第一方面，在第一方面的某些实现方式中，所述运动检测网络包括：Darknet神经网络模型和YOLOv3目标检测模型；In combination with the first aspect, in some implementations of the first aspect, the motion detection network includes: a Darknet neural network model and a YOLOv3 target detection model;

将所述二维矩阵输入所述运动检测网络进行检测，确定所述目标运动物体，包括：Inputting the two-dimensional matrix into the motion detection network for detection to determine the target moving object includes:

将所述二维矩阵输入所述Darknet神经网络模型，得到多尺度的卷积特征层；Inputting the two-dimensional matrix into the Darknet neural network model to obtain a multi-scale convolutional feature layer;

将所述多尺度的卷积特征层输入所述YOLOv3目标检测模型，确定所述目标运动物体对应的边框定位。The multi-scale convolutional feature layer is input into the YOLOv3 target detection model to determine the frame location corresponding to the target moving object.

在本申请实施例中，二维矩阵输入Darknet神经网络模型，可以融合不同的运动特征，学习有效且全面的运动语义表征；基于YOLO-v3目标检测模型对神经网络模型输出的多尺度特征层进行运动检测，可以预测视频画面中的运动目标。In an embodiment of the present application, a two-dimensional matrix is input into a Darknet neural network model to fuse different motion features and learn an effective and comprehensive motion semantic representation; motion detection is performed on the multi-scale feature layer output by the neural network model based on the YOLO-v3 target detection model to predict moving targets in the video screen.

第二方面，提供了一种电子设备，所述电子设备包括：一个或多个处理器、存储器和显示屏；所述存储器与所述一个或多个处理器耦合，所述存储器用于存储计算机程序代码，所述计算机程序代码包括计算机指令，所述一个或多个处理器调用所述计算机指令以使得所述电子设备执行：In a second aspect, an electronic device is provided, the electronic device comprising: one or more processors, a memory, and a display screen; the memory is coupled to the one or more processors, the memory is used to store computer program code, the computer program code comprises computer instructions, and the one or more processors call the computer instructions to enable the electronic device to execute:

结合第二方面，在第二方面的某些实现方式中，根据所述压缩域语法元素，利用运动检测网络进行检测，确定目标运动物体，包括：In conjunction with the second aspect, in some implementations of the second aspect, detecting using a motion detection network according to the compressed domain syntax element to determine the target moving object includes:

结合第二方面，在第二方面的某些实现方式中，根据所述压缩域语法元素，确定运动特征，包括：In conjunction with the second aspect, in some implementations of the second aspect, determining the motion feature according to the compressed domain syntax element includes:

结合第二方面，在第二方面的某些实现方式中，所述方法还包括：In conjunction with the second aspect, in some implementations of the second aspect, the method further includes:

结合第二方面，在第二方面的某些实现方式中，所述压缩域语法元素包括：编码比特量、运动矢量和残差系数。In combination with the second aspect, in certain implementations of the second aspect, the compression domain syntax elements include: coding bit amount, motion vector and residual coefficient.

结合第二方面，在第二方面的某些实现方式中，所述运动特征包括：运动信息量、运动矢量强度、残差系数密度；In conjunction with the second aspect, in some implementations of the second aspect, the motion features include: motion information amount, motion vector intensity, and residual coefficient density;

结合第二方面，在第二方面的某些实现方式中，所述运动检测网络包括：Darknet神经网络模型和YOLOv3目标检测模型；In conjunction with the second aspect, in some implementations of the second aspect, the motion detection network includes: a Darknet neural network model and a YOLOv3 target detection model;

应理解，在上述第一方面中对相关内容的扩展、限定、解释和说明也适用于第二方面中相同的内容。It should be understood that the expansion, limitation, explanation and description of the relevant contents in the above-mentioned first aspect also apply to the same contents in the second aspect.

第三方面，提供了一种运动物体检测装置，包括用于执行第一方面中任一种运动物体检测方法的单元。In a third aspect, a moving object detection device is provided, comprising a unit for executing any one of the moving object detection methods in the first aspect.

在一种可能的实现方式中，当该运动物体检测装置是电子设备时，该处理单元可以是处理器，该输入单元可以是通信接口；该电子设备还可以包括存储器，该存储器用于存储计算机程序代码，当该处理器执行该存储器所存储的计算机程序代码时，使得该电子设备执行第一方面中的任一种方法。In one possible implementation, when the motion object detection device is an electronic device, the processing unit may be a processor, and the input unit may be a communication interface; the electronic device may also include a memory, which is used to store computer program code, and when the processor executes the computer program code stored in the memory, the electronic device executes any one of the methods in the first aspect.

第四方面，提供了一种芯片系统，所述芯片系统应用于电子设备，所述芯片系统包括一个或多个处理器，所述处理器用于调用计算机指令以使得所述电子设备执行第一方面中的任一种运动物体检测方法。In a fourth aspect, a chip system is provided, which is applied to an electronic device, and the chip system includes one or more processors, and the processor is used to call computer instructions so that the electronic device executes any one of the moving object detection methods in the first aspect.

第五方面，提供了一种计算机可读存储介质，所述计算机可读存储介质存储有计算机程序代码，当所述计算机程序代码被电子设备运行时，使得该电子设备执行第一方面中的任一种运动物体检测方法。In a fifth aspect, a computer-readable storage medium is provided, wherein the computer-readable storage medium stores a computer program code. When the computer program code is executed by an electronic device, the electronic device executes any one of the moving object detection methods in the first aspect.

第六方面，提供了一种计算机程序产品，所述计算机程序产品包括：计算机程序代码，当所述计算机程序代码被电子设备运行时，使得该电子设备执行第一方面中的任一种运动物体检测方法。In a sixth aspect, a computer program product is provided, the computer program product comprising: a computer program code, when the computer program code is executed by an electronic device, the electronic device executes any one of the moving object detection methods in the first aspect.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是本申请实施例提供的一种应用场景的示意图；FIG1 is a schematic diagram of an application scenario provided by an embodiment of the present application;

图2是本申请实施例提供的一种视频编解码的流程示意图；FIG2 is a schematic diagram of a video encoding and decoding process provided by an embodiment of the present application;

图3是编解码过程中涉及的三种类型的视频帧；Figure 3 shows three types of video frames involved in the encoding and decoding process;

图4是编解码过程中涉及的序列的示意图；FIG4 is a schematic diagram of a sequence involved in the encoding and decoding process;

图5是本申请实施例提供的一种运动物体检测方法的流程示意图；FIG5 is a schematic diagram of a flow chart of a moving object detection method provided in an embodiment of the present application;

图6是本申请实施例提供的运动矢量的示意图；FIG6 is a schematic diagram of a motion vector provided in an embodiment of the present application;

图7是本申请实施例提供的一种视频帧划分成像素块的示意图；FIG7 is a schematic diagram of dividing a video frame into pixel blocks provided by an embodiment of the present application;

图8是本申请实施例提供的二维矩阵的可视化图像；FIG8 is a visualization image of a two-dimensional matrix provided in an embodiment of the present application;

图9是一种适用于本申请的电子设备的硬件系统的示意图；FIG9 is a schematic diagram of a hardware system of an electronic device applicable to the present application;

图10是一种适用于本申请的电子设备的软件系统的示意图；FIG10 is a schematic diagram of a software system of an electronic device applicable to the present application;

图11是本申请提供的一种运动物体检测装置的结构示意图；FIG11 is a schematic structural diagram of a moving object detection device provided by the present application;

图12是本申请提供的一种电子设备的结构示意图。FIG12 is a schematic diagram of the structure of an electronic device provided by the present application.

具体实施方式DETAILED DESCRIPTION

下面将结合附图，对本申请实施例中的技术方案进行描述。The technical solutions in the embodiments of the present application will be described below in conjunction with the accompanying drawings.

首先，对本申请实施例中的部分用语进行解释说明，以便于本领域技术人员理解。First, some terms in the embodiments of the present application are explained to facilitate understanding by those skilled in the art.

1、编码、解码，编码就是将信息从一种形式或格式转换为另一种形式的过程，解码是编码的反向。1. Encoding and decoding. Encoding is the process of converting information from one form or format to another, and decoding is the reverse of encoding.

2、H264、H265，均为视频编解码器标准，H265是H264的升级版，H265标准保留H264原来的某些技术，同时对一些相关的技术加以改进。2. H264 and H265 are both video codec standards. H265 is an upgraded version of H264. The H265 standard retains some of the original technologies of H264 and improves some related technologies.

3、分辨率，即一个平面内像素的数量。通常表示成宽（W）×高（H）。比如，常见的1080P的图像，即表示一个平面内有1920×1080个像素。3. Resolution, which is the number of pixels in a plane. Usually expressed as width (W) × height (H). For example, a common 1080P image means that there are 1920 × 1080 pixels in a plane.

4、码流（data rate），是指视频文件在单位时间内使用的数据流量，也叫码率或码流率，单位是Kb/s或者Mb/s。码流越大，单位时间传送的数据就越多，所包括的信息量也越多。4. Data rate refers to the data flow rate used by a video file per unit time, also called bit rate or bit rate, and the unit is Kb/s or Mb/s. The larger the bit rate, the more data is transmitted per unit time, and the more information is included.

5、宏块（macroblock），宏块是视频编码技术中的一个基本概念。通过将画面分成一个个大小不同的块来在不同位置实行不同的压缩策略。5. Macroblock: Macroblock is a basic concept in video coding technology. It divides the picture into blocks of different sizes to implement different compression strategies at different locations.

在视频编码中，一个编码图像通常划分成若干宏块组成，一个宏块由一个亮度像素块和附加的两个色度像素块组成。例如，亮度块为16x16大小的像素块，而两个色度图像像素块的大小依据其图像的采样格式而定，如：对于YUV420采样图像，色度块为8x8大小的像素块。每个图像中，若干宏块被排列成片的形式，视频编码算法以宏块为单位，逐个宏块进行编码，组织成连续的视频码流。In video coding, a coded image is usually divided into several macroblocks. A macroblock consists of a luminance pixel block and two additional chrominance pixel blocks. For example, the luminance block is a 16x16 pixel block, and the size of the two chrominance image pixel blocks depends on the sampling format of the image. For example, for a YUV420 sampled image, the chrominance block is an 8x8 pixel block. In each image, several macroblocks are arranged in the form of slices. The video coding algorithm encodes each macroblock one by one in units of macroblocks and organizes them into a continuous video code stream.

6、光流（optical flow）算法，光流是指空间运动物体在观察成像平面上的像素运动的瞬时速度。光流算法是利用图像序列中像素在时间域上的变化以及相邻帧之间的相关性，来找到上一帧与当前帧之间存在的对应关系，从而计算出相邻帧之间物体的运动信息的一种方法。6. Optical flow algorithm. Optical flow refers to the instantaneous speed of the pixel movement of a moving object in space on the observation imaging plane. The optical flow algorithm uses the change of pixels in the time domain in the image sequence and the correlation between adjacent frames to find the corresponding relationship between the previous frame and the current frame, thereby calculating the motion information of the object between adjacent frames.

以上是对本申请实施例所涉及的名词的简单介绍，以下不再赘述。The above is a brief introduction to the terms involved in the embodiments of the present application, which will not be repeated below.

为了对场景进行说明，图1示出了三种应用场景示意图，下面对三种应用场景分别进行介绍说明。To illustrate the scenarios, FIG1 shows schematic diagrams of three application scenarios, and the three application scenarios are introduced and explained respectively below.

场景一：安防领域Scenario 1: Security field

在家庭、学校、工厂等地方，为了保护人身安全以及财产安全，通常会安装监控设备，并利用监控设备采集和显示监控画面。监控画面指示监控设备针对拍摄场景进行拍摄后，在显示屏上显示的画面。In order to protect personal safety and property safety, monitoring equipment is usually installed in homes, schools, factories, etc., and monitoring images are collected and displayed using the monitoring equipment. The monitoring image refers to the image displayed on the display screen after the monitoring equipment captures the shooting scene.

例如，以在工厂周围安装监控设备为例，如图1中的（a）所示，为显示的监控画面。当监控设备运行运动物体检测方法后，可以对监控画面中的运动情况进行分析，并将识别出的运动物体（比如为运动的人像）进行标识和突出显示，提醒观看监控画面的安防人员有可疑人员出现，注意防范小偷入室盗窃等。For example, taking the installation of monitoring equipment around a factory as an example, as shown in (a) of Figure 1, the monitoring screen is displayed. After the monitoring equipment runs the moving object detection method, the motion in the monitoring screen can be analyzed, and the identified moving objects (such as moving human figures) can be marked and highlighted, reminding the security personnel watching the monitoring screen that there are suspicious persons and to guard against thieves breaking into the house and theft.

场景二：公共交通领域Scenario 2: Public transportation

在一些道路或十字路口上，为了对来往车辆流量进行监控和调整，通常会安装路面监控设备，并利用路面监控设备采集和显示监控画面。On some roads or intersections, in order to monitor and adjust the flow of vehicles, road surface monitoring equipment is usually installed, and the road surface monitoring equipment is used to collect and display monitoring images.

例如，以在某一条道路上安装路面监控设备为例，如图1中的（b）所示，为显示的监控画面。当路面监控设备运行运动物体检测方法后，可以对监控画面中出现的车辆的运动情况进行分析，并将识别出的运动物体（比如为车辆）进行标识和显示。当检测到某一方向上行驶的车辆特别多，比如大于预设车辆数量阈值时，为了避免车辆拥堵造成交通瘫痪，可以提示相关人员针对红绿灯进行控制和调整，加快车辆通行速度。For example, taking the installation of road monitoring equipment on a certain road as an example, as shown in (b) of Figure 1, it is a displayed monitoring screen. After the road monitoring equipment runs the moving object detection method, it can analyze the movement of the vehicles appearing in the monitoring screen, and identify and display the identified moving objects (such as vehicles). When it is detected that there are a lot of vehicles traveling in a certain direction, such as more than the preset vehicle number threshold, in order to avoid traffic congestion and paralysis, relevant personnel can be prompted to control and adjust the traffic lights to speed up the vehicle passage.

场景三：拍摄领域Scene 3: Shooting area

以电子设备为手机进行举例，如图1中的（c）所示，当用户利用手机进行拍摄时，为了提高拍摄效果，手机可以运行运动物体检测方法，对拍摄画面中出现的运动情况进行分析，并将识别出的运动物体（比如运动的小狗）进行标识和显示；还可以借助运动物体检测方法更准确地定位出运动区域，该运动区域可以用于指示运动物体所在区域；然后，根据运动区域对应的拍摄场景中的物理区域，手机还可以借助运动物体检测方法进行智能化聚焦，比如在物理区域距离镜头较近时减小焦距，在物理区域距离镜头较远时增大焦距。Taking a mobile phone as an example of an electronic device, as shown in (c) in FIG1 , when a user uses a mobile phone to take a photo, in order to improve the shooting effect, the mobile phone can run a moving object detection method to analyze the movement in the shot image, and identify and display the identified moving object (such as a moving puppy); the moving object detection method can also be used to more accurately locate the moving area, which can be used to indicate the area where the moving object is located; then, according to the physical area in the shooting scene corresponding to the moving area, the mobile phone can also use the moving object detection method to perform intelligent focusing, such as reducing the focal length when the physical area is close to the lens, and increasing the focal length when the physical area is far from the lens.

在上述三个应用示例中，监控设备、路面监控设备或手机所运行的运动物体检测方法可以为相关技术提供的第一种检测方法或第二种检测方法，下面分别对该两种检测方法进行介绍说明。In the above three application examples, the moving object detection method run by the monitoring equipment, road monitoring equipment or mobile phone can be the first detection method or the second detection method provided by the relevant technology. The two detection methods are introduced and explained respectively below.

第一种检测方法通常是在像素域上进行的方法，先通过算法来对多帧视频像素数据进行计算，然后，基于计算结果估计出视频中的运动对象。例如，先利用光流算法，针对多帧视频像素数据中同一位置处的像素值进行计算，确定同一位置处的像素值的变化情况；再根据多帧视频像素数据每个同一位置处的像素变化情况来捕捉视频的变化。同一位置处的像素指的是在不同视频像素数据中位于同一行同一列位置处的像素。The first detection method is usually performed in the pixel domain. First, an algorithm is used to calculate the pixel data of multiple frames of video, and then the moving objects in the video are estimated based on the calculation results. For example, the optical flow algorithm is first used to calculate the pixel values at the same position in the pixel data of multiple frames of video to determine the change of the pixel values at the same position; then the changes in the video are captured based on the pixel changes at each same position in the pixel data of multiple frames of video. Pixels at the same position refer to pixels located in the same row and column in different video pixel data.

然而，这种基于像素域进行检测的检测方法，在视频帧分辨率越大时，需要处理的数据越多。当分辨率大到一定程度时，需要处理的视频像素数据非常庞大，计算时则可能出现耗时非常长的情况，这样，针对监控设备这些实时性要求比较高的设备来说，可能会产生非常严重的延迟，进而导致无法及时触发入室盗窃的警报或无法及时进行红绿灯切换，产生严重的后果。However, this detection method based on pixel domain detection requires more data to be processed when the video frame resolution is larger. When the resolution is large enough, the video pixel data that needs to be processed is very large, and the calculation may take a very long time. In this way, for monitoring equipment with high real-time requirements, it may cause very serious delays, which may lead to the failure to trigger the burglary alarm in time or the failure to switch the traffic light in time, resulting in serious consequences.

第二种检测方法是从压缩域的角度进行的方法，先直接从压缩域码流中取出运动信息，然后，基于运动信息结合传统时空分析算法，比如马尔科夫随机场、条件随机场等算法确定出运动区域和/或运动物体。虽然第二种检测方法由于可以直接获取运动信息，相对于第一种检测节约了计算步骤，实现了较快地检测速度，但是第二种方法只能考虑简单的运动情况、对象稀疏的运动场景，无法适用于复杂场景。The second detection method is a method from the perspective of the compressed domain. First, the motion information is directly extracted from the compressed domain code stream. Then, based on the motion information, the motion area and/or moving object are determined in combination with traditional spatiotemporal analysis algorithms, such as Markov random fields, conditional random fields, etc. Although the second detection method can directly obtain motion information, it saves calculation steps compared to the first detection method and achieves a faster detection speed, but the second method can only consider simple motion situations and motion scenes with sparse objects, and cannot be applied to complex scenes.

示例性地，当视频拍摄的场景中出现多个运动物体之间交错重叠的情况，比如在密集的人流中，多个人像交错重叠站立在一起时，此时，利用第二种检测方法针对该视频进行运动物体检测时，常常会把重叠的目标合并成一个目标物体，导致出现漏检。For example, when there are multiple overlapping moving objects in the scene captured by the video, such as multiple people standing together in a dense crowd, when the second detection method is used to detect moving objects in the video, the overlapping targets are often merged into one target object, resulting in missed detections.

示例性地，当视频拍摄的场景中出现明显的动态背景噪声时候，比如突然刮起的风吹动树枝，使得树枝不停地摇晃时，此时，利用第二种检测方法针对该视频进行运动物体检测时，常常会将噪声错误地认定为运动物体，进而导致误检。For example, when there is obvious dynamic background noise in the scene of the video, such as when a sudden wind blows the branches, causing them to shake continuously, at this time, when the second detection method is used to detect moving objects in the video, the noise is often mistakenly identified as a moving object, resulting in false detection.

除了上述所示的漏检、误检的情况之外，第二种检测方法的参数往往也是固定的，无法适应性调整，从而第二种检测方法只能适用于特定的场景，无法对各种不同场景做出自适应反应。总之，第二种检测方法虽然可以解决实时性的问题，但是由于算法鲁棒性较弱，容易出现漏检、误检等问题，导致影响用户体验。In addition to the missed detection and false detection cases shown above, the parameters of the second detection method are often fixed and cannot be adjusted adaptively, so the second detection method can only be applied to specific scenarios and cannot make adaptive responses to various scenarios. In short, although the second detection method can solve the real-time problem, due to the weak robustness of the algorithm, it is prone to missed detection, false detection and other problems, which affects the user experience.

有鉴于此，本申请实施例提供一种运动物体检测方法，该方法可以直接利用视频编码过程中产生的运动信息进行运动分析；结合该运动信息，基于运动物体的时空连贯性，对运动特征进行插值处理和滤波平滑；再重建成二维矩阵形式后输入Darknet神经网络模型，以融合不同的运动特征，学习有效且全面的运动语义表征；最后基于YOLO-v3目标检测模型对神经网络模型输出的多尺度特征层进行运动检测，以预测视频画面中的运动目标。In view of this, an embodiment of the present application provides a moving object detection method, which can directly use the motion information generated in the video encoding process to perform motion analysis; combined with the motion information, based on the spatiotemporal coherence of the moving object, the motion features are interpolated and filtered; then reconstructed into a two-dimensional matrix form and input into a Darknet neural network model to fuse different motion features and learn effective and comprehensive motion semantic representation; finally, based on the YOLO-v3 target detection model, motion detection is performed on the multi-scale feature layer output by the neural network model to predict the moving target in the video screen.

由于本申请提供的方法可以直接利用视频编码过程中产生的运动信息，节省了运动信息的计算步骤；另外，又结合了Darknet神经网络模型、YOLO-v3目标检测模型，从而可以实现高效、快速的视频运动目标检测任务，在保证实时性的基础下，解决相关方法中在复杂场景下的鲁棒性问题。Since the method provided in this application can directly utilize the motion information generated in the video encoding process, the calculation steps of the motion information are saved; in addition, it is combined with the Darknet neural network model and the YOLO-v3 target detection model, thereby realizing efficient and fast video motion target detection tasks, and solving the robustness problems of related methods in complex scenarios while ensuring real-time performance.

本申请实施例提供的运动物体检测方法可以用于上述图1所示的三种场景中，还可以应用但不限于以下场景中：The moving object detection method provided in the embodiment of the present application can be used in the three scenarios shown in FIG. 1 above, and can also be applied to but not limited to the following scenarios:

可选地，智能人机交互场景，比如骨骼跟踪、手势识别、空间测绘、三维重建、地图重建、自主导航、增强现实（augmented reality，AR）场景等。Optionally, intelligent human-computer interaction scenarios, such as skeleton tracking, gesture recognition, spatial mapping, 3D reconstruction, map reconstruction, autonomous navigation, augmented reality (AR) scenarios, etc.

可选地，视频播放场景，比如视频通话、视频会议应用、长短视频应用、视频直播类应用、视频网课应用以及智能猫眼等场景等。Optionally, the video playback scenarios include video calls, video conferencing applications, long and short video applications, video live broadcast applications, video online course applications, and smart cat's eye scenarios.

还可以应用于远程医疗诊断等，应理解，上述为对应用场景的举例说明，并不对本申请的应用场景作任何限定。It can also be applied to remote medical diagnosis, etc. It should be understood that the above is an example of the application scenario and does not impose any limitation on the application scenario of this application.

由于本申请涉及视频编解码过程，下面先结合图2和图3对视频编解码过程进行简单介绍，然后，再结合图4至图5对本申请实施例提供的运动物体检测方法进行介绍。Since the present application involves a video encoding and decoding process, the video encoding and decoding process is briefly introduced below in conjunction with Figures 2 and 3, and then the moving object detection method provided in an embodiment of the present application is introduced in conjunction with Figures 4 to 5.

参考图2，图2为本申请实施例提供的一种视频编解码的流程示意图。Refer to Figure 2, which is a schematic diagram of a video encoding and decoding process provided in an embodiment of the present application.

如图2所示，伴随着用户对高清视频的需求量的增加，视频数据量也在不断加大。如果未经压缩，这些视频很难应用于实际的传输和存储。比如，图像的每个像素的三个颜色分量RGB各需要一个字节表示，那么每一个像素至少需要3字节，分辨率1280×720的图像就需要2.76M字节。为了减小视频数据的数据量，相关技术一般会将视频数据使用视频编码技术（如H264、H265）进行压缩后，再进行传输和存储。As shown in Figure 2, with the increasing demand for high-definition videos, the amount of video data is also increasing. If not compressed, these videos are difficult to apply to actual transmission and storage. For example, the three color components RGB of each pixel of an image require one byte each, so each pixel requires at least 3 bytes, and an image with a resolution of 1280×720 requires 2.76M bytes. In order to reduce the amount of video data, related technologies generally compress video data using video encoding technology (such as H264, H265) before transmission and storage.

后续，当用户触发电子设备播放视频时，电子设备首先获取到视频文件，将视频文件根据解协议，解析为标准的相应地封装格式数据。这些封装格式数据可以包括视频码流和音频码流。封装格式例如可以为AVI、MKV或者MP4等格式。Subsequently, when the user triggers the electronic device to play a video, the electronic device first obtains the video file and parses the video file into standard corresponding encapsulation format data according to the decoding protocol. These encapsulation format data may include video code streams and audio code streams. The encapsulation format may be, for example, AVI, MKV, or MP4.

在获取到封装格式数据后，电子设备对封装格式数据进行解封装格式处理，将视频码流和音频码流进行分离，得到音频码流数据和视频码流数据（也称为视频压缩数据）。进一步地，针对视频码流数据，电子设备可以利用视频解码技术，对视频码流数据进行解码得到视频像素数据；然后，电子设备再将视频像素数据通过显示屏进行播放。After obtaining the packaged format data, the electronic device performs decapsulation processing on the packaged format data, separates the video code stream and the audio code stream, and obtains the audio code stream data and the video code stream data (also called video compression data). Furthermore, for the video code stream data, the electronic device can use video decoding technology to decode the video code stream data to obtain video pixel data; then, the electronic device plays the video pixel data through the display screen.

在视频传输的过程中，相关技术提供的第一种检测方法相当于针对解码得到的视频像素数据进行运动物体检测处理，第二种检测方法相当于针对解封装格式处理后的视频码流数据进行运动物体检测处理，本申请提供的检测方法也相当于针对解封装格式处理后的视频码流数据进行运动物体检测处理。During the video transmission process, the first detection method provided by the related technology is equivalent to performing moving object detection processing on the decoded video pixel data, and the second detection method is equivalent to performing moving object detection processing on the video bitstream data after decapsulation format processing. The detection method provided by the present application is also equivalent to performing moving object detection processing on the video bitstream data after decapsulation format processing.

由于本申请提供的检测方法处于解封装阶段，需要基于编码产生的视频文件进行处理，所以，在此对编解码的逻辑也进行一些简单说明。Since the detection method provided in this application is in the decapsulation stage and needs to be processed based on the video file generated by encoding, some simple explanations of the encoding and decoding logic are also given here.

参考图3，图3为编解码过程中涉及的三种类型的视频帧。Refer to FIG. 3 , which shows three types of video frames involved in the encoding and decoding process.

在视频画面中，通常会发现一帧视频帧与一帧视频帧之间会有很多重复的内容，比如背景。类似背景这类内容可能会出几帧画面甚至几秒的画面中都没有变化。因此，为了解决这些冗余数据问题，可以基于之前的画面去渲染下一帧的画面，从而减少带宽。例如，如图3中的（a）所示，背景和小球在3帧画面中都没有变化，仅有大树和矩形发生了变换；由此，可以将背景和小球、与大树和矩形分离；基于之前的画面中的背景和小球，去渲染下一帧的画面中的背景和小球，由此来实现减少带宽的目的。In video images, it is often found that there are a lot of repeated contents between video frames, such as background. Content such as background may not change in several frames or even several seconds. Therefore, in order to solve these redundant data problems, the next frame can be rendered based on the previous frame to reduce bandwidth. For example, as shown in (a) in Figure 3, the background and the ball have not changed in the three frames, only the big tree and the rectangle have changed; thus, the background and the ball can be separated from the big tree and the rectangle; based on the background and the ball in the previous frame, the background and the ball in the next frame are rendered, thereby achieving the purpose of reducing bandwidth.

根据这个逻辑，在进行编码时，如图3中的（b）所示，可以将视频帧分为三种类型。第一种类型：I帧，也称为关键帧，是一个完整数据的帧，不需要依赖其他数据进行渲染，类似静态图像，如图3中的（b）所示的左侧图像。第二种类型：P帧，也称为预测帧，基于I帧或前一个P帧进行渲染的帧；解码时，需要用之前缓存的I帧画面叠加上本帧定义的I帧到P帧之间的差值，从而可以构建出当前P帧的画面，P帧如图3中的（b）所示的右侧图像。第三种类型：B帧，也成为双向预测帧，基于I帧和P帧进行渲染的帧；在解码时，通过将缓存的I帧画面，解码的P帧画面与本帧数据叠加取得最终的画面；B帧压缩率更好，如图3中的（b）所示的中间图像。应理解，I帧去掉的是视频帧在空间维度上的冗余信息，P帧和B帧去掉的是视频帧在时间纬度上的冗余信息。还应理解，在解码时，需要以I帧、P帧、B帧的顺序进行解码。According to this logic, when encoding, as shown in (b) in Figure 3, video frames can be divided into three types. The first type: I frame, also known as key frame, is a frame of complete data that does not rely on other data for rendering, similar to a static image, as shown in the left image in (b) in Figure 3. The second type: P frame, also known as predicted frame, is a frame rendered based on I frame or the previous P frame; when decoding, it is necessary to superimpose the difference between the I frame and the P frame defined by the current frame with the previously cached I frame picture, so that the picture of the current P frame can be constructed. The P frame is shown in the right image in (b) in Figure 3. The third type: B frame, also known as bidirectional predicted frame, is a frame rendered based on I frame and P frame; when decoding, the final picture is obtained by superimposing the cached I frame picture, the decoded P frame picture and the current frame data; B frame has a better compression rate, as shown in the middle image in (b) in Figure 3. It should be understood that I frame removes redundant information of video frame in spatial dimension, and P frame and B frame remove redundant information of video frame in time dimension. It should also be understood that when decoding, it is necessary to decode in the order of I frame, P frame, and B frame.

基于上述介绍的三种类型的视频帧，如图4所示，可以将从I帧开始，到下一个I帧结束的多帧图像称为一个序列，一个序列为一段连续的视频码流数据。其中，每帧视频帧由多个宏块构成。需要说明的是，宏块除了包括像素数据之外还可以包括其他信息，比如该宏块在对应视频帧中的位置信息、对应视频帧的类型（I帧、P帧或B帧类型）等。Based on the three types of video frames introduced above, as shown in Figure 4, multiple frames starting from an I frame and ending at the next I frame can be called a sequence, and a sequence is a continuous video code stream data. Among them, each video frame is composed of multiple macroblocks. It should be noted that in addition to pixel data, a macroblock can also include other information, such as the position information of the macroblock in the corresponding video frame, the type of the corresponding video frame (I frame, P frame or B frame type), etc.

参考图5，图5为本申请实施例提供的一种运动物体检测方法的流程示意图，如图5所示，该方法10可以包括以下S11至S16，下面针对S11至S16进行详细的介绍。Refer to Figure 5, which is a flow chart of a moving object detection method provided in an embodiment of the present application. As shown in Figure 5, the method 10 may include the following S11 to S16, and S11 to S16 are introduced in detail below.

S11、获取视频码流数据，并提取压缩域语法元素。S11. Obtain video code stream data and extract compression domain syntax elements.

上述S11也可以描述为：获取视频码流数据，根据视频的压缩格式对视频码流数据进行解析，提取P帧对应的压缩域语法元素和B帧对应的压缩域语法元素。The above S11 can also be described as: acquiring video code stream data, parsing the video code stream data according to the video compression format, and extracting compression domain syntax elements corresponding to P frames and compression domain syntax elements corresponding to B frames.

视频码流数据可以为电子设备存储的数据或从其他电子设备、云端服务器接收到的数据，本申请对此不进行任何限定。视频码流数据为编码后、解码前传输的数据，或者说，视频码流数据即为待解码的有损失的视频压缩数据。例如，视频码流数据可以为图1所示的三种应用场景中监控设备、路面监控设备或手机拍摄到的监控视频。The video code stream data may be data stored in an electronic device or data received from other electronic devices or cloud servers, and this application does not impose any restrictions on this. The video code stream data is data transmitted after encoding and before decoding, or in other words, the video code stream data is lossy video compression data to be decoded. For example, the video code stream data may be surveillance video captured by a monitoring device, a road monitoring device, or a mobile phone in the three application scenarios shown in FIG1.

压缩域语法元素为视频码流数据中变量信息的总称。在H264、H265视频压缩编码过程中会产生一些描述视频运动的语法元素，也就是说，编码后进行传输的视频码流数据中本身包括有压缩域语法元素，由此，可以直接从视频码流数据中进行提取，并用于进行运动分析，无需进行其他计算和处理。需要说明的是，由于I帧为关键帧，是一个完整数据的帧，不包含其他帧的信息，因此，I帧没有对应的压缩域语法元素；而P帧、B帧均需要基于其他帧进行渲染，携带有各自对应的前后帧的信息，因此，P帧、B帧均有对应的压缩域语法元素。The compression domain syntax elements are a general term for variable information in video bitstream data. In the process of H264 and H265 video compression encoding, some syntax elements that describe video motion are generated. That is to say, the video bitstream data that is transmitted after encoding itself includes compression domain syntax elements. Therefore, they can be directly extracted from the video bitstream data and used for motion analysis without other calculations and processing. It should be noted that since the I frame is a key frame, it is a frame of complete data and does not contain information about other frames. Therefore, the I frame has no corresponding compression domain syntax elements; while P frames and B frames need to be rendered based on other frames and carry information about their corresponding previous and next frames. Therefore, P frames and B frames have corresponding compression domain syntax elements.

压缩域语法元素可以包括：编码比特量、运动矢量和残差系数，这三种压缩域语法元素是视频编码过程中产生的，具有一定的运动表征能力。下面分别对其原理进行说明。The compression domain syntax elements may include: coding bit amount, motion vector and residual coefficient. These three compression domain syntax elements are generated during the video encoding process and have certain motion representation capabilities. The principles are described below.

1）编码比特量，从视频编码的角度来说，编码过程中所消耗比特量越大的区域，其可预测的冗余信息量越少。对于图1所示场景中拍摄到的监控视频来说，目标的出现及其复杂的运动都是难以预料的，视频帧中的运动区域会消耗更多的比特量来编码生成不可预测的运动信息。因此，编码所需比特量在一定程度上可以等效于针对目标编码所需的信息量。有信息量的运动区域大概率存在运动物体。1) Encoding bit quantity. From the perspective of video encoding, the larger the bit quantity consumed in the encoding process, the less the amount of predictable redundant information. For the surveillance video captured in the scene shown in Figure 1, the appearance of the target and its complex movement are unpredictable. The moving area in the video frame will consume more bits to encode and generate unpredictable motion information. Therefore, the amount of bits required for encoding can be equivalent to the amount of information required for encoding the target to a certain extent. There is a high probability that there are moving objects in the moving area with information.

2）运动矢量，是视频编码过程中用于运动估计的一种语法元素，表示视频像素点相对于参考像素点的位移，该位移数据包括水平分量和竖直分量。运动矢量可以理解为一种类似光流的运动表征。参考图6，图6为本申请实施例提供的一种运动矢量的示意图，如图6中的（b）中所示的箭头可以用于表示图6中的（a）所示的汽车对应的运动矢量。2) Motion vector is a syntactic element used for motion estimation in the video encoding process, which indicates the displacement of a video pixel relative to a reference pixel. The displacement data includes a horizontal component and a vertical component. A motion vector can be understood as a motion representation similar to optical flow. Referring to FIG. 6 , FIG. 6 is a schematic diagram of a motion vector provided in an embodiment of the present application. For example, the arrow shown in (b) of FIG. 6 can be used to represent the motion vector corresponding to the car shown in (a) of FIG. 6 .

3）残差系数，由于物体的运动使得运动区域内的像素前后帧之间存在明显的差异，因而处于运动区域内的宏块相比背景区域会携带更多的残差信息。经过视频压缩过程中的离散余弦变换（DiscreteCosine Transform，DCT）后，这些区域的DCT块会存在较多的非零残差系数，因此，残差系数可以用于反映视频画面中的运动情况。3) Residual coefficients: Due to the movement of objects, there are obvious differences between the previous and next frames of pixels in the moving area, so the macroblocks in the moving area will carry more residual information than the background area. After the discrete cosine transform (DCT) in the video compression process, the DCT blocks in these areas will have more non-zero residual coefficients. Therefore, the residual coefficients can be used to reflect the motion in the video picture.

DCT变换属于傅里叶变换的一种，用于对图像或视频进行有损数据压缩。DCT将图像分成由不同频率组成的小块，该小块即可称为DCT块，然后，将DCT块进行量化。在量化过程中，舍弃高频分量，将剩下的低频分量保存下来用于进行图像重建。DCT is a type of Fourier transform used for lossy data compression of images or videos. DCT divides an image into small blocks of different frequencies, which are called DCT blocks. Then, the DCT blocks are quantized. During the quantization process, high-frequency components are discarded and the remaining low-frequency components are saved for image reconstruction.

基于上述原理，可以通过提取P帧对应的三种压缩域语法元素，来表征P帧所包含的运动情况；通过提取B帧对应的三种压缩域语法元素，来表征B帧所包含的运动情况。应理解，针对视频码流数据中包括的每个P帧和每个B帧都可以提取到各自对应的三种压缩域语法元素。Based on the above principle, the motion contained in the P frame can be represented by extracting the three compression domain syntax elements corresponding to the P frame; the motion contained in the B frame can be represented by extracting the three compression domain syntax elements corresponding to the B frame. It should be understood that the three compression domain syntax elements corresponding to each P frame and each B frame included in the video bitstream data can be extracted.

S12、根据压缩域语法元素，确定P帧和B帧对应的运动特征。S12. Determine motion features corresponding to the P frame and the B frame according to the compression domain syntax elements.

上述S12也可以描述为：根据P帧对应的压缩域语法元素，确定对应的运动特征；根据B帧对应的压缩域语法元素，确定对应的运动特征。该运动特征可以包括：运动信息量b、运动矢量强度m、残差系数密度d。The above S12 can also be described as: determining the corresponding motion feature according to the compression domain syntax element corresponding to the P frame; determining the corresponding motion feature according to the compression domain syntax element corresponding to the B frame. The motion feature may include: motion information amount b, motion vector strength m, and residual coefficient density d.

应理解，P帧对应的压缩域语法元素有三种，分别为编码比特量、运动矢量和残差系数，基于此，可以根据编码比特量确定出对应的运动信息量b，根据运动矢量确定出对应的运动矢量强度m，以及根据残差系数确定出对应的残差系数密度d。It should be understood that there are three compression domain syntax elements corresponding to the P frame, namely, the coding bit amount, motion vector and residual coefficient. Based on this, the corresponding motion information amount b can be determined according to the coding bit amount, the corresponding motion vector strength m can be determined according to the motion vector, and the corresponding residual coefficient density d can be determined according to the residual coefficient.

同理，B帧对应的压缩域语法元素也有三种，分别为编码比特量、运动矢量和残差系数，基于此，可以根据编码比特量确定出对应的运动信息量b，根据运动矢量确定出对应的运动矢量强度m，以及根据残差密度确定出对应的残差系数密度d。下面分别对运动特征的确定过程进行介绍。Similarly, there are three types of compression domain syntax elements corresponding to B frames, namely, coding bit amount, motion vector and residual coefficient. Based on this, the corresponding motion information amount b can be determined according to the coding bit amount, the corresponding motion vector strength m can be determined according to the motion vector, and the corresponding residual coefficient density d can be determined according to the residual density. The following introduces the process of determining motion features.

示例性地，参考图7，图7为本申请实施例提供的一种视频帧划分成块的示意图。如图7中的（a）所示，假设第t个视频帧U（t）被划分成不重叠的4×4的像素块，按照光栅扫描顺序进行块地址分配，那么第i个像素块可以表示为

，每个像素块

可以基于提取的三种压缩域语法元素，计算出对应的三个运动特征。应理解，光栅扫描顺序指的是从左往右、从上往下，先扫描一行再移动至下一行起始位置进行扫描的顺序。For example, refer to FIG. 7 , which is a schematic diagram of dividing a video frame into blocks provided by an embodiment of the present application. As shown in (a) of FIG. 7 , assuming that the t-th video frame U(t) is divided into non-overlapping 4×4 pixel blocks, and the block addresses are allocated in a raster scanning order, then the i-th pixel block can be represented as

, each pixel block

The corresponding three motion features can be calculated based on the three extracted compression domain syntax elements. It should be understood that the raster scanning order refers to the order of scanning from left to right and from top to bottom, first scanning a line and then moving to the starting position of the next line for scanning.

1）编码单元(codingunit，CU)是视频编码器的基本单元，如图7中的（b）所示，假设在H.264编码过程中，将视频帧划分成4×4的编码单元块（也可以称为宏块），每一个编码单元块

包含了

个4×4的子块

，其中任一个子块

对应的运动信息量

可以基于以下公式（一）确定：1) Coding unit (CU) is the basic unit of the video encoder. As shown in (b) of Figure 7, it is assumed that in the H.264 encoding process, the video frame is divided into 4×4 coding unit blocks (also called macroblocks). Each coding unit block

Included

4×4 sub-blocks

, any of the sub-blocks

The corresponding motion information

It can be determined based on the following formula (I):

其中，

为向下取整运算，

指示一个编码单元块所对应的编码比特量，

指示子块的个数，

用于指示编码单元块

所包含的4×4子块的块地址集合。in,

To perform floor operation,

Indicates the amount of coded bits corresponding to a coding unit block,

Indicates the number of sub-blocks,

Used to indicate a coding unit block

The set of block addresses of the included 4×4 sub-blocks.

需要说明的是，块地址例如可以使用1、2、3等数字序列来表示，当然，也可以适用其他方式进行表示，本申请实施例对此不进行任何限制。另外，上述编码单元块的划分是以H.264为例进行举例，在H.265中编码单元块为尺寸不一的块，其尺寸范围是16x16块至128x128块，计算方法类似，在此不再赘述。It should be noted that the block address can be represented by a digital sequence such as 1, 2, 3, etc., of course, other methods can also be used to represent it, and the embodiments of the present application do not impose any restrictions on this. In addition, the above-mentioned division of the coding unit block is based on H.264 as an example. In H.265, the coding unit block is a block of different sizes, and its size range is 16x16 blocks to 128x128 blocks. The calculation method is similar and will not be repeated here.

2）如图7中的（c）所示，假设将视频帧到下一帧的位移量划分成多个运动矢量块，每一个运动矢量块

包含了

个4×4子块，

是一个运动矢量块

所包含的4×4子块的块地址集合，运动矢量块

提取的横竖两个方向的运动矢量值为

和

，则其中的任一个子块

的运动矢量强度

可以基于以下公式（二）确定：2) As shown in (c) of Figure 7, assume that the displacement from one video frame to the next frame is divided into multiple motion vector blocks, each of which is

Included

4×4 sub-blocks,

is a motion vector block

The block address set of the contained 4×4 sub-blocks, motion vector block

The extracted motion vector values in both horizontal and vertical directions are

and

, then any of the sub-blocks

The motion vector strength

It can be determined based on the following formula (II):

其中，

为向下取整运算。in,

This is a floor operation.

3）为了统一块的大小，也即统一计算的精度，可以给定一个DCT块

，其中，DCT块

包含了

个4×4子块，这些4×4子块的块地址集合表示为

，那么对于其中的任一个子块

的残差系数密度

可以基于以下公式（三）确定：3) In order to unify the block size, that is, the accuracy of the calculation, a DCT block can be given

, where DCT block

Included

4×4 sub-blocks, the block address set of these 4×4 sub-blocks is expressed as

, then for any of the sub-blocks

The residual coefficient density of

It can be determined based on the following formula (III):

其中，

为向下取整运算，

表示计算一个DCT块中的非零残差系数密度。in,

To perform floor operation,

It means calculating the density of non-zero residual coefficients in a DCT block.

需要说明的是，在H.264中，DCT变换有4×4、8×8的尺寸，视频帧划分成不重叠的像素块的个数需与DCT变换的尺寸匹配。It should be noted that in H.264, DCT transform has 4×4 and 8×8 sizes, and the number of non-overlapping pixel blocks into which the video frame is divided must match the size of the DCT transform.

基于上述三个公式，可以分别确定出P帧包括的每个块所对应的三个运动特征，以及确定出B帧包括的每个块所对应的三个运动特征。Based on the above three formulas, three motion features corresponding to each block included in the P frame and three motion features corresponding to each block included in the B frame can be determined respectively.

应理解，由于压缩域语法元素在背景区域和前景区域的统计特性是完全不同的，由此可以利用这种差异化特点，来粗略地作为估计视频画面中的运动区域的考量标准。It should be understood that since the statistical characteristics of the compressed domain syntax elements in the background area and the foreground area are completely different, this difference can be used as a rough consideration criterion for estimating the motion area in the video picture.

S13、根据I帧前后相邻B帧和/或P帧对应的运动特征，确定I帧对应的运动特征。S13: Determine the motion features corresponding to the I frame according to the motion features corresponding to the adjacent B frames and/or P frames before and after the I frame.

该I帧的运动特征也包括：运动信息量b、运动矢量强度m、残差系数密度d。The motion characteristics of the I frame also include: motion information amount b, motion vector strength m, and residual coefficient density d.

其中，I帧前后相邻多帧的数量可以为前后相邻的1帧、2帧、3帧，具体数量可以根据需要进行设定和调整，本申请实施例对此不进行任何限制。Among them, the number of adjacent multiple frames before and after the I frame can be 1 frame, 2 frames, or 3 frames before and after. The specific number can be set and adjusted as needed, and the embodiment of the present application does not impose any restrictions on this.

例如，若设定基于I帧前后相邻的1帧视频帧的运动特征，来确定该I帧的运动特征。结合图4所示，针对序列中的I帧来说，前面相邻的1帧视频帧为P帧，后面相邻的1帧视频帧为B帧，则可以基于该前后相邻的P帧和B帧的运动特征，来确定该I帧的运动特征。For example, if the motion features of the I frame are determined based on the motion features of the 1 video frames that are adjacent to and before the I frame, as shown in FIG4 , for an I frame in a sequence, the 1 video frame that is adjacent to the front is a P frame, and the 1 video frame that is adjacent to the back is a B frame, then the motion features of the I frame can be determined based on the motion features of the 1 P frame and the 1 B frame that are adjacent to the front and back.

又例如，若设定基于I帧前后相邻的2帧视频帧的运动特征，来确定该I帧的运动特征。结合图4所示，针对序列中的I帧来说，前面相邻的2帧视频帧均为P帧，后面相邻的2帧视频帧均为B帧，则可以基于该前后相邻的2帧P帧和2帧B帧的运动特征，来确定该I帧的运动特征。其他数量的计算方法依次类推，在此不再赘述。For another example, if the motion features of the I frame are determined based on the motion features of the two adjacent video frames before and after the I frame. As shown in FIG4 , for an I frame in a sequence, the two adjacent video frames in front are P frames, and the two adjacent video frames in the back are B frames. Then, the motion features of the I frame can be determined based on the motion features of the two adjacent P frames and the two B frames. The calculation methods of other quantities are similar and will not be described in detail here.

应理解，上述仅为两种举例，I帧前后相邻的视频帧的类型可以为其他组合情况，但是，I帧前后相邻帧只可能是P帧或B帧，因此确定I帧的运动特征时，基于已经确定出的B帧和P帧的运动特征即可进行确定。It should be understood that the above are only two examples, and the types of adjacent video frames before and after the I frame can be other combinations, but the adjacent frames before and after the I frame can only be P frames or B frames. Therefore, when determining the motion features of the I frame, the determination can be made based on the motion features of the B frame and the P frame that have been determined.

需要说明的是，由于I帧是一帧完整帧，在编码过程没有产生对应的压缩域语法元素，因而无法确定出对应的运动特征。但是，考虑到视频中运动对象具有时空连续性，所进行的运动是连续的，因此，可以使用时域插值的方法来确定I帧对应的运动特征。It should be noted that since the I frame is a complete frame, no corresponding compression domain syntax elements are generated during the encoding process, and therefore the corresponding motion features cannot be determined. However, considering that the moving objects in the video have spatiotemporal continuity and the motion is continuous, the time domain interpolation method can be used to determine the motion features corresponding to the I frame.

可选地，I帧的运动特征，可以利用以下公式（四）来确定：Optionally, the motion feature of the I frame can be determined using the following formula (IV):

其中，t—φ为起始时间，t+φ为结束时间，τ为起始时间至结束时间中的任意时间，τ时间对应的视频帧U(τ)中第i个子块对应的运动信息量为

、运动矢量强度为

、残差系数密度为

，t时间对应的I帧视频帧中第i个子块对应的运动信息量为

、运动矢量强度为

、残差系数密度为

。Among them, t-φ is the start time, t+φ is the end time, τ is any time between the start time and the end time, and the amount of motion information corresponding to the i-th sub-block in the video frame U(τ) corresponding to the τ time is

, the motion vector strength is

, the residual coefficient density is

, the amount of motion information corresponding to the i-th sub-block in the I-frame video frame corresponding to time t is

, the motion vector strength is

, the residual coefficient density is

.

需要说明的是，根据I帧前后相邻的视频帧的运动特征，利用以上公式（四）可以插值确定出该I帧的运动特征，从而可以对漏检区域进行填补。其插值窗口可以更改，具体可以根据需要进行设置。It should be noted that, according to the motion features of the adjacent video frames before and after the I frame, the above formula (IV) can be used to interpolate and determine the motion features of the I frame, so that the missed detection area can be filled. The interpolation window can be changed and can be set according to needs.

S14、对I帧、P帧和B帧分别对应的运动特征，进行平滑处理。S14, performing smoothing processing on the motion features corresponding to the I frame, the P frame and the B frame respectively.

可选地，可以利用时域中值滤波算法进行平滑处理，当然，还可以利用其他滤波算法或多种算法的组合来进行平滑处理，本申请实施例对此不进行任何限制。Optionally, a time domain median filtering algorithm may be used for smoothing. Of course, other filtering algorithms or a combination of multiple algorithms may also be used for smoothing, and the embodiments of the present application do not impose any limitation on this.

应理解，进行平滑处理后，相邻视频帧之间的运动信息过渡更加自然，可以去除误检区域的噪声，还可以避免出现个别数据异常、差异较大的情况。It should be understood that after smoothing, the transition of motion information between adjacent video frames is more natural, the noise in the false detection area can be removed, and the occurrence of individual data anomalies and large differences can be avoided.

S15、根据每帧视频帧对应的运动特征，重构生成二维矩阵。S15. Reconstruct and generate a two-dimensional matrix according to the motion features corresponding to each video frame.

可选地，根据S13中没有进行平滑处理的运动特征，重构生成二维矩阵，或者还可以根据S14中平滑处理后的运动特征，重构生成二维矩阵。Optionally, the two-dimensional matrix may be reconstructed and generated according to the motion features that have not been smoothed in S13, or the two-dimensional matrix may be reconstructed and generated according to the motion features that have been smoothed in S14.

由于每个视频帧对应的运动特征包括运动信息量、运动矢量强度和残差系数密度三个运动特征，因此，重构生成二维矩阵后，可以得到三个二维矩阵。Since the motion features corresponding to each video frame include three motion features, namely, the amount of motion information, the intensity of motion vectors, and the density of residual coefficients, three two-dimensional matrices can be obtained after reconstructing and generating a two-dimensional matrix.

参考图8，图8为三种运动特征数据重构生成的二维矩阵的可视化图像。图8中的（a）为运动信息量对应的二维矩阵的可视化图像，图8中的（b）为运动矢量强度对应的二维矩阵的可视化图像，图8中的（c）为残差系数密度对应的二维矩阵的可视化图像。Refer to Figure 8, which is a visualization image of the two-dimensional matrix generated by reconstructing three types of motion feature data. (a) in Figure 8 is a visualization image of the two-dimensional matrix corresponding to the amount of motion information, (b) in Figure 8 is a visualization image of the two-dimensional matrix corresponding to the motion vector intensity, and (c) in Figure 8 is a visualization image of the two-dimensional matrix corresponding to the residual coefficient density.

需要说明的是，由于二维矩阵中的每个点的特征值源自一个4×4的块，也就是说，4×4的块转换到二维矩阵时是一个1×1的点，因此，图8所示的二维矩阵的可视化图像尺寸为视频帧的分辨率的四分之一。It should be noted that since the eigenvalue of each point in the two-dimensional matrix originates from a 4×4 block, that is, the 4×4 block is a 1×1 point when converted to a two-dimensional matrix, the visualization image size of the two-dimensional matrix shown in FIG8 is one quarter of the resolution of the video frame.

S16、将二维矩阵进行通道合并后，输入运动检测网络进行检测，确定目标运动物体。S16, after merging the channels of the two-dimensional matrix, input the matrix into a motion detection network for detection to determine the target moving object.

参考图8可知，每帧视频帧对应有三个二维矩阵，由此，可以将该三个二维矩阵进行通道合并（concat），然后再输入运动检测网络进行检测处理。Referring to FIG. 8 , it can be seen that each video frame corresponds to three two-dimensional matrices. Therefore, the three two-dimensional matrices can be channel-merged (concat) and then input into the motion detection network for detection processing.

需要说明的是，深度学习的运动检测网络的输入通常为二维图片形式，因此，本申请中需要将运动特征重构后以二维矩阵的形式输入运动检测网络，由此，可以便于运动检测网络进行检测处理。二维矩阵携带有空间位置信息。It should be noted that the input of the deep learning motion detection network is usually in the form of a two-dimensional image. Therefore, in this application, the motion features need to be reconstructed and input into the motion detection network in the form of a two-dimensional matrix, thereby facilitating the motion detection network to perform detection processing. The two-dimensional matrix carries spatial position information.

可选地，运动检测网络可以包括：Darknet神经网络模型和YOLOv3目标检测模型。按照处理顺序，二维矩阵进行通道合并处理后先输入Darknet神经网络模型进行处理，然后，将处理结果再输入YOLOv3目标检测模型进行处理。Optionally, the motion detection network may include: a Darknet neural network model and a YOLOv3 target detection model. According to the processing order, the two-dimensional matrix is first input into the Darknet neural network model for processing after the channel merging processing, and then the processing result is input into the YOLOv3 target detection model for processing.

其中，Darknet神经网络用于根据通道合并后的二维矩阵生成多尺度的卷积特征层；YOLOv3目标检测模型用于根据多尺度的卷积特征层生成目标运动物体的边框定位。多尺度的卷积特征层用于指示二维矩阵的不同尺度的融合结果；边框定位包括目标运动物体所对应的边框的一个顶点的二维坐标、以及边框的宽和高。Among them, the Darknet neural network is used to generate a multi-scale convolutional feature layer based on the two-dimensional matrix after channel merging; the YOLOv3 target detection model is used to generate the border positioning of the target moving object based on the multi-scale convolutional feature layer. The multi-scale convolutional feature layer is used to indicate the fusion results of different scales of the two-dimensional matrix; the border positioning includes the two-dimensional coordinates of a vertex of the border corresponding to the target moving object, as well as the width and height of the border.

应理解，上述包括Darknet神经网络模型和YOLOv3目标检测模型的运动检测模型为根据训练样本数据已训练好的运动检测模型，具有优异且稳定的检测能力。运动检测网络还可以为Faster-RCNN，RetinaNet、SSD等网络模型，当然，还可以为其他多个模型的组合，本申请实施例对此不进行任何限制。It should be understood that the motion detection model including the Darknet neural network model and the YOLOv3 target detection model is a motion detection model that has been trained based on the training sample data and has excellent and stable detection capabilities. The motion detection network can also be a network model such as Faster-RCNN, RetinaNet, SSD, and of course, it can also be a combination of multiple other models, and the embodiments of the present application do not impose any restrictions on this.

还应理解，确定出的目标运动物体的边框定位可以在显示屏上进行显示。It should also be understood that the determined frame location of the target moving object may be displayed on a display screen.

本申请实施例提供的运动物体检测方法，该方法可以直接利用视频编码过程中产生的运动信息进行运动分析；结合该运动信息，基于运动物体的时空连贯性，对运动特征进行插值处理和滤波平滑；再重建成二维矩阵形式后输入Darknet神经网络模型，以融合不同的运动特征，学习有效且全面的运动语义表征；最后基于YOLO-v3目标检测模型对神经网络模型输出的多尺度特征层进行运动检测，以预测视频画面中的运动目标。The moving object detection method provided in the embodiment of the present application can directly use the motion information generated in the video encoding process to perform motion analysis; combine the motion information, and interpolate and filter the motion features based on the spatiotemporal coherence of the moving object; then reconstruct them into a two-dimensional matrix form and input them into the Darknet neural network model to fuse different motion features and learn effective and comprehensive motion semantic representation; finally, based on the YOLO-v3 target detection model, motion detection is performed on the multi-scale feature layer output by the neural network model to predict the moving target in the video picture.

相比于现有的基于压缩域的运动检测方案，本申请无需解码帧图像数据，可直接从压缩域中提取可靠的运动信息进行运动分析，因此处理速度容易达到实时性。在分析时，针对编码比特量、运动矢量、残差系数三种压缩域运动信息分别设计了运动信息量、运动矢量强度、残差系数密度三种运动特征来表征视频画面的运动情况。另外，又结合了Darknet神经网络模型、YOLO-v3目标检测模型，从而可以实现高效、快速的视频运动目标检测任务，在保证实时性的基础下，解决相关方法中在复杂场景下的鲁棒性弱、性能差的问题。Compared with the existing motion detection scheme based on compressed domain, this application does not need to decode frame image data, and can directly extract reliable motion information from the compressed domain for motion analysis, so the processing speed can easily reach real-time. During the analysis, three motion features, namely motion information amount, motion vector intensity, and residual coefficient density, are designed for the three types of compressed domain motion information, namely, coding bit amount, motion vector, and residual coefficient, to characterize the motion of the video screen. In addition, the Darknet neural network model and the YOLO-v3 target detection model are combined to achieve efficient and fast video motion target detection tasks, and solve the problems of weak robustness and poor performance in complex scenarios in related methods while ensuring real-time performance.

上文结合图1至图8，描述了本申请实施例适用的多种场景以及运动物体检测方法。下面将结合图9至图12，详细描述本申请适用的电子设备的软件系统、硬件系统、装置以及芯片系统。应理解，本申请实施例中的软件系统、硬件系统、装置以及芯片系统可以执行前述本申请实施例的各种方法，即以下各种产品的具体工作过程，可以参考前述方法实施例中的对应过程。The above text, in conjunction with Figures 1 to 8, describes a variety of scenarios and moving object detection methods applicable to embodiments of the present application. The following will describe in detail the software system, hardware system, device, and chip system of the electronic device applicable to the present application in conjunction with Figures 9 to 12. It should be understood that the software system, hardware system, device, and chip system in the embodiments of the present application can execute the various methods of the aforementioned embodiments of the present application, that is, the specific working processes of the following various products can refer to the corresponding processes in the aforementioned method embodiments.

本申请实施例提供的运动物体检测方法可以适用于各种电子设备。在本申请实施例中，电子设备可以是手机、智慧屏、平板电脑、可穿戴电子设备、车载电子设备、增强现实设备、虚拟现实（virtual reality，VR）设备、笔记本电脑、超级移动个人计算机（ultra-mobile personal computer，UMPC）、上网本、个人数字助理（personal digitalassistant，PDA）、投影仪等等，本申请实施例对电子设备的具体类型不作任何限制。The moving object detection method provided in the embodiment of the present application can be applied to various electronic devices. In the embodiment of the present application, the electronic device can be a mobile phone, a smart screen, a tablet computer, a wearable electronic device, an in-vehicle electronic device, an augmented reality device, a virtual reality (VR) device, a laptop computer, an ultra-mobile personal computer (UMPC), a netbook, a personal digital assistant (PDA), a projector, etc. The embodiment of the present application does not impose any restrictions on the specific type of the electronic device.

图9示出了一种适用于本申请的电子设备的硬件系统。电子设备100可用于实现上述方法实施例中描述的运动物体检测方法。Fig. 9 shows a hardware system of an electronic device applicable to the present application. The electronic device 100 can be used to implement the moving object detection method described in the above method embodiment.

电子设备100可以包括处理器110，外部存储器接口120，内部存储器121，通用串行总线（universal serial bus，USB）接口130，充电管理模块140，电源管理模块141，电池142，天线1，天线2，移动通信模块150，无线通信模块160，音频模块170，扬声器170A，受话器170B，麦克风170C，耳机接口170D，传感器模块180，按键190，马达191，指示器192，摄像头193，显示屏194，以及用户标识模块（subscriberidentification module，SIM）卡接口195等。其中传感器模块180可以包括压力传感器180A，陀螺仪传感器180B，气压传感器180C，磁传感器180D，加速度传感器180E，距离传感器180F，接近光传感器180G，指纹传感器180H，温度传感器180J，触摸传感器180K，环境光传感器180L，骨传导传感器180M等。The electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, a mobile communication module 150, a wireless communication module 160, an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, a sensor module 180, a button 190, a motor 191, an indicator 192, a camera 193, a display screen 194, and a subscriber identification module (SIM) card interface 195, etc. The sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, an ambient light sensor 180L, a bone conduction sensor 180M, etc.

需要说明的是，图9所示的结构并不构成对电子设备100的具体限定。It should be noted that the structure shown in FIG. 9 does not constitute a specific limitation on the electronic device 100 .

在本申请另一些实施例中，电子设备100可以包括比图9所示的部件更多或更少的部件，或者，电子设备100可以包括图9所示的部件中某些部件的组合，或者，电子设备100可以包括图9所示的部件中某些部件的子部件。图9所示的部件可以以硬件、软件、或软件和硬件的组合实现。In other embodiments of the present application, the electronic device 100 may include more or fewer components than those shown in FIG9 , or the electronic device 100 may include a combination of some of the components shown in FIG9 , or the electronic device 100 may include subcomponents of some of the components shown in FIG9 . The components shown in FIG9 may be implemented in hardware, software, or a combination of software and hardware.

处理器110可以包括一个或多个处理单元。例如，处理器110可以包括以下处理单元中的至少一个：应用处理器（application processor，AP）、调制解调处理器、图形处理器（graphics processing unit，GPU）、图像信号处理器（image signal processor，ISP）、控制器、视频编解码器、数字信号处理器（digital signal processor，DSP）、基带处理器、神经网络处理器（neural-network processing unit，NPU）。其中，不同的处理单元可以是独立的器件，也可以是集成的器件。控制器可以根据指令操作码和时序信号，产生操作控制信号，完成取指令和执行指令的控制。The processor 110 may include one or more processing units. For example, the processor 110 may include at least one of the following processing units: an application processor (AP), a modem processor, a graphics processing unit (GPU), an image signal processor (ISP), a controller, a video codec, a digital signal processor (DSP), a baseband processor, and a neural-network processing unit (NPU). Different processing units may be independent devices or integrated devices. The controller may generate an operation control signal according to the instruction opcode and the timing signal to complete the control of fetching and executing instructions.

处理器110中还可以设置存储器，用于存储指令和数据。在一些实施例中，处理器110中的存储器为高速缓冲存储器。该存储器可以保存处理器110刚用过或循环使用的指令或数据。如果处理器110需要再次使用该指令或数据，可从所述存储器中直接调用。避免了重复存取，减少了处理器110的等待时间，因而提高了系统的效率。The processor 110 may also be provided with a memory for storing instructions and data. In some embodiments, the memory in the processor 110 is a cache memory. The memory may store instructions or data that the processor 110 has just used or cyclically used. If the processor 110 needs to use the instruction or data again, it may be directly called from the memory. This avoids repeated access, reduces the waiting time of the processor 110, and thus improves the efficiency of the system.

在一些实施例中，处理器110可以包括一个或多个接口。例如，处理器110可以包括以下接口中的至少一个：内部集成电路（inter-integrated circuit，I2C）接口、内部集成电路音频（inter-integrated circuit sound，I2S）接口、脉冲编码调制（pulse codemodulation，PCM）接口、通用异步接收传输器（universal asynchronousreceiver/transmitter，UART）接口、移动产业处理器接口（mobile industry processorinterface，MIPI）、通用输入输出（general-purpose input/output，GPIO）接口、SIM接口、USB接口。In some embodiments, the processor 110 may include one or more interfaces. For example, the processor 110 may include at least one of the following interfaces: an inter-integrated circuit (I2C) interface, an inter-integrated circuit sound (I2S) interface, a pulse code modulation (PCM) interface, a universal asynchronous receiver/transmitter (UART) interface, a mobile industry processor interface (MIPI), a general-purpose input/output (GPIO) interface, a SIM interface, and a USB interface.

示例性地，在本申请的实施例中，处理器110可以用于执行本申请实施例提供的运动物体检测方法；例如，获取第一图像和法向图，所述法向图用于表示被摄物对应的法线方向，所述第一图像和所述法向图来源于不同的摄像头；将所述第一图像和所述法向图进行拼接，得到第一拼接图像；利用光照主方向估计模型对所述第一拼接图像进行处理，确定光照主方向，所述光照主方向包括光源的方位角和高度角。Exemplarily, in an embodiment of the present application, the processor 110 can be used to execute the moving object detection method provided in the embodiment of the present application; for example, obtain a first image and a normal map, the normal map is used to represent the normal direction corresponding to the subject, and the first image and the normal map come from different cameras; splice the first image and the normal map to obtain a first spliced image; use the main illumination direction estimation model to process the first spliced image to determine the main illumination direction, and the main illumination direction includes the azimuth and altitude of the light source.

图9所示的各模块间的连接关系只是示意性说明，并不构成对电子设备100的各模块间的连接关系的限定。可选地，电子设备100的各模块也可以采用上述实施例中多种连接方式的组合。The connection relationship between the modules shown in Fig. 9 is only a schematic illustration and does not constitute a limitation on the connection relationship between the modules of the electronic device 100. Optionally, the modules of the electronic device 100 may also adopt a combination of multiple connection modes in the above embodiments.

电子设备100的无线通信功能可以通过天线1、天线2、移动通信模块150、无线通信模块160、调制解调处理器以及基带处理器等器件实现。The wireless communication function of the electronic device 100 can be implemented through components such as the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, the modem processor, and the baseband processor.

电子设备100可以通过GPU、显示屏194以及应用处理器实现显示功能。GPU为图像处理的微处理器，连接显示屏194和应用处理器。GPU用于执行数学和几何计算，用于图形渲染。处理器110可包括一个或多个GPU，其执行程序指令以生成或改变显示信息。The electronic device 100 can realize the display function through the GPU, the display screen 194 and the application processor. The GPU is a microprocessor for image processing, which connects the display screen 194 and the application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. The processor 110 may include one or more GPUs, which execute program instructions to generate or change display information.

显示屏194可以用于显示图像或视频。Display screen 194 may be used to display images or videos.

电子设备100可以通过ISP、摄像头193、视频编解码器、GPU、显示屏194以及应用处理器等实现拍摄功能。The electronic device 100 can realize the shooting function through the ISP, the camera 193, the video codec, the GPU, the display screen 194 and the application processor.

ISP 用于处理摄像头193反馈的数据。例如，拍照时，打开快门，光线通过镜头被传递到摄像头感光元件上，光信号转换为电信号，摄像头感光元件将所述电信号传递给ISP处理，转化为肉眼可见的图像。ISP可以对图像的噪点、亮度和色彩进行算法优化，ISP还可以优化拍摄场景的曝光和色温等参数。在一些实施例中，ISP可以设置在摄像头193中。The ISP is used to process the data fed back by the camera 193. For example, when taking a photo, the shutter is opened, and the light is transmitted to the camera photosensitive element through the lens. The light signal is converted into an electrical signal, and the camera photosensitive element transmits the electrical signal to the ISP for processing and converts it into an image visible to the naked eye. The ISP can perform algorithm optimization on the noise, brightness and color of the image. The ISP can also optimize the exposure and color temperature of the shooting scene. In some embodiments, the ISP can be set in the camera 193.

摄像头193用于捕获静态图像或视频。物体通过镜头生成光学图像投射到感光元件。感光元件可以是电荷耦合器件（charge coupled device，CCD）或互补金属氧化物半导体（complementary metal-oxide-semiconductor，CMOS）光电晶体管。感光元件把光信号转换成电信号，之后将电信号传递给ISP转换成数字图像信号。ISP将数字图像信号输出到DSP加工处理。DSP将数字图像信号转换成标准的红绿蓝（red green blue，RGB），YUV等格式的图像信号。在一些实施例中，电子设备100可以包括1个或N个摄像头193，N为大于1的正整数。The camera 193 is used to capture still images or videos. The object generates an optical image through the lens and projects it onto the photosensitive element. The photosensitive element can be a charge coupled device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor. The photosensitive element converts the optical signal into an electrical signal, and then passes the electrical signal to the ISP for conversion into a digital image signal. The ISP outputs the digital image signal to the DSP for processing. The DSP converts the digital image signal into an image signal in a standard red green blue (RGB), YUV or other format. In some embodiments, the electronic device 100 may include 1 or N cameras 193, where N is a positive integer greater than 1.

数字信号处理器用于处理数字信号，除了可以处理数字图像信号，还可以处理其他数字信号。例如，当电子设备100在频点选择时，数字信号处理器用于对频点能量进行傅里叶变换等。The digital signal processor is used to process digital signals, and can process not only digital image signals but also other digital signals. For example, when the electronic device 100 is selecting a frequency point, the digital signal processor is used to perform Fourier transform on the frequency point energy.

视频编解码器用于对数字视频压缩或解压缩。电子设备100可以支持一种或多种视频编解码器。这样，电子设备100可以播放或录制多种编码格式的视频，例如：动态图像专家组（moving picture experts group，MPEG）1、MPEG2、MPEG3和MPEG4。Video codecs are used to compress or decompress digital videos. The electronic device 100 may support one or more video codecs. Thus, the electronic device 100 may play or record videos in a variety of coding formats, such as Moving Picture Experts Group (MPEG) 1, MPEG2, MPEG3, and MPEG4.

外部存储器接口120可以用于连接外部存储卡，例如安全数码（secure digital，SD）卡，实现扩展电子设备100的存储能力。外部存储卡通过外部存储器接口120与处理器110通信，实现数据存储功能。例如将音乐，视频等文件保存在外部存储卡中。The external memory interface 120 can be used to connect an external memory card, such as a secure digital (SD) card, to expand the storage capacity of the electronic device 100. The external memory card communicates with the processor 110 through the external memory interface 120 to implement a data storage function, such as storing music, video and other files in the external memory card.

内部存储器121可以用于存储计算机可执行程序代码，所述可执行程序代码包括指令。内部存储器121可以包括存储程序区和存储数据区。The internal memory 121 may be used to store computer executable program codes, which include instructions. The internal memory 121 may include a program storage area and a data storage area.

电子设备100可以通过音频模块170、扬声器170A、受话器170B、麦克风170C、耳机接口170D以及应用处理器等实现音频功能，例如，音乐播放和录音。The electronic device 100 can implement audio functions, such as music playing and recording, through the audio module 170, the speaker 170A, the receiver 170B, the microphone 170C, the headphone interface 170D, and the application processor.

音频模块170用于将数字音频信息转换成模拟音频信号输出，也可以用于将模拟音频输入转换为数字音频信号。音频模块170还可以用于对音频信号编码和解码。The audio module 170 is used to convert digital audio information into analog audio signal output, and can also be used to convert analog audio input into digital audio signals. The audio module 170 can also be used to encode and decode audio signals.

扬声器170A，也称为喇叭，用于将音频电信号转换为声音信号。电子设备100可以通过扬声器170A收听音乐或免提通话。受话器170B，也称为听筒，用于将音频电信号转换成声音信号。The speaker 170A, also known as a horn, is used to convert an audio electrical signal into a sound signal. The electronic device 100 can listen to music or make a hands-free call through the speaker 170A. The receiver 170B, also known as an earpiece, is used to convert an audio electrical signal into a sound signal.

在一些实施例中，压力传感器180A可以设置于显示屏194。压力传感器180A的种类很多，例如可以是电阻式压力传感器、电感式压力传感器或电容式压力传感器。电容式压力传感器可以是包括至少两个具有导电材料的平行板，当力作用于压力传感器180A，电极之间的电容改变，电子设备100根据电容的变化确定压力的强度。当触摸操作作用于显示屏194时，电子设备100根据压力传感器180A检测所述触摸操作。电子设备100也可以根据压力传感器180A的检测信号计算触摸的位置。在一些实施例中，作用于相同触摸位置，但不同触摸操作强度的触摸操作，可以对应不同的操作指令。例如：当触摸操作强度小于第一压力阈值的触摸操作作用于短消息应用图标时，执行查看短消息的指令；当触摸操作强度大于或等于第一压力阈值的触摸操作作用于短消息应用图标时，执行新建短消息的指令。In some embodiments, the pressure sensor 180A may be provided on the display screen 194. There are many types of pressure sensors 180A, for example, they may be resistive pressure sensors, inductive pressure sensors or capacitive pressure sensors. The capacitive pressure sensor may include at least two parallel plates having conductive materials. When force acts on the pressure sensor 180A, the capacitance between the electrodes changes, and the electronic device 100 determines the intensity of the pressure according to the change in capacitance. When a touch operation acts on the display screen 194, the electronic device 100 detects the touch operation according to the pressure sensor 180A. The electronic device 100 may also calculate the position of the touch according to the detection signal of the pressure sensor 180A. In some embodiments, touch operations acting on the same touch position but with different touch operation intensities may correspond to different operation instructions. For example: when a touch operation with a touch operation intensity less than the first pressure threshold acts on a short message application icon, an instruction to view a short message is executed; when a touch operation with a touch operation intensity greater than or equal to the first pressure threshold acts on a short message application icon, an instruction to create a new short message is executed.

上文详细描述了电子设备100的硬件系统，下面介绍电子设备100的软件系统。软件系统可以采用分层架构、事件驱动架构、微核架构、微服务架构或云架构，本申请实施例以分层架构为例，示例性地描述电子设备100的软件系统。The above describes in detail the hardware system of the electronic device 100, and the following describes the software system of the electronic device 100. The software system may adopt a layered architecture, an event-driven architecture, a micro-core architecture, a micro-service architecture, or a cloud architecture. The embodiment of the present application takes the layered architecture as an example to exemplarily describe the software system of the electronic device 100.

如图10所示，采用分层架构的软件系统分成若干个层，每一层都有清晰的角色和分工。层与层之间通过软件接口通信。As shown in Figure 10, a software system using a layered architecture is divided into several layers, each with clear roles and division of labor. The layers communicate with each other through software interfaces.

在一些实施例中，软件系统可以分为四层，从上至下分别为应用程序层、应用程序框架层、安卓运行时（Android Runtime）和系统库、以及内核层。In some embodiments, the software system can be divided into four layers, from top to bottom, namely, an application layer, an application framework layer, an Android runtime (Android Runtime) and a system library, and a kernel layer.

应用程序层可以包括相机、图库、日历、通话、地图、导航、WLAN、蓝牙、音乐、视频、短信息等应用程序。The application layer may include applications such as camera, gallery, calendar, call, map, navigation, WLAN, Bluetooth, music, video, short message, etc.

本申请实施例的运动物体检测方法可以应用于相机应用、视频应用、AR应用等。The moving object detection method of the embodiment of the present application can be applied to camera applications, video applications, AR applications, etc.

应用程序框架层为应用程序层的应用程序提供应用程序编程接口（applicationprogramming interface，API）和编程框架。应用程序框架层可以包括一些预定义的函数。The application framework layer provides an application programming interface (API) and a programming framework for applications in the application layer. The application framework layer may include some predefined functions.

例如，应用程序框架层包括窗口管理器、内容提供器、视图系统、电话管理器、资源管理器和通知管理器。For example, the application framework layer includes the window manager, content provider, view system, telephony manager, resource manager, and notification manager.

窗口管理器用于管理窗口程序。窗口管理器可以获取显示屏大小，判断是否有状态栏、锁定屏幕和截取屏幕。The window manager is used to manage window programs. The window manager can obtain the display screen size, determine whether there is a status bar, lock the screen, and capture the screen.

内容提供器用来存放和获取数据，并使这些数据可以被应用程序访问。所述数据可以包括视频、图像、音频、拨打和接听的电话、浏览历史和书签、以及电话簿。Content providers are used to store and retrieve data and make it accessible to applications. The data may include video, images, audio, calls made and received, browsing history and bookmarks, and phone books.

视图系统包括可视控件，例如显示文字的控件和显示图片的控件。视图系统可用于构建应用程序。显示界面可以由一个或多个视图组成，例如，包括短信通知图标的显示界面，可以包括显示文字的视图以及显示图片的视图。The view system includes visual controls, such as controls for displaying text and controls for displaying images. The view system can be used to build applications. A display interface can be composed of one or more views. For example, a display interface including a text notification icon can include a view for displaying text and a view for displaying images.

电话管理器用于提供电子设备100的通信功能，例如通话状态（接通或挂断）的管理。The phone manager is used to provide communication functions of the electronic device 100, such as management of call status (connected or hung up).

资源管理器为应用程序提供各种资源，比如本地化字符串、图标、图片、布局文件和视频文件。The resource manager provides various resources for applications, such as localized strings, icons, images, layout files, and video files.

通知管理器使应用程序可以在状态栏中显示通知信息，可以用于传达告知类型的消息，可以短暂停留后自动消失，无需用户交互。The notification manager enables applications to display notification information in the status bar, which can be used to convey notification-type messages and disappear automatically after a short stay without user interaction.

Android Runtime包括核心库和虚拟机。Android Runtime负责安卓系统的调度和管理。核心库包含两部分：一部分是java语言需要调用的功能函数，另一部分是安卓的核心库。Android Runtime includes core libraries and virtual machines. Android Runtime is responsible for scheduling and management of the Android system. The core library contains two parts: one is the function that the Java language needs to call, and the other is the Android core library.

应用程序层和应用程序框架层运行在虚拟机中。虚拟机将应用程序层和应用程序框架层的java文件执行为二进制文件。虚拟机用于执行对象生命周期的管理、堆栈管理、线程管理、安全和异常的管理、以及垃圾回收等功能。The application layer and the application framework layer run in the virtual machine. The virtual machine executes the Java files of the application layer and the application framework layer as binary files. The virtual machine is used to perform functions such as object life cycle management, stack management, thread management, security and exception management, and garbage collection.

系统库可以包括多个功能模块，例如：表面管理器（surface manager），媒体库（Media Libraries），三维图形处理库（例如：针对嵌入式系统的开放图形库（opengraphics library forembedded systems，OpenGL ES）和2D图形引擎（例如：skia图形库（skiagraphics library，SGL））。The system library can include multiple functional modules, such as: surface manager, media library, 3D graphics processing library (such as open graphics library for embedded systems (OpenGL ES) and 2D graphics engine (such as skia graphics library (SGL)).

表面管理器用于对显示子系统进行管理，并且为多个应用程序提供了2D图层和3D图层的融合。The surface manager is used to manage the display subsystem and provide the fusion of 2D layers and 3D layers for multiple applications.

媒体库支持多种音频格式的回放和录制、多种视频格式回放和录制以及静态图像文件。媒体库可以支持多种音视频编码格式，例如: MPEG4、H.264、动态图像专家组音频层面3（moving picture experts group audio layer III，MP3）、高级音频编码（advancedaudio coding，AAC）、自适应多码率（adaptive multi-rate，AMR）、联合图像专家组（jointphotographic experts group，JPG）和便携式网络图形（portable network graphics，PNG）。The media library supports playback and recording of multiple audio formats, playback and recording of multiple video formats, and still image files. The media library can support multiple audio and video coding formats, such as: MPEG4, H.264, moving picture experts group audio layer III (MP3), advanced audio coding (AAC), adaptive multi-rate (AMR), joint photographic experts group (JPG) and portable network graphics (PNG).

三维图形处理库可以用于实现三维图形绘图、图像渲染、合成和图层处理。The 3D graphics processing library can be used to implement 3D graphics drawing, image rendering, compositing and layer processing.

二维图形引擎是2D绘图的绘图引擎。A 2D graphics engine is a drawing engine for 2D drawings.

内核层是硬件和软件之间的层。内核层可以包括显示驱动、摄像头驱动、音频驱动和传感器驱动等驱动模块。The kernel layer is the layer between hardware and software. The kernel layer can include driver modules such as display driver, camera driver, audio driver and sensor driver.

图11是本申请实施例提供的运动物体检测装置的结构示意图。该运动物体检测装置200包括获取单元210和处理单元220。11 is a schematic diagram of the structure of a moving object detection device provided in an embodiment of the present application. The moving object detection device 200 includes an acquisition unit 210 and a processing unit 220 .

获取单元210用于获取视频码流数据，并提取压缩域语法元素，压缩域语法元素用于指示视频码流数据中的变量信息。The acquisition unit 210 is used to acquire video code stream data and extract compression domain syntax elements, where the compression domain syntax elements are used to indicate variable information in the video code stream data.

处理单元220用于根据压缩域语法元素，利用运动检测网络进行检测，确定目标运动物体。The processing unit 220 is used to perform detection using a motion detection network according to the compressed domain syntax elements to determine the target moving object.

可选地，作为一种实施例，处理单元220还用于根据压缩域语法元素，确定运动特征；根据运动特征，生成二维矩阵；将二维矩阵输入运动检测网络进行检测，确定目标运动物体。Optionally, as an embodiment, the processing unit 220 is further used to determine motion features according to compression domain syntax elements; generate a two-dimensional matrix according to the motion features; and input the two-dimensional matrix into a motion detection network for detection to determine the target moving object.

可选地，作为一种实施例，处理单元220还用于根据P帧的压缩域语法元素，确定P帧对应的运动特征，视频码流数据包括I帧、P帧和B帧；根据B帧的压缩域语法元素，确定B帧对应的运动特征；根据I帧前后相邻的P帧对应的运动特征和/或B帧对应的运动特征，利用插值方法，确定I帧对应的运动特征。Optionally, as an embodiment, the processing unit 220 is further used to determine the motion features corresponding to the P frame according to the compression domain syntax elements of the P frame, and the video code stream data includes I frames, P frames and B frames; determine the motion features corresponding to the B frame according to the compression domain syntax elements of the B frame; and determine the motion features corresponding to the I frame using an interpolation method based on the motion features corresponding to the P frames before and after the I frame and/or the motion features corresponding to the B frame.

可选地，作为一种实施例，处理单元220还用于对I帧、P帧和B帧对应的运动特征，进行平滑处理。Optionally, as an embodiment, the processing unit 220 is further configured to perform smoothing processing on motion features corresponding to the I frame, the P frame and the B frame.

可选地，作为一种实施例，压缩域语法元素包括：编码比特量、运动矢量和残差系数。Optionally, as an embodiment, the compression domain syntax elements include: coding bit amount, motion vector and residual coefficient.

可选地，作为一种实施例，运动特征包括：运动信息量、运动矢量强度、残差系数密度；运动信息量与编码比特量对应，运动矢量强度与所述运动矢量对应，残差系数密度与残差系数对应。Optionally, as an embodiment, the motion characteristics include: motion information amount, motion vector strength, and residual coefficient density; the motion information amount corresponds to the coding bit amount, the motion vector strength corresponds to the motion vector, and the residual coefficient density corresponds to the residual coefficient.

可选地，作为一种实施例，运动检测网络包括：Darknet神经网络模型和YOLOv3目标检测模型；处理单元220还用于将二维矩阵输入Darknet神经网络模型，得到多尺度的卷积特征层；将多尺度的卷积特征层输入YOLOv3目标检测模型，确定目标运动物体对应的边框定位。Optionally, as an embodiment, the motion detection network includes: a Darknet neural network model and a YOLOv3 target detection model; the processing unit 220 is also used to input the two-dimensional matrix into the Darknet neural network model to obtain a multi-scale convolutional feature layer; the multi-scale convolutional feature layer is input into the YOLOv3 target detection model to determine the border positioning corresponding to the target moving object.

应理解，Darknet神经网络模型和YOLOv3目标检测模型可以部署于运动物体检测装置200中。It should be understood that the Darknet neural network model and the YOLOv3 target detection model can be deployed in the moving object detection device 200.

需要说明的是，上述运动物体检测装置200以功能单元的形式体现。这里的术语“单元”可以通过软件和/或硬件形式实现，对此不作具体限定。It should be noted that the moving object detection device 200 is implemented in the form of a functional unit. The term "unit" here can be implemented in the form of software and/or hardware, and is not specifically limited to this.

例如，“单元”可以是实现上述功能的软件程序、硬件电路或二者结合。所述硬件电路可能包括应用特有集成电路（application specific integrated circuit，ASIC）、电子电路、用于执行一个或多个软件或固件程序的处理器（例如共享处理器、专有处理器或组处理器等）和存储器、合并逻辑电路和/或其它支持所描述的功能的合适组件。For example, a "unit" may be a software program, a hardware circuit, or a combination of the two that implements the above functions. The hardware circuit may include an application specific integrated circuit (ASIC), an electronic circuit, a processor (such as a shared processor, a dedicated processor, or a group processor, etc.) and a memory for executing one or more software or firmware programs, a combined logic circuit, and/or other suitable components that support the described functions.

因此，在本申请的实施例中描述的各示例的单元，能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行，取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能，但是这种实现不应认为超出本申请的范围。Therefore, the units of each example described in the embodiments of the present application can be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Professional and technical personnel can use different methods to implement the described functions for each specific application, but such implementation should not be considered to be beyond the scope of the present application.

图12示出了本申请提供的一种电子设备的结构示意图。图12中的虚线表示该单元或该模块为可选的，电子设备300可用于实现上述方法实施例中描述的运动物体检测方法。Fig. 12 shows a schematic diagram of the structure of an electronic device provided by the present application. The dotted line in Fig. 12 indicates that the unit or the module is optional, and the electronic device 300 can be used to implement the moving object detection method described in the above method embodiment.

电子设备300包括一个或多个处理器301，该一个或多个处理器301可支持电子设备300实现方法实施例中的方法。处理器301可以是通用处理器或者专用处理器。例如，处理器301可以是中央处理器（central processing unit，CPU）、数字信号处理器（digitalsignal processor，DSP）、专用集成电路（application specific integrated circuit，ASIC）、现场可编程门阵列（field programmable gate array，FPGA）或者其它可编程逻辑器件，如分立门、晶体管逻辑器件或分立硬件组件。The electronic device 300 includes one or more processors 301, which can support the electronic device 300 to implement the method in the method embodiment. The processor 301 can be a general-purpose processor or a special-purpose processor. For example, the processor 301 can be a central processing unit (CPU), a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA), or other programmable logic devices, such as discrete gates, transistor logic devices, or discrete hardware components.

处理器301可以用于对电子设备300进行控制，执行软件程序，处理软件程序的数据。The processor 301 can be used to control the electronic device 300, execute software programs, and process data of the software programs.

电子设备300还可以包括通信单元305，用以实现信号的输入（接收）和输出（发送）。The electronic device 300 may further include a communication unit 305 for implementing input (reception) and output (transmission) of signals.

例如，电子设备300可以是芯片，通信单元305可以是该芯片的输入和/或输出电路，或者，通信单元305可以是该芯片的通信接口，该芯片可以作为终端设备或其它电子设备的组成部分。For example, the electronic device 300 may be a chip, the communication unit 305 may be an input and/or output circuit of the chip, or the communication unit 305 may be a communication interface of the chip, and the chip may be a component of a terminal device or other electronic devices.

又例如，电子设备300可以是终端设备，通信单元305可以是该终端设备的收发器，或者，通信单元305可以是该终端设备的收发电路。For another example, the electronic device 300 may be a terminal device, and the communication unit 305 may be a transceiver of the terminal device, or the communication unit 305 may be a transceiver circuit of the terminal device.

电子设备300中可以包括一个或多个存储器302，其上存有程序304，程序304可被处理器301运行，生成指令303，使得处理器301根据指令303执行上述方法实施例中描述的运动物体检测方法。The electronic device 300 may include one or more memories 302 on which a program 304 is stored. The program 304 can be executed by the processor 301 to generate instructions 303, so that the processor 301 executes the moving object detection method described in the above method embodiment according to the instructions 303.

可选地，存储器302中还可以存储有数据。可选地，处理器301还可以读取存储器302中存储的数据，该数据可以与程序304存储在相同的存储地址，该数据也可以与程序304存储在不同的存储地址。Optionally, data may be stored in the memory 302. Optionally, the processor 301 may read data stored in the memory 302. The data may be stored at the same storage address as the program 304, or may be stored at a different storage address than the program 304.

处理器301和存储器302可以单独设置，也可以集成在一起；例如，集成在终端设备的系统级芯片（system on chip，SOC）上。The processor 301 and the memory 302 may be provided separately or integrated together; for example, integrated on a system on chip (SOC) of a terminal device.

示例性地，存储器302可以用于存储本申请实施例中提供的运动物体检测方法的相关程序304，处理器301可以用于在视频处理时调用存储器302中存储的运动物体检测方法的相关程序304，执行本申请实施例的运动物体检测方法；例如，获取视频码流数据，并提取压缩域语法元素，压缩域语法元素用于指示视频码流数据中的变量信息；根据压缩域语法元素，利用运动检测网络进行检测，确定目标运动物体。Exemplarily, the memory 302 can be used to store the related program 304 of the moving object detection method provided in the embodiment of the present application, and the processor 301 can be used to call the related program 304 of the moving object detection method stored in the memory 302 during video processing to execute the moving object detection method of the embodiment of the present application; for example, obtain video bitstream data and extract compressed domain syntax elements, the compressed domain syntax elements are used to indicate variable information in the video bitstream data; according to the compressed domain syntax elements, use the motion detection network to perform detection and determine the target moving object.

本申请还提供了一种计算机程序产品，该计算机程序产品被处理器执行时实现本申请中任一方法实施例所述的运动物体检测方法。The present application also provides a computer program product, which, when executed by a processor, implements the moving object detection method described in any method embodiment of the present application.

该计算机程序产品可以存储在存储器302中，例如是程序经过预处理、编译、汇编和链接等处理过程最终被转换为能够被处理器执行的可执行目标文件。The computer program product may be stored in the memory 302 , for example, the program is converted into an executable target file that can be executed by a processor after preprocessing, compiling, assembling and linking.

本申请还提供了一种计算机可读存储介质，其上存储有计算机程序，该计算机程序被计算机执行时实现本申请中任一方法实施例所述的运动物体检测方法。该计算机程序可以是高级语言程序，也可以是可执行目标程序。The present application also provides a computer-readable storage medium on which a computer program is stored, and when the computer program is executed by a computer, the moving object detection method described in any method embodiment of the present application is implemented. The computer program can be a high-level language program or an executable target program.

可选地，该计算机可读存储介质例如是存储器302。存储器302可以是易失性存储器或非易失性存储器，或者，存储器302可以同时包括易失性存储器和非易失性存储器。其中，非易失性存储器可以是只读存储器（read-only memory，ROM）、可编程只读存储器（programmableROM，PROM）、可擦除可编程只读存储器（erasable PROM，EPROM）、电可擦除可编程只读存储器（electrically EPROM，EEPROM）或闪存。易失性存储器可以是随机存取存储器（random access memory，RAM），其用作外部高速缓存。通过示例性但不是限制性说明，许多形式的RAM可用，例如静态随机存取存储器（static RAM，SRAM）、动态随机存取存储器（dynamic RAM，DRAM）、同步动态随机存取存储器（synchronous DRAM，SDRAM）、双倍数据速率同步动态随机存取存储器（doubledata rate SDRAM，DDR SDRAM）、增强型同步动态随机存取存储器（enhanced SDRAM，ESDRAM）、同步连接动态随机存取存储器（synchlink DRAM，SLDRAM）和直接内存总线随机存取存储器（directrambus RAM，DR RAM）。Optionally, the computer-readable storage medium is, for example, a memory 302. The memory 302 may be a volatile memory or a non-volatile memory, or the memory 302 may include both a volatile memory and a non-volatile memory. Among them, the non-volatile memory may be a read-only memory (ROM), a programmable read-only memory (PROM), an erasable programmable read-only memory (EPROM), an electrically erasable programmable read-only memory (EEPROM), or a flash memory. The volatile memory may be a random access memory (RAM), which is used as an external cache. By way of example and not limitation, many forms of RAM are available, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate synchronous dynamic random access memory (DDR SDRAM), enhanced synchronous dynamic random access memory (ESDRAM), synchronous link DRAM (SLDRAM), and direct memory bus random access memory (DR RAM).

本领域的技术人员可以清楚地了解到，为了描述的方便和简洁，上述描述的装置和设备的具体工作过程以及产生的技术效果，可以参考前述方法实施例中对应的过程和技术效果，在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and brevity of description, the specific working process of the above-described devices and equipment and the technical effects produced can refer to the corresponding processes and technical effects in the aforementioned method embodiments, and will not be repeated here.

在本申请所提供的几个实施例中，所揭露的系统、装置和方法，可以通过其它的方式实现。例如，以上所描述的方法实施例的一些特征可以忽略，或不执行。以上所描述的装置实施例仅仅是示意性的，单元的划分，仅仅为一种逻辑功能划分，实际实现时可以有另外的划分方式，多个单元或组件可以结合或者可以集成到另一个系统。另外，各单元之间的耦合或各个组件之间的耦合可以是直接耦合，也可以是间接耦合，上述耦合包括电的、机械的或其它形式的连接。In several embodiments provided in the present application, the disclosed systems, devices and methods can be implemented in other ways. For example, some features of the method embodiments described above can be ignored or not performed. The device embodiments described above are merely schematic, and the division of units is only a logical function division. There may be other division methods in actual implementation, and multiple units or components may be combined or integrated into another system. In addition, the coupling between the units or the coupling between the components may be direct coupling or indirect coupling, and the above coupling includes electrical, mechanical or other forms of connection.

应理解，在本申请的各种实施例中，各过程的序号的大小并不意味着执行顺序的先后，各过程的执行顺序应以其功能和内在逻辑确定，而不应对本申请的实施例的实施过程构成任何限定。It should be understood that in the various embodiments of the present application, the size of the serial number of each process does not mean the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present application.

另外，本文中术语“系统”和“网络”在本文中常被可互换使用。本文中的术语“和/或”，仅仅是一种描述关联对象的关联关系，表示可以存在三种关系，例如，A和/或B，可以表示：单独存在A，同时存在A和B，单独存在B这三种情况。另外，本文中字符“/”，一般表示前后关联对象是一种“或”的关系。In addition, the terms "system" and "network" are often used interchangeably in this article. The term "and/or" in this article is only a description of the association relationship of the associated objects, indicating that there can be three relationships. For example, A and/or B can mean: A exists alone, A and B exist at the same time, and B exists alone. In addition, the character "/" in this article generally indicates that the associated objects before and after are in an "or" relationship.

总之，以上所述仅为本申请技术方案的较佳实施例而已，并非用于限定本申请的保护范围。凡在本申请的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本申请的保护范围之内。In short, the above is only a preferred embodiment of the technical solution of this application, and is not intended to limit the protection scope of this application. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this application should be included in the protection scope of this application.

Claims

1. A moving object detection method, characterized in that the method comprises:

Acquire video code stream data, and extract compression domain syntax elements, where the compression domain syntax elements are used to indicate variable information in the video code stream data;

According to the compression domain syntax elements, a motion detection network is used to perform detection to determine the target moving object.

2. The moving object detection method according to claim 1, characterized in that the step of detecting the target moving object by using a motion detection network according to the compressed domain syntax element comprises:

Determining motion features according to the compressed domain syntax elements;

generating a two-dimensional matrix according to the motion characteristics;

The two-dimensional matrix is input into the motion detection network for detection to determine the target moving object.

3. The moving object detection method according to claim 2, characterized in that determining the motion feature according to the compressed domain syntax element comprises:

Determining motion features corresponding to the P frame according to compression domain syntax elements of the P frame, wherein the video code stream data includes an I frame, a P frame, and a B frame;

Determine motion features corresponding to the B frame according to the compression domain syntax elements of the B frame;

The motion features corresponding to the I frame are determined by using an interpolation method according to the motion features corresponding to the P frames adjacent to and before the I frame and/or the motion features corresponding to the B frame.

4. The moving object detection method according to claim 3, characterized in that the method further comprises:

The motion features corresponding to the I frame, the P frame and the B frame are smoothed.

5. The moving object detection method according to claim 4 is characterized in that the compressed domain syntax elements include: coding bit amount, motion vector and residual coefficient.

6. The moving object detection method according to claim 5, characterized in that the motion features include: motion information amount, motion vector intensity, and residual coefficient density;

The amount of motion information corresponds to the amount of coded bits, the motion vector strength corresponds to the motion vector, and the residual coefficient density corresponds to the residual coefficient.

7. The moving object detection method according to claim 6, characterized in that the motion detection network comprises: a Darknet neural network model and a YOLOv3 target detection model;

Inputting the two-dimensional matrix into the motion detection network for detection to determine the target moving object includes:

Inputting the two-dimensional matrix into the Darknet neural network model to obtain a multi-scale convolutional feature layer;

The multi-scale convolutional feature layer is input into the YOLOv3 target detection model to determine the frame location corresponding to the target moving object.

8. An electronic device, comprising a processor and a memory;

The memory is used to store a computer program executable on the processor;

The processor is used to execute the moving object detection method according to any one of claims 1 to 7.

9. A chip system, characterized in that it comprises: a processor, used to call and run a computer program from a memory, so that a device equipped with the chip system executes the moving object detection method as described in any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program, wherein the computer program includes program instructions, and when the program instructions are executed by a processor, the processor executes the moving object detection method according to any one of claims 1 to 7.