WO2025180090A1 - Data processing method and apparatus - Google Patents
Data processing method and apparatusInfo
- Publication number
- WO2025180090A1 WO2025180090A1 PCT/CN2025/071236 CN2025071236W WO2025180090A1 WO 2025180090 A1 WO2025180090 A1 WO 2025180090A1 CN 2025071236 W CN2025071236 W CN 2025071236W WO 2025180090 A1 WO2025180090 A1 WO 2025180090A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- feature
- feature maps
- dimension
- video
- processing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/96—Management of image or video recognition tasks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
Definitions
- the present application relates to the field of artificial intelligence, and in particular to a data processing method and device thereof.
- AI Artificial Intelligence
- digital computers or machines controlled by digital computers to simulate, extend, and expand human intelligence, to perceive the environment, acquire knowledge, and use that knowledge to achieve optimal results.
- AI is a branch of computer science that seeks to understand the essence of intelligence and produce new intelligent machines that can respond in a manner similar to human intelligence.
- AI also involves studying the design principles and implementation methods of various intelligent machines, enabling them to perceive, reason, and make decisions.
- the present application provides a data processing method, which can be executed by a video processing system, and the method includes: the video processing system obtains features of a video, wherein the features include multiple feature maps distributed in a time dimension, and the dimensions of the feature maps include a channel dimension and a spatial dimension, that is, the features of the video are at least four-dimensional features, and too high a dimension will result in excessive computational overhead for subsequent operations; the video processing system converts the multiple feature maps from a distribution in the time dimension to a distribution in the channel dimension or the spatial dimension to obtain the one or more first feature maps; and based on the one or more first feature maps, obtains the task processing result of the video.
- changing the distribution of the multiple feature maps from the time dimension to the channel dimension or the spatial dimension can be understood as: fusing the multiple feature maps in the time dimension to the channel dimension or the spatial dimension (fusion can be described as compression, for example, fusion to the channel dimension, or fusion to the spatial dimension, or fusion to the channel dimension and the spatial dimension at the same time).
- the idea of this application is to reduce the dimension of video features. Specifically, the features in the time dimension are compressed to the channel dimension or the spatial dimension, so that the compressed features (that is, the first feature map in the embodiment of this application) do not include the time dimension, and the amount of data in the channel dimension and the spatial dimension increases.
- the feature dimension is reduced, the amount of subsequent calculations required can be greatly reduced, for example, it can be changed from 3D convolution to 2D convolution.
- the first feature map does not include the time dimension.
- the first feature map when obtaining the task processing result of the video based on the one or more first feature maps, can be processed by a feature extraction network to obtain one or more second feature maps, and the task processing result of the video can be obtained through the task network based on the one or more second feature maps.
- the task processing result is a processing result of a video understanding task, a video-based generation task, or a video enhancement task.
- the first feature map and the feature have the same size in spatial dimension.
- the size of the feature is (x, t, h, w), and the size of the first feature map is (x*t, h, w), where x is the number of channels, t is the number of times, h and w are the height and width in the spatial dimension, respectively, and * is the product.
- the feature maps in different time dimensions among the multiple feature maps may be stacked in the channel dimension to obtain the one or more first feature maps, and the stacking order may be based on the order in the time dimension.
- the temporal information on the feature channel will be mixed together, and the object relationship between frames or between different frames will be more difficult to measure. Therefore, it is necessary to perform time-related recovery and enhancement on the compressed features.
- processing the first feature map through the feature extraction network to obtain one or more second feature maps includes: determining the weight corresponding to the channel dimension of the input feature map based on the input feature map, and the input feature map is the intermediate output obtained by the feature extraction network processing the one or more first feature maps; performing a convolution operation on the input feature map to obtain convolution operation results of multiple channel dimensions, and fusing the convolution operation results of the multiple channel dimensions according to the weights to obtain a processing result.
- the weights obtained by the first weight determination module are used to restore the time series information in the features, thereby enhancing the enhancement of the time series information and improving the processing performance of the network.
- processing the first feature map through a feature extraction network to obtain one or more second feature maps includes: expanding an input feature map into multiple feature maps distributed along a time dimension, where the input feature map is an intermediate output obtained by the feature extraction network processing the one or more first feature maps; and performing feature interaction between the multiple feature maps obtained along the time dimension.
- Interaction between features referred to as feature interaction, refers to obtaining an interaction result by calculating the relationship between different features through a certain mapping (such as convolution or attention mechanism).
- the interaction includes: interaction based on an attention mechanism, or interaction achieved through large kernel convolution.
- timing information can be added through time coding.
- timing information can be added through time coding.
- the processing of the first feature map through a feature extraction network to obtain one or more second feature maps includes: determining a time code that is consistent with a size of the multiple feature maps; fusing the time code with the multiple feature maps to obtain multiple feature maps in a fused time dimension; and performing feature interaction between the multiple feature maps in the fused time dimension.
- the interaction module is further configured to fuse the interaction result obtained through the interaction with the input feature.
- the present application provides a data processing device, comprising:
- a compression module configured to obtain features of a video, the features comprising a plurality of feature maps distributed in a time dimension, the dimensions of the feature maps comprising a channel dimension and a spatial dimension, and convert the plurality of feature maps from being distributed in the time dimension to being distributed in the channel dimension or the spatial dimension, to obtain the one or more first feature maps; the first feature maps do not include the time dimension;
- a processing module is used to process the first feature map through a feature extraction network to obtain one or more second feature maps, and obtain a task processing result of the video through a task network based on the one or more second feature maps.
- the task processing result is a processing result of a video understanding task, a video-based generation task, or a video enhancement task.
- the first feature map and the feature have the same size in spatial dimension.
- the size of the feature is (x, t, h, w), and the size of the first feature map is (x*t, h, w), where x is the number of channels, t is the number of times, h and w are the height and width in the spatial dimension, respectively, and * is the product.
- the compression module is specifically configured to:
- the feature maps at different time dimensions in the multiple feature maps are stacked in the channel dimension to obtain the one or more first feature maps.
- the feature extraction network includes a first weight determination module and a convolution module
- the first weight determination module is used to determine the weight corresponding to the channel dimension of the input feature map based on the input feature map;
- the convolution module is used to perform a convolution operation on the input feature map to obtain convolution operation results of multiple channel dimensions, and fuse the convolution operation results of the multiple channel dimensions according to the weights to obtain a processing result.
- the feature extraction network includes a transformation module and an interaction module
- the transformation module is used to expand the input feature map into multiple feature maps distributed in the time dimension;
- the interaction module is used to perform feature interaction between the multiple feature maps distributed in the time dimension.
- the interaction includes:
- the transformation module is further configured to:
- the interaction module is specifically used to perform feature interaction between the multiple feature maps in the fused time dimension.
- the interaction module is further configured to fuse the interaction result obtained through the interaction with the input feature.
- an embodiment of the present application provides a data processing device, which may include a memory, a processor, and a bus system, wherein the memory is used to store programs, and the processor is used to execute the programs in the memory to perform the first aspect and any optional method thereof.
- an embodiment of the present application provides a computer-readable storage medium, in which a computer program is stored.
- the computer-readable storage medium is run on a computer, the computer executes the above-mentioned first aspect and any optional method thereof.
- an embodiment of the present application provides a computer program, which, when executed on a computer, enables the computer to execute the above-mentioned first aspect and any optional method thereof.
- the present application provides a chip system comprising a processor configured to support the execution of a data processing device to implement the functions described in the aforementioned aspects, such as transmitting or processing the data or information described in the aforementioned methods.
- the chip system further comprises a memory configured to store program instructions and data necessary for executing the device or training the device.
- the chip system may consist of a single chip or may include a chip and other discrete components.
- FIG1A is a schematic diagram of a structure of an artificial intelligence main framework
- FIGS. 1B and 1C are schematic diagrams of the application system framework of the present invention.
- FIG1D is a schematic diagram of an optional hardware structure of a terminal
- FIG2 is a schematic diagram of the structure of a server
- FIG3 is a schematic diagram of a system architecture of the present application.
- Figure 4 shows a process of cloud services
- FIG5 is a flowchart of a data processing method provided in an embodiment of the present application.
- FIG6 is a schematic diagram of a data processing method provided in an embodiment of the present application.
- FIG7 is a schematic diagram of an effect provided by an embodiment of the present application.
- FIG8 is a schematic structural diagram of a data processing device provided in an embodiment of the present application.
- FIG9 is a schematic diagram of a structure of an execution device provided in an embodiment of the present application.
- FIG10 is a schematic diagram of a structure of a training device provided in an embodiment of the present application.
- FIG11 is a schematic diagram of the structure of a chip provided in an embodiment of the present application.
- the terms “substantially,” “about,” and similar terms are used as terms of approximation, not as terms of degree, and are intended to take into account the inherent variations in measurements or calculations that one of ordinary skill in the art would recognize.
- the use of “may” when describing embodiments of the present invention refers to “one or more possible embodiments.”
- the terms “use,” “using,” and “used” may be considered synonymous with the terms “utilize,” “utilizing,” and “utilized,” respectively.
- the term “exemplary” is intended to refer to an example or illustration.
- FIG. 1A shows a schematic diagram of the main AI framework.
- This AI framework will be explained from two perspectives: the “intelligent information chain” (horizontal axis) and the “IT value chain” (vertical axis).
- the “intelligent information chain” reflects the entire process from data acquisition to processing. For example, it could be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, and intelligent execution and output.
- Infrastructure provides computing power for AI systems, enabling communication with the outside world and supporting this through a foundational platform. External communication occurs through sensors; computing power is provided by intelligent chips (CPUs, NPUs, GPUs, ASICs, FPGAs, and other hardware accelerators).
- the foundational platform includes a distributed computing framework and network-related platform guarantees and support, including cloud storage and computing, and interconnected networks. For example, sensors communicate with the outside world to acquire data, which is then fed into the intelligent chips within the distributed computing system provided by the foundational platform for computation.
- Data above the infrastructure layer represents data sources for AI.
- This data includes graphics, images, voice, and text, as well as IoT data from traditional devices.
- Data processing generally includes data training, machine learning, deep learning, search, reasoning, decision-making, etc.
- machine learning and deep learning can symbolize and formalize data for intelligent information modeling, extraction, preprocessing, and training.
- Reasoning refers to the process of simulating human intelligent reasoning in computers or intelligent systems, using formalized information to perform machine thinking and solve problems based on reasoning control strategies. Typical functions are search and matching.
- Decision-making refers to the process of making decisions after intelligent information is reasoned, and usually provides functions such as classification, sorting, and prediction.
- some general capabilities can be further formed based on the results of the data processing, such as algorithms or a general system, for example, translation, text analysis, computer vision processing, speech recognition, image recognition, etc.
- Smart products and industry applications refer to the products and applications of artificial intelligence systems in various fields. They are the encapsulation of the overall artificial intelligence solution, which productizes intelligent information decision-making and realizes practical application. Its application areas mainly include: smart terminals, smart transportation, smart medical care, autonomous driving, smart cities, etc.
- video processing applications applications with video processing functions
- cloud services provided by cloud-side servers.
- video processing applications applications with video processing functions
- cloud services provided by cloud-side servers.
- Video processing can be, but is not limited to, video understanding, video enhancement, and video-based generation tasks.
- video understanding can include but is not limited to: action recognition, temporal action localization, video summarization, video detection, video segmentation, multimodal video understanding, pedestrian re-identification and other tasks.
- the product form of the embodiment of the present application can be a video processing application.
- the video processing application can be run on a terminal device or a cloud-side server.
- the video processing task in the embodiment of the present application can be: obtaining a task processing result of the video based on the video input by the user.
- a video processing application may implement a video processing task based on a video input by a user and obtain a task processing result of the video.
- the user can open a video processing application installed on the terminal device and input a video.
- the video processing application can process the video input by the user through a model trained by the method provided in the embodiment of the present application, or through the method provided in the embodiment of the present application, and present the task processing results of the video to the user (the presentation method can be but is not limited to display, playback, saving, uploading to the cloud side, etc.).
- a user can open a video processing application installed on a terminal device and input a video.
- the video processing application can send the video to a cloud-side server.
- the cloud-side server processes the video using a model trained using the method provided in an embodiment of the present application, and transmits the task processing results of the video back to the terminal device.
- the terminal device can present the task processing results of the video to the user (the presentation method can be, but is not limited to, display, playback, saving, uploading to the cloud side, etc.).
- FIG. 1B is a schematic diagram of the functional architecture of a video processing application in an embodiment of the present application:
- a video processing application 102 may receive input parameters 101 (e.g., including a video) and generate a video task processing result 103.
- the video processing application 102 may be executed on, for example, at least one computer system and include computer code that, when executed by one or more computers, causes the computers to execute a model trained using the method provided in the embodiments of the present application.
- FIG. 1C is a schematic diagram of the physical architecture for running a video processing application in an embodiment of the present application:
- FIG1C shows a schematic diagram of a system architecture.
- the system may include a terminal 100 and a server 200.
- the server 200 may include one or more servers (FIG1C illustrates one server as an example), and the server 200 may provide video processing or natural language generation functions for one or more terminals.
- the terminal 100 can be installed with a video processing application, or a web page related to the video processing or natural language generation function can be opened.
- the above application and web page can provide an interface.
- the terminal 100 can receive the relevant parameters entered by the user on the video processing or natural language generation function interface, and send the above parameters to the server 200.
- the server 200 can obtain the task processing results of the video based on the received parameters, and return the task processing results of the video to the terminal 100.
- the terminal 100 can also complete the action of obtaining the task processing result of the video based on the received parameters by itself without the need for cooperation from the server, and the embodiments of the present application are not limited thereto.
- the terminal 100 in the embodiment of the present application can be a mobile phone, a tablet computer, a wearable device, a vehicle-mounted device, an augmented reality (AR)/virtual reality (VR) device, a laptop computer, an ultra-mobile personal computer (UMPC), a netbook, a personal digital assistant (PDA), etc., and the embodiment of the present application does not impose any restrictions on this.
- AR augmented reality
- VR virtual reality
- UMPC ultra-mobile personal computer
- PDA personal digital assistant
- FIG1D shows a schematic diagram of an optional hardware structure of the terminal 100 .
- the terminal 100 may include components such as a radio frequency unit 110, a memory 120, an input unit 130, a display unit 140, a camera 150 (optional), an audio circuit 160 (optional), a speaker 161 (optional), a microphone 162 (optional), a processor 170, an external interface 180, and a power supply 190.
- a radio frequency unit 110 such as a radio frequency unit 110, a memory 120, an input unit 130, a display unit 140, a camera 150 (optional), an audio circuit 160 (optional), a speaker 161 (optional), a microphone 162 (optional), a processor 170, an external interface 180, and a power supply 190.
- FIG1D is merely an example of a terminal or multi-function device and does not limit the terminal or multi-function device.
- the terminal or multi-function device may include more or fewer components than shown, or may combine certain components or have different components.
- the input unit 130 can be used to receive input digital or character information and generate key signal input related to user settings and function control of the portable multifunction device.
- the input unit 130 may include a touch screen 131 (optional) and/or other input devices 132.
- the touch screen 131 can detect user touch operations on or near it (for example, operations performed on or near the touch screen using a finger, joint, stylus, or any other suitable object) and drive corresponding connected devices according to pre-set programs.
- the touch screen can detect user touch actions on the touch screen, convert the touch actions into touch signals and transmit them to the processor 170. It can also receive and execute commands sent by the processor 170; the touch signals include at least touch point coordinate information.
- the touch screen 131 provides an input interface and an output interface between the terminal 100 and the user.
- Touch screens can be implemented using various types, including resistive, capacitive, infrared, and surface acoustic wave.
- the input unit 130 may also include other input devices.
- the other input devices 132 may include, but are not limited to, one or more of a physical keyboard, function keys (such as a volume control key, a switch key, etc.), a trackball, a mouse, a joystick, and the like.
- other input devices 132 can receive input video.
- the display unit 140 may be used to display information input by the user or provided to the user, various menus of the terminal 100, interactive interfaces, file display, and/or playback of any multimedia file. In an embodiment of the present application, the display unit 140 may be used to display the interface of a generated application, processing results, etc.
- Memory 120 can be used to store instructions and data. It primarily includes an instruction storage area and a data storage area.
- the data storage area can store various data, such as multimedia files and text.
- the instruction storage area can store software units such as the operating system, applications, and instructions required for at least one function, or subsets or extensions thereof. It may also include non-volatile random access memory (RAM). It provides processor 170 with management functions for the hardware, software, and data resources within the computing and processing device, supporting control software and applications. It is also used to store multimedia files and running programs and applications.
- the processor 170 is the control center of the terminal 100. It connects all components of the terminal 100 using various interfaces and circuits. By executing instructions stored in the memory 120 and accessing data stored therein, it executes various functions of the terminal 100 and processes data, thereby providing overall control of the terminal device.
- the processor 170 may include one or more processing units.
- the processor 170 may integrate an application processor and a modem processor, with the application processor primarily processing the operating system, user interface, and application programs, while the modem processor primarily handles wireless communications. It is understood that the modem processor may not be integrated into the processor 170.
- the processor and memory may be implemented on a single chip; in other embodiments, they may be implemented on separate chips.
- the processor 170 may also generate corresponding operational control signals and send them to the corresponding components of the computing and processing device. It may also read and process data in the software, particularly the data and programs in the memory 120, to enable the various functional modules therein to perform their corresponding functions, thereby controlling the corresponding components to operate as instructed.
- the memory 120 can be used to store software codes related to the data processing method
- the processor 170 can execute the steps of the chip's data processing method, and can also schedule other units (such as the above-mentioned input unit 130 and display unit 140) to achieve corresponding functions.
- the RF unit 110 (optional) can be used to send and receive information or receive and send signals during a call. For example, it receives downlink information from the base station and sends it to the processor 170 for processing; in addition, it sends uplink data to the base station.
- the RF circuit includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low-noise amplifier (LNA), a duplexer, etc.
- the RF unit 110 can also communicate with network devices and other devices via wireless communication.
- This wireless communication can use any communication standard or protocol, including but not limited to Global System of Mobile Communications (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), email, Short Messaging Service (SMS), etc.
- GSM Global System of Mobile Communications
- GPRS General Packet Radio Service
- CDMA Code Division Multiple Access
- WCDMA Wideband Code Division Multiple Access
- LTE Long Term Evolution
- SMS Short Messaging Service
- the RF unit 110 can send the video to the server 200 and receive the processing result sent by the server 200.
- radio frequency unit 110 is optional and can be replaced by other communication interfaces, such as a network port.
- the terminal 100 also includes a power supply 190 (such as a battery) for supplying power to various components.
- a power supply 190 such as a battery
- the power supply can be logically connected to the processor 170 through a power management system, thereby managing functions such as charging, discharging, and power consumption through the power management system.
- the terminal 100 also includes an external interface 180, which can be a standard Micro USB interface or a multi-pin connector. It can be used to connect the terminal 100 to communicate with other devices, and can also be used to connect a charger to charge the terminal 100.
- an external interface 180 can be a standard Micro USB interface or a multi-pin connector. It can be used to connect the terminal 100 to communicate with other devices, and can also be used to connect a charger to charge the terminal 100.
- terminal 100 may also include a flashlight, a wireless fidelity (WiFi) module, a Bluetooth module, sensors with different functions, etc., which are not described in detail here. Some or all of the methods described below may be applied to terminal 100 as shown in FIG. 1D .
- WiFi wireless fidelity
- Bluetooth Bluetooth
- FIG2 provides a schematic diagram of the structure of a server 200.
- the server 200 includes a bus 201, a processor 202, a communication interface 203, and a memory 204.
- the processor 202, the memory 204, and the communication interface 203 communicate with each other via the bus 201.
- Bus 201 can be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus. Buses can be classified as address buses, data buses, control buses, and the like. For ease of illustration, FIG2 shows only one thick line, but this does not imply that there is only one bus or only one type of bus.
- PCI Peripheral Component Interconnect
- EISA Extended Industry Standard Architecture
- the processor 202 can be any one or more of a central processing unit (CPU), a graphics processing unit (GPU), a microprocessor (MP), or a digital signal processor (DSP).
- CPU central processing unit
- GPU graphics processing unit
- MP microprocessor
- DSP digital signal processor
- the memory 204 may include volatile memory, such as random access memory (RAM).
- RAM random access memory
- the memory 204 may also include non-volatile memory, such as read-only memory (ROM), flash memory, hard drive (HDD), or solid state drive (SSD).
- ROM read-only memory
- HDD hard drive
- SSD solid state drive
- the memory 204 may be used to store software codes related to the data processing method, and the processor 202 may execute the steps of the data processing method of the chip, and may also schedule other units to implement corresponding functions.
- the above-mentioned terminal 100 and server 200 can be centralized or distributed devices, and the processors in the above-mentioned terminal 100 and server 200 (such as processor 170 and processor 202) can be hardware circuits (such as application specific integrated circuit (ASIC), field-programmable gate array (FPGA), general-purpose processor, digital signal processor (DSP), microprocessor or microcontroller, etc.), or a combination of these hardware circuits.
- ASIC application specific integrated circuit
- FPGA field-programmable gate array
- DSP digital signal processor
- microprocessor or microcontroller etc.
- the processor can be a hardware system with the function of executing instructions, such as CPU, DSP, etc., or a hardware system without the function of executing instructions, such as ASIC, FPGA, etc., or a combination of the above-mentioned hardware systems without the function of executing instructions and hardware systems with the function of executing instructions.
- the steps related to the model reasoning process in the embodiments of this application involve AI-related operations.
- the instruction execution architecture of the terminal device and server is not limited to the processor-memory architecture described above.
- the system architecture provided in the embodiments of this application is described in detail below with reference to Figure 3.
- FIG3 is a schematic diagram of the system architecture provided by an embodiment of the present application.
- the system architecture 500 includes an execution device 510 , a training device 520 , a database 530 , a client device 540 , a data storage system 550 , and a data acquisition system 560 .
- the execution device 510 includes a calculation module 511, an I/O interface 512, a pre-processing module 513, and a post-processing module 514.
- the calculation module 511 may include the target model/rule 501, and the pre-processing module 513 and the post-processing module 514 are optional.
- the execution device 510 may be a terminal device or a server that runs the aforementioned generated application program.
- the data acquisition device 560 is used to collect training samples.
- the training samples can be image data, etc. After collecting the training samples, the data acquisition device 560 stores these training samples in the database 530.
- the training device 520 can train the neural network to be trained (such as the feature extraction network, task network and other neural networks in the embodiments of the present application) based on the training samples maintained in the database 530 to obtain the target model/rule 501.
- the neural network to be trained such as the feature extraction network, task network and other neural networks in the embodiments of the present application
- the training device 520 can perform a pre-training process on the neural network to be trained based on the training samples maintained in the database 530, or fine-tune the model based on the pre-training.
- the training samples maintained in the database 530 may not all be collected by the data acquisition device 560, but may also be received from other devices. It should also be noted that the training device 520 may not train the target model/rule 501 entirely based on the training samples maintained in the database 530, but may also obtain training samples from the cloud or other places for model training. The above description should not be used as a limitation on the embodiments of the present application.
- the target model/rule 501 obtained through training with the training device 520 can be applied to different systems or devices, such as the execution device 510 shown in FIG3 .
- the execution device 510 can be a terminal, such as a mobile phone terminal, a tablet computer, a laptop computer, an augmented reality (AR)/virtual reality (VR) device, a vehicle-mounted terminal, etc., or a server, etc.
- AR augmented reality
- VR virtual reality
- the training device 520 may transfer the trained model to the execution device 510 .
- the execution device 510 is configured with an input/output (I/O) interface 512 for data interaction with external devices.
- the user can input data (such as video in the embodiment of the present application) to the I/O interface 512 through the client device 540.
- Preprocessing module 513 and preprocessing module 514 are used to preprocess the input data received by I/O interface 512. It should be understood that preprocessing module 513 and preprocessing module 514 may be absent or only one preprocessing module may be present. If preprocessing module 513 and preprocessing module 514 are absent, computing module 511 may be used directly to process the input data.
- the execution device 510 When the execution device 510 preprocesses the input data, or when the computing module 511 of the execution device 510 performs calculations and other related processing, the execution device 510 can call the data, code, etc. in the data storage system 550 for corresponding processing, and can also store the data, instructions, etc. obtained from the corresponding processing in the data storage system 550.
- the I/O interface 512 provides the processed results to the client device 540 and thus to the user.
- client device 540 can automatically send input data to I/O interface 512. If user authorization is required for client device 540 to automatically send input data, the user can set the corresponding permissions in client device 540. The user can view the results output by execution device 510 on client device 540, and the specific presentation form can be a display, sound, action, or other specific method.
- Client device 540 can also serve as a data acquisition terminal, collecting input data input into I/O interface 512 and output results output from I/O interface 512 as new sample data, and storing them in database 530. Of course, collection can also be performed without client device 540, and instead the I/O interface 512 directly stores the input data input into I/O interface 512 and output results output from I/O interface 512 as new sample data in database 530.
- FIG3 is merely a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationships between the devices, components, modules, etc. shown in the figure do not constitute any limitation.
- the data storage system 550 is an external memory relative to the execution device 510. In other cases, the data storage system 550 can also be placed in the execution device 510. It should be understood that the execution device 510 can be deployed in the client device 540.
- the computing module 511 of the above-mentioned execution device 510 can obtain the code stored in the data storage system 550 to implement the steps related to the model reasoning process in the embodiment of the present application.
- the computing module 511 of the execution device 510 may include a hardware circuit (such as an application specific integrated circuit (ASIC), a field-programmable gate array (FPGA), a general-purpose processor, a digital signal processor (DSP), a microprocessor or a microcontroller, etc.), or a combination of these hardware circuits.
- the training device 520 may be a hardware system with an instruction execution function, such as a CPU, DSP, etc., or a hardware system without an instruction execution function, such as an ASIC, FPGA, etc., or a combination of the above-mentioned hardware systems without an instruction execution function and hardware systems with an instruction execution function.
- the computing module 511 of the execution device 510 can be a hardware system with an execution instruction function, and the steps related to the model reasoning process provided in the embodiment of the present application can be software codes stored in the memory.
- the computing module 511 of the execution device 510 can obtain the software code from the memory and execute the obtained software code to implement the steps related to the model reasoning process provided in the embodiment of the present application.
- the computing module 511 of the execution device 510 can be a combination of a hardware system that does not have the function of executing instructions and a hardware system that has the function of executing instructions. Some of the steps related to the model reasoning process provided in the embodiment of the present application can also be implemented by the hardware system that does not have the function of executing instructions in the computing module 511 of the execution device 510, which is not limited here.
- the above-mentioned training device 520 can obtain the code stored in the memory (not shown in Figure 3, which can be integrated into the training device 520 or deployed separately from the training device 520) to implement the steps related to model training in the embodiment of the present application.
- the training device 520 may include a hardware circuit (such as an application specific integrated circuit (ASIC), a field-programmable gate array (FPGA), a general-purpose processor, a digital signal processor (DSP), a microprocessor or a microcontroller, etc.), or a combination of these hardware circuits.
- the training device 520 may be a hardware system with an instruction execution function, such as a CPU, DSP, etc., or a hardware system without an instruction execution function, such as an ASIC, FPGA, etc., or a combination of the above-mentioned hardware systems without an instruction execution function and hardware systems with an instruction execution function.
- the training device 520 can be a combination of a hardware system that does not have the function of executing instructions and a hardware system that has the function of executing instructions. Some of the steps related to model training provided in the embodiments of the present application can also be implemented by the hardware system in the training device 520 that does not have the function of executing instructions, which is not limited here.
- Video processing cloud services provided by the server :
- the server can provide video processing service to the end side through an application programming interface (API).
- API application programming interface
- the terminal device can send relevant parameters (such as video) to the server through the API provided by the cloud.
- the server can obtain processing results based on the received parameters, etc., and return the processing results to the terminal.
- FIG4 shows a process of using a video processing function cloud service provided by a cloud platform.
- SDK software development kit
- the cloud platform provides multiple development versions of the SDK for users to choose according to the requirements of the development environment, such as JAVA version SDK, Python version SDK, PHP version SDK, Android version SDK, etc.
- the local development environment can also be used to develop other functions, forming an application that integrates video processing functional capabilities.
- a video processing application When a video processing application is used and needs to perform video processing, it can trigger an API call for that function.
- the application triggers the video processing function, it initiates an API request to the running instance of the video processing service in the cloud environment.
- the API request includes image data, and the running instance in the cloud environment processes the image to obtain the processing result.
- the cloud environment returns the processing results to the application, thus completing a video processing function call.
- a neural network can be composed of neural units.
- a neural unit can refer to an operation unit that takes xs (i.e., input data) and intercept 1 as input.
- the output of the operation unit can be:
- f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal of the neural unit into the output signal.
- the output signal of this activation function can be used as the input of the next convolutional layer, and the activation function can be a sigmoid function.
- a neural network is a network formed by connecting multiple single neural units mentioned above, that is, the output of one neural unit can be the input of another neural unit.
- the input of each neural unit can be connected to the local receptive field of the previous layer to extract the features of the local receptive field.
- the local receptive field can be an area composed of several neural units.
- Convolutional neural network is a deep neural network with a convolutional structure.
- a convolutional neural network contains a feature extractor consisting of a convolution layer and a subsampling layer, which can be regarded as a filter.
- a convolution layer refers to a neuron layer in a convolutional neural network that performs convolution processing on the input signal. In the convolution layer of a convolutional neural network, a neuron can only be connected to some neurons in the adjacent layers.
- a convolution layer usually contains several feature planes, and each feature plane can be composed of a number of rectangularly arranged neural units. The neural units in the same feature plane share weights, and the shared weights here are convolution kernels.
- Shared weights can be understood as the way of extracting features is independent of position.
- the convolution kernel can be formalized as a matrix of random size, and the convolution kernel can obtain reasonable weights through learning during the training process of the convolutional neural network.
- the direct benefit of shared weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.
- CNN is a very common neural network. The following will focus on a detailed introduction to its structure.
- a convolutional neural network is a deep neural network with a convolutional structure. It is a deep learning architecture, which uses machine learning algorithms to perform multiple levels of learning at different levels of abstraction.
- CNN is a feed-forward artificial neural network, in which each neuron responds to an image input.
- Deep Neural Network also known as multi-layer neural network
- DNN Deep Neural Network
- the neural network inside DNN can be divided into three categories: input layer, hidden layer, and output layer.
- the first layer is the input layer
- the last layer is the output layer
- the layers in between are all hidden layers.
- the layers are fully connected, that is, any neuron in the i-th layer must be connected to any neuron in the i+1-th layer.
- the coefficient from the kth neuron in the L-1th layer to the jth neuron in the Lth layer is defined as It's important to note that the input layer has no W parameter.
- more hidden layers allow the network to better capture complex real-world situations. Theoretically, a model with more parameters has higher complexity and greater "capacity,” meaning it can handle more complex learning tasks.
- Training a deep neural network is essentially the process of learning the weight matrix, with the ultimate goal of obtaining the weight matrices for all layers of a trained deep neural network (a weight matrix formed by the vectors W across many layers).
- the network's output When training a deep neural network, we want the network's output to be as close as possible to the desired predicted value. This is done by comparing the network's predictions with the desired target values and then updating the weight vectors of each layer based on the difference between the two. (Of course, before the first update, there's usually an initialization process, which pre-configures the parameters for each layer in the deep neural network.) For example, if the network's prediction is too high, the weight vectors are adjusted to predict a lower value. This adjustment is repeated until the deep neural network can predict the desired target value or a value very close to it. Therefore, it's necessary to predefine how to compare the difference between the predicted and target values. This is known as the loss function, or objective function, a crucial equation used to measure the difference between the predicted and target values. For example, a higher loss function indicates a greater difference, so training a deep neural network becomes a process of minimizing this loss.
- the loss function or objective function
- the back propagation (BP) algorithm can be used to correct the size of the initial model parameters during training, reducing the model's error loss. Specifically, forward propagation of the input signal to the output generates error loss. This error loss information is then backpropagated to update the parameters in the initial model, thereby converging the error loss.
- the BP algorithm is a backpropagation algorithm driven by error loss, aiming to obtain optimal model parameters, such as the weight matrix.
- the idea of this application is to reduce the dimension of video features. Specifically, the features in the time dimension are compressed to the channel dimension or the spatial dimension, so that the compressed features (that is, the first feature map in the embodiment of this application) do not include the time dimension, and the amount of data in the channel dimension and the spatial dimension increases.
- the feature dimension is reduced, the amount of subsequent calculations required can be greatly reduced, for example, it can be changed from 3D convolution to 2D convolution.
- the size of the feature is (x, t, h, w), and the size of the first feature map is (x*t, h, w), where x is the number of channels, t is the number of times, h and w are the height and width in the spatial dimension, respectively, and * is the product.
- the first feature map and the feature have the same size in the spatial dimension, while the size of the first feature map in the channel dimension is larger than the size of the video feature in the channel dimension.
- the feature maps in different time dimensions among the multiple feature maps may be stacked in the channel dimension to obtain the one or more first feature maps, and the stacking order may be based on the order in the time dimension.
- the feature maps in different time dimensions among the multiple feature maps can be stacked in the spatial dimension to obtain the one or more first feature maps, and the first feature map and the feature have the same size in the channel dimension, while the size of the first feature map in the spatial dimension is larger than the size of the video feature in the channel dimension.
- the time dimension in the video features can be compressed to the channel dimension and the spatial dimension at the same time.
- the size of the first feature map in the spatial dimension is larger than the size of the video features in the channel dimension
- the size of the first feature map in the spatial dimension is larger than the size of the video features in the channel dimension.
- FIG6 is an overall block diagram of a lightweight video understanding model based on time axis compression and time feature enhancement proposed in an embodiment of the present application.
- the overall network process is as follows: (1) For an input video sequence X, whose vector shape is (3, t, h, w), its time axis is compressed to the channel axis through a vector deformation operation (which can reduce the large amount of computing resources consumed in subsequent feature processing), that is, an input with a shape of (3t, h, w) is obtained.
- the first feature map can be processed by a feature extraction network to obtain one or more second feature maps
- the task processing result of the video can be obtained through a task network based on the one or more second feature maps.
- the temporal information on the feature channel will be mixed together, and the object relationship between frames or between different frames will be more difficult to measure. Therefore, it is necessary to perform time-related recovery and enhancement on the compressed features.
- the feature extraction network may include, but is not limited to, a convolutional or transformer network.
- the feature extraction network may include multiple feature extraction units, each of which is connected via a downsampling operation, and each of which is used to process features at different scales.
- the feature extraction network may include a first weight determination module and a convolution module.
- the first weight determination module and the convolution module may belong to the at least one feature extraction unit.
- the first weight determination module is used to determine the weight corresponding to the channel dimension of the input feature map based on the input feature map.
- the weight can represent the temporal importance of the channel, and the weights corresponding to different channels can be different.
- the convolution module is used to perform a convolution operation on the input feature map to obtain convolution operation results of multiple channel dimensions, and fuse the convolution operation results of the multiple channel dimensions according to the weight to obtain a processing result. In this way, the weights obtained by the first weight determination module are used to restore the timing information in the feature, thereby enhancing the enhancement of the timing information and improving the processing performance of the network.
- the above-mentioned first weight determination module and convolution module can be the time importance branch in the time channel learning unit.
- the time channel learning unit consists of two branches, one branch is the time importance branch, and the other branch is the cross-time object interaction module.
- the outputs of the two branches are fused together through a summation operation to obtain the final output.
- the temporal importance learning branch is used to capture the temporal importance of the channel. It consists of a 1 ⁇ 1 temporal attention convolution.
- the formula of the temporal attention convolution is as follows:
- f(x, y) represents the output value of 2D convolution at the point (x, y)
- k is the size of the convolution kernel
- c is the number of channels
- g represents the convolution kernel
- h represents the feature map
- the convolution operation can be implemented based on the convolution module introduced in the above embodiment
- wm is the input adaptive weight calculated according to the input feature (that is, the weight corresponding to the channel dimension of the input feature map introduced above)
- wm can be obtained in various ways, and can be obtained through a multi-layer perceptron, a global attention module, etc. (which can be implemented based on the first weight determination module introduced in the above embodiment).
- the feature extraction network may include a transformation module and an interaction module.
- the transformation module and the interaction module may belong to at least one of the aforementioned feature extraction units.
- the transformation module and the interaction module may belong to the same feature extraction unit as the first weight determination module and the convolution module described above.
- the result obtained by the interaction module (or the result obtained by performing other processing based on the result obtained by the interaction module) can be further integrated with the processing result obtained by the first weight determination module.
- the transformation module is used to expand the input feature map into multiple feature maps distributed in the time dimension.
- the input feature map can be expanded into multiple feature maps distributed in the time dimension through a channel conversion function.
- the number of expanded features in the time dimension can be consistent with the number of video features in the time dimension.
- the interaction module is used to perform feature interaction between the multiple feature maps distributed in the time dimension.
- the interaction includes: interaction based on an attention mechanism, or interaction achieved through large kernel convolution.
- the transformation module is further configured to determine a time code that is consistent in size with the multiple feature maps obtained by the transformation module.
- the time code may be a four-dimensional tensor.
- Each eigenvalue of the multiple feature maps obtained by the transformation module may correspond to a code value in the time code.
- the time code and the multiple feature maps obtained by the transformation module may be fused to obtain multiple feature maps in a fused time dimension.
- the fusion may be an addition operation at corresponding positions.
- the interaction module is specifically configured to perform feature interaction between the multiple feature maps in the fused time dimension.
- the interaction module is also used to fuse the interaction results obtained through the interaction and the input features.
- the interaction results obtained through the interaction and the input features can be first mapped to information of the same size, and then the corresponding positions are added to perform fusion.
- the transformation module and interaction module described above can be referred to as the cross-temporal object interaction module.
- the cross-temporal object interaction module can be used to restore temporal information and capture object relationships between different frames.
- Figure 6 (c) shows a schematic diagram of the cross-temporal object interaction module, and its functionality is described by the following formula:
- Fb represents the input feature
- FT is the time code
- ⁇ represents the Sigmoid activation function. Represents element-wise multiplication.
- the channel conversion function can be implemented by 2D convolution; the time code FT is a learnable/unlearnable position code; the cross-time function It can be implemented by various modules that capture global positional relationships, for example, it can be implemented by large-kernel 2D convolution or by a windowed global attention mechanism.
- the cross-temporal object interaction module is used to restore temporal information and capture object relationships within or across frames. This module converts channels into frame numbers, adds temporal information through temporal position encoding, captures object relationships across frames using a cross-temporal function, and finally converts the channel number into the output channel number.
- the input can be processed by a 5 ⁇ 5 convolution with a stride of 2, and then the input features are obtained through a maximum pooling module; the input features are input to the subsequent network of four stages, each stage contains a different number of temporal channel learning unit (CTL) blocks, each stage processes features of different scales, and the downsampling operation is used to obtain the downsampled features of each next stage; for the output features of the last stage, an average pooling and fully connected layer are input to obtain the final output of the network.
- CTL temporal channel learning unit
- the core module is the CTL block, which performs the main function of temporal feature enhancement.
- the temporal learning unit block can be a residual block, which includes a 1 ⁇ 1 convolution to reduce the number of input channels to 1/4 of the original number, a core temporal channel learning unit (CTL), and another 1 ⁇ 1 convolution to expand the number of input channels by 4 times, returning to the original number of input channels.
- CTL core temporal channel learning unit
- this application simultaneously proposes a temporal feature enhancement strategy of temporal importance learning and cross-temporal object interaction to enhance compressed feature learning and improve the performance of the overall network.
- temporal channel learning unit and cross-temporal object interaction module proposed in the present invention can also be embedded in ordinary 2D convolutional networks. They are not limited to video understanding tasks. For example, they can be applied to ordinary 2D backbone networks to improve network performance.
- the device 800 includes:
- An acquisition module 801 is configured to acquire features of a video, wherein the features include multiple feature maps distributed along a time dimension, wherein the dimensions of the feature maps include a channel dimension and a spatial dimension, and to convert the multiple feature maps from being distributed along the time dimension to being distributed along the channel dimension or the spatial dimension, thereby obtaining one or more first feature maps; wherein the first feature maps do not include the time dimension.
- the processing module 802 is used to process the first feature map through a feature extraction network to obtain one or more second feature maps, and obtain a task processing result of the video through a task network based on the one or more second feature maps.
- processing module 802 For a detailed description of the processing module 802 , reference may be made to the description of step 503 in the above embodiment, which will not be repeated here.
- the task processing result is a processing result of a video understanding task, a video-based generation task, or a video enhancement task.
- the first feature map and the feature have the same size in spatial dimension.
- the size of the feature is (x, t, h, w), and the size of the first feature map is (x*t, h, w), where x is the number of channels, t is the number of times, h and w are the height and width in the spatial dimension, respectively, and * is the product.
- the compression module is specifically configured to:
- the feature maps at different time dimensions in the multiple feature maps are stacked in the channel dimension to obtain the one or more first feature maps.
- the feature extraction network includes a first weight determination module and a convolution module
- the first weight determination module is used to determine the weight corresponding to the channel dimension of the input feature map based on the input feature map;
- the convolution module is used to perform a convolution operation on the input feature map to obtain convolution operation results of multiple channel dimensions, and fuse the convolution operation results of the multiple channel dimensions according to the weights to obtain a processing result.
- the feature extraction network includes a transformation module and an interaction module
- the transformation module is used to expand the input feature map into multiple feature maps distributed in the time dimension;
- the interaction module is used to perform feature interaction between the multiple feature maps distributed in the time dimension.
- the interaction includes:
- the transformation module is further configured to:
- the interaction module is specifically used to perform feature interaction between the multiple feature maps in the fused time dimension.
- the interaction module is further configured to fuse the interaction result obtained through the interaction with the input feature.
- Figure 9 is a structural diagram of a terminal device provided in an embodiment of the present application.
- the terminal device 900 can be specifically manifested as a virtual reality VR device, a mobile phone, a tablet, a laptop computer, a smart wearable device, etc., which is not limited here.
- the terminal device 900 includes: a receiver 901, a transmitter 902, a processor 903 and a memory 904 (wherein the number of processors 903 in the terminal device 900 can be one or more, and Figure 9 takes one processor as an example), wherein the processor 903 may include an application processor 9031 and a communication processor 9032.
- the receiver 901, the transmitter 902, the processor 903 and the memory 904 may be connected via a bus or other means.
- the memory 904 may include read-only memory and random access memory, and provides instructions and data to the processor 903. A portion of the memory 904 may also include non-volatile random access memory (NVRAM).
- NVRAM non-volatile random access memory
- the memory 904 stores processor and operation instructions, executable modules, or data structures, or subsets or extensions thereof.
- the operation instructions may include various operation instructions for implementing various operations.
- Processor 903 controls the operation of the execution device.
- the various components of the execution device are coupled together via a bus system.
- the bus system may also include a power bus, a control bus, and a status signal bus.
- all bus systems are referred to as a bus system in the figure.
- the methods disclosed in the above embodiments of the present application can be applied to the processor 903 or implemented by the processor 903.
- the processor 903 can be an integrated circuit chip with signal processing capabilities. During implementation, each step of the above method can be completed by the hardware integrated logic circuit in the processor 903 or by instructions in the form of software.
- the above processor 903 can be a general-purpose processor, a digital signal processor (DSP), a microprocessor or a microcontroller, and can further include an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components.
- DSP digital signal processor
- ASIC application-specific integrated circuit
- FPGA field-programmable gate array
- the processor 903 can implement or execute the various methods, steps and logic block diagrams disclosed in the embodiments of the present application.
- the general-purpose processor can be a microprocessor or the processor can also be any conventional processor, etc.
- the steps of the method disclosed in conjunction with the embodiments of the present application can be directly embodied as being executed by a hardware decoding processor, or can be executed by a combination of hardware and software modules in the decoding processor.
- the software module can be located in a storage medium well-known in the art, such as random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, or registers. This storage medium is located in memory 904.
- Processor 903 reads information from memory 904 and, in conjunction with its hardware, completes the steps involved in the model training or model inference process in the above method.
- Receiver 901 can be used to receive input digital or character information and generate signal input related to executing device-related settings and function control.
- Transmitter 902 can be used to output digital or character information through the first interface.
- Transmitter 902 can also be used to send instructions to the disk pack through the first interface to modify data in the disk pack.
- Transmitter 902 can also include a display device such as a display screen.
- FIG. 10 is a structural diagram of a server provided by an embodiment of the present application.
- the server 1000 may have relatively large differences due to different configurations or performances, and may include one or more central processing units (CPU) 1010 (for example, one or more processors) and a memory 1032, and one or more storage media 1030 (for example, one or more mass storage devices) storing application programs 1042 or data 1044.
- the memory 1032 and the storage medium 1030 can be temporary storage or permanent storage.
- the program stored in the storage medium 1030 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations in the server.
- the central processing unit 1010 can be configured to communicate with the storage medium 1030 to execute a series of instruction operations in the storage medium 1030 on the server 1000.
- the server 1000 may also include one or more power supplies 1026, one or more wired or wireless network interfaces 1050, one or more input and output interfaces 1058; or, one or more operating systems 1041, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
- operating systems 1041 such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
- the central processing unit 1010 is used to execute actions related to model training or model reasoning in the above embodiments.
- An embodiment of the present application also provides a computer program product, which, when running on a computer, enables the computer to execute the steps executed by the aforementioned execution device, or enables the computer to execute the steps executed by the aforementioned training device.
- a computer-readable storage medium is also provided in an embodiment of the present application, which stores a program for signal processing.
- the computer-readable storage medium When the computer-readable storage medium is run on a computer, it enables the computer to execute the steps executed by the aforementioned execution device, or enables the computer to execute the steps executed by the aforementioned training device.
- the execution device, training device or terminal device provided in the embodiments of the present application may specifically be a chip, and the chip includes: a processing unit and a communication unit, wherein the processing unit may be, for example, a processor, and the communication unit may be, for example, an input/output interface, a pin or a circuit, etc.
- the processing unit may execute the computer execution instructions stored in the storage unit, so that the chip in the execution device executes the data processing method described in the above embodiment, or so that the chip in the training device executes the data processing method described in the above embodiment.
- the storage unit is a storage unit in the chip, such as a register, a cache, etc.
- the storage unit may also be a storage unit located outside the chip in the wireless access device end, such as a read-only memory (ROM) or other types of static storage devices that can store static information and instructions, a random access memory (RAM), etc.
- ROM read-only memory
- RAM random access memory
- the chip may be a neural network processor (NPU) 1100.
- NPU 1100 is mounted on a host CPU (host CPU) as a coprocessor, with tasks assigned by the host CPU.
- the core of the NPU is arithmetic circuit 1103, which is controlled by controller 1104 to retrieve matrix data from memory and perform multiplication operations.
- arithmetic circuit 1103 includes multiple processing elements (PEs). In some implementations, arithmetic circuit 1103 is a two-dimensional systolic array. Arithmetic circuit 1103 can also be a one-dimensional systolic array or other electronic circuitry capable of performing mathematical operations such as multiplication and addition. In some implementations, arithmetic circuit 1103 is a general-purpose matrix processor.
- PEs processing elements
- Arithmetic circuit 1103 can also be a one-dimensional systolic array or other electronic circuitry capable of performing mathematical operations such as multiplication and addition.
- arithmetic circuit 1103 is a general-purpose matrix processor.
- the arithmetic circuit retrieves the corresponding data of matrix B from weight memory 1102 and caches it on each PE in the arithmetic circuit.
- the arithmetic circuit retrieves the data of matrix A from input memory 1101 and performs a matrix operation on matrix B.
- the partial or final matrix result is stored in accumulator 1108.
- Unified memory 1106 is used to store input and output data. Weight data is directly transferred to weight memory 1102 via the Direct Memory Access Controller (DMAC) 1105. Input data is also transferred to unified memory 1106 via the DMAC.
- DMAC Direct Memory Access Controller
- BIU stands for Bus Interface Unit, that is, the bus interface unit 1110, which is used for the interaction between the AXI bus and the DMAC and instruction fetch buffer (IFB) 1109.
- IOB instruction fetch buffer
- the bus interface unit 1110 (BIU) is used for the instruction fetch memory 1109 to obtain instructions from the external memory, and is also used for the storage unit access controller 1105 to obtain the original data of the input matrix A or the weight matrix B from the external memory.
- DMAC is mainly used to move input data in the external memory DDR to the unified memory 1106 or to move weight data to the weight memory 1102 or to move input data to the input memory 1101.
- the vector calculation unit 1107 includes multiple processing units. When necessary, it further processes the output of the calculation circuit 1103, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, etc. It is mainly used for non-convolutional/fully connected layer network calculations in neural networks, such as batch normalization, pixel-level summation, and upsampling of feature planes.
- the vector calculation unit 1107 can store the processed output vector in the unified memory 1106.
- the vector calculation unit 1107 can apply a linear function or a nonlinear function to the output of the operation circuit 1103, such as linear interpolation of the feature plane extracted by the convolution layer, or accumulate a vector of values to generate an activation value.
- the vector calculation unit 1107 generates a normalized value, a pixel-level summed value, or both.
- the processed output vector can be used as an activation input to the operation circuit 1103, for example, for use in subsequent layers in a neural network.
- An instruction fetch buffer 1109 connected to the controller 1104 is used to store instructions used by the controller 1104;
- Unified memory 1106, input memory 1101, weight memory 1102, and instruction fetch memory 1109 are all on-chip memories. External memories are private to the NPU hardware architecture.
- the processor mentioned in any of the above places can be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits for controlling the execution of the above program.
- the device embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed across multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the present embodiment.
- the connection relationship between the modules indicates that there is a communication connection between them, which can be specifically implemented as one or more communication buses or signal lines.
- the technical solution of the present application is essentially or the part that contributes to the prior art can be embodied in the form of a software product, which is stored in a readable storage medium, such as a computer's floppy disk, USB flash drive, mobile hard disk, ROM, RAM, magnetic disk or optical disk, etc., and includes a number of instructions to enable a computer device (which can be a personal computer, training equipment, or network equipment, etc.) to execute the methods described in each embodiment of the present application.
- a computer device which can be a personal computer, training equipment, or network equipment, etc.
- all or part of the embodiments may be implemented by software, hardware, firmware, or any combination thereof.
- all or part of the embodiments may be implemented in the form of a computer program product.
- the computer program product includes one or more computer instructions.
- the computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device.
- the computer instructions can be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium.
- the computer instructions can be transmitted from one website, computer, training equipment or data center to another website, computer, training equipment or data center via wired (e.g., coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means.
- wired e.g., coaxial cable, optical fiber, digital subscriber line (DSL)
- wireless e.g., infrared, wireless, microwave, etc.
- the computer-readable storage medium can be any available medium that a computer can store or a data storage device such as a training device, data center, etc. that includes one or more available media.
- the available medium can be a magnetic medium (e.g., a floppy disk, a hard disk, a tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a solid-state drive (SSD)).
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Computing Systems (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Image Analysis (AREA)
- Stored Programmes (AREA)
Abstract
Description
本申请要求于2024年03月01日提交国家知识产权局、申请号为202410239097.5、申请名称为“一种数据处理方法及其装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims priority to the Chinese patent application filed with the State Intellectual Property Office on March 1, 2024, with application number 202410239097.5 and application name “A data processing method and device thereof”, the entire contents of which are incorporated by reference into this application.
本申请涉及人工智能领域,尤其涉及一种数据处理方法及其装置。The present application relates to the field of artificial intelligence, and in particular to a data processing method and device thereof.
人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个分支,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式作出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。Artificial Intelligence (AI) is the theory, methods, techniques, and application systems that use digital computers or machines controlled by digital computers to simulate, extend, and expand human intelligence, to perceive the environment, acquire knowledge, and use that knowledge to achieve optimal results. In other words, AI is a branch of computer science that seeks to understand the essence of intelligence and produce new intelligent machines that can respond in a manner similar to human intelligence. AI also involves studying the design principles and implementation methods of various intelligent machines, enabling them to perceive, reason, and make decisions.
不同于图像识别,视频处理任务(例如视频理解)需要更多的计算资源。因此,研究轻量化的视频理解模型从而减少资源消耗非常重要。轻量化的视频理解模型具有很多实际的应用,比如自动驾驶、机器人、工业控制等。然而,现有的视频理解模型更多关注如何得到更高的准确度,模型设计较为庞大,却很少关注如何提升小模型在端侧应用的性能。Unlike image recognition, video processing tasks (such as video understanding) require more computing resources. Therefore, it is crucial to research lightweight video understanding models to reduce resource consumption. Lightweight video understanding models have many practical applications, such as autonomous driving, robotics, and industrial control. However, existing video understanding models focus more on achieving higher accuracy, resulting in relatively large model designs, while paying little attention to improving the performance of small models in end-to-end applications.
因此,亟需一种针对于视频处理的轻量化模型设计。Therefore, there is an urgent need for a lightweight model design for video processing.
第一方面,本申请提供了一种数据处理方法,该方法可以由视频处理系统执行,所述方法包括:视频处理系统获取视频的特征,所述特征包括在时间维度上分布的多个特征图,所述特征图的维度包括通道维度和空间维度,也就是,视频的特征至少是四维特征,维度过高会导致后续运算的计算开销过大;视频处理系统将所述多个特征图由在所述时间维度上的分布转换为在所述通道维度或所述空间维度上的分布,得到所述一个或多个第一特征图;根据所述一个或多个第一特征图,得到所述视频的任务处理结果。In a first aspect, the present application provides a data processing method, which can be executed by a video processing system, and the method includes: the video processing system obtains features of a video, wherein the features include multiple feature maps distributed in a time dimension, and the dimensions of the feature maps include a channel dimension and a spatial dimension, that is, the features of the video are at least four-dimensional features, and too high a dimension will result in excessive computational overhead for subsequent operations; the video processing system converts the multiple feature maps from a distribution in the time dimension to a distribution in the channel dimension or the spatial dimension to obtain the one or more first feature maps; and based on the one or more first feature maps, obtains the task processing result of the video.
其中,将所述多个特征图由在所述时间维度上的分布变为在所述通道维度或所述空间维度上的分布可以理解为:将所述多个特征图在所述时间维度上融合至所述通道维度或所述空间维度(融合可以描述为压缩,例如,融合至通道维度,或者融合至空间维度,或者同时融合至通道维度和空间维度)。Among them, changing the distribution of the multiple feature maps from the time dimension to the channel dimension or the spatial dimension can be understood as: fusing the multiple feature maps in the time dimension to the channel dimension or the spatial dimension (fusion can be described as compression, for example, fusion to the channel dimension, or fusion to the spatial dimension, or fusion to the channel dimension and the spatial dimension at the same time).
已有视频理解模型在移动端上性能较差,而性能较好的模型资源消耗过大,不利于应用到端侧设备,移动端模型的资源消耗和性能表现较难平衡。本申请的思路在于,将视频的特征的维度进行缩减,具体的,是将时间维度上的特征压缩到通道维度或者空间维度,使得压缩后的特征(也就是本申请实施例中的第一特征图)不包括时间维度了,而在通道维度和空间维度上的数据量变多,在特征维度降低的情况下,可以使得后续所需的计算量大大降低,例如可以从3D卷积变为2D卷积。Existing video understanding models have poor performance on mobile terminals, while models with better performance consume too much resources, which is not conducive to application to end-side devices. It is difficult to balance the resource consumption and performance of mobile terminal models. The idea of this application is to reduce the dimension of video features. Specifically, the features in the time dimension are compressed to the channel dimension or the spatial dimension, so that the compressed features (that is, the first feature map in the embodiment of this application) do not include the time dimension, and the amount of data in the channel dimension and the spatial dimension increases. When the feature dimension is reduced, the amount of subsequent calculations required can be greatly reduced, for example, it can be changed from 3D convolution to 2D convolution.
其中,所述第一特征图不包括所述时间维度。The first feature map does not include the time dimension.
其中,在根据所述一个或多个第一特征图,得到所述视频的任务处理结果时,可以通过特征提取网络处理所述第一特征图,得到一个或多个第二特征图,并根据所述一个或多个第二特征图,通过任务网络,得到所述视频的任务处理结果。In which, when obtaining the task processing result of the video based on the one or more first feature maps, the first feature map can be processed by a feature extraction network to obtain one or more second feature maps, and the task processing result of the video can be obtained through the task network based on the one or more second feature maps.
在一种可能的实现中,所述任务处理结果为视频理解任务、基于视频的生成任务或者视频增强任务的处理结果。In a possible implementation, the task processing result is a processing result of a video understanding task, a video-based generation task, or a video enhancement task.
在一种可能的实现中,所述第一特征图和所述特征在空间维度上的尺寸相同。In a possible implementation, the first feature map and the feature have the same size in spatial dimension.
在一种可能的实现中,所述特征的尺寸为(x,t,h,w),所述第一特征图的尺寸为(x*t,h,w),其中,x为通道数量,t为时间数量,h和w分别为空间维度上的高度和宽度,*为乘积。In one possible implementation, the size of the feature is (x, t, h, w), and the size of the first feature map is (x*t, h, w), where x is the number of channels, t is the number of times, h and w are the height and width in the spatial dimension, respectively, and * is the product.
在一种可能的实现中,可以将所述多个特征图中在不同时间维度上的特征图在通道维度上进行堆叠,得到所述一个或多个第一特征图,堆叠顺序可以按照时间维度上的先后顺序。In a possible implementation, the feature maps in different time dimensions among the multiple feature maps may be stacked in the channel dimension to obtain the one or more first feature maps, and the stacking order may be based on the order in the time dimension.
将视频序列的时间轴压缩到通道轴后,特征通道上的时间信息会混合到一块,帧间或者不同帧间的物体关系更难度量,因此需要对压缩后的特征进行时间相关的恢复和增强。After compressing the time axis of the video sequence to the channel axis, the temporal information on the feature channel will be mixed together, and the object relationship between frames or between different frames will be more difficult to measure. Therefore, it is necessary to perform time-related recovery and enhancement on the compressed features.
在一种可能的实现中,所述通过特征提取网络处理所述第一特征图,得到一个或多个第二特征图,包括;基于输入的特征图,确定所述输入的特征图的通道维度对应的权重,所述输入的特征图为所述特征提取网络处理所述一个或多个第一特征图得到的中间输出;对所述输入的特征图进行卷积运算,得到多个通道维度的卷积运算结果,并根据所述权重将所述多个通道维度的卷积运算结果进行融合,得到处理结果。In one possible implementation, processing the first feature map through the feature extraction network to obtain one or more second feature maps includes: determining the weight corresponding to the channel dimension of the input feature map based on the input feature map, and the input feature map is the intermediate output obtained by the feature extraction network processing the one or more first feature maps; performing a convolution operation on the input feature map to obtain convolution operation results of multiple channel dimensions, and fusing the convolution operation results of the multiple channel dimensions according to the weights to obtain a processing result.
通过上述方式,利用第一权重确定模块得到的权重来恢复特征中的时序信息,从而增强了时序信息的增强,提高了网络的处理性能。In the above manner, the weights obtained by the first weight determination module are used to restore the time series information in the features, thereby enhancing the enhancement of the time series information and improving the processing performance of the network.
在一种可能的实现中,所述通过特征提取网络处理所述第一特征图,得到一个或多个第二特征图,包括:将输入的特征图扩展为时间维度上分布的多个特征图,所述输入的特征图为所述特征提取网络处理所述一个或多个第一特征图得到的中间输出;对得到的所述时间维度上分布的多个特征图之间进行特征交互。特征之间的交互可以简称为特征交互,所谓特征交互是指通过一定的映射(例如卷积、注意力机制等),通过计算不同特征之间的关系,得到交互结果。In one possible implementation, processing the first feature map through a feature extraction network to obtain one or more second feature maps includes: expanding an input feature map into multiple feature maps distributed along a time dimension, where the input feature map is an intermediate output obtained by the feature extraction network processing the one or more first feature maps; and performing feature interaction between the multiple feature maps obtained along the time dimension. Interaction between features, referred to as feature interaction, refers to obtaining an interaction result by calculating the relationship between different features through a certain mapping (such as convolution or attention mechanism).
通过上述方式,通过将通道转换为帧数目,并进行时间维度上的交互,从而恢复时序信息和捕获同帧或者不同帧之间的物体关系。In the above way, by converting the channels into frame numbers and interacting in the time dimension, the timing information is restored and the relationship between objects in the same frame or different frames is captured.
在一种可能的实现中,所述交互,包括:基于注意力机制的交互、或者通过大核卷积实现的交互。In a possible implementation, the interaction includes: interaction based on an attention mechanism, or interaction achieved through large kernel convolution.
此外,还可以通过时间编码增加时序信息。在一种可能的实现中,In addition, timing information can be added through time coding. In one possible implementation,
所述通过特征提取网络处理所述第一特征图,得到一个或多个第二特征图,包括:确定和所述多个特征图的尺寸一致的时间编码;将所述时间编码和所述多个特征图进行融合,得到融合后的时间维度上的多个特征图;对所述融合后的时间维度上的多个特征图之间进行特征交互。The processing of the first feature map through a feature extraction network to obtain one or more second feature maps includes: determining a time code that is consistent with a size of the multiple feature maps; fusing the time code with the multiple feature maps to obtain multiple feature maps in a fused time dimension; and performing feature interaction between the multiple feature maps in the fused time dimension.
在一种可能的实现中,所述交互模块,还用于将通过所述交互得到的交互结果和所述输入的特征进行融合。In a possible implementation, the interaction module is further configured to fuse the interaction result obtained through the interaction with the input feature.
第二方面,本申请提供了一种数据处理装置,所述装置包括:In a second aspect, the present application provides a data processing device, comprising:
压缩模块,用于获取视频的特征,所述特征包括在时间维度上分布的多个特征图,所述特征图的维度包括通道维度和空间维度,将所述多个特征图由在所述时间维度上的分布变为在所述通道维度或所述空间维度上的分布,得到所述一个或多个第一特征图;所述第一特征图不包括所述时间维度;a compression module, configured to obtain features of a video, the features comprising a plurality of feature maps distributed in a time dimension, the dimensions of the feature maps comprising a channel dimension and a spatial dimension, and convert the plurality of feature maps from being distributed in the time dimension to being distributed in the channel dimension or the spatial dimension, to obtain the one or more first feature maps; the first feature maps do not include the time dimension;
处理模块,用于通过特征提取网络处理所述第一特征图,得到一个或多个第二特征图,并根据所述一个或多个第二特征图,通过任务网络,得到所述视频的任务处理结果。A processing module is used to process the first feature map through a feature extraction network to obtain one or more second feature maps, and obtain a task processing result of the video through a task network based on the one or more second feature maps.
在一种可能的实现中,所述任务处理结果为视频理解任务、基于视频的生成任务或者视频增强任务的处理结果。In a possible implementation, the task processing result is a processing result of a video understanding task, a video-based generation task, or a video enhancement task.
在一种可能的实现中,所述第一特征图和所述特征在空间维度上的尺寸相同。In a possible implementation, the first feature map and the feature have the same size in spatial dimension.
在一种可能的实现中,所述特征的尺寸为(x,t,h,w),所述第一特征图的尺寸为(x*t,h,w),其中,x为通道数量,t为时间数量,h和w分别为空间维度上的高度和宽度,*为乘积。In one possible implementation, the size of the feature is (x, t, h, w), and the size of the first feature map is (x*t, h, w), where x is the number of channels, t is the number of times, h and w are the height and width in the spatial dimension, respectively, and * is the product.
在一种可能的实现中,所述压缩模块,具体用于:In a possible implementation, the compression module is specifically configured to:
将所述多个特征图中在不同时间维度上的特征图在通道维度上进行堆叠,得到所述一个或多个第一特征图。The feature maps at different time dimensions in the multiple feature maps are stacked in the channel dimension to obtain the one or more first feature maps.
在一种可能的实现中,所述特征提取网络包括第一权重确定模块和卷积模块;In one possible implementation, the feature extraction network includes a first weight determination module and a convolution module;
所述第一权重确定模块,用于基于输入的特征图,确定所述输入的特征图的通道维度对应的权重;The first weight determination module is used to determine the weight corresponding to the channel dimension of the input feature map based on the input feature map;
所述卷积模块,用于对所述输入的特征图进行卷积运算,得到多个通道维度的卷积运算结果,并根据所述权重将所述多个通道维度的卷积运算结果进行融合,得到处理结果。The convolution module is used to perform a convolution operation on the input feature map to obtain convolution operation results of multiple channel dimensions, and fuse the convolution operation results of the multiple channel dimensions according to the weights to obtain a processing result.
在一种可能的实现中,所述特征提取网络包括变换模块和交互模块;In one possible implementation, the feature extraction network includes a transformation module and an interaction module;
所述变换模块,用于将输入的特征图扩展为时间维度上分布的多个特征图;The transformation module is used to expand the input feature map into multiple feature maps distributed in the time dimension;
所述交互模块,用于对得到的所述时间维度上分布的多个特征图之间进行特征交互。The interaction module is used to perform feature interaction between the multiple feature maps distributed in the time dimension.
在一种可能的实现中,所述交互,包括:In a possible implementation, the interaction includes:
基于注意力机制的交互、或者通过大核卷积实现的交互。Interaction based on attention mechanism or interaction achieved through large kernel convolution.
在一种可能的实现中,所述变换模块,还用于:In a possible implementation, the transformation module is further configured to:
确定和所述变换模块得到的所述多个特征图的尺寸一致的时间编码;Determine a temporal encoding having a size consistent with the plurality of feature maps obtained by the transform module;
将所述时间编码和所述变换模块得到的所述多个特征图进行融合,得到融合后的时间维度上的多个特征图;Fusing the multiple feature maps obtained by the time encoding and the transformation module to obtain multiple fused feature maps in the time dimension;
所述交互模块,具体用于对所述融合后的时间维度上的多个特征图之间进行特征交互。The interaction module is specifically used to perform feature interaction between the multiple feature maps in the fused time dimension.
在一种可能的实现中,所述交互模块,还用于将通过所述交互得到的交互结果和所述输入的特征进行融合。In a possible implementation, the interaction module is further configured to fuse the interaction result obtained through the interaction with the input feature.
第三方面,本申请实施例提供了一种数据处理装置,可以包括存储器、处理器以及总线系统,其中,存储器用于存储程序,处理器用于执行存储器中的程序,以执行如上述第一方面及其任一可选的方法。In a third aspect, an embodiment of the present application provides a data processing device, which may include a memory, a processor, and a bus system, wherein the memory is used to store programs, and the processor is used to execute the programs in the memory to perform the first aspect and any optional method thereof.
第四方面,本申请实施例提供了一种计算机可读存储介质,计算机可读存储介质中存储有计算机程序,当其在计算机上运行时,使得计算机执行上述第一方面及其任一可选的方法。In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, in which a computer program is stored. When the computer-readable storage medium is run on a computer, the computer executes the above-mentioned first aspect and any optional method thereof.
第五方面,本申请实施例提供了一种计算机程序,当其在计算机上运行时,使得计算机执行上述第一方面及其任一可选的方法。In a fifth aspect, an embodiment of the present application provides a computer program, which, when executed on a computer, enables the computer to execute the above-mentioned first aspect and any optional method thereof.
第六方面,本申请提供了一种芯片系统,该芯片系统包括处理器,用于支持执行数据处理装置实现上述方面中所涉及的功能,例如,发送或处理上述方法中所涉及的数据;或,信息。在一种可能的设计中,所述芯片系统还包括存储器,所述存储器,用于保存执行设备或训练设备必要的程序指令和数据。该芯片系统,可以由芯片构成,也可以包括芯片和其他分立器件。In a sixth aspect, the present application provides a chip system comprising a processor configured to support the execution of a data processing device to implement the functions described in the aforementioned aspects, such as transmitting or processing the data or information described in the aforementioned methods. In one possible design, the chip system further comprises a memory configured to store program instructions and data necessary for executing the device or training the device. The chip system may consist of a single chip or may include a chip and other discrete components.
图1A为人工智能主体框架的一种结构示意图;FIG1A is a schematic diagram of a structure of an artificial intelligence main framework;
图1B和至图1C为本发明的应用系统框架示意;1B and 1C are schematic diagrams of the application system framework of the present invention;
图1D为终端的一种可选的硬件结构示意图;FIG1D is a schematic diagram of an optional hardware structure of a terminal;
图2为一种服务器的结构示意图;FIG2 is a schematic diagram of the structure of a server;
图3为本申请的一种系统架构示意;FIG3 is a schematic diagram of a system architecture of the present application;
图4为一种云服务的流程;Figure 4 shows a process of cloud services;
图5为本申请实施例提供的一种数据处理方法的流程示意;FIG5 is a flowchart of a data processing method provided in an embodiment of the present application;
图6为本申请实施例提供的一种数据处理方法的处理示意;FIG6 is a schematic diagram of a data processing method provided in an embodiment of the present application;
图7为本申请实施例提供的一种效果示意;FIG7 is a schematic diagram of an effect provided by an embodiment of the present application;
图8为本申请实施例提供的数据处理装置的一种结构示意图;FIG8 is a schematic structural diagram of a data processing device provided in an embodiment of the present application;
图9为本申请实施例提供的执行设备的一种结构示意图;FIG9 is a schematic diagram of a structure of an execution device provided in an embodiment of the present application;
图10为本申请实施例提供的训练设备一种结构示意图;FIG10 is a schematic diagram of a structure of a training device provided in an embodiment of the present application;
图11为本申请实施例提供的芯片的一种结构示意图。FIG11 is a schematic diagram of the structure of a chip provided in an embodiment of the present application.
下面结合本发明实施例中的附图对本发明实施例进行描述。本发明的实施方式部分使用的术语仅用于对本发明的具体实施例进行解释,而非旨在限定本发明。The following describes the embodiments of the present invention in conjunction with the accompanying drawings. The terms used in the embodiments of the present invention are only used to explain the specific embodiments of the present invention, and are not intended to limit the present invention.
下面结合附图,对本申请的实施例进行描述。本领域普通技术人员可知,随着技术的发展和新场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。The embodiments of the present application are described below in conjunction with the accompanying drawings. Those skilled in the art will appreciate that, with the development of technology and the emergence of new scenarios, the technical solutions provided in the embodiments of the present application are also applicable to similar technical problems.
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的术语在适当情况下可以互换,这仅仅是描述本申请的实施例中对相同属性的对象在描述时所采用的区分方式。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,以便包含一系列单元的过程、方法、系统、产品或设备不必限于那些单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它单元。The terms "first", "second", etc. in the specification and claims of the present application and the above-mentioned drawings are used to distinguish similar objects, and are not necessarily used to describe a specific order or sequential order. It should be understood that the terms used in this way can be interchangeable under appropriate circumstances, and this is merely a way of distinguishing the objects of the same attributes when describing them in the embodiments of the present application. In addition, the terms "including" and "having" and any of their variations are intended to cover non-exclusive inclusions, so that the process, method, system, product or equipment comprising a series of units need not be limited to those units, but may include other units that are not clearly listed or inherent to these processes, methods, products or equipment.
本文中所用用语“基本(substantially)”、“大约(about)”及类似用语用作近似用语、而并非用作程度用语,且旨在考虑到所属领域中的普通技术人员将知的测量值或计算值的固有偏差。此外,在阐述本发明实施例时使用“可(may)”是指“可能的一个或多个实施例”。本文中所用用语“使用(use)”、“正使用(using)”、及“被使用(used)”可被视为分别与用语“利用(utilize)”、“正利用(utilizing)”、及“被利用(utilized)”同义。另外,用语“示例性(exemplary)”旨在指代实例或例示。As used herein, the terms "substantially," "about," and similar terms are used as terms of approximation, not as terms of degree, and are intended to take into account the inherent variations in measurements or calculations that one of ordinary skill in the art would recognize. Furthermore, the use of "may" when describing embodiments of the present invention refers to "one or more possible embodiments." As used herein, the terms "use," "using," and "used" may be considered synonymous with the terms "utilize," "utilizing," and "utilized," respectively. Additionally, the term "exemplary" is intended to refer to an example or illustration.
首先对人工智能系统总体工作流程进行描述,请参见图1A,图1A示出的为人工智能主体框架的一种结构示意图,下面从“智能信息链”(水平轴)和“IT价值链”(垂直轴)两个维度对上述人工智能主题框架进行阐述。其中,“智能信息链”反映从数据的获取到处理的一列过程。举例来说,可以是智能信息感知、智能信息表示与形成、智能推理、智能决策、智能执行与输出的一般过程。在这个过程中,数据经历了“数据—信息—知识—智慧”的凝练过程。“IT价值链”从人智能的底层基础设施、信息(提供和处理技术实现)到系统的产业生态过程,反映人工智能为信息技术产业带来的价值。First, let's describe the overall workflow of an AI system. See Figure 1A, which shows a schematic diagram of the main AI framework. This AI framework will be explained from two perspectives: the "intelligent information chain" (horizontal axis) and the "IT value chain" (vertical axis). The "intelligent information chain" reflects the entire process from data acquisition to processing. For example, it could be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, and intelligent execution and output. Throughout this process, data undergoes a condensed journey from "data-information-knowledge-wisdom." The "IT value chain," spanning the underlying infrastructure of human intelligence, information (provided and processed by technology), and the system's industrial ecosystem, reflects the value that AI brings to the information technology industry.
(1)基础设施(1) Infrastructure
基础设施为人工智能系统提供计算能力支持,实现与外部世界的沟通,并通过基础平台实现支撑。通过传感器与外部沟通;计算能力由智能芯片(CPU、NPU、GPU、ASIC、FPGA等硬件加速芯片)提供;基础平台包括分布式计算框架及网络等相关的平台保障和支持,可以包括云存储和计算、互联互通网络等。举例来说,传感器和外部沟通获取数据,这些数据提供给基础平台提供的分布式计算系统中的智能芯片进行计算。Infrastructure provides computing power for AI systems, enabling communication with the outside world and supporting this through a foundational platform. External communication occurs through sensors; computing power is provided by intelligent chips (CPUs, NPUs, GPUs, ASICs, FPGAs, and other hardware accelerators). The foundational platform includes a distributed computing framework and network-related platform guarantees and support, including cloud storage and computing, and interconnected networks. For example, sensors communicate with the outside world to acquire data, which is then fed into the intelligent chips within the distributed computing system provided by the foundational platform for computation.
(2)数据(2) Data
基础设施的上一层的数据用于表示人工智能领域的数据来源。数据涉及到图形、图像、语音、文本,还涉及到传统设备的物联网数据,包括已有系统的业务数据以及力、位移、液位、温度、湿度等感知数据。Data above the infrastructure layer represents data sources for AI. This data includes graphics, images, voice, and text, as well as IoT data from traditional devices. This includes business data from existing systems and sensor data such as force, displacement, liquid level, temperature, and humidity.
(3)数据处理(3) Data processing
数据处理通常包括数据训练,机器学习,深度学习,搜索,推理,决策等方式。Data processing generally includes data training, machine learning, deep learning, search, reasoning, decision-making, etc.
其中,机器学习和深度学习可以对数据进行符号化和形式化的智能信息建模、抽取、预处理、训练等。Among them, machine learning and deep learning can symbolize and formalize data for intelligent information modeling, extraction, preprocessing, and training.
推理是指在计算机或智能系统中,模拟人类的智能推理方式,依据推理控制策略,利用形式化的信息进行机器思维和求解问题的过程,典型的功能是搜索与匹配。Reasoning refers to the process of simulating human intelligent reasoning in computers or intelligent systems, using formalized information to perform machine thinking and solve problems based on reasoning control strategies. Typical functions are search and matching.
决策是指智能信息经过推理后进行决策的过程,通常提供分类、排序、预测等功能。Decision-making refers to the process of making decisions after intelligent information is reasoned, and usually provides functions such as classification, sorting, and prediction.
(4)通用能力(4) General ability
对数据经过上面提到的数据处理后,进一步基于数据处理的结果可以形成一些通用的能力,比如可以是算法或者一个通用系统,例如,翻译,文本的分析,计算机视觉的处理,语音识别,图像的识别等等。After the data has undergone the data processing mentioned above, some general capabilities can be further formed based on the results of the data processing, such as algorithms or a general system, for example, translation, text analysis, computer vision processing, speech recognition, image recognition, etc.
(5)智能产品及行业应用(5) Smart products and industry applications
智能产品及行业应用指人工智能系统在各领域的产品和应用,是对人工智能整体解决方案的封装,将智能信息决策产品化、实现落地应用,其应用领域主要包括:智能终端、智能交通、智能医疗、自动驾驶、智慧城市等。Smart products and industry applications refer to the products and applications of artificial intelligence systems in various fields. They are the encapsulation of the overall artificial intelligence solution, which productizes intelligent information decision-making and realizes practical application. Its application areas mainly include: smart terminals, smart transportation, smart medical care, autonomous driving, smart cities, etc.
首先介绍本申请的应用场景,本申请可以但不限于应用在具有视频处理功能的应用程序(以下可以简称为视频处理类应用程序)或者云侧服务器提供的云服务等,接下来分别进行介绍:First, we will introduce the application scenarios of this application. This application can be applied to, but is not limited to, applications with video processing functions (hereinafter referred to as video processing applications) or cloud services provided by cloud-side servers. The following are introduced respectively:
视频处理可以但不限于是视频理解、视频增强、基于视频的生成任务等。Video processing can be, but is not limited to, video understanding, video enhancement, and video-based generation tasks.
其中,视频理解可以包括但不限于:动作识别,时序动作定位,视频摘要,视频检测,视频分割,多模态视频理解,行人重识别等任务。Among them, video understanding can include but is not limited to: action recognition, temporal action localization, video summarization, video detection, video segmentation, multimodal video understanding, pedestrian re-identification and other tasks.
一、视频处理类应用程序1. Video processing applications
本申请实施例的产品形态可以为视频处理类应用程序。视频处理类应用程序可以运行在终端设备或者云侧的服务器上。The product form of the embodiment of the present application can be a video processing application. The video processing application can be run on a terminal device or a cloud-side server.
其中,本申请实施例中的视频处理任务可以为:基于用户输入的视频得到视频的任务处理结果。Among them, the video processing task in the embodiment of the present application can be: obtaining a task processing result of the video based on the video input by the user.
在一种可能的实现中,视频处理类应用程序可以基于用户输入的视频,实现视频处理任务,得到视频的任务处理结果。In a possible implementation, a video processing application may implement a video processing task based on a video input by a user and obtain a task processing result of the video.
在一种可能的实现中,用户可以打开终端设备上安装的视频处理类应用程序,并输入视频,视频处理类应用程序可以通过本申请实施例提供的方法训练得到的模型、或者是通过本申请实施例提供的方法对用户输入的视频进行处理,并将视频的任务处理结果呈现给用户(呈现方式可以但不限于是显示、播放、保存、上传到云侧等)。In one possible implementation, the user can open a video processing application installed on the terminal device and input a video. The video processing application can process the video input by the user through a model trained by the method provided in the embodiment of the present application, or through the method provided in the embodiment of the present application, and present the task processing results of the video to the user (the presentation method can be but is not limited to display, playback, saving, uploading to the cloud side, etc.).
在一种可能的实现中,用户可以打开终端设备上安装的视频处理类应用程序,并输入视频,视频处理类应用程序可以将视频发送至云侧的服务器,云侧的服务器通过本申请实施例提供的方法训练得到的模型对视频进行处理,并将视频的任务处理结果回传至终端设备,终端设备可以将视频的任务处理结果呈现给用户(呈现方式可以但不限于是显示、播放、保存、上传到云侧等)。In one possible implementation, a user can open a video processing application installed on a terminal device and input a video. The video processing application can send the video to a cloud-side server. The cloud-side server processes the video using a model trained using the method provided in an embodiment of the present application, and transmits the task processing results of the video back to the terminal device. The terminal device can present the task processing results of the video to the user (the presentation method can be, but is not limited to, display, playback, saving, uploading to the cloud side, etc.).
接下来分别从功能架构以及实现功能的产品架构介绍本申请实施例中的视频处理类应用程序。Next, the video processing application in the embodiment of this application is introduced from the perspective of functional architecture and product architecture that implements the functions.
参照图1B,图1B为本申请实施例中视频处理类应用程序的功能架构示意:Referring to FIG. 1B , FIG. 1B is a schematic diagram of the functional architecture of a video processing application in an embodiment of the present application:
在一种可能的实现中,如图1B所示,视频处理类应用程序102可接收输入的参数101(例如包含视频)且产生视频的任务处理结果103。视频处理类应用程序102可在(举例来说)至少一个计算机系统上执行,且包括计算机代码,所述计算机代码在由一或多个计算机执行时致使所述计算机执行用于执行通过本申请实施例提供的方法训练得到的模型。In one possible implementation, as shown in FIG1B , a video processing application 102 may receive input parameters 101 (e.g., including a video) and generate a video task processing result 103. The video processing application 102 may be executed on, for example, at least one computer system and include computer code that, when executed by one or more computers, causes the computers to execute a model trained using the method provided in the embodiments of the present application.
参照图1C,图1C为本申请实施例中运行视频处理类应用程序的实体架构示意:Referring to FIG. 1C , FIG. 1C is a schematic diagram of the physical architecture for running a video processing application in an embodiment of the present application:
参见图1C,图1C示出了一种系统架构示意图。该系统可以包括终端100、以及服务器200。其中,服务器200可以包括一个或者多个服务器(图1C中以包括一个服务器作为示例进行说明),服务器200可以为一个或者多个终端提供视频处理或者自然语言生成功能。Referring to FIG1C , FIG1C shows a schematic diagram of a system architecture. The system may include a terminal 100 and a server 200. The server 200 may include one or more servers (FIG1C illustrates one server as an example), and the server 200 may provide video processing or natural language generation functions for one or more terminals.
其中,终端100上可以安装有视频处理类应用程序,或者打开与视频处理或者自然语言生成功能相关的网页,上述应用程序和网页可以提供一个界面,终端100可以接收用户在视频处理或者自然语言生成功能界面上输入的相关参数,并将上述参数发送至服务器200,服务器200可以基于接收到的参数,得到视频的任务处理结果,并将视频的任务处理结果返回至终端100。Among them, the terminal 100 can be installed with a video processing application, or a web page related to the video processing or natural language generation function can be opened. The above application and web page can provide an interface. The terminal 100 can receive the relevant parameters entered by the user on the video processing or natural language generation function interface, and send the above parameters to the server 200. The server 200 can obtain the task processing results of the video based on the received parameters, and return the task processing results of the video to the terminal 100.
应理解,在一些可选的实现中,终端100也可以由自身完成基于接收到的参数,得到视频的任务处理结果的动作,而不需要服务器配合实现,本申请实施例并不限定。It should be understood that in some optional implementations, the terminal 100 can also complete the action of obtaining the task processing result of the video based on the received parameters by itself without the need for cooperation from the server, and the embodiments of the present application are not limited thereto.
接下来描述图1C中终端100的产品形态;Next, the product form of the terminal 100 in FIG1C is described;
本申请实施例中的终端100可以为手机、平板电脑、可穿戴设备、车载设备、增强现实(augmented reality,AR)/虚拟现实(virtual reality,VR)设备、笔记本电脑、超级移动个人计算机(ultra-mobile personal computer,UMPC)、上网本、个人数字助理(personal digital assistant,PDA)等,本申请实施例对此不作任何限制。The terminal 100 in the embodiment of the present application can be a mobile phone, a tablet computer, a wearable device, a vehicle-mounted device, an augmented reality (AR)/virtual reality (VR) device, a laptop computer, an ultra-mobile personal computer (UMPC), a netbook, a personal digital assistant (PDA), etc., and the embodiment of the present application does not impose any restrictions on this.
图1D示出了终端100的一种可选的硬件结构示意图。FIG1D shows a schematic diagram of an optional hardware structure of the terminal 100 .
参考图1D所示,终端100可以包括射频单元110、存储器120、输入单元130、显示单元140、摄像头150(可选的)、音频电路160(可选的)、扬声器161(可选的)、麦克风162(可选的)、处理器170、外部接口180、电源190等部件。本领域技术人员可以理解,图1D仅仅是终端或多功能设备的举例,并不构成对终端或多功能设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件。1D , the terminal 100 may include components such as a radio frequency unit 110, a memory 120, an input unit 130, a display unit 140, a camera 150 (optional), an audio circuit 160 (optional), a speaker 161 (optional), a microphone 162 (optional), a processor 170, an external interface 180, and a power supply 190. Those skilled in the art will appreciate that FIG1D is merely an example of a terminal or multi-function device and does not limit the terminal or multi-function device. The terminal or multi-function device may include more or fewer components than shown, or may combine certain components or have different components.
输入单元130可用于接收输入的数字或字符信息,以及产生与该便携式多功能装置的用户设置以及功能控制有关的键信号输入。具体地,输入单元130可包括触摸屏131(可选的)和/或其他输入设备132。该触摸屏131可收集用户在其上或附近的触摸操作(比如用户使用手指、关节、触笔等任何适合的物体在触摸屏上或在触摸屏附近的操作),并根据预先设定的程序驱动相应的连接装置。触摸屏可以检测用户对触摸屏的触摸动作,将该触摸动作转换为触摸信号发送给该处理器170,并能接收该处理器170发来的命令并加以执行;该触摸信号至少包括触点坐标信息。该触摸屏131可以提供该终端100和用户之间的输入界面和输出界面。此外,可以采用电阻式、电容式、红外线以及表面声波等多种类型实现触摸屏。除了触摸屏131,输入单元130还可以包括其他输入设备。具体地,其他输入设备132可以包括但不限于物理键盘、功能键(比如音量控制按键、开关按键等)、轨迹球、鼠标、操作杆等中的一种或多种。The input unit 130 can be used to receive input digital or character information and generate key signal input related to user settings and function control of the portable multifunction device. Specifically, the input unit 130 may include a touch screen 131 (optional) and/or other input devices 132. The touch screen 131 can detect user touch operations on or near it (for example, operations performed on or near the touch screen using a finger, joint, stylus, or any other suitable object) and drive corresponding connected devices according to pre-set programs. The touch screen can detect user touch actions on the touch screen, convert the touch actions into touch signals and transmit them to the processor 170. It can also receive and execute commands sent by the processor 170; the touch signals include at least touch point coordinate information. The touch screen 131 provides an input interface and an output interface between the terminal 100 and the user. Touch screens can be implemented using various types, including resistive, capacitive, infrared, and surface acoustic wave. In addition to the touch screen 131, the input unit 130 may also include other input devices. Specifically, the other input devices 132 may include, but are not limited to, one or more of a physical keyboard, function keys (such as a volume control key, a switch key, etc.), a trackball, a mouse, a joystick, and the like.
其中,其他输入设备132可以接收到输入的视频。Among them, other input devices 132 can receive input video.
该显示单元140可用于显示由用户输入的信息或提供给用户的信息、终端100的各种菜单、交互界面、文件显示和/或任意一种多媒体文件的播放。在本申请实施例中,显示单元140可用于显示生成类应用程序的界面、处理结果等。The display unit 140 may be used to display information input by the user or provided to the user, various menus of the terminal 100, interactive interfaces, file display, and/or playback of any multimedia file. In an embodiment of the present application, the display unit 140 may be used to display the interface of a generated application, processing results, etc.
该存储器120可用于存储指令和数据,存储器120可主要包括存储指令区和存储数据区,存储数据区可存储各种数据,如多媒体文件、文本等;存储指令区可存储操作系统、应用、至少一个功能所需的指令等软件单元,或者他们的子集、扩展集。还可以包括非易失性随机存储器;向处理器170提供包括管理计算处理设备中的硬件、软件以及数据资源,支持控制软件和应用。还用于多媒体文件的存储,以及运行程序和应用的存储。Memory 120 can be used to store instructions and data. It primarily includes an instruction storage area and a data storage area. The data storage area can store various data, such as multimedia files and text. The instruction storage area can store software units such as the operating system, applications, and instructions required for at least one function, or subsets or extensions thereof. It may also include non-volatile random access memory (RAM). It provides processor 170 with management functions for the hardware, software, and data resources within the computing and processing device, supporting control software and applications. It is also used to store multimedia files and running programs and applications.
处理器170是终端100的控制中心,利用各种接口和线路连接整个终端100的各个部分,通过运行或执行存储在存储器120内的指令以及调用存储在存储器120内的数据,执行终端100的各种功能和处理数据,从而对终端设备进行整体控制。可选的,处理器170可包括一个或多个处理单元;优选的,处理器170可集成应用处理器和调制解调处理器,其中,应用处理器主要处理操作系统、用户界面和应用程序等,调制解调处理器主要处理无线通信。可以理解的是,上述调制解调处理器也可以不集成到处理器170中。在一些实施例中,处理器、存储器、可以在单一芯片上实现,在一些实施例中,他们也可以在独立的芯片上分别实现。处理器170还可以用于产生相应的操作控制信号,发给计算处理设备相应的部件,读取以及处理软件中的数据,尤其是读取和处理存储器120中的数据和程序,以使其中的各个功能模块执行相应的功能,从而控制相应的部件按指令的要求进行动作。The processor 170 is the control center of the terminal 100. It connects all components of the terminal 100 using various interfaces and circuits. By executing instructions stored in the memory 120 and accessing data stored therein, it executes various functions of the terminal 100 and processes data, thereby providing overall control of the terminal device. Optionally, the processor 170 may include one or more processing units. Preferably, the processor 170 may integrate an application processor and a modem processor, with the application processor primarily processing the operating system, user interface, and application programs, while the modem processor primarily handles wireless communications. It is understood that the modem processor may not be integrated into the processor 170. In some embodiments, the processor and memory may be implemented on a single chip; in other embodiments, they may be implemented on separate chips. The processor 170 may also generate corresponding operational control signals and send them to the corresponding components of the computing and processing device. It may also read and process data in the software, particularly the data and programs in the memory 120, to enable the various functional modules therein to perform their corresponding functions, thereby controlling the corresponding components to operate as instructed.
其中,存储器120可以用于存储数据处理方法相关的软件代码,处理器170可以执行芯片的数据处理方法的步骤,也可以调度其他单元(例如上述输入单元130以及显示单元140)以实现相应的功能。Among them, the memory 120 can be used to store software codes related to the data processing method, the processor 170 can execute the steps of the chip's data processing method, and can also schedule other units (such as the above-mentioned input unit 130 and display unit 140) to achieve corresponding functions.
该射频单元110(可选的)可用于收发信息或通话过程中信号的接收和发送,例如,将基站的下行信息接收后,给处理器170处理;另外,将设计上行的数据发送给基站。通常,RF电路包括但不限于天线、至少一个放大器、收发信机、耦合器、低噪声放大器(Low Noise Amplifier,LNA)、双工器等。此外,射频单元110还可以通过无线通信与网络设备和其他设备通信。该无线通信可以使用任一通信标准或协议,包括但不限于全球移动通讯系统(Global System of Mobile communication,GSM)、通用分组无线服务(General Packet Radio Service,GPRS)、码分多址(Code Division Multiple Access,CDMA)、宽带码分多址(Wideband Code Division Multiple Access,WCDMA)、长期演进(Long Term Evolution,LTE)、电子邮件、短消息服务(Short Messaging Service,SMS)等。The RF unit 110 (optional) can be used to send and receive information or receive and send signals during a call. For example, it receives downlink information from the base station and sends it to the processor 170 for processing; in addition, it sends uplink data to the base station. Typically, the RF circuit includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low-noise amplifier (LNA), a duplexer, etc. In addition, the RF unit 110 can also communicate with network devices and other devices via wireless communication. This wireless communication can use any communication standard or protocol, including but not limited to Global System of Mobile Communications (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), email, Short Messaging Service (SMS), etc.
其中,在本申请实施例中,该射频单元110可以将视频发送至服务器200,并接收到服务器200发送的处理结果。In this embodiment of the present application, the RF unit 110 can send the video to the server 200 and receive the processing result sent by the server 200.
应理解,该射频单元110为可选的,其可以被替换为其他通信接口,例如可以是网口。It should be understood that the radio frequency unit 110 is optional and can be replaced by other communication interfaces, such as a network port.
终端100还包括给各个部件供电的电源190(比如电池),优选的,电源可以通过电源管理系统与处理器170逻辑相连,从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。The terminal 100 also includes a power supply 190 (such as a battery) for supplying power to various components. Preferably, the power supply can be logically connected to the processor 170 through a power management system, thereby managing functions such as charging, discharging, and power consumption through the power management system.
终端100还包括外部接口180,该外部接口可以是标准的Micro USB接口,也可以使多针连接器,可以用于连接终端100与其他装置进行通信,也可以用于连接充电器为终端100充电。The terminal 100 also includes an external interface 180, which can be a standard Micro USB interface or a multi-pin connector. It can be used to connect the terminal 100 to communicate with other devices, and can also be used to connect a charger to charge the terminal 100.
尽管未示出,终端100还可以包括闪光灯、无线保真(wireless fidelity,WiFi)模块、蓝牙模块、不同功能的传感器等,在此不再赘述。下文中描述的部分或全部方法均可以应用在如图1D所示的终端100中。Although not shown, terminal 100 may also include a flashlight, a wireless fidelity (WiFi) module, a Bluetooth module, sensors with different functions, etc., which are not described in detail here. Some or all of the methods described below may be applied to terminal 100 as shown in FIG. 1D .
接下来描述图1C中服务器200的产品形态;Next, the product form of the server 200 in FIG1C is described;
图2提供了一种服务器200的结构示意图,如图2所示,服务器200包括总线201、处理器202、通信接口203和存储器204。处理器202、存储器204和通信接口203之间通过总线201通信。FIG2 provides a schematic diagram of the structure of a server 200. As shown in FIG2, the server 200 includes a bus 201, a processor 202, a communication interface 203, and a memory 204. The processor 202, the memory 204, and the communication interface 203 communicate with each other via the bus 201.
总线201可以是外设部件互连标准(peripheral component interconnect,PCI)总线或扩展工业标准结构(extended industry standard architecture,EISA)总线等。总线可以分为地址总线、数据总线、控制总线等。为便于表示,图2中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。Bus 201 can be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus. Buses can be classified as address buses, data buses, control buses, and the like. For ease of illustration, FIG2 shows only one thick line, but this does not imply that there is only one bus or only one type of bus.
处理器202可以为中央处理器(central processing unit,CPU)、图形处理器(graphics processing unit,GPU)、微处理器(micro processor,MP)或者数字信号处理器(digital signal processor,DSP)等处理器中的任意一种或多种。The processor 202 can be any one or more of a central processing unit (CPU), a graphics processing unit (GPU), a microprocessor (MP), or a digital signal processor (DSP).
存储器204可以包括易失性存储器(volatile memory),例如随机存取存储器(random access memory,RAM)。存储器204还可以包括非易失性存储器(non-volatile memory),例如只读存储器(read-only memory,ROM),快闪存储器,机械硬盘(hard drive drive,HDD)或固态硬盘(solid state drive,SSD)。The memory 204 may include volatile memory, such as random access memory (RAM). The memory 204 may also include non-volatile memory, such as read-only memory (ROM), flash memory, hard drive (HDD), or solid state drive (SSD).
其中,存储器204可以用于存储数据处理方法相关的软件代码,处理器202可以执行芯片的数据处理方法的步骤,也可以调度其他单元以实现相应的功能。The memory 204 may be used to store software codes related to the data processing method, and the processor 202 may execute the steps of the data processing method of the chip, and may also schedule other units to implement corresponding functions.
应理解,上述终端100和服务器200可以为集中式或者是分布式的设备,上述终端100和服务器200中的处理器(例如处理器170以及处理器202)可以为硬件电路(如专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field-programmable gate array,FPGA)、通用处理器、数字信号处理器(digital signal processing,DSP)、微处理器或微控制器等等)、或这些硬件电路的组合,例如,处理器可以为具有执行指令功能的硬件系统,如CPU、DSP等,或者为不具有执行指令功能的硬件系统,如ASIC、FPGA等,或者为上述不具有执行指令功能的硬件系统以及具有执行指令功能的硬件系统的组合。It should be understood that the above-mentioned terminal 100 and server 200 can be centralized or distributed devices, and the processors in the above-mentioned terminal 100 and server 200 (such as processor 170 and processor 202) can be hardware circuits (such as application specific integrated circuit (ASIC), field-programmable gate array (FPGA), general-purpose processor, digital signal processor (DSP), microprocessor or microcontroller, etc.), or a combination of these hardware circuits. For example, the processor can be a hardware system with the function of executing instructions, such as CPU, DSP, etc., or a hardware system without the function of executing instructions, such as ASIC, FPGA, etc., or a combination of the above-mentioned hardware systems without the function of executing instructions and hardware systems with the function of executing instructions.
应理解,本申请实施例中的和模型推理过程相关的步骤涉及AI相关的运算,在执行AI运算时,终端设备和服务器的指令执行架构不仅仅局限在上述介绍的处理器结合存储器的架构。下面结合图3对本申请实施例提供的系统架构进行详细的介绍。It should be understood that the steps related to the model reasoning process in the embodiments of this application involve AI-related operations. When performing AI operations, the instruction execution architecture of the terminal device and server is not limited to the processor-memory architecture described above. The system architecture provided in the embodiments of this application is described in detail below with reference to Figure 3.
图3为本申请实施例提供的系统架构示意图。如图3所示,系统架构500包括执行设备510、训练设备520、数据库530、客户设备540、数据存储系统550以及数据采集系统560。FIG3 is a schematic diagram of the system architecture provided by an embodiment of the present application. As shown in FIG3 , the system architecture 500 includes an execution device 510 , a training device 520 , a database 530 , a client device 540 , a data storage system 550 , and a data acquisition system 560 .
执行设备510包括计算模块511、I/O接口512、预处理模块513和预处理模块514。计算模块511中可以包括目标模型/规则501,预处理模块513和预处理模块514是可选的。The execution device 510 includes a calculation module 511, an I/O interface 512, a pre-processing module 513, and a post-processing module 514. The calculation module 511 may include the target model/rule 501, and the pre-processing module 513 and the post-processing module 514 are optional.
其中,执行设备510可以为上述运行生成类应用程序的终端设备或者服务器。The execution device 510 may be a terminal device or a server that runs the aforementioned generated application program.
数据采集设备560用于采集训练样本。训练样本可以为图像数据等。在采集到训练样本之后,数据采集设备560将这些训练样本存入数据库530。The data acquisition device 560 is used to collect training samples. The training samples can be image data, etc. After collecting the training samples, the data acquisition device 560 stores these training samples in the database 530.
训练设备520可以基于数据库530中维护训练样本,对待训练的神经网络(例如本申请实施例中的特征提取网络、任务网络以及其他神经网络),以得到目标模型/规则501。The training device 520 can train the neural network to be trained (such as the feature extraction network, task network and other neural networks in the embodiments of the present application) based on the training samples maintained in the database 530 to obtain the target model/rule 501.
应理解,训练设备520可以基于数据库530中维护训练样本,对待训练的神经网络进行预训练过程,或者是在预训练的基础上进行模型的微调。It should be understood that the training device 520 can perform a pre-training process on the neural network to be trained based on the training samples maintained in the database 530, or fine-tune the model based on the pre-training.
需要说明的是,在实际应用中,数据库530中维护的训练样本不一定都来自于数据采集设备560的采集,也有可能是从其他设备接收得到的。另外需要说明的是,训练设备520也不一定完全基于数据库530维护的训练样本进行目标模型/规则501的训练,也有可能从云端或其他地方获取训练样本进行模型训练,上述描述不应该作为对本申请实施例的限定。It should be noted that, in actual applications, the training samples maintained in the database 530 may not all be collected by the data acquisition device 560, but may also be received from other devices. It should also be noted that the training device 520 may not train the target model/rule 501 entirely based on the training samples maintained in the database 530, but may also obtain training samples from the cloud or other places for model training. The above description should not be used as a limitation on the embodiments of the present application.
根据训练设备520训练得到的目标模型/规则501可以应用于不同的系统或设备中,如应用于图3所示的执行设备510,该执行设备510可以是终端,如手机终端,平板电脑,笔记本电脑,增强现实(augmented reality,AR)/虚拟现实(virtual reality,VR)设备,车载终端等,还可以是服务器等。The target model/rule 501 obtained through training with the training device 520 can be applied to different systems or devices, such as the execution device 510 shown in FIG3 . The execution device 510 can be a terminal, such as a mobile phone terminal, a tablet computer, a laptop computer, an augmented reality (AR)/virtual reality (VR) device, a vehicle-mounted terminal, etc., or a server, etc.
具体的,训练设备520可以将训练后的模型传递至执行设备510。Specifically, the training device 520 may transfer the trained model to the execution device 510 .
在图3中,执行设备510配置输入/输出(input/output,I/O)接口512,用于与外部设备进行数据交互,用户可以通过客户设备540向I/O接口512输入数据(例如本申请实施例中的视频等)。In Figure 3, the execution device 510 is configured with an input/output (I/O) interface 512 for data interaction with external devices. The user can input data (such as video in the embodiment of the present application) to the I/O interface 512 through the client device 540.
预处理模块513和预处理模块514用于根据I/O接口512接收到的输入数据进行预处理。应理解,可以没有预处理模块513和预处理模块514或者只有的一个预处理模块。当不存在预处理模块513和预处理模块514时,可以直接采用计算模块511对输入数据进行处理。Preprocessing module 513 and preprocessing module 514 are used to preprocess the input data received by I/O interface 512. It should be understood that preprocessing module 513 and preprocessing module 514 may be absent or only one preprocessing module may be present. If preprocessing module 513 and preprocessing module 514 are absent, computing module 511 may be used directly to process the input data.
在执行设备510对输入数据进行预处理,或者在执行设备510的计算模块511执行计算等相关的处理过程中,执行设备510可以调用数据存储系统550中的数据、代码等以用于相应的处理,也可以将相应处理得到的数据、指令等存入数据存储系统550中。When the execution device 510 preprocesses the input data, or when the computing module 511 of the execution device 510 performs calculations and other related processing, the execution device 510 can call the data, code, etc. in the data storage system 550 for corresponding processing, and can also store the data, instructions, etc. obtained from the corresponding processing in the data storage system 550.
最后,I/O接口512将处理结果提供给客户设备540,从而提供给用户。Finally, the I/O interface 512 provides the processed results to the client device 540 and thus to the user.
在图3所示情况下,用户可以手动给定输入数据,该“手动给定输入数据”可以通过I/O接口512提供的界面进行操作。另一种情况下,客户设备540可以自动地向I/O接口512发送输入数据,如果要求客户设备540自动发送输入数据需要获得用户的授权,则用户可以在客户设备540中设置相应权限。用户可以在客户设备540查看执行设备510输出的结果,具体的呈现形式可以是显示、声音、动作等具体方式。客户设备540也可以作为数据采集端,采集如图所示输入I/O接口512的输入数据及输出I/O接口512的输出结果作为新的样本数据,并存入数据库530。当然,也可以不经过客户设备540进行采集,而是由I/O接口512直接将如图所示输入I/O接口512的输入数据及输出I/O接口512的输出结果,作为新的样本数据存入数据库530。In the scenario shown in FIG3 , the user can manually input data, and this "manual input data" can be operated through the interface provided by I/O interface 512. In another scenario, client device 540 can automatically send input data to I/O interface 512. If user authorization is required for client device 540 to automatically send input data, the user can set the corresponding permissions in client device 540. The user can view the results output by execution device 510 on client device 540, and the specific presentation form can be a display, sound, action, or other specific method. Client device 540 can also serve as a data acquisition terminal, collecting input data input into I/O interface 512 and output results output from I/O interface 512 as new sample data, and storing them in database 530. Of course, collection can also be performed without client device 540, and instead the I/O interface 512 directly stores the input data input into I/O interface 512 and output results output from I/O interface 512 as new sample data in database 530.
值得注意的是,图3仅是本申请实施例提供的一种系统架构的示意图,图中所示设备、器件、模块等之间的位置关系不构成任何限制,例如,在图3中,数据存储系统550相对执行设备510是外部存储器,在其它情况下,也可以将数据存储系统550置于执行设备510中。应理解,上述执行设备510可以部署于客户设备540中。It is worth noting that FIG3 is merely a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationships between the devices, components, modules, etc. shown in the figure do not constitute any limitation. For example, in FIG3 , the data storage system 550 is an external memory relative to the execution device 510. In other cases, the data storage system 550 can also be placed in the execution device 510. It should be understood that the execution device 510 can be deployed in the client device 540.
从模型的推理侧来说:From the inference side of the model:
本申请实施例中,上述执行设备510的计算模块511可以获取到数据存储系统550中存储的代码来实现本申请实施例中的和模型推理过程相关的步骤。In the embodiment of the present application, the computing module 511 of the above-mentioned execution device 510 can obtain the code stored in the data storage system 550 to implement the steps related to the model reasoning process in the embodiment of the present application.
本申请实施例中,执行设备510的计算模块511可以包括硬件电路(如专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field-programmable gate array,FPGA)、通用处理器、数字信号处理器(digital signal processing,DSP)、微处理器或微控制器等等)、或这些硬件电路的组合,例如,训练设备520可以为具有执行指令功能的硬件系统,如CPU、DSP等,或者为不具有执行指令功能的硬件系统,如ASIC、FPGA等,或者为上述不具有执行指令功能的硬件系统以及具有执行指令功能的硬件系统的组合。In an embodiment of the present application, the computing module 511 of the execution device 510 may include a hardware circuit (such as an application specific integrated circuit (ASIC), a field-programmable gate array (FPGA), a general-purpose processor, a digital signal processor (DSP), a microprocessor or a microcontroller, etc.), or a combination of these hardware circuits. For example, the training device 520 may be a hardware system with an instruction execution function, such as a CPU, DSP, etc., or a hardware system without an instruction execution function, such as an ASIC, FPGA, etc., or a combination of the above-mentioned hardware systems without an instruction execution function and hardware systems with an instruction execution function.
具体的,执行设备510的计算模块511可以为具有执行指令功能的硬件系统,本申请实施例提供的和模型推理过程相关的步骤可以为存储在存储器中的软件代码,执行设备510的计算模块511可以从存储器中获取到软件代码,并执行获取到的软件代码来实现本申请实施例提供的和模型推理过程相关的步骤。Specifically, the computing module 511 of the execution device 510 can be a hardware system with an execution instruction function, and the steps related to the model reasoning process provided in the embodiment of the present application can be software codes stored in the memory. The computing module 511 of the execution device 510 can obtain the software code from the memory and execute the obtained software code to implement the steps related to the model reasoning process provided in the embodiment of the present application.
应理解,执行设备510的计算模块511可以为不具有执行指令功能的硬件系统以及具有执行指令功能的硬件系统的组合,本申请实施例提供的和模型推理过程相关的步骤的部分步骤还可以通过执行设备510的计算模块511中不具有执行指令功能的硬件系统来实现,这里并不限定。It should be understood that the computing module 511 of the execution device 510 can be a combination of a hardware system that does not have the function of executing instructions and a hardware system that has the function of executing instructions. Some of the steps related to the model reasoning process provided in the embodiment of the present application can also be implemented by the hardware system that does not have the function of executing instructions in the computing module 511 of the execution device 510, which is not limited here.
从模型的训练侧来说:From the training side of the model:
本申请实施例中,上述训练设备520可以获取到存储器(图3中未示出,可以集成于训练设备520或者与训练设备520分离部署)中存储的代码来实现本申请实施例中和模型训练相关的步骤。In an embodiment of the present application, the above-mentioned training device 520 can obtain the code stored in the memory (not shown in Figure 3, which can be integrated into the training device 520 or deployed separately from the training device 520) to implement the steps related to model training in the embodiment of the present application.
本申请实施例中,训练设备520可以包括硬件电路(如专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field-programmable gate array,FPGA)、通用处理器、数字信号处理器(digital signal processing,DSP)、微处理器或微控制器等等)、或这些硬件电路的组合,例如,训练设备520可以为具有执行指令功能的硬件系统,如CPU、DSP等,或者为不具有执行指令功能的硬件系统,如ASIC、FPGA等,或者为上述不具有执行指令功能的硬件系统以及具有执行指令功能的硬件系统的组合。In an embodiment of the present application, the training device 520 may include a hardware circuit (such as an application specific integrated circuit (ASIC), a field-programmable gate array (FPGA), a general-purpose processor, a digital signal processor (DSP), a microprocessor or a microcontroller, etc.), or a combination of these hardware circuits. For example, the training device 520 may be a hardware system with an instruction execution function, such as a CPU, DSP, etc., or a hardware system without an instruction execution function, such as an ASIC, FPGA, etc., or a combination of the above-mentioned hardware systems without an instruction execution function and hardware systems with an instruction execution function.
应理解,训练设备520可以为不具有执行指令功能的硬件系统以及具有执行指令功能的硬件系统的组合,本申请实施例提供的中和模型训练相关的部分步骤还可以通过训练设备520中不具有执行指令功能的硬件系统来实现,这里并不限定。It should be understood that the training device 520 can be a combination of a hardware system that does not have the function of executing instructions and a hardware system that has the function of executing instructions. Some of the steps related to model training provided in the embodiments of the present application can also be implemented by the hardware system in the training device 520 that does not have the function of executing instructions, which is not limited here.
三、服务器提供的视频处理功能类云服务:3. Video processing cloud services provided by the server:
在一种可能的实现中,服务器可以通过应用程序编程接口(application programming interface,API)为端侧提供视频处理功能的服务。In one possible implementation, the server can provide video processing service to the end side through an application programming interface (API).
其中,终端设备可以通过云端提供的API,将相关参数(例如视频)发送至服务器,服务器可以基于接收到的参数,得到处理结果等),并将处理结果返回至终端。Among them, the terminal device can send relevant parameters (such as video) to the server through the API provided by the cloud. The server can obtain processing results based on the received parameters, etc., and return the processing results to the terminal.
关于终端以及服务器的描述可以上述实施例的描述,这里不再赘述。The description of the terminal and the server can be the same as that of the above embodiments, and will not be repeated here.
如图4示出了使用一项云平台提供的视频处理功能类云服务的流程。FIG4 shows a process of using a video processing function cloud service provided by a cloud platform.
1.开通并购买视频处理服务。1. Activate and purchase video processing services.
2.用户可以下载视频处理服务对应的软件开发工具包(software development kit,SDK),通常云平台提供多个开发版本的SDK,供用户根据开发环境的需求选择,例如JAVA版本的SDK、python版本的SDK、PHP版本的SDK、Android版本的SDK等。2. Users can download the software development kit (SDK) corresponding to the video processing service. Usually, the cloud platform provides multiple development versions of the SDK for users to choose according to the requirements of the development environment, such as JAVA version SDK, Python version SDK, PHP version SDK, Android version SDK, etc.
3.用户根据需求下载对应版本的SDK到本地后,将SDK工程导入至本地开发环境,在本地开发环境中进行配置和调试,本地开发环境还可以进行其他功能的开发,使得形成一个集合了视频处理功能类能力的应用。3. After the user downloads the corresponding version of the SDK to the local computer as needed, import the SDK project into the local development environment, configure and debug it in the local development environment. The local development environment can also be used to develop other functions, forming an application that integrates video processing functional capabilities.
4.视频处理功能类应用在被使用的过程中,当需要进行视频处理功能时,可以触发视频处理功能的API调用。当应用触发视频处理功能时,发起API请求至云环境中的视频处理功能类服务的运行实例,其中,API请求中携带图像数据,由云环境中的运行实例对图像进行处理,获得处理结果。4. When a video processing application is used and needs to perform video processing, it can trigger an API call for that function. When the application triggers the video processing function, it initiates an API request to the running instance of the video processing service in the cloud environment. The API request includes image data, and the running instance in the cloud environment processes the image to obtain the processing result.
5.云环境将处理结果返回至应用,由此完成一次的视频处理功能调用。5. The cloud environment returns the processing results to the application, thus completing a video processing function call.
由于本申请实施例涉及大量神经网络的应用,为了便于理解,下面先对本申请实施例涉及的相关术语及神经网络等相关概念进行介绍。Since the embodiments of the present application involve the application of a large number of neural networks, in order to facilitate understanding, the relevant terms and related concepts such as neural networks involved in the embodiments of the present application are first introduced below.
(1)神经网络(1) Neural Network
神经网络可以是由神经单元组成的,神经单元可以是指以xs(即输入数据)和截距1为输入的运算单元,该运算单元的输出可以为:
A neural network can be composed of neural units. A neural unit can refer to an operation unit that takes xs (i.e., input data) and intercept 1 as input. The output of the operation unit can be:
其中,s=1、2、……n,n为大于1的自然数,Ws为xs的权重,b为神经单元的偏置。f为神经单元的激活函数(activation functions),用于将非线性特性引入神经网络中,来将神经单元中的输入信号转换为输出信号。该激活函数的输出信号可以作为下一层卷积层的输入,激活函数可以是sigmoid函数。神经网络是将多个上述单一的神经单元联结在一起形成的网络,即一个神经单元的输出可以是另一个神经单元的输入。每个神经单元的输入可以与前一层的局部接受域相连,来提取局部接受域的特征,局部接受域可以是由若干个神经单元组成的区域。Where s = 1, 2, ... n, where n is a natural number greater than 1, Ws is the weight of xs, and b is the bias of the neural unit. f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal of the neural unit into the output signal. The output signal of this activation function can be used as the input of the next convolutional layer, and the activation function can be a sigmoid function. A neural network is a network formed by connecting multiple single neural units mentioned above, that is, the output of one neural unit can be the input of another neural unit. The input of each neural unit can be connected to the local receptive field of the previous layer to extract the features of the local receptive field. The local receptive field can be an area composed of several neural units.
(2)卷积神经网络(convolutional neuron network,CNN)是一种带有卷积结构的深度神经网络。卷积神经网络包含了一个由卷积层和子采样层构成的特征抽取器,该特征抽取器可以看作是滤波器。卷积层是指卷积神经网络中对输入信号进行卷积处理的神经元层。在卷积神经网络的卷积层中,一个神经元可以只与部分邻层神经元连接。一个卷积层中,通常包含若干个特征平面,每个特征平面可以由一些矩形排列的神经单元组成。同一特征平面的神经单元共享权重,这里共享的权重就是卷积核。共享权重可以理解为提取特征的方式与位置无关。卷积核可以以随机大小的矩阵的形式化,在卷积神经网络的训练过程中卷积核可以通过学习得到合理的权重。另外,共享权重带来的直接好处是减少卷积神经网络各层之间的连接,同时又降低了过拟合的风险。(2) Convolutional neural network (CNN) is a deep neural network with a convolutional structure. A convolutional neural network contains a feature extractor consisting of a convolution layer and a subsampling layer, which can be regarded as a filter. A convolution layer refers to a neuron layer in a convolutional neural network that performs convolution processing on the input signal. In the convolution layer of a convolutional neural network, a neuron can only be connected to some neurons in the adjacent layers. A convolution layer usually contains several feature planes, and each feature plane can be composed of a number of rectangularly arranged neural units. The neural units in the same feature plane share weights, and the shared weights here are convolution kernels. Shared weights can be understood as the way of extracting features is independent of position. The convolution kernel can be formalized as a matrix of random size, and the convolution kernel can obtain reasonable weights through learning during the training process of the convolutional neural network. In addition, the direct benefit of shared weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.
CNN是一种非常常见的神经网络,下面重点对CNN的结构进行详细的介绍。如前文的基础概念介绍所述,卷积神经网络是一种带有卷积结构的深度神经网络,是一种深度学习(deep learning)架构,深度学习架构是指通过机器学习的算法,在不同的抽象层级上进行多个层次的学习。作为一种深度学习架构,CNN是一种前馈(feed-forward)人工神经网络,该前馈人工神经网络中的各个神经元可以对输入其中的图像作出响应。CNN is a very common neural network. The following will focus on a detailed introduction to its structure. As mentioned in the previous basic concepts, a convolutional neural network is a deep neural network with a convolutional structure. It is a deep learning architecture, which uses machine learning algorithms to perform multiple levels of learning at different levels of abstraction. As a deep learning architecture, CNN is a feed-forward artificial neural network, in which each neuron responds to an image input.
(3)深度神经网络(3) Deep Neural Networks
深度神经网络(Deep Neural Network,DNN),也称多层神经网络,可以理解为具有很多层隐含层的神经网络,这里的“很多”并没有特别的度量标准。从DNN按不同层的位置划分,DNN内部的神经网络可以分为三类:输入层,隐含层,输出层。一般来说第一层是输入层,最后一层是输出层,中间的层数都是隐含层。层与层之间是全连接的,也就是说,第i层的任意一个神经元一定与第i+1层的任意一个神经元相连。虽然DNN看起来很复杂,但是就每一层的工作来说,其实并不复杂,简单来说就是如下线性关系表达式:其中,是输入向量,是输出向量,是偏移向量,W是权重矩阵(也称系数),α()是激活函数。每一层仅仅是对输入向量经过如此简单的操作得到输出向量由于DNN层数多,则系数W和偏移向量的数量也就很多了。这些参数在DNN中的定义如下所述:以系数W为例:假设在一个三层的DNN中,第二层的第4个神经元到第三层的第2个神经元的线性系数定义为上标3代表系数W所在的层数,而下标对应的是输出的第三层索引2和输入的第二层索引4。总结就是:第L-1层的第k个神经元到第L层的第j个神经元的系数定义为需要注意的是,输入层是没有W参数的。在深度神经网络中,更多的隐含层让网络更能够刻画现实世界中的复杂情形。理论上而言,参数越多的模型复杂度越高,“容量”也就越大,也就意味着它能完成更复杂的学习任务。训练深度神经网络的也就是学习权重矩阵的过程,其最终目的是得到训练好的深度神经网络的所有层的权重矩阵(由很多层的向量W形成的权重矩阵)。Deep Neural Network (DNN), also known as multi-layer neural network, can be understood as a neural network with many hidden layers. There is no special metric for "many" here. Based on the position of different layers in DNN, the neural network inside DNN can be divided into three categories: input layer, hidden layer, and output layer. Generally speaking, the first layer is the input layer, the last layer is the output layer, and the layers in between are all hidden layers. The layers are fully connected, that is, any neuron in the i-th layer must be connected to any neuron in the i+1-th layer. Although DNN looks complicated, the work of each layer is actually not complicated. Simply put, it is the following linear relationship expression: in, is the input vector, is the output vector, is the offset vector, W is the weight matrix (also called coefficient), and α() is the activation function. Each layer is just an input vector After such a simple operation, the output vector Since there are many DNN layers, the coefficient W and the offset vector The definition of these parameters in DNN is as follows: Take the coefficient W as an example: Assume that in a three-layer DNN, the linear coefficient from the 4th neuron in the second layer to the 2nd neuron in the third layer is defined as The superscript 3 represents the layer number of the coefficient W, while the subscript corresponds to the output of the third layer index 2 and the input of the second layer index 4. In summary, the coefficient from the kth neuron in the L-1th layer to the jth neuron in the Lth layer is defined as It's important to note that the input layer has no W parameter. In deep neural networks, more hidden layers allow the network to better capture complex real-world situations. Theoretically, a model with more parameters has higher complexity and greater "capacity," meaning it can handle more complex learning tasks. Training a deep neural network is essentially the process of learning the weight matrix, with the ultimate goal of obtaining the weight matrices for all layers of a trained deep neural network (a weight matrix formed by the vectors W across many layers).
(4)损失函数(4) Loss function
在训练深度神经网络的过程中,因为希望深度神经网络的输出尽可能的接近真正想要预测的值,所以可以通过比较当前网络的预测值和真正想要的目标值,再根据两者之间的差异情况来更新每一层神经网络的权重向量(当然,在第一次更新之前通常会有初始化的过程,即为深度神经网络中的各层预先配置参数),比如,如果网络的预测值高了,就调整权重向量让它预测低一些,不断的调整,直到深度神经网络能够预测出真正想要的目标值或与真正想要的目标值非常接近的值。因此,就需要预先定义“如何比较预测值和目标值之间的差异”,这便是损失函数(loss function)或目标函数(objective function),它们是用于衡量预测值和目标值的差异的重要方程。其中,以损失函数举例,损失函数的输出值(loss)越高表示差异越大,那么深度神经网络的训练就变成了尽可能缩小这个loss的过程。When training a deep neural network, we want the network's output to be as close as possible to the desired predicted value. This is done by comparing the network's predictions with the desired target values and then updating the weight vectors of each layer based on the difference between the two. (Of course, before the first update, there's usually an initialization process, which pre-configures the parameters for each layer in the deep neural network.) For example, if the network's prediction is too high, the weight vectors are adjusted to predict a lower value. This adjustment is repeated until the deep neural network can predict the desired target value or a value very close to it. Therefore, it's necessary to predefine how to compare the difference between the predicted and target values. This is known as the loss function, or objective function, a crucial equation used to measure the difference between the predicted and target values. For example, a higher loss function indicates a greater difference, so training a deep neural network becomes a process of minimizing this loss.
(5)反向传播算法(5) Backpropagation algorithm
可以采用误差反向传播(back propagation,BP)算法在训练过程中修正初始模型中参数的大小,使得模型的误差损失越来越小。具体地,前向传递输入信号直至输出会产生误差损失,通过反向传播误差损失信息来更新初始模型中的参数,从而使误差损失收敛。反向传播算法是以误差损失为主导的反向传播运动,旨在得到最优的模型参数,例如权重矩阵。The back propagation (BP) algorithm can be used to correct the size of the initial model parameters during training, reducing the model's error loss. Specifically, forward propagation of the input signal to the output generates error loss. This error loss information is then backpropagated to update the parameters in the initial model, thereby converging the error loss. The BP algorithm is a backpropagation algorithm driven by error loss, aiming to obtain optimal model parameters, such as the weight matrix.
不同于图像识别,视频处理任务(例如视频理解)需要更多的计算资源。因此,研究轻量化的视频理解模型从而减少资源消耗非常重要。轻量化的视频理解模型具有很多实际的应用,比如自动驾驶、机器人、工业控制等。然而,现有的视频理解模型更多关注如何得到更高的准确度,模型设计较为庞大,却很少关注如何提升小模型在端侧应用的性能。Unlike image recognition, video processing tasks (such as video understanding) require more computing resources. Therefore, it is crucial to research lightweight video understanding models to reduce resource consumption. Lightweight video understanding models have many practical applications, such as autonomous driving, robotics, and industrial control. However, existing video understanding models focus more on achieving higher accuracy, resulting in relatively large model designs, while paying little attention to improving the performance of small models in end-to-end applications.
因此,亟需一种针对于视频处理的轻量化模型设计。Therefore, there is an urgent need for a lightweight model design for video processing.
已有技术中,无论是基于卷积的方法还是基于transformer的方法,他们均将视频序列的时间轴视为单独的维度进行视频处理,需要用大量的额外的计算和内存占用来处理时间维度上的信息,资源消耗大,速度优势不明显,不利于应用到移动端侧设备上。本申请实施例提出了一种基于时间轴压缩和时间特征恢复的轻量化视频理解模型,能够以更优异的性能应用到移动和端侧应用的视频理解任务中。In existing technologies, whether convolution-based or transformer-based methods, they all treat the timeline of the video sequence as a separate dimension for video processing. This requires a lot of additional computing and memory usage to process information in the time dimension, resulting in high resource consumption and a lack of significant speed advantage, making it difficult to apply to mobile devices. This embodiment of the present application proposes a lightweight video understanding model based on timeline compression and temporal feature recovery, which can be applied to video understanding tasks in mobile and end-side applications with superior performance.
本申请实施例提供了一种数据处理方法。下面结合附图对本申请实施例的数据处理方法进行详细的介绍。The present invention provides a data processing method. The data processing method of the present invention is described in detail below with reference to the accompanying drawings.
参照图5,图5为本申请实施例提供的一种数据处理方法的流程示意,如图5所示,本申请实施例提供的一种数据处理方法,可以包括步骤501至503,下面分别对这些步骤进行详细的描述。Refer to Figure 5, which is a flow chart of a data processing method provided in an embodiment of the present application. As shown in Figure 5, a data processing method provided in an embodiment of the present application may include steps 501 to 503, and these steps are described in detail below.
501、获取视频的特征,所述特征包括在时间维度上分布的多个特征图,所述特征图的维度包括通道维度和空间维度。501. Obtain features of a video, where the features include multiple feature maps distributed in a time dimension, and the dimensions of the feature maps include a channel dimension and a spatial dimension.
其中,视频的特征可以为通过对视频进行特征提取得到的,由于视频本身包括时间维度上分布的数据,也就是多个时间帧,每个帧对应一帧图像数据,因此,对视频进行特征提取得到的特征可以包括在时间通道上分布的多个特征图,每个特征图可以包括通道维度和空间维度。也就是,视频的特征至少是四维特征,维度过高会导致后续运算的计算开销过大。The features of a video can be obtained by extracting features from the video. Since the video itself includes data distributed along the time dimension, that is, multiple time frames, each frame corresponding to a frame of image data, the features extracted from the video can include multiple feature maps distributed along the time channel, each feature map including channel dimensions and spatial dimensions. In other words, the features of a video are at least four-dimensional features. Too high a dimensionality will result in excessive computational overhead for subsequent operations.
502、将所述多个特征图由在所述时间维度上的分布变为在所述通道维度或所述空间维度上的分布,得到所述一个或多个第一特征图。502. Change the distribution of the multiple feature maps in the time dimension to a distribution in the channel dimension or the spatial dimension to obtain the one or more first feature maps.
已有视频理解模型在移动端上性能较差,而性能较好的模型资源消耗过大,不利于应用到端侧设备,移动端模型的资源消耗和性能表现较难平衡。本申请的思路在于,将视频的特征的维度进行缩减,具体的,是将时间维度上的特征压缩到通道维度或者空间维度,使得压缩后的特征(也就是本申请实施例中的第一特征图)不包括时间维度了,而在通道维度和空间维度上的数据量变多,在特征维度降低的情况下,可以使得后续所需的计算量大大降低,例如可以从3D卷积变为2D卷积。Existing video understanding models have poor performance on mobile terminals, while models with better performance consume too much resources, which is not conducive to application to end-side devices. It is difficult to balance the resource consumption and performance of mobile terminal models. The idea of this application is to reduce the dimension of video features. Specifically, the features in the time dimension are compressed to the channel dimension or the spatial dimension, so that the compressed features (that is, the first feature map in the embodiment of this application) do not include the time dimension, and the amount of data in the channel dimension and the spatial dimension increases. When the feature dimension is reduced, the amount of subsequent calculations required can be greatly reduced, for example, it can be changed from 3D convolution to 2D convolution.
以压缩到通道维度上为例,在一种可能的实现中,所述特征的尺寸为(x,t,h,w),所述第一特征图的尺寸为(x*t,h,w),其中,x为通道数量,t为时间数量,h和w分别为空间维度上的高度和宽度,*为乘积。Taking compression to the channel dimension as an example, in one possible implementation, the size of the feature is (x, t, h, w), and the size of the first feature map is (x*t, h, w), where x is the number of channels, t is the number of times, h and w are the height and width in the spatial dimension, respectively, and * is the product.
在一种可能的实现中,所述第一特征图和所述特征在空间维度上的尺寸相同,而第一特征图在通道维度上的尺寸大于视频的特征在通道维度上的尺寸。In a possible implementation, the first feature map and the feature have the same size in the spatial dimension, while the size of the first feature map in the channel dimension is larger than the size of the video feature in the channel dimension.
在一种可能的实现中,可以将所述多个特征图中在不同时间维度上的特征图在通道维度上进行堆叠,得到所述一个或多个第一特征图,堆叠顺序可以按照时间维度上的先后顺序。In a possible implementation, the feature maps in different time dimensions among the multiple feature maps may be stacked in the channel dimension to obtain the one or more first feature maps, and the stacking order may be based on the order in the time dimension.
以压缩到通道维度上为例,在一种可能的实现中,可以将所述多个特征图中在不同时间维度上的特征图在空间维度上进行堆叠,得到所述一个或多个第一特征图,所述第一特征图和所述特征在通道维度上的尺寸相同,而第一特征图在空间维度上的尺寸大于视频的特征在通道维度上的尺寸。Taking compression to the channel dimension as an example, in a possible implementation, the feature maps in different time dimensions among the multiple feature maps can be stacked in the spatial dimension to obtain the one or more first feature maps, and the first feature map and the feature have the same size in the channel dimension, while the size of the first feature map in the spatial dimension is larger than the size of the video feature in the channel dimension.
又或者,可以将视频的特征中的时间维度同时压缩到通道维度和空间维度,这种情况下,所述第一特征图在空间维度上的尺寸大于视频的特征在通道维度上的尺寸,第一特征图在空间维度上的尺寸大于视频的特征在通道维度上的尺寸。Alternatively, the time dimension in the video features can be compressed to the channel dimension and the spatial dimension at the same time. In this case, the size of the first feature map in the spatial dimension is larger than the size of the video features in the channel dimension, and the size of the first feature map in the spatial dimension is larger than the size of the video features in the channel dimension.
如图6所示,图6是本申请实施例提出的一种基于时间轴压缩和时间特征增强的轻量化视频理解模型的整体框图。其整体网络的流程如下:(1)对于输入的视频序列X,其向量形状为(3,t,h,w),通过向量变形操作将其时间轴压缩到通道轴(可降低后续特征处理大量的计算资源消耗),即得到形状为(3t,h,w)的输入。As shown in FIG6 , FIG6 is an overall block diagram of a lightweight video understanding model based on time axis compression and time feature enhancement proposed in an embodiment of the present application. The overall network process is as follows: (1) For an input video sequence X, whose vector shape is (3, t, h, w), its time axis is compressed to the channel axis through a vector deformation operation (which can reduce the large amount of computing resources consumed in subsequent feature processing), that is, an input with a shape of (3t, h, w) is obtained.
503、根据所述一个或多个第一特征图,得到所述视频的任务处理结果。503. Obtain a task processing result of the video according to the one or more first feature maps.
在一种可能的实现中,可以通过特征提取网络处理所述第一特征图,得到一个或多个第二特征图,并根据所述一个或多个第二特征图,通过任务网络,得到所述视频的任务处理结果。In a possible implementation, the first feature map can be processed by a feature extraction network to obtain one or more second feature maps, and the task processing result of the video can be obtained through a task network based on the one or more second feature maps.
将视频序列的时间轴压缩到通道轴后,特征通道上的时间信息会混合到一块,帧间或者不同帧间的物体关系更难度量,因此需要对压缩后的特征进行时间相关的恢复和增强。After compressing the time axis of the video sequence to the channel axis, the temporal information on the feature channel will be mixed together, and the object relationship between frames or between different frames will be more difficult to measure. Therefore, it is necessary to perform time-related recovery and enhancement on the compressed features.
在一种可能的实现中,所述特征提取网络可以包括但不限于基于卷积或者transformer网络实现。例如,特征提取网络可以包括多个特征提取单元,不同特征提取单元之间通过降采样操作连接,不同的特征提取单元用于处理不同尺度的特征。In one possible implementation, the feature extraction network may include, but is not limited to, a convolutional or transformer network. For example, the feature extraction network may include multiple feature extraction units, each of which is connected via a downsampling operation, and each of which is used to process features at different scales.
本申请实施例中,为了对压缩后的特征(第一特征图)进行时间相关的恢复和增强,在特征提取网路中增加了相关的操作,接下来进行介绍:In the embodiment of the present application, in order to perform time-dependent recovery and enhancement on the compressed features (first feature map), relevant operations are added to the feature extraction network, which are described below:
在一种可能的实现中,特征提取网络可以包括第一权重确定模块和卷积模块。第一权重确定模块和卷积模块可以属于上述至少一个特征提取单元。In a possible implementation, the feature extraction network may include a first weight determination module and a convolution module. The first weight determination module and the convolution module may belong to the at least one feature extraction unit.
所述第一权重确定模块,用于基于输入的特征图,确定所述输入的特征图的通道维度对应的权重,权重可以表示出通道上的时间重要性,不同的通道所对应的权重可以不同,所述卷积模块,用于对所述输入的特征图进行卷积运算,得到多个通道维度的卷积运算结果,并根据所述权重将所述多个通道维度的卷积运算结果进行融合,得到处理结果。通过上述方式,利用第一权重确定模块得到的权重来恢复特征中的时序信息,从而增强了时序信息的增强,提高了网络的处理性能。The first weight determination module is used to determine the weight corresponding to the channel dimension of the input feature map based on the input feature map. The weight can represent the temporal importance of the channel, and the weights corresponding to different channels can be different. The convolution module is used to perform a convolution operation on the input feature map to obtain convolution operation results of multiple channel dimensions, and fuse the convolution operation results of the multiple channel dimensions according to the weight to obtain a processing result. In this way, the weights obtained by the first weight determination module are used to restore the timing information in the feature, thereby enhancing the enhancement of the timing information and improving the processing performance of the network.
示例性的,如图6所示,上述第一权重确定模块和卷积模块可以为时间通道学习单元中的时间重要性分支,时间通道学习单元由两个分支组成,一个分支为时间重要性分支,另一个分支为跨时间物体交互模块,两个分支的输出通过求和操作融合到一起,得到最终输出。Exemplarily, as shown in Figure 6, the above-mentioned first weight determination module and convolution module can be the time importance branch in the time channel learning unit. The time channel learning unit consists of two branches, one branch is the time importance branch, and the other branch is the cross-time object interaction module. The outputs of the two branches are fused together through a summation operation to obtain the final output.
时间重要性学习分支用于捕获通道上的时间重要性,它由一个1×1的时间注意力卷积组成,时间注意力卷积的公式如下:
The temporal importance learning branch is used to capture the temporal importance of the channel. It consists of a 1×1 temporal attention convolution. The formula of the temporal attention convolution is as follows:
其中,f(x,y)表示2D卷积在点(x,y)上的输出值,k是卷积核的大小,c是通道的数目,g表示卷积核,h表示特征图,卷积操作可以基于上述实施例中介绍的卷积模块实现,wm是根据输入特征计算的输入自适应权重(也就是上述介绍的输入的特征图的通道维度对应的权重),wm的获得方式多样,可以通过多层感知机、全局注意力模块等方式(可以基于上述实施例中介绍的第一权重确定模块实现)获得。Among them, f(x, y) represents the output value of 2D convolution at the point (x, y), k is the size of the convolution kernel, c is the number of channels, g represents the convolution kernel, h represents the feature map, the convolution operation can be implemented based on the convolution module introduced in the above embodiment, wm is the input adaptive weight calculated according to the input feature (that is, the weight corresponding to the channel dimension of the input feature map introduced above), wm can be obtained in various ways, and can be obtained through a multi-layer perceptron, a global attention module, etc. (which can be implemented based on the first weight determination module introduced in the above embodiment).
在一种可能的实现中,特征提取网络可以包括变换模块和交互模块。变换模块和交互模块可以属于上述至少一个特征提取单元。特别的,变换模块和交互模块可以和上述介绍的第一权重确定模块和卷积模块属于同一个特征提取单元。In one possible implementation, the feature extraction network may include a transformation module and an interaction module. The transformation module and the interaction module may belong to at least one of the aforementioned feature extraction units. In particular, the transformation module and the interaction module may belong to the same feature extraction unit as the first weight determination module and the convolution module described above.
例如,交互模块得到的结果(或者基于交互模块得到的结果之后再进行其他处理得到的结果)可以和第一权重确定模块得到的处理结果进行进一步的融合。For example, the result obtained by the interaction module (or the result obtained by performing other processing based on the result obtained by the interaction module) can be further integrated with the processing result obtained by the first weight determination module.
在一种可能的实现中,所述变换模块,用于将输入的特征图扩展为时间维度上分布的多个特征图,例如可以通过通道转换函数将输入的特征图扩展为时间维度上分布的多个特征图,例如,扩展后的特征的时间维度上的数量可以和视频的特征在时间维度上的数量一致。In one possible implementation, the transformation module is used to expand the input feature map into multiple feature maps distributed in the time dimension. For example, the input feature map can be expanded into multiple feature maps distributed in the time dimension through a channel conversion function. For example, the number of expanded features in the time dimension can be consistent with the number of video features in the time dimension.
所述交互模块,用于对得到的所述时间维度上分布的多个特征图之间进行特征交互。The interaction module is used to perform feature interaction between the multiple feature maps distributed in the time dimension.
例如,在一种可能的实现中,所述交互,包括:基于注意力机制的交互、或者通过大核卷积实现的交互。For example, in one possible implementation, the interaction includes: interaction based on an attention mechanism, or interaction achieved through large kernel convolution.
在一种可能的实现中,所述变换模块,还用于:确定和所述变换模块得到的所述多个特征图的尺寸一致的时间编码,时间编码可以为四维的张量,所述变换模块得到的所述多个特征图的每个特征值可以对应于时间编码中一个编码值,可以将所述时间编码和所述变换模块得到的所述多个特征图进行融合,得到融合后的时间维度上的多个特征图,这里的融合可以是对应位置的加和操作。进而,所述交互模块,具体用于对所述融合后的时间维度上的多个特征图之间进行特征交互。In one possible implementation, the transformation module is further configured to determine a time code that is consistent in size with the multiple feature maps obtained by the transformation module. The time code may be a four-dimensional tensor. Each eigenvalue of the multiple feature maps obtained by the transformation module may correspond to a code value in the time code. The time code and the multiple feature maps obtained by the transformation module may be fused to obtain multiple feature maps in a fused time dimension. The fusion may be an addition operation at corresponding positions. Furthermore, the interaction module is specifically configured to perform feature interaction between the multiple feature maps in the fused time dimension.
此外,还可以增加残差连接,在一种可能的实现中,所述交互模块,还用于将通过所述交互得到的交互结果和所述输入的特征进行融合,例如可以先将通过所述交互得到的交互结果和所述输入的特征映射为相同尺寸的信息,之后进行对应位置的加和来进行融合。In addition, residual connections can be added. In one possible implementation, the interaction module is also used to fuse the interaction results obtained through the interaction and the input features. For example, the interaction results obtained through the interaction and the input features can be first mapped to information of the same size, and then the corresponding positions are added to perform fusion.
上述变换模块和交互模块可以称之为跨时间物体交模块,跨时间物体交互模块可以用于恢复时序信息以及捕获不同帧之前的物体关系。如6中的(c)是跨时间物体交互模块的示意图,其所描绘的功能如下公式所示:
The transformation module and interaction module described above can be referred to as the cross-temporal object interaction module. The cross-temporal object interaction module can be used to restore temporal information and capture object relationships between different frames. Figure 6 (c) shows a schematic diagram of the cross-temporal object interaction module, and its functionality is described by the following formula:
其中,Fb表示输入特征,FT是时间编码,是通道转换函数,表示把通道数目从Ci转换为数目T,是跨时间函数,用于学习不同帧之间的物体关系,φ表示Sigmoid激活函数,表示元素点乘。其中,通道转换函数可以由2D卷积实现;时间编码FT是可学习/不可学习位置编码;跨时间函数可以由捕获全局的位置关系的各种模块实现,比如,可以用大核2D卷积实现,也可以由分窗的全局注意力机制实现。Among them, Fb represents the input feature, FT is the time code, Is the channel conversion function, which means converting the number of channels from Ci to the number T. It is a cross-time function used to learn the relationship between objects in different frames. φ represents the Sigmoid activation function. Represents element-wise multiplication. Among them, the channel conversion function can be implemented by 2D convolution; the time code FT is a learnable/unlearnable position code; the cross-time function It can be implemented by various modules that capture global positional relationships, for example, it can be implemented by large-kernel 2D convolution or by a windowed global attention mechanism.
跨时间物体交互模块用来恢复时序信息和捕获同帧或者不同帧之间的物体关系。该模块通过将通道转换为帧数目,然后通过时间位置编码增加时序信息,再通过跨时间函数捕获不同帧的物体关系,最后将通道数目转化为输出通道数目。The cross-temporal object interaction module is used to restore temporal information and capture object relationships within or across frames. This module converts channels into frame numbers, adds temporal information through temporal position encoding, captures object relationships across frames using a cross-temporal function, and finally converts the channel number into the output channel number.
参照图6,可以通过一个步长为2的5×5卷积对输入进行处理,然后经过一个最大池化模块得到输入特征;输入特征输入到四个阶段的后续网络,每个阶段分别包含了不同数目的时间通道学习单元(CTL)块,每个阶段分别处理不同尺度的特征,通过下采样操作得到每下一个阶段的降采样特征;对于最后一个阶段的输出特征,输入一个平均池化和全连接层得到网络的最终输出。Referring to Figure 6, the input can be processed by a 5×5 convolution with a stride of 2, and then the input features are obtained through a maximum pooling module; the input features are input to the subsequent network of four stages, each stage contains a different number of temporal channel learning unit (CTL) blocks, each stage processes features of different scales, and the downsampling operation is used to obtain the downsampled features of each next stage; for the output features of the last stage, an average pooling and fully connected layer are input to obtain the final output of the network.
本框架只需要基于2D卷积即可实现对视频的处理,核心模块为CTL块,完成时间特征增强的主要功能。时间学习单元块可以是一个残差块,包含了一个1×1卷积用于把输入通道数缩减为原来的1/4,一个核心时间通道学习单元(CTL)以及另一个1×1卷积把输入通道数再扩展4倍,得到原来的输入通道数。This framework only requires 2D convolutions to process videos. The core module is the CTL block, which performs the main function of temporal feature enhancement. The temporal learning unit block can be a residual block, which includes a 1×1 convolution to reduce the number of input channels to 1/4 of the original number, a core temporal channel learning unit (CTL), and another 1×1 convolution to expand the number of input channels by 4 times, returning to the original number of input channels.
通过上述方式,本申请同时提出时间重要性学习和跨时间物体交互的时间特征增强策略来增强压缩后的特征学习,提升整体网络的性能。Through the above methods, this application simultaneously proposes a temporal feature enhancement strategy of temporal importance learning and cross-temporal object interaction to enhance compressed feature learning and improve the performance of the overall network.
接下来结合实验介绍本申请实施例的有益效果,在Kinetcs400数据集上进行了充分的实验,来验证了本发明的有效性,如图7所示,采用了本申请实施例中的时间通道学习单元的两分支结构性能最好。Next, the beneficial effects of the embodiments of the present application are introduced in combination with experiments. Sufficient experiments are carried out on the Kinetcs400 dataset to verify the effectiveness of the present invention. As shown in Figure 7, the two-branch structure of the time channel learning unit in the embodiment of the present application has the best performance.
此外,本发明的所提出的时间通道学习单元以及跨时间物体交互模块,还可以嵌入到普通2D卷积网络中去,不限于视频理解任务中,例如,可应用于普通2D骨干网络中提升网络的性能。In addition, the temporal channel learning unit and cross-temporal object interaction module proposed in the present invention can also be embedded in ordinary 2D convolutional networks. They are not limited to video understanding tasks. For example, they can be applied to ordinary 2D backbone networks to improve network performance.
参照图8,图8为本申请实施例提供的一种数据处理装置的结构示意,如图8所示,本申请实施例提供的一种数据处理装置,所述装置800包括:8 , which is a schematic diagram of the structure of a data processing device provided in an embodiment of the present application. As shown in FIG8 , a data processing device provided in an embodiment of the present application, the device 800 includes:
获取模块801,用于获取视频的特征,所述特征包括在时间维度上分布的多个特征图,所述特征图的维度包括通道维度和空间维度,将所述多个特征图由在所述时间维度上的分布变为在所述通道维度或所述空间维度上的分布,得到所述一个或多个第一特征图;所述第一特征图不包括所述时间维度;An acquisition module 801 is configured to acquire features of a video, wherein the features include multiple feature maps distributed along a time dimension, wherein the dimensions of the feature maps include a channel dimension and a spatial dimension, and to convert the multiple feature maps from being distributed along the time dimension to being distributed along the channel dimension or the spatial dimension, thereby obtaining one or more first feature maps; wherein the first feature maps do not include the time dimension.
关于获取模块801的具体描述可以参照上述实施例中步骤501和502的描述,这里不再赘述。For a detailed description of the acquisition module 801 , reference may be made to the description of steps 501 and 502 in the above embodiment, which will not be repeated here.
处理模块802,用于通过特征提取网络处理所述第一特征图,得到一个或多个第二特征图,并根据所述一个或多个第二特征图,通过任务网络,得到所述视频的任务处理结果。The processing module 802 is used to process the first feature map through a feature extraction network to obtain one or more second feature maps, and obtain a task processing result of the video through a task network based on the one or more second feature maps.
关于处理模块802的具体描述可以参照上述实施例中步骤503的描述,这里不再赘述。For a detailed description of the processing module 802 , reference may be made to the description of step 503 in the above embodiment, which will not be repeated here.
在一种可能的实现中,所述任务处理结果为视频理解任务、基于视频的生成任务或者视频增强任务的处理结果。In a possible implementation, the task processing result is a processing result of a video understanding task, a video-based generation task, or a video enhancement task.
在一种可能的实现中,所述第一特征图和所述特征在空间维度上的尺寸相同。In a possible implementation, the first feature map and the feature have the same size in spatial dimension.
在一种可能的实现中,所述特征的尺寸为(x,t,h,w),所述第一特征图的尺寸为(x*t,h,w),其中,x为通道数量,t为时间数量,h和w分别为空间维度上的高度和宽度,*为乘积。In one possible implementation, the size of the feature is (x, t, h, w), and the size of the first feature map is (x*t, h, w), where x is the number of channels, t is the number of times, h and w are the height and width in the spatial dimension, respectively, and * is the product.
在一种可能的实现中,所述压缩模块,具体用于:In a possible implementation, the compression module is specifically configured to:
将所述多个特征图中在不同时间维度上的特征图在通道维度上进行堆叠,得到所述一个或多个第一特征图。The feature maps at different time dimensions in the multiple feature maps are stacked in the channel dimension to obtain the one or more first feature maps.
在一种可能的实现中,所述特征提取网络包括第一权重确定模块和卷积模块;In one possible implementation, the feature extraction network includes a first weight determination module and a convolution module;
所述第一权重确定模块,用于基于输入的特征图,确定所述输入的特征图的通道维度对应的权重;The first weight determination module is used to determine the weight corresponding to the channel dimension of the input feature map based on the input feature map;
所述卷积模块,用于对所述输入的特征图进行卷积运算,得到多个通道维度的卷积运算结果,并根据所述权重将所述多个通道维度的卷积运算结果进行融合,得到处理结果。The convolution module is used to perform a convolution operation on the input feature map to obtain convolution operation results of multiple channel dimensions, and fuse the convolution operation results of the multiple channel dimensions according to the weights to obtain a processing result.
在一种可能的实现中,所述特征提取网络包括变换模块和交互模块;In one possible implementation, the feature extraction network includes a transformation module and an interaction module;
所述变换模块,用于将输入的特征图扩展为时间维度上分布的多个特征图;The transformation module is used to expand the input feature map into multiple feature maps distributed in the time dimension;
所述交互模块,用于对得到的所述时间维度上分布的多个特征图之间进行特征交互。The interaction module is used to perform feature interaction between the multiple feature maps distributed in the time dimension.
在一种可能的实现中,所述交互,包括:In a possible implementation, the interaction includes:
基于注意力机制的交互、或者通过大核卷积实现的交互。Interaction based on attention mechanism or interaction achieved through large kernel convolution.
在一种可能的实现中,所述变换模块,还用于:In a possible implementation, the transformation module is further configured to:
确定和所述变换模块得到的所述多个特征图的尺寸一致的时间编码;Determine a temporal encoding having a size consistent with the plurality of feature maps obtained by the transform module;
将所述时间编码和所述变换模块得到的所述多个特征图进行融合,得到融合后的时间维度上的多个特征图;Fusing the multiple feature maps obtained by the time encoding and the transformation module to obtain multiple fused feature maps in the time dimension;
所述交互模块,具体用于对所述融合后的时间维度上的多个特征图之间进行特征交互。The interaction module is specifically used to perform feature interaction between the multiple feature maps in the fused time dimension.
在一种可能的实现中,所述交互模块,还用于将通过所述交互得到的交互结果和所述输入的特征进行融合。In a possible implementation, the interaction module is further configured to fuse the interaction result obtained through the interaction with the input feature.
接下来介绍本申请实施例提供的一种终端设备,请参阅图9,图9为本申请实施例提供的终端设备的一种结构示意图,终端设备900具体可以表现为虚拟现实VR设备、手机、平板、笔记本电脑、智能穿戴设备等,此处不做限定。具体的,终端设备900包括:接收器901、发射器902、处理器903和存储器904(其中终端设备900中的处理器903的数量可以一个或多个,图9中以一个处理器为例),其中,处理器903可以包括应用处理器9031和通信处理器9032。在本申请的一些实施例中,接收器901、发射器902、处理器903和存储器904可通过总线或其它方式连接。Next, a terminal device provided in an embodiment of the present application is introduced. Please refer to Figure 9. Figure 9 is a structural diagram of a terminal device provided in an embodiment of the present application. The terminal device 900 can be specifically manifested as a virtual reality VR device, a mobile phone, a tablet, a laptop computer, a smart wearable device, etc., which is not limited here. Specifically, the terminal device 900 includes: a receiver 901, a transmitter 902, a processor 903 and a memory 904 (wherein the number of processors 903 in the terminal device 900 can be one or more, and Figure 9 takes one processor as an example), wherein the processor 903 may include an application processor 9031 and a communication processor 9032. In some embodiments of the present application, the receiver 901, the transmitter 902, the processor 903 and the memory 904 may be connected via a bus or other means.
存储器904可以包括只读存储器和随机存取存储器,并向处理器903提供指令和数据。存储器904的一部分还可以包括非易失性随机存取存储器(non-volatile random access memory,NVRAM)。存储器904存储有处理器和操作指令、可执行模块或者数据结构,或者它们的子集,或者它们的扩展集,其中,操作指令可包括各种操作指令,用于实现各种操作。The memory 904 may include read-only memory and random access memory, and provides instructions and data to the processor 903. A portion of the memory 904 may also include non-volatile random access memory (NVRAM). The memory 904 stores processor and operation instructions, executable modules, or data structures, or subsets or extensions thereof. The operation instructions may include various operation instructions for implementing various operations.
处理器903控制执行设备的操作。具体的应用中,执行设备的各个组件通过总线系统耦合在一起,其中总线系统除包括数据总线之外,还可以包括电源总线、控制总线和状态信号总线等。但是为了清楚说明起见,在图中将各种总线都称为总线系统。Processor 903 controls the operation of the execution device. In specific applications, the various components of the execution device are coupled together via a bus system. In addition to a data bus, the bus system may also include a power bus, a control bus, and a status signal bus. However, for clarity, all bus systems are referred to as a bus system in the figure.
上述本申请实施例揭示的方法可以应用于处理器903中,或者由处理器903实现。处理器903可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法的各步骤可以通过处理器903中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器903可以是通用处理器、数字信号处理器(digital signal processing,DSP)、微处理器或微控制器,还可进一步包括专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field-programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。该处理器903可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器904,处理器903读取存储器904中的信息,结合其硬件完成上述方法中涉及模型训练或者模型推理过程的步骤。The methods disclosed in the above embodiments of the present application can be applied to the processor 903 or implemented by the processor 903. The processor 903 can be an integrated circuit chip with signal processing capabilities. During implementation, each step of the above method can be completed by the hardware integrated logic circuit in the processor 903 or by instructions in the form of software. The above processor 903 can be a general-purpose processor, a digital signal processor (DSP), a microprocessor or a microcontroller, and can further include an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components. The processor 903 can implement or execute the various methods, steps and logic block diagrams disclosed in the embodiments of the present application. The general-purpose processor can be a microprocessor or the processor can also be any conventional processor, etc. The steps of the method disclosed in conjunction with the embodiments of the present application can be directly embodied as being executed by a hardware decoding processor, or can be executed by a combination of hardware and software modules in the decoding processor. The software module can be located in a storage medium well-known in the art, such as random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, or registers. This storage medium is located in memory 904. Processor 903 reads information from memory 904 and, in conjunction with its hardware, completes the steps involved in the model training or model inference process in the above method.
接收器901可用于接收输入的数字或字符信息,以及产生与执行设备的相关设置以及功能控制有关的信号输入。发射器902可用于通过第一接口输出数字或字符信息;发射器902还可用于通过第一接口向磁盘组发送指令,以修改磁盘组中的数据;发射器902还可以包括显示屏等显示设备。Receiver 901 can be used to receive input digital or character information and generate signal input related to executing device-related settings and function control. Transmitter 902 can be used to output digital or character information through the first interface. Transmitter 902 can also be used to send instructions to the disk pack through the first interface to modify data in the disk pack. Transmitter 902 can also include a display device such as a display screen.
本申请实施例还提供了一种服务器,请参阅图10,图10是本申请实施例提供的服务器一种结构示意图,服务器1000可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上中央处理器(central processing units,CPU)1010(例如,一个或一个以上处理器)和存储器1032,一个或一个以上存储应用程序1042或数据1044的存储介质1030(例如一个或一个以上海量存储设备)。其中,存储器1032和存储介质1030可以是短暂存储或持久存储。存储在存储介质1030的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对服务器中的一系列指令操作。更进一步地,中央处理器1010可以设置为与存储介质1030通信,在服务器1000上执行存储介质1030中的一系列指令操作。The embodiment of the present application also provides a server. Please refer to Figure 10. Figure 10 is a structural diagram of a server provided by an embodiment of the present application. The server 1000 may have relatively large differences due to different configurations or performances, and may include one or more central processing units (CPU) 1010 (for example, one or more processors) and a memory 1032, and one or more storage media 1030 (for example, one or more mass storage devices) storing application programs 1042 or data 1044. Among them, the memory 1032 and the storage medium 1030 can be temporary storage or permanent storage. The program stored in the storage medium 1030 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations in the server. Furthermore, the central processing unit 1010 can be configured to communicate with the storage medium 1030 to execute a series of instruction operations in the storage medium 1030 on the server 1000.
服务器1000还可以包括一个或一个以上电源1026,一个或一个以上有线或无线网络接口1050,一个或一个以上输入输出接口1058;或,一个或一个以上操作系统1041,例如Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTM等等。The server 1000 may also include one or more power supplies 1026, one or more wired or wireless network interfaces 1050, one or more input and output interfaces 1058; or, one or more operating systems 1041, such as Windows Server™, Mac OS X™, Unix™, Linux™, FreeBSD™, etc.
本申请实施例中,中央处理器1010,用于执行上述实施例中和模型训练或者模型推理相关的动作。In an embodiment of the present application, the central processing unit 1010 is used to execute actions related to model training or model reasoning in the above embodiments.
本申请实施例中还提供一种包括计算机程序产品,当其在计算机上运行时,使得计算机执行如前述执行设备所执行的步骤,或者,使得计算机执行如前述训练设备所执行的步骤。An embodiment of the present application also provides a computer program product, which, when running on a computer, enables the computer to execute the steps executed by the aforementioned execution device, or enables the computer to execute the steps executed by the aforementioned training device.
本申请实施例中还提供一种计算机可读存储介质,该计算机可读存储介质中存储有用于进行信号处理的程序,当其在计算机上运行时,使得计算机执行如前述执行设备所执行的步骤,或者,使得计算机执行如前述训练设备所执行的步骤。A computer-readable storage medium is also provided in an embodiment of the present application, which stores a program for signal processing. When the computer-readable storage medium is run on a computer, it enables the computer to execute the steps executed by the aforementioned execution device, or enables the computer to execute the steps executed by the aforementioned training device.
本申请实施例提供的执行设备、训练设备或终端设备具体可以为芯片,芯片包括:处理单元和通信单元,所述处理单元例如可以是处理器,所述通信单元例如可以是输入/输出接口、管脚或电路等。该处理单元可执行存储单元存储的计算机执行指令,以使执行设备内的芯片执行上述实施例描述的数据处理方法,或者,以使训练设备内的芯片执行上述实施例描述的数据处理方法。可选地,所述存储单元为所述芯片内的存储单元,如寄存器、缓存等,所述存储单元还可以是所述无线接入设备端内的位于所述芯片外部的存储单元,如只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)等。The execution device, training device or terminal device provided in the embodiments of the present application may specifically be a chip, and the chip includes: a processing unit and a communication unit, wherein the processing unit may be, for example, a processor, and the communication unit may be, for example, an input/output interface, a pin or a circuit, etc. The processing unit may execute the computer execution instructions stored in the storage unit, so that the chip in the execution device executes the data processing method described in the above embodiment, or so that the chip in the training device executes the data processing method described in the above embodiment. Optionally, the storage unit is a storage unit in the chip, such as a register, a cache, etc. The storage unit may also be a storage unit located outside the chip in the wireless access device end, such as a read-only memory (ROM) or other types of static storage devices that can store static information and instructions, a random access memory (RAM), etc.
具体的,请参阅图11,图11为本申请实施例提供的芯片的一种结构示意图,所述芯片可以表现为神经网络处理器NPU 1100,NPU 1100作为协处理器挂载到主CPU(Host CPU)上,由Host CPU分配任务。NPU的核心部分为运算电路1103,通过控制器1104控制运算电路1103提取存储器中的矩阵数据并进行乘法运算。Specifically, see Figure 11 , which illustrates a schematic diagram of the chip structure provided in an embodiment of the present application. The chip may be a neural network processor (NPU) 1100. NPU 1100 is mounted on a host CPU (host CPU) as a coprocessor, with tasks assigned by the host CPU. The core of the NPU is arithmetic circuit 1103, which is controlled by controller 1104 to retrieve matrix data from memory and perform multiplication operations.
在一些实现中,运算电路1103内部包括多个处理单元(Process Engine,PE)。在一些实现中,运算电路1103是二维脉动阵列。运算电路1103还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路1103是通用的矩阵处理器。In some implementations, arithmetic circuit 1103 includes multiple processing elements (PEs). In some implementations, arithmetic circuit 1103 is a two-dimensional systolic array. Arithmetic circuit 1103 can also be a one-dimensional systolic array or other electronic circuitry capable of performing mathematical operations such as multiplication and addition. In some implementations, arithmetic circuit 1103 is a general-purpose matrix processor.
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路从权重存储器1102中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路从输入存储器1101中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器(accumulator)1108中。For example, assume there are input matrix A, weight matrix B, and output matrix C. The arithmetic circuit retrieves the corresponding data of matrix B from weight memory 1102 and caches it on each PE in the arithmetic circuit. The arithmetic circuit retrieves the data of matrix A from input memory 1101 and performs a matrix operation on matrix B. The partial or final matrix result is stored in accumulator 1108.
统一存储器1106用于存放输入数据以及输出数据。权重数据直接通过存储单元访问控制器(Direct Memory Access Controller,DMAC)1105,DMAC被搬运到权重存储器1102中。输入数据也通过DMAC被搬运到统一存储器1106中。Unified memory 1106 is used to store input and output data. Weight data is directly transferred to weight memory 1102 via the Direct Memory Access Controller (DMAC) 1105. Input data is also transferred to unified memory 1106 via the DMAC.
BIU为Bus Interface Unit即,总线接口单元1110,用于AXI总线与DMAC和取指存储器(Instruction Fetch Buffer,IFB)1109的交互。BIU stands for Bus Interface Unit, that is, the bus interface unit 1110, which is used for the interaction between the AXI bus and the DMAC and instruction fetch buffer (IFB) 1109.
总线接口单元1110(Bus Interface Unit,简称BIU),用于取指存储器1109从外部存储器获取指令,还用于存储单元访问控制器1105从外部存储器获取输入矩阵A或者权重矩阵B的原数据。The bus interface unit 1110 (BIU) is used for the instruction fetch memory 1109 to obtain instructions from the external memory, and is also used for the storage unit access controller 1105 to obtain the original data of the input matrix A or the weight matrix B from the external memory.
DMAC主要用于将外部存储器DDR中的输入数据搬运到统一存储器1106或将权重数据搬运到权重存储器1102中或将输入数据数据搬运到输入存储器1101中。DMAC is mainly used to move input data in the external memory DDR to the unified memory 1106 or to move weight data to the weight memory 1102 or to move input data to the input memory 1101.
向量计算单元1107包括多个运算处理单元,在需要的情况下,对运算电路1103的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。主要用于神经网络中非卷积/全连接层网络计算,如Batch Normalization(批归一化),像素级求和,对特征平面进行上采样等。The vector calculation unit 1107 includes multiple processing units. When necessary, it further processes the output of the calculation circuit 1103, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, etc. It is mainly used for non-convolutional/fully connected layer network calculations in neural networks, such as batch normalization, pixel-level summation, and upsampling of feature planes.
在一些实现中,向量计算单元1107能将经处理的输出的向量存储到统一存储器1106。例如,向量计算单元1107可以将线性函数;或,非线性函数应用到运算电路1103的输出,例如对卷积层提取的特征平面进行线性插值,再例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元1107生成归一化的值、像素级求和的值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路1103的激活输入,例如用于在神经网络中的后续层中的使用。In some implementations, the vector calculation unit 1107 can store the processed output vector in the unified memory 1106. For example, the vector calculation unit 1107 can apply a linear function or a nonlinear function to the output of the operation circuit 1103, such as linear interpolation of the feature plane extracted by the convolution layer, or accumulate a vector of values to generate an activation value. In some implementations, the vector calculation unit 1107 generates a normalized value, a pixel-level summed value, or both. In some implementations, the processed output vector can be used as an activation input to the operation circuit 1103, for example, for use in subsequent layers in a neural network.
控制器1104连接的取指存储器(instruction fetch buffer)1109,用于存储控制器1104使用的指令;An instruction fetch buffer 1109 connected to the controller 1104 is used to store instructions used by the controller 1104;
统一存储器1106,输入存储器1101,权重存储器1102以及取指存储器1109均为On-Chip存储器。外部存储器私有于该NPU硬件架构。Unified memory 1106, input memory 1101, weight memory 1102, and instruction fetch memory 1109 are all on-chip memories. External memories are private to the NPU hardware architecture.
其中,上述任一处提到的处理器,可以是一个通用中央处理器,微处理器,ASIC,或一个或多个用于控制上述程序执行的集成电路。The processor mentioned in any of the above places can be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits for controlling the execution of the above program.
另外需说明的是,以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。另外,本申请提供的装置实施例附图中,模块之间的连接关系表示它们之间具有通信连接,具体可以实现为一条或多条通信总线或信号线。It should also be noted that the device embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed across multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the present embodiment. In addition, in the drawings of the device embodiments provided in this application, the connection relationship between the modules indicates that there is a communication connection between them, which can be specifically implemented as one or more communication buses or signal lines.
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到本申请可借助软件加必需的通用硬件的方式来实现,当然也可以通过专用硬件包括专用集成电路、专用CPU、专用存储器、专用元器件等来实现。一般情况下,凡由计算机程序完成的功能都可以很容易地用相应的硬件来实现,而且,用来实现同一功能的具体硬件结构也可以是多种多样的,例如模拟电路、数字电路或专用电路等。但是,对本申请而言更多情况下软件程序实现是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在可读取的存储介质中,如计算机的软盘、U盘、移动硬盘、ROM、RAM、磁碟或者光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,训练设备,或者网络设备等)执行本申请各个实施例所述的方法。Through the description of the above embodiments, those skilled in the art can clearly understand that the present application can be implemented by means of software plus necessary general hardware, and of course can also be implemented by special hardware including application-specific integrated circuits, special CPUs, special memories, special components, etc. In general, all functions performed by computer programs can be easily implemented with corresponding hardware, and the specific hardware structures used to implement the same function can also be diverse, such as analog circuits, digital circuits or special circuits, etc. However, for the present application, software program implementation is a better implementation method in most cases. Based on this understanding, the technical solution of the present application is essentially or the part that contributes to the prior art can be embodied in the form of a software product, which is stored in a readable storage medium, such as a computer's floppy disk, USB flash drive, mobile hard disk, ROM, RAM, magnetic disk or optical disk, etc., and includes a number of instructions to enable a computer device (which can be a personal computer, training equipment, or network equipment, etc.) to execute the methods described in each embodiment of the present application.
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。In the above embodiments, all or part of the embodiments may be implemented by software, hardware, firmware, or any combination thereof. When implemented by software, all or part of the embodiments may be implemented in the form of a computer program product.
所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、训练设备或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、训练设备或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的训练设备、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(Solid State Disk,SSD))等。The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the process or function described in the embodiment of the present application is generated in whole or in part. The computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions can be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium. For example, the computer instructions can be transmitted from one website, computer, training equipment or data center to another website, computer, training equipment or data center via wired (e.g., coaxial cable, optical fiber, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium can be any available medium that a computer can store or a data storage device such as a training device, data center, etc. that includes one or more available media. The available medium can be a magnetic medium (e.g., a floppy disk, a hard disk, a tape), an optical medium (e.g., a DVD), or a semiconductor medium (e.g., a solid-state drive (SSD)).
Claims (19)
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202410239097.5A CN120580617A (en) | 2024-03-01 | 2024-03-01 | A data processing method and device thereof |
| CN202410239097.5 | 2024-03-01 |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| WO2025180090A1 true WO2025180090A1 (en) | 2025-09-04 |
| WO2025180090A9 WO2025180090A9 (en) | 2025-10-30 |
Family
ID=96863398
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2025/071236 Pending WO2025180090A1 (en) | 2024-03-01 | 2025-01-08 | Data processing method and apparatus |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN120580617A (en) |
| WO (1) | WO2025180090A1 (en) |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112926472A (en) * | 2021-03-05 | 2021-06-08 | 深圳先进技术研究院 | Video classification method, device and equipment |
| CN113284055A (en) * | 2021-03-18 | 2021-08-20 | 华为技术有限公司 | Image processing method and device |
| CN116542289A (en) * | 2023-03-31 | 2023-08-04 | 华为技术有限公司 | Data processing method and device |
| CN117391138A (en) * | 2023-09-28 | 2024-01-12 | 华为技术有限公司 | A data processing method and its device |
| US20240029406A1 (en) * | 2021-04-08 | 2024-01-25 | Huawei Technologies Co., Ltd. | Image processing method, training method, and apparatus |
-
2024
- 2024-03-01 CN CN202410239097.5A patent/CN120580617A/en active Pending
-
2025
- 2025-01-08 WO PCT/CN2025/071236 patent/WO2025180090A1/en active Pending
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112926472A (en) * | 2021-03-05 | 2021-06-08 | 深圳先进技术研究院 | Video classification method, device and equipment |
| CN113284055A (en) * | 2021-03-18 | 2021-08-20 | 华为技术有限公司 | Image processing method and device |
| US20240029406A1 (en) * | 2021-04-08 | 2024-01-25 | Huawei Technologies Co., Ltd. | Image processing method, training method, and apparatus |
| CN116542289A (en) * | 2023-03-31 | 2023-08-04 | 华为技术有限公司 | Data processing method and device |
| CN117391138A (en) * | 2023-09-28 | 2024-01-12 | 华为技术有限公司 | A data processing method and its device |
Also Published As
| Publication number | Publication date |
|---|---|
| WO2025180090A9 (en) | 2025-10-30 |
| CN120580617A (en) | 2025-09-02 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN111738403B (en) | Neural network optimization method and related equipment | |
| CN113065636A (en) | Pruning processing method, data processing method and equipment for convolutional neural network | |
| CN113516227B (en) | Neural network training method and device based on federal learning | |
| WO2024260402A1 (en) | Data processing method and apparatus | |
| CN114169393B (en) | Image classification method and related equipment thereof | |
| CN113536970A (en) | Training method of video classification model and related device | |
| WO2025040048A1 (en) | Data processing method and apparatus | |
| WO2025067211A1 (en) | Data processing method and apparatus | |
| CN112529149A (en) | Data processing method and related device | |
| WO2024213099A1 (en) | Data processing method and apparatus | |
| CN116258651A (en) | An image processing method and related device | |
| CN116542289A (en) | Data processing method and device | |
| WO2024245349A1 (en) | Data processing method and apparatus | |
| CN114707643A (en) | Model segmentation method and related equipment thereof | |
| WO2024140630A1 (en) | Model training method and related device | |
| US20250157071A1 (en) | Data processing method and apparatus | |
| WO2025092554A1 (en) | Data processing method and apparatus thereof | |
| WO2025031373A1 (en) | Data processing method and device | |
| WO2025044967A1 (en) | Data processing method and apparatus | |
| WO2024055952A1 (en) | Data processing method and apparatus thereof | |
| WO2025180090A1 (en) | Data processing method and apparatus | |
| CN116433621A (en) | Data processing method and device | |
| CN113065638B (en) | A neural network compression method and related equipment | |
| CN117193523A (en) | A data processing method and its device | |
| CN117765341A (en) | A data processing method and related devices |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 25760634 Country of ref document: EP Kind code of ref document: A1 |