[go: up one dir, main page]

WO2025108230A1 - Multi-model reasoning method and system, network processing unit, extended reality display chip and apparatus, and image processing method - Google Patents

Multi-model reasoning method and system, network processing unit, extended reality display chip and apparatus, and image processing method Download PDF

Info

Publication number
WO2025108230A1
WO2025108230A1 PCT/CN2024/132671 CN2024132671W WO2025108230A1 WO 2025108230 A1 WO2025108230 A1 WO 2025108230A1 CN 2024132671 W CN2024132671 W CN 2024132671W WO 2025108230 A1 WO2025108230 A1 WO 2025108230A1
Authority
WO
WIPO (PCT)
Prior art keywords
model
storage space
reasoning
models
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/CN2024/132671
Other languages
French (fr)
Chinese (zh)
Inventor
陶表犁
庄佳衍
尹文
王易诚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Gravityxr Electronics And Technology Co Ltd
Original Assignee
Gravityxr Electronics And Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202311587154.0A external-priority patent/CN120046721A/en
Priority claimed from CN202311582338.8A external-priority patent/CN120047302A/en
Application filed by Gravityxr Electronics And Technology Co Ltd filed Critical Gravityxr Electronics And Technology Co Ltd
Publication of WO2025108230A1 publication Critical patent/WO2025108230A1/en
Pending legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining

Definitions

  • the present invention relates to neural network reasoning technology and extended reality display technology, and in particular to a multi-model reasoning method, a multi-model reasoning system, a network processing unit, an extended reality display chip, an extended reality display device, a parallel processing method for image data, and a computer-readable storage medium.
  • a network processing unit is a processor that uses circuits to simulate human neurons and synaptic structures. It is widely used in the field of artificial intelligence (AI) technology to simulate humans' processing of various complex information such as images, sounds, and languages.
  • AI artificial intelligence
  • the existing technology usually uses multiple independently packaged NPU chips or multi-core NPU chips to perform multi-model reasoning of multiple neural network models.
  • the independently packaged NPU chips have fixed computing and storage resources.
  • problems such as high chip area, power consumption and cost, which cannot be integrated on a large scale.
  • problems such as high chip area, power consumption and cost, which cannot be integrated on a large scale.
  • XR extended reality
  • the multiple NPU cores (Core) of the multi-core NPU chip can share part of the chip's storage resources, so there are also the above-mentioned problems of high area, power consumption, cost and insufficient flexibility, which cannot adapt to the multi-scenario and multi-functional application requirements of XR display devices.
  • the field is in urgent need of a multi-model reasoning technology, which can be used to flexibly switch time-division multiplexing between multiple neural network models according to actual needs, thereby realizing latency optimized reasoning and/or storage space optimized reasoning of multiple models.
  • Extended Reality (XR) display technology is an immersive display technology that uses modern high-tech means with computers as the core to create a digital environment that combines real and virtual things, providing users with seamless transitions between the virtual world and the real world. It mainly includes Virtual Reality (VR) display, Augmented Reality (AR) display, Mixed Reality (MR) display and many other implementation methods.
  • VR Virtual Reality
  • AR Augmented Reality
  • MR Mixed Reality
  • NPU network processing unit
  • ISP image signal processor
  • the existing technology usually needs to tile the entire image through the image signal processor (ISP), divide it into multiple N ⁇ M (N ⁇ H, M ⁇ W) tile images according to the data processing capability of the network processing unit (NPU), and store them in the ISP buffer.
  • the network processing unit (NPU) then reads the image data of each tile image from the ISP buffer one by one for network reasoning.
  • this will introduce additional operations and intermediate data such as slicing, zero-filling and windowing, and trimming of overlapping images to increase the requirements for the overall computing power of the system.
  • it also requires the system to have a larger data cache space to greatly increase the area, power consumption and cost of the image signal processor (ISP), thereby severely limiting the development and application of the existing network processing unit (NPU) to parallel process multiple image data.
  • the present invention provides a multi-model reasoning method, a multi-model reasoning system, and a computer-readable storage medium, which can determine at least one switching node from multiple processing nodes of each model according to the goals of latency priority and/or storage space priority, and determine the storage space corresponding to each model, and then use the full computing resources of the network processing unit to perform time-division multiplexing reasoning of each model in turn according to the switching node, so as to achieve latency optimized reasoning and/or storage space optimized reasoning of multiple models.
  • the present invention also provides a network processing unit, an extended reality display chip, an extended reality display device, a parallel processing method for image data, and a computer-readable storage medium, which can acquire multiple channels of image data line by line in parallel, write each channel of image data into a buffer at a preset speed, and then use the full computing resources of the network processing unit through the i model to read and process the written i-th channel of image data from the buffer.
  • the present invention can improve the accuracy of image processing by eliminating the need to tile the entire image, and reduce the requirements for the overall computing power of the system and data cache space, thereby improving the ability of a single network processing unit (NPU) to process multiple channels of image data in parallel under the same hardware processing technology and cost conditions, and thereby flexibly and fully utilizing the overall computing power of the system to reduce network inference delay, so as to meet the needs of parallel processing of binocular image data in XR devices.
  • NPU network processing unit
  • FIG. 1 shows an architecture diagram of a network processing unit provided according to some embodiments of the present invention.
  • FIG2 shows a flowchart of a multi-model reasoning method provided according to some embodiments of the present invention.
  • FIG3 shows a flowchart of determining switching nodes and storage space distribution for latency optimization reasoning according to some embodiments of the present invention.
  • FIG. 4 shows a schematic diagram of latency optimization reasoning provided according to some embodiments of the present invention.
  • FIG. 5 shows a flowchart of determining a switching node and storage space distribution for storage space optimization reasoning according to some embodiments of the present invention.
  • FIG6 shows a schematic diagram of storage space optimization reasoning provided according to some embodiments of the present invention.
  • FIG. 7 shows an architecture diagram of an extended reality display chip provided according to some embodiments of the present invention.
  • FIG. 8 shows a flowchart of parallel processing of multiple channels of image data according to some embodiments of the present invention.
  • FIG. 9 is a schematic diagram showing a line-by-line acquisition of image data according to some embodiments of the present invention.
  • FIG. 10 shows a structural diagram of an XR display chip provided according to some embodiments of the present invention.
  • FIG. 11 is a schematic diagram showing a noise reduction process according to some embodiments of the present invention.
  • the terms “installed”, “connected”, and “connected” should be understood in a broad sense, for example, it can be a fixed connection, a detachable connection, or an integral connection; it can be a mechanical connection or an electrical connection; it can be a direct connection, or it can be indirectly connected through an intermediate medium, or it can be the internal communication of two components.
  • installed should be understood in a broad sense, for example, it can be a fixed connection, a detachable connection, or an integral connection; it can be a mechanical connection or an electrical connection; it can be a direct connection, or it can be indirectly connected through an intermediate medium, or it can be the internal communication of two components.
  • first”, “second”, “third”, etc. may be used herein to describe various components, regions, layers and/or parts, these components, regions, layers and/or parts should not be limited by these terms, and these terms are only used to distinguish different components, regions, layers and/or parts. Therefore, the first component, region, layer and/or part discussed below may be referred to as a second component, region, layer and/or part without departing from some embodiments of the present invention.
  • NPU Network Processing Unit
  • XR extended reality
  • the present invention provides a multi-model reasoning method, a multi-model reasoning system and a computer-readable storage medium, which can determine at least one switching node from multiple processing nodes of each model according to the goals of latency priority and/or storage space priority, and determine the storage space corresponding to each model, and then use the full computing resources of the network processing unit to perform time-division multiplexing reasoning of each model in turn according to the switching node, so as to achieve latency optimized reasoning and/or storage space optimized reasoning of multiple models.
  • the multi-model reasoning method provided in the first aspect of the present invention can be implemented via the multi-model reasoning system provided in the second aspect of the present invention.
  • the multi-model reasoning system can be configured in a network processing unit (NPU) chip, which is configured with a memory and a processor.
  • the memory includes but is not limited to the computer-readable storage medium provided in the third aspect of the present invention, on which computer instructions are stored.
  • the processor is connected to the memory and is configured to execute the computer instructions stored on the memory to implement the multi-model reasoning method provided in the first aspect of the present invention.
  • Figure 1 shows an architecture diagram of a network processing unit provided according to some embodiments of the present invention.
  • Figure 2 shows a flow chart of a multi-model reasoning method provided according to some embodiments of the present invention.
  • a multiplication and accumulation (MAC) array computing unit 11 a vector processing unit (VPU) 12, a time division multiplexing (TDM) unit 13, and a static random access memory (SRAM) for storing neural network reasoning data are integrated in the NPU chip 10.
  • the SRAM can meet the needs of neural network reasoning and is specifically divided into a parameter storage space SRAM_W 141 for storing neural network model parameters and an input and intermediate data storage space SRAM_F 142 for storing input data and intermediate data of neural network reasoning.
  • the NPU chip 10 can first obtain the reasoning data of multiple models M1 to Mm from a signal source such as an image signal processor (ISP) of the XR display device.
  • the reasoning data can be image data of multiple complete images synchronously obtained by multiple image signal processors, or multiple sliced image data obtained by one image signal processor after slicing (Tile) a complete image according to the cache space size of SRAM_F.
  • the multiple sliced images correspond to a part of the complete image respectively, and the amount of data is not greater than the cache space of SRAM_F.
  • the multiple models M1 to Mm can be the same neural network model with the same structure, parameters and functions, or different neural network models with different structures, parameters and functions.
  • the TDM unit 13 of the NPU chip 10 can search and determine at least one switching node from the multiple processing nodes of each model M1 ⁇ Mm according to the goals of delay priority and/or storage space priority, and determine the static storage space corresponding to each model M1 ⁇ Mm, and then the NPU chip 10 writes the reasoning data of each model M1 ⁇ Mm into the input and intermediate data storage space SRAM_F in the corresponding static storage space for each model M1 ⁇ Mm to read.
  • the neural network model can include multiple inherent processing nodes respectively, and complete a cycle of neural network reasoning by sequentially executing the relevant operations of each processing node.
  • the switching node is obtained by screening from each processing node according to the goals of delay priority and/or storage space priority, and is correspondingly distributed according to the delay requirements of each neural network model and/or the total cache data volume of the intermediate data generated by each neural network model to indicate the opening and closing time of the processing window of each neural network model.
  • Figure 3 shows a flow chart of determining a switching node and storage space distribution for latency optimization reasoning according to some embodiments of the present invention.
  • Figure 4 shows a schematic diagram of latency optimization reasoning according to some embodiments of the present invention.
  • the process of determining the storage space distribution of the latency optimization reasoning can be implemented in an offline manner based on a large number of pre-prepared reasoning data samples.
  • the present invention can first assume that there are m neural network models M1 to Mm that need to be processed in parallel, and respectively determine the reasoning cycle P1 to Pm (in ms) for each model M1 to Mm to complete a reasoning.
  • the TDM unit 13 in the NPU chip 10 can statically divide the full storage space of the NPU chip 10 into parameter storage spaces W1 to Wm for storing parameters of each model, and input and intermediate data storage spaces F1 to Fm for storing input data and intermediate data of each model according to the number of models m that need to be processed in parallel, so as to determine the first storage space distribution for storing the reasoning data of each model.
  • the parameter storage space W1 to Wm of each model M1 to Mm can be determined by compiling each model M1 to Mm respectively, and setting the target performance FPS (Frames per second) and total bandwidth, and its total size should not exceed SRAM_W 141, that is, SUM (W1: Wm) ⁇ SRAM_W.
  • the input and intermediate data storage space F1 ⁇ Fm of each model M1 ⁇ Mm can also be determined by compiling each model M1 ⁇ Mm separately and setting the target performance FPS (Frames per second) and total bandwidth, and the total size should not exceed SRAM_F 142, that is, SUM(F1:Fm) ⁇ SRAM_F.
  • the TDM unit 13 can determine the total cycle N of multi-model reasoning according to the reasoning cycles P1-Pm of each model M1-Mm, and thereby determine the number of inferences of each model M1-Mm within the total cycle N.
  • the TDM unit 13 can determine at least one switching node of each model M1-Mm to be time-division multiplexed from multiple processing nodes inherent to each model M1-Mm according to the total cycle N and the number of inferences that each model M1-Mm needs to complete within a total cycle N, so as to give priority to meeting the delay requirements of each model M1-Mm (i.e., the interval between completing two neural network inferences).
  • the present invention uses two adjacent switching nodes i and i+1 to form a rotation time slice T for performing neural network inference on the corresponding model Mi.
  • the TDM unit 13 can perform multi-model reasoning on the inference data samples of each model M1 ⁇ Mm based on the above-mentioned first storage space distribution and the switching node i of each model M1 ⁇ Mm, using the full computing resources of the NPU chip 10 to determine the storage space missing of each model M1 ⁇ Mm.
  • the above-mentioned inference data samples can have the same form and size as the inference data for subsequent multi-model inference, and have similar content (for example: both are image data of the same target size and with noise signals and/or mosaics).
  • the technician can connect the NPU chip 10 to an external buffer, and statically divide the cache space of the external buffer into m blocks according to the compilation requirements of each model M1 ⁇ Mm, and then load the model parameters of each model M1 ⁇ Mm into each parameter storage space W1 ⁇ Wm inside the NPU chip 10 to complete the initialization of each model M1 ⁇ Mm.
  • the TDM unit 13 may start the MAC array computing unit 11 and the VPU 12 in the NPU chip 10 to run the NPU chip 10 as shown in FIG4 .
  • the MAC array computing unit 11 and the VPU 12 may read the model parameters of the first model M1 from the first storage space W1 corresponding to the first model M1 in the above-mentioned first storage space distribution, and read the inference data samples of the first model M1 from the first storage space F1 corresponding to the first model M1 in the above-mentioned first storage space distribution, so as to perform the first model inference using the full computing resources of the NPU chip 10.
  • the NPU chip 10 may interrupt the first model inference, first write the first intermediate data generated by the first model inference into the first storage space F1 of the SRAM_F 142, then write the overflowed first intermediate data into the corresponding first external storage space, and count the data amount of the overflowed first intermediate data to determine the storage space S_M1_j missing from the first model M1, where j represents the rotation of the jth cycle.
  • the read and write operations and calculation operations of the NPU chip 10 can be performed independently and simultaneously.
  • the NPU chip 10 can also preferably write the inference data samples of the second model M2 into the second storage space F2 corresponding to the second model M2 in the first storage space distribution while performing the first model inference.
  • the MAC array computing unit 11 and the VPU 12 in the NPU chip 10 can read the inference data samples of the second model M2 from the second storage space F2 in real time via on-chip transmission, and switch the full computing resources of the NPU chip 10 to the second model M2 to perform the second model inference, so as to determine the second external storage space S_M2_j used for the second model inference as described above.
  • the present invention can write the inference data of the subsequent model i+1 into the corresponding storage space Fi in advance while performing the neural network inference calculation of the previous model i, thereby further improving the efficiency and real-time performance of multi-model inference.
  • the NPU chip 10 can perform round-robin calculations according to the fixed time slice T of each model M1 ⁇ Mm as described above, so as to complete multiple round-robin calculations of all models M1 ⁇ Mm when the total cycle N/T time slices are consumed.
  • each model Mi has completed Num(Mi) inference calculations, and the external storage requirement size S_Mi_j required by the NPU chip 10 for each model i within a total cycle N is obtained.
  • the TDM unit 13 can optimize the first storage space distribution accordingly to determine a second storage space distribution that meets the multi-model reasoning requirements.
  • the TDM unit 13 can determine that the on-chip storage space of the NPU chip 10 cannot support the time division multiplexing of m models, and the number of models needs to be reduced.
  • the NPU chip 10 can synchronously obtain reasoning data about multiple images through a multi-channel image signal processor (Image Signal Processor, ISP), and/or obtain multi-channel reasoning data by slicing at least one acquired reasoning data. Then, its MAC array computing unit 11 and VPU 12 use the full computing resources of the NPU chip 10 to perform time-division multiplexing reasoning of each model M1 ⁇ Mm in turn according to the switching nodes determined by the TDM unit 13, so as to achieve latency optimization reasoning of multiple models M1 ⁇ Mm.
  • ISP Image Signal Processor
  • the MAC array computing unit 11 and the VPU 12 can read the reasoning data of the first model M1 from the third storage space F1' corresponding to the first model M1 in the predetermined second storage space distribution, so as to use the full computing resources of the NPU chip 10 to perform the first model reasoning.
  • the MAC array computing unit 11 and the VPU 12 can interrupt the first model reasoning, write the third intermediate data generated by the first model reasoning into the third storage space F1', and read the reasoning data of the second model from the fourth storage space F2' corresponding to the second model M2 in the second storage space distribution, so as to use the full computing resources of the NPU chip 10 to perform the second model reasoning.
  • the MAC array computing unit 11 and the VPU 12 may interrupt the inference of the m-1th model as described above, write the intermediate data generated by the inference of the m-1th model into the storage space F(m-1)’, and read the inference data of model Mm from the storage space Fm’ corresponding to model Mm in the second storage space distribution, so as to utilize the full computing resources of the NPU chip 10 to perform the inference of the mth model, thereby realizing the delay-optimized inference of multiple models M1 ⁇ Mm, and effectively controlling the delay of each model M1 ⁇ Mm (i.e., the interval between completing two neural network inferences).
  • Figure 5 shows a flow chart of determining a switching node and storage space distribution for storage space optimization reasoning according to some embodiments of the present invention.
  • Figure 6 shows a schematic diagram of storage space optimization reasoning according to some embodiments of the present invention.
  • each model M1-Mm performs time-division multiplexing round-robin reasoning according to a fixed time slice T0 or a time slice T composed of delay-priority switching nodes
  • the present invention can also determine the switching node for storage space optimization reasoning in an offline manner based on a large number of pre-prepared reasoning data samples.
  • the present invention can still assume that there are m neural network models M1 ⁇ Mm) that need to be processed in parallel, and determine the inference cycle P1 ⁇ Pm (in ms) for each model M1 ⁇ Mm to complete an inference.
  • the TDM unit 13 in the NPU chip 10 can statically divide the full storage space of the NPU chip 10 into parameter storage spaces W1 ⁇ Wm for storing parameters of each model, and input and intermediate data storage spaces F1 ⁇ Fm for storing input data and intermediate data of each model according to the number of models m that need to be processed in parallel, so as to determine the third storage space distribution for storing the inference data of each model M1 ⁇ Mm.
  • the parameter storage space W1 ⁇ Wm of each model M1 ⁇ Mm can be determined by compiling each model M1 ⁇ Mm respectively and setting the target performance FPS (Frames per second) and total bandwidth, and its total size should not exceed SRAM_W 141, that is, SUM(W1:Wm) ⁇ SRAM_W.
  • the input and intermediate data storage space F1 ⁇ Fm of each model M1 ⁇ Mm can also be determined by compiling each model M1 ⁇ Mm separately and setting the target performance FPS (Frames per second) and total bandwidth, and the total size should not exceed SRAM_F 142, that is, SUM(F1:Fm) ⁇ SRAM_F.
  • the TDM unit 13 can use the full computing resources of the NPU chip 10 to perform model reasoning on the inference data samples of each model M1 ⁇ Mm based on the above-mentioned third storage space distribution and the processing nodes inherent in each model M1 ⁇ Mm, so as to respectively determine the storage space lacking in each processing node of each model M1 ⁇ Mm.
  • the inference data sample can have the same form and size as the inference data for subsequent multi-model inference, and have similar content (for example: both are of the same target size and have image data with noise signals and/or mosaics).
  • the technician can connect the NPU chip 10 to an external buffer, and statically divide the cache space of the external buffer into m blocks according to the compilation requirements of each model M1 ⁇ Mm, and then load the model parameters of each model M1 ⁇ Mm into each parameter storage space W1 ⁇ Wm inside the NPU chip 10 to complete the initialization of each model M1 ⁇ Mm.
  • the TDM unit 13 can start the MAC array computing unit 11 and the VPU 12 in the NPU chip 10 to use the full computing resources of the NPU chip 10 to perform model reasoning on the reasoning data samples of each model M1 ⁇ Mm in turn.
  • the NPU chip 10 can interrupt the current model reasoning in response to any processing node in the model Mi, first write the fifth intermediate data generated by the model reasoning into the fifth storage space Fi corresponding to the model Mi in the third storage space distribution, and then write the overflowed fifth intermediate data into the external storage space, so as to count the amount of data S_Mi_j of the model Mi overflowing to the external buffer at each processing node j according to the amount of data of the overflowed fifth intermediate data, and thereby determine the storage space lacking in each model M1 ⁇ Mm at its processing node.
  • j ⁇ Ni, j is the jth node of the model Mi
  • Ni is the total number of nodes of the model
  • the TDM unit 13 may first start the MAC array computing unit 11 and the VPU 12 in the NPU chip 10, and use the full computing resources of the NPU chip 10 to perform model inference on the inference data samples of each model M1 to Mm in turn.
  • the MAC array computing unit 11 and the VPU 12 may read the model parameters of the model Mi from the sixth storage space Wi corresponding to the model Mi in the third storage space distribution, and read the inference data samples of the model Mi from the sixth storage space Fi corresponding to the model Mi in the third storage space distribution, so as to use the full computing resources of the NPU chip 10 to perform the sixth model inference.
  • the NPU chip 10 can interrupt the model reasoning of the model Mi by time-division multiplexing multiple models, first write the sixth intermediate data generated by the model reasoning into the sixth storage space Fi, and then re-read the sixth intermediate data from the sixth storage space to continue the model reasoning between the first processing node and the second processing node, and so on, until the complete reasoning of the model Mi is completed to determine the calculation time Mi_Tn required for the model Mi.
  • each model M1 ⁇ Mm can be switched more frequently for time-division multiplexing multi-model reasoning, thereby further reducing the delay of each model M1 ⁇ Mm (i.e., the interval between completing two neural network reasoning).
  • the NPU chip 10 can synchronously obtain reasoning data about multiple images through a multi-channel image signal processor (ISP), and/or obtain multi-channel reasoning data by slicing at least one acquired reasoning data. Then, its MAC array computing unit 11 and VPU 12 use the full computing resources of the NPU chip 10 to perform time-division multiplexing reasoning of each model M1 ⁇ Mm in turn according to the switching nodes determined by the TDM unit 13, so as to realize storage space optimization reasoning of multiple models M1 ⁇ Mm.
  • ISP image signal processor
  • the MAC array computing unit 11 and the VPU 12 can read the reasoning data of the third model Mi from the seventh storage space F1 corresponding to the third model Mi in the predetermined third storage space distribution, so as to perform the third model reasoning using the full computing resources of the NPU chip 10.
  • the MAC array computing unit 11 and the VPU 12 can interrupt the third model reasoning, write the seventh intermediate data generated by the third model reasoning into the seventh storage space F1, and read the reasoning data of the fourth model Mi+1 from the eighth storage space corresponding to the fourth model Mi+1 in the third storage space distribution, so as to perform the fourth model reasoning using the full computing resources of the NPU chip 10.
  • the MAC array computing unit 11 and the VPU 12 can interrupt the inference of the m-1th model as described above, write the intermediate data generated by the inference of the m-1th model into the storage space F(m-1), and read the inference data of the model Mm from the storage space Fm corresponding to the model Mm in the third storage space distribution, so as to utilize the full computing resources of the NPU chip 10 to perform the inference of the mth model, thereby realizing the storage space optimized inference of multiple models M1 ⁇ Mm, and effectively controlling the intermediate data generated by the time-division multiplexing inference of each model M1 ⁇ Mm.
  • the calculation time of each model M1 ⁇ Mm is divided into multiple small time slots.
  • the above-mentioned multi-model reasoning method, multi-model reasoning system and computer-readable storage medium provided by the present invention can rotate the neural network reasoning calculation of each model M1 ⁇ Mm by time-division multiplexing reasoning, and immediately save the intermediate data such as the state information calculated by the current model Mi after the corresponding time slot ends, so as to rotate the neural network reasoning calculation of the next model M(i+1), thereby effectively controlling the delay of each model M1 ⁇ Mm and reducing the cache space required for the intermediate data generated by multi-model reasoning.
  • each model M1 ⁇ Mm caches the intermediate data by on-chip storage and shares the full computing resources of the NPU chip 10 for neural network reasoning
  • the present invention can effectively shorten the switching time between each model M1 ⁇ Mm and avoid the idle waste of computing resources in the entire multi-model reasoning system, thereby improving the computing efficiency of each model M1 ⁇ Mm as a whole to support concurrent reasoning of multiple models M1 ⁇ Mm.
  • the existing technology usually needs to tile the entire image through the image signal processor (Image Signal Processor, ISP), divide it into multiple N ⁇ M (N ⁇ H, M ⁇ W) tile images according to the data processing capability of the network processing unit (NPU), and store them in the ISP buffer. Then the network processing unit (NPU) reads the image data of each tile image from the ISP buffer one by one for network reasoning.
  • ISP Image Signal Processor
  • NPU Network Processing Unit
  • NPU Network Processing Unit
  • the present invention provides a network processing unit, an extended reality display chip, an extended reality display device, a parallel processing method for image data, and a computer-readable storage medium, which can acquire multiple channels of image data line by line in parallel, write each channel of image data into a buffer at a preset speed, and then read and process the written i-th channel of image data from the buffer by using the full computing resources of the network processing unit through the i model.
  • the present invention can improve the accuracy of image processing by eliminating the need for tile processing of the entire image, and reduce the requirements for the overall computing power of the system and data cache space, thereby improving the ability of a single network processing unit (NPU) to process multiple channels of image data in parallel under the same hardware processing technology and cost conditions, and thereby flexibly and fully utilizing the overall computing power of the system to reduce network inference delay, so as to meet the needs of parallel processing of binocular image data in extended reality (Extended Reality, XR) display devices.
  • NPU network processing unit
  • the parallel processing method of the image data provided in the seventh aspect of the present invention can be implemented via the extended reality (XR) display chip provided in the fifth aspect of the present invention.
  • XR extended reality
  • Figure 7 shows an architecture diagram of an extended reality display chip provided according to some embodiments of the present invention.
  • the above-mentioned extended reality (XR) display chip provided in the fifth aspect of the present invention can be configured in the above-mentioned extended reality (XR) display device provided in the sixth aspect of the present invention, which is configured with a memory (not shown), at least two image signal processors 71-72, and a network processing unit 80 provided in the fourth aspect of the present invention.
  • the memory includes but is not limited to the above-mentioned computer-readable storage medium provided in the fifth aspect of the present invention, on which computer instructions are stored.
  • the at least two image signal processors 71-72 are respectively connected to the left eye camera and the right eye camera of the extended reality display device to obtain the real scene image of the real world collected by them.
  • the network processing unit 80 is respectively connected to the memory and each image signal processor 71-72, and is suitable for reading and executing the computer instructions stored on the memory to implement the above-mentioned parallel processing method of image data provided in the fourth aspect of the present invention, so as to alternately obtain the image data outputted by each image signal processor 71-72 line by line, and perform parallel processing on the obtained image data.
  • the network processing unit 80, the extended reality display chip and the extended reality display device are also only a non-restrictive implementation method provided by the present invention, and do not constitute a limitation on the execution subject and execution order of each step in the parallel processing method of the image data.
  • FIG. 8 shows a flowchart of parallel processing of image data according to some embodiments of the present invention.
  • the network processing unit 80 provided by the present invention is configured with hardware devices such as caches 811 to 812, a multiplication and accumulation (MAC) array computing unit 82, a vector processing unit (VPU) 83, and a register 84, and is also configured with a software program containing a plurality of pre-trained image processing models, wherein computing operations and storage operations can be performed separately and independently, thereby enabling full computing resource sharing of the network processing unit 80 through the MAC array computing unit 82 and the vector processing unit (VPU) 83, so as to flexibly support the network processing unit 80 to perform operations such as data reading, computing, and caching.
  • hardware devices such as caches 811 to 812, a multiplication and accumulation (MAC) array computing unit 82, a vector processing unit (VPU) 83, and a register 84
  • MAC multiplication and accumulation
  • VPU vector processing unit
  • image processing such as denoising (Denoise) and/or de-mosaicing (Demosic)
  • those skilled in the art may also configure a variety of neural network models with different parameters, structures and/or functions to adapt to various image processing requirements such as denoising and demosic removal, so as to correspondingly meet the requirements of multifunctional parallel processing of binocular image data in XR devices.
  • SRAM static random access memory
  • each cache space can be preferably divided into an input buffer space, a parameter buffer space, and a feature buffer space to cache the image data currently to be processed by the corresponding image processing model, the model parameter data, and the intermediate data generated by the neural network inference performed by the corresponding image processing model.
  • the first buffer 811 and the second buffer 812 are used below to refer to the two independent buffer spaces.
  • the first buffer 811 is connected to the image signal processor 71 via the ISP buffer 711
  • the second buffer 812 is connected to the image signal processor 72 via the ISP buffer 721.
  • the two buffers read two channels of image data line by line from the corresponding ISP buffers 711-712 at a preset first speed v1 , respectively, so that each image processing model can read the currently written image data from the corresponding buffers 811-812 according to the corresponding switching nodes, and use the full computing resources of the network processing unit 80 to process the image data, so as to perform parallel processing of the two channels of image data of the binocular camera.
  • the image processing model can include multiple inherent processing nodes, and complete a cycle of neural network reasoning by sequentially executing the relevant operations of each processing node.
  • the switching node can be obtained by screening from the processing nodes of each image processing model according to the goals of delay priority and/or storage space priority, and correspondingly distributed according to the delay requirements of each neural network model and/or the total cache data volume of intermediate data generated by each neural network model to indicate the opening and closing time of the processing window of each image processing model.
  • each of the above-mentioned image processing models can respectively include multiple inherent processing nodes, and complete a cycle of neural network reasoning by sequentially executing relevant operations of each processing node.
  • the technician can divide the full storage space of the network processing unit 80 in advance according to the number of models N that need to be processed in parallel in an offline manner to determine the first storage space distribution for storing the input image data, model parameter data, intermediate data and other reasoning data of each model.
  • the technician can also determine the total cycle of the N model reasoning to complete one reasoning according to the reasoning cycle of each model to complete one reasoning, and thereby determine the number of reasoning times of each model in the total cycle.
  • the technician can select and determine at least one switching node for time-division multiplexing each model from multiple processing nodes of each model according to the total cycle and the number of reasoning times of each model, and then based on the first storage space distribution and the switching nodes of each model, use the full computing resources of the network processing unit 80 to perform multi-model reasoning on the reasoning data samples of multiple models to determine the storage space lacking for each model.
  • the technicians can optimize the first storage space distribution according to the total storage space of the network processing unit 80 and the storage space lacking in each model to determine the second storage space distribution that meets the multi-model reasoning requirements, and thereby determine the static partitioning scheme of the buffers 811 to 812 to ensure that the parallel processing of the binocular image data of the XR device can be completed within the specified delay range.
  • the technician can divide the full storage space of the network processing unit 80 according to the number of models N that need to be processed in parallel to determine the third storage space distribution for storing the reasoning data of each model, and determine the total cycle of multi-model reasoning according to the reasoning cycle of each model as described above, and then determine the number of reasoning times of each model in the total cycle.
  • the technician can use the full computing resources of the network processing unit 80 to perform model reasoning on the reasoning data samples of each model based on the third storage space distribution and each processing node of each model, so as to respectively determine the storage space lacking in each processing node of each model.
  • the technician can divide different numbers of subgraphs for each model according to the processing node in the order of the lack of storage space from small to large, and perform reasoning tests to respectively determine the calculation time required for each model.
  • the technician can determine the maximum number of subgraphs whose reasoning times and calculation time meet the total performance requirements of multi-model reasoning, and select the switching node that needs to cache the minimum amount of intermediate data from the processing nodes of each model according to the position of the processing node corresponding to the maximum number of subgraphs.
  • the total performance requirement of the multi-model reasoning can be represented by the cumulative sum of the number of reasonings and the calculation time of each image processing model.
  • the present invention can further reduce the amount of intermediate data generated by time-division multiplexing of each image processing model, thereby further reducing the area, power consumption and cost of the network processing unit 80 and the image signal processors 71-72.
  • the present invention can determine at least one switching node from multiple processing nodes of each image processing model according to the specific goals of latency priority and/or storage space priority, and determine the storage space corresponding to each image processing model, so as to utilize the full computing resources of the network processing unit 80 to perform time-division multiplexing reasoning of each image processing model in sequence, so as to realize parallel reasoning of latency optimization and/or storage space optimization of multiple models.
  • the two-channel image signal processors 71-72 can be connected to the left camera and the right camera of the XR display device respectively, so as to obtain the collected left-eye image data and right-eye image data from the image sensors of the left camera and the right camera respectively, and pre-process them. Afterwards, the two-channel image signal processors 71-72 can continuously transmit the pre-processed left-eye image data and right-eye image data to the corresponding ISP buffers 711-712 line by line at the second speed v 2 in a ping-pong buffer read-write manner within the preset exposure time t 1.
  • the left-eye image data and the right-eye image data collected by the binocular camera of the XR display device can be original image data with noise signals and/or mosaics.
  • the image signal processors 71-72 can synchronously write the left-eye image data collected by the left-eye camera and the right-eye image data collected by the right-eye camera to the corresponding ISP buffers 711-721 line by line.
  • the input buffer space of the ISP buffers 711-721 will be filled with the 1st to Mth lines of image data of the left-eye image and the right-eye image at the same time.
  • the image signal processors 71-72 can send an interrupt instruction to the network processing unit 80 to notify it to use the above-mentioned exposure time t1 as a fixed time slice and write the 1st to Mth lines of image data cached on the ISP buffers 711-721 line by line to the corresponding buffers 811-812 on the network processing unit 80 at the above-mentioned first speed v1 ( v1 ⁇ N ⁇ v2 ).
  • the network processing unit 80 can first determine that the first processing window of the first model is open, thereby reading the 1st to Mth rows of image data currently written into the left-eye image from the first buffer 811, and driving the above-mentioned MAC array calculation unit 82 and the vector processing unit (VPU) 83, so as to support the first model to process the currently written left-eye image data with the full computing resources of the network processing unit 80, and generate corresponding first intermediate data.
  • VPU vector processing unit
  • the network processing unit 80 can determine that the first processing window of the above-mentioned first model is closed, and the second processing window of the second model is opened, so as to first write the first intermediate data generated by the first model between the first switching node T11 and the second switching node T21 back to the feature buffer space of the first buffer 811 to free up the computing resources of the network processing unit 80, and then read the 1st to Mth rows of image data currently written in the right-eye image from the second buffer 812, and drive the above-mentioned MAC array computing unit 82 and vector processing unit (VPU) 83 to switch all computing resources of the network processing unit 80 to the second model to support the second model to process the currently written right-eye image data to generate corresponding second intermediate data.
  • VPU vector processing unit
  • the image signal processors 71-72 can continue to write the left eye image data collected by the left eye camera and the right eye image data collected by the right eye camera to the corresponding ISP buffers 711-721 line by line while the network processing unit 80 reads the image data cached in the ISP buffers 711-721, so as to realize the dynamic synchronization of reading and writing data.
  • the input buffer space of the ISP buffers 711-721 will be filled with the image data of the M+1 to 2Mth lines of the left eye image and the right eye image again.
  • the image signal processors 71-72 may again send an interrupt instruction to the network processing unit 30, notifying it to continue to use the above-mentioned exposure time t 1 as a fixed time slice and write out the M+ 1-2Mth lines of image data cached in the ISP buffers 711-721 to the corresponding buffers 811-812 on the network processing unit 80 line by line at the above -mentioned first speed v 1 (v 1 ⁇ N ⁇ v 2 ), thereby preventing the 2M+1-3Mth lines of left-eye image data written into the ISP buffers 711-721 within the next exposure time 2t 1 -3t 1 from being accumulated and overflowing.
  • the network processing unit 80 can determine that the second processing window of the above-mentioned second model is closed, and the first processing window of the first model is opened again, so as to first write the second intermediate data generated by the second model between the second switching node T 21 and the first switching node T 12 back to the feature buffer space of the second buffer 812 to free up the computing resources of the network processing unit 80, and then read the M+1 ⁇ 2Mth rows of image data currently written in the left-eye image from the first buffer 811, as well as the first intermediate data generated by the previous round of the first neural network reasoning, and drive the above-mentioned MAC array computing unit 82 and the vector processing unit (VPU) 83 to switch all computing resources of the network processing unit 80 back to the first model to support the first model to continue processing the currently written left-eye image data to regenerate the corresponding first intermediate data.
  • VPU vector processing unit
  • the two image processing models in the network processing unit 80 can perform neural network inference on the left-eye image and the right-eye image alternately window by window according to the predetermined switching nodes and the reading and writing method of the ping-pong cache, thereby meeting the demand for parallel processing of binocular image data in the XR device by configuring less ISP cache space (for example: 2M lines of image data).
  • FIG. 9 shows a schematic diagram of acquiring image data line by line according to some embodiments of the present invention.
  • the XR display chip provided by the present invention can use the sliding window shown in the figure to perform convolution calculation from left to right and from top to bottom in the direction of row priority.
  • This way of reading and calculating image data is consistent with the direction in which the image signal processor 71-72 writes data to the ISP buffer 711-721.
  • the real-time flow of image data from the image signal processor 71-72 to the network processing unit 80 can be achieved by quantitatively configuring the read and write speeds of the image signal processor 71-72 and the network processing unit 80, thereby reducing the waiting time of the image data in the ISP buffer 711-721 cache, so as to reduce the latency of image processing.
  • the row-priority image data reading and calculation method adopted by the present invention only needs to write and cache a few rows (for example: 3 rows corresponding to the sliding window size) of image data to perform neural network inference of the corresponding image in real time, thereby greatly reducing the cache space requirements of the ISP buffers 711 to 721, thereby greatly reducing the area, power consumption and cost of the image signal processor, and eliminating the need for tile processing of the entire image to reduce the requirements for the overall system computing power and data cache space, thereby reducing the requirements for the overall system computing power.
  • FIG. 10 shows a structural diagram of an XR display chip provided according to some embodiments of the present invention.
  • the XR display chip in addition to the internal buffer 91 of the network processing unit 90, the XR display chip only needs to configure a small-area, small-capacity (for example: 0.4MB) ISP buffer 92 outside the network processing unit 80 to meet the needs of parallel processing of binocular image data in the XR device, thereby facilitating the further development and application of the network processing unit 80 to parallel processing of multi-channel image data.
  • a small-area, small-capacity for example: 0.4MB
  • each image processing model may include a multi-layer neural network structure.
  • the image processing model may perform convolution calculations from left to right and from top to bottom in a row-first direction to synchronously complete the neural network reasoning of the corresponding number of layers.
  • Figure 11 shows a schematic diagram of a noise reduction process provided according to some embodiments of the present invention.
  • AI noise reduction AIDenoise
  • Figure 11 it is divided into two parts, an encoder and a decoder, at the algorithm module level.
  • AIDenoise reasoning based on machine learning is a process of estimating a potential clean image from an actually observed noisy image.
  • the image processing model can feature map the image through an encoder, and then integrate and restore the image features through a decoder to finally output a clean image with noise eliminated.
  • the image processing model can use convolution to form a jump connection part between the encoder and the decoder, so that the features extracted by image downsampling are integrated into the upsampling part to promote the fusion of feature information.
  • the potential mapping of the corresponding reference image is obtained, and the image processing model can perform neural network reasoning on the acquired noisy image according to the potential mapping to finally obtain a clean image with noise eliminated.
  • this AI denoising not only better preserves the edge texture details of the image, but also can use the architecture of the network processing unit (NPU) for parallel computing, thereby making full use of hardware performance to speed up the computing operation rate.
  • NPU network processing unit
  • the image processing model in response to completing the neural network reasoning of the preset number of layers L according to the currently written multiple lines of image data, the image processing model begins to generate result data about the initial multiple lines of images of the corresponding left/right eye images.
  • the result data generated by the image processing model can be denoised image data after eliminating noise.
  • the result data generated by the image processing model can also be restored image data after eliminating mosaics.
  • the network processing unit 80 can output the generated result data line by line to the corresponding ISP buffers 712 ⁇ 722 at the first speed v1 as shown in Figure 7 according to the reading and writing method of the Ping-pong buffer, and the ISP buffers 712 ⁇ 722 return the received result data of the preset number of lines to the corresponding image signal processors 71 ⁇ 72 at the second speed v2, so that the image data in each image signal processor 71 ⁇ 72 can flow through the network processing unit 80, thereby reducing the size requirements of the ISP buffers 712 ⁇ 722 and the internal buffers 811 ⁇ 812 of the network processing unit 80, and reducing the latency of image processing.
  • the network processing unit 80 can complete the neural network reasoning of the preset number of layers L when processing the Kth line of image data, and output the 1st to Mth lines of processing result data, and output the M+1st to 2Mth lines, 2M+1st to 3Mth lines, and other subsequent processing result data window by window as the corresponding model processing window is opened and closed.
  • K can be determined by the receptive field of the image processing model. For the same model that processes binocular image data, it can have the same K value of about 60 to 100 lines.
  • the network processing unit 80 can alternately obtain the multiple original image data provided by the multiple image signal processors 71 ⁇ 72 while returning the processing result data to each image signal processor 71 ⁇ 72 in equal amounts, thereby realizing the flow and dynamic balance of image data.
  • the network processing unit 80 may also preferably delete multiple lines of image data that only involve the first L-1 layers of neural network inference to save feature buffer space of the buffers 811 ⁇ 812.
  • the network processing unit 80, XR display chip, XR display device, parallel processing method of image data, and computer-readable storage medium can acquire multiple channels of image data line by line in parallel, write each channel of image data into the buffer at a preset speed, and then use the full computing resources of the network processing unit through the i model to read and process the written i-th channel of image data from the buffer.
  • the present invention can improve the accuracy of image processing by eliminating the need for tile processing of the entire image, and reduce the requirements for the overall computing power of the system and data cache space, thereby improving the ability of a single network processing unit (NPU) to process multiple channels of image data in parallel under the same hardware processing technology and cost conditions, and thereby flexibly and fully utilizing the overall computing power of the system to reduce network inference delay, so as to meet the needs of parallel processing of binocular image data in XR devices.
  • NPU network processing unit
  • the steps of the method or algorithm described in conjunction with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two.
  • the software module may reside in a RAM memory, a flash memory, a ROM memory, an EPROM memory, an EEPROM memory, a register, a hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
  • An exemplary storage medium is coupled to a processor so that the processor can read and write information from/to the storage medium.
  • a storage medium may be integrated into a processor.
  • the processor and the storage medium may reside in an ASIC.
  • the ASIC may reside in a user terminal.
  • the processor and the storage medium may reside in a user terminal as discrete components.
  • the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented as a computer program product in software, each function may be stored on or transmitted by a computer-readable medium as one or more instructions or codes.
  • Computer-readable media include both computer storage media and communication media, including any medium that facilitates the transfer of a computer program from one place to another. Storage media may be any available medium that can be accessed by a computer. As an example and not limitation, such a computer-readable medium may include RAM, ROM, EEPROM, CD-ROM or other optical disk storage, disk storage or other magnetic storage device, or any other medium that can be used to carry or store the desired program code in the form of an instruction or data structure and can be accessed by a computer.
  • any connection is also properly referred to as a computer-readable medium.
  • the software is transmitted from a website, a server, or other remote source using a coaxial cable, a fiber optic cable, a twisted pair, a digital subscriber line (DSL), or wireless technologies such as infrared, radio, and microwaves
  • the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies are included in the definition of the medium.
  • Disk and disc as used herein include compact disc (CD), laser disc, optical disc, digital versatile disc (DVD), floppy disk and Blu-ray disc, wherein disk often reproduces data magnetically, while disc reproduces data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Image Processing (AREA)

Abstract

The present invention provides a multi-model reasoning method and system, a network processing unit, an extended reality display chip and apparatus, and an image processing method. The multi-model reasoning method comprises the following steps: acquiring reasoning data of a plurality of models; determining at least one switching node from a plurality of processing nodes of each model on the basis of a target having a time delay priority and/or a storage space priority, and determining a storage space corresponding to each model; writing the reasoning data of each model into a corresponding storage space of a network processing unit; and on the basis of the switching node, using full computing resources of the network processing unit to sequentially perform time-division multiplexing reasoning of the models so as to implement time delay optimization reasoning and/or storage space optimization reasoning of the plurality of models. By using the described configurations, the multi-model reasoning method can flexibly switch time-division multiplexing among a plurality of neural network models according to actual requirements, thereby implementing time delay optimization reasoning and/or storage space optimization reasoning of the plurality of models.

Description

多模型推理方法及系统、网络处理单元、扩展现实显示芯片及装置、图像处理方法Multi-model reasoning method and system, network processing unit, extended reality display chip and device, image processing method

本申请要求申请日为2023年11月24日、中国申请号为202311587154.0、名为“一种多模型推理方法、系统及存储介质”专利申请,以及申请日为2023年11月24日、中国申请号为202311582338.8、名为“网络处理单元、扩展现实显示芯片及装置、图像处理方法”专利申请的优先权,这两件申请的全部内容都通过引用并入本文。This application claims priority to a patent application with a filing date of November 24, 2023, Chinese application number 202311587154.0, entitled “A multi-model reasoning method, system and storage medium”, and a patent application with a filing date of November 24, 2023, Chinese application number 202311582338.8, entitled “Network processing unit, extended reality display chip and device, image processing method”. The entire contents of these two applications are incorporated herein by reference.

技术领域Technical Field

本发明涉及神经网络推理技术及扩展现实显示技术,尤其涉及一种多模型推理方法、一种多模型推理系统、一种网络处理单元、一种扩展现实显示芯片、一种扩展现实显示装置、一种图像数据的并行处理方法,以及一种计算机可读存储介质。The present invention relates to neural network reasoning technology and extended reality display technology, and in particular to a multi-model reasoning method, a multi-model reasoning system, a network processing unit, an extended reality display chip, an extended reality display device, a parallel processing method for image data, and a computer-readable storage medium.

背景技术Background Art

网络处理单元(Network Processing Unit,NPU)是一种用电路模拟人类神经元和突触结构的处理器,被广泛应用于人工智能(Artificial Intelligence,AI)技术领域,以模拟人类对图像、声音、语言等各种复杂信息进行处理。A network processing unit (NPU) is a processor that uses circuits to simulate human neurons and synaptic structures. It is widely used in the field of artificial intelligence (AI) technology to simulate humans' processing of various complex information such as images, sounds, and languages.

随着AI技术的不断发展,其涉及的数据维度、数据量及模型复杂度呈现出爆炸式的增长趋势,从而对NPU的运算能力提出了巨大的考验。现有技术通常采用多个独立封装的NPU芯片或多核NPU芯片来进行多个神经网络模型的多模型推理。然而,独立封装的NPU芯片具有固化的计算和存储资源,一方面存在芯片面积、功耗、成本过高,无法大规模集成的问题,另一方面存在灵活性不足,无法适应模型数量的增减来调配计算和存储资源的问题,因而无法适应扩展现实(Extended Reality,XR)显示设备多场景、多功能的应用需求。相对地,多核NPU芯片的多个NPU核心(Core)虽然能够共享芯片的部分存储资源,但其NPU核心的数量及每个NPU核心的计算资源却是独立固化的,因而也存在上述面积、功耗、成本过高以及灵活性不足的问题,无法适应XR显示设备多场景、多功能的应用需求。With the continuous development of AI technology, the data dimensions, data volume and model complexity involved have shown an explosive growth trend, which has put forward a huge test on the computing power of NPU. The existing technology usually uses multiple independently packaged NPU chips or multi-core NPU chips to perform multi-model reasoning of multiple neural network models. However, the independently packaged NPU chips have fixed computing and storage resources. On the one hand, there are problems such as high chip area, power consumption and cost, which cannot be integrated on a large scale. On the other hand, there is insufficient flexibility and the problem of being unable to adapt to the increase or decrease in the number of models to allocate computing and storage resources. Therefore, it cannot adapt to the multi-scenario and multi-functional application requirements of extended reality (XR) display devices. Relatively speaking, although the multiple NPU cores (Core) of the multi-core NPU chip can share part of the chip's storage resources, the number of its NPU cores and the computing resources of each NPU core are independently fixed, so there are also the above-mentioned problems of high area, power consumption, cost and insufficient flexibility, which cannot adapt to the multi-scenario and multi-functional application requirements of XR display devices.

为了克服现有技术所存在的上述缺陷,本领域亟需一种多模型推理技术,用于根据实际需求来灵活切换多个神经网络模型之间的时分复用,从而实现多个模型的时延(Latency)优化推理和/或存储空间优化推理。In order to overcome the above-mentioned defects of the prior art, the field is in urgent need of a multi-model reasoning technology, which can be used to flexibly switch time-division multiplexing between multiple neural network models according to actual needs, thereby realizing latency optimized reasoning and/or storage space optimized reasoning of multiple models.

此外,扩展现实(Extended Reality,XR)显示技术是一种通过以计算机为核心的现代高科技手段营造真实、虚拟组合的数字化环境,为体验者带来虚拟世界与现实世界之间无缝转换的沉浸式显示技术,主要包括虚拟现实(Virtual Reality,VR)显示、增强现实(Augmented Reality,AR)显示、混合现实(Mixed Reality,MR)显示等多种实现方式。In addition, Extended Reality (XR) display technology is an immersive display technology that uses modern high-tech means with computers as the core to create a digital environment that combines real and virtual things, providing users with seamless transitions between the virtual world and the real world. It mainly includes Virtual Reality (VR) display, Augmented Reality (AR) display, Mixed Reality (MR) display and many other implementation methods.

在手机、相机、监控等领域的现有技术中,通常为每一路图像信号处理器(Image Signal Processor,ISP)配置一个单独的网络处理单元(Network Processing Unit,NPU)来进行网络推理,因而无法灵活、充分地利用系统的整体算力,以造成系统整体算力的浪费,并增大网络推理延迟。In the existing technologies in the fields of mobile phones, cameras, and surveillance, a separate network processing unit (NPU) is usually configured for each image signal processor (ISP) to perform network inference. As a result, the overall computing power of the system cannot be flexibly and fully utilized, resulting in a waste of the overall computing power of the system and an increase in network inference latency.

此外,在从相机模块获取其采集的分辨率为H×W的整张图像数据后,现有技术通常需要通过图像信号处理器(ISP)对该整张图像进行分片(Tile)处理,根据网络处理单元(NPU)的数据处理能力将其分割为多张N×M(N<H,M<W)的分片图像后存入ISP缓存器,再由网络处理单元(NPU)从ISP缓存器逐一读取各分片图像的图像数据进行网络推理。这一方面会引入分片、补0加窗和剪裁重合图像等额外的操作和中间数据,以提升对系统整体算力的要求,另一方面还要求系统具备更大的数据缓存空间,以极大地提高图像信号处理器(ISP)的面积、功耗和成本,从而严重限制了现有网络处理单元(NPU)向并行处理多路图像数据的发展和应用。In addition, after obtaining the entire image data with a resolution of H×W from the camera module, the existing technology usually needs to tile the entire image through the image signal processor (ISP), divide it into multiple N×M (N<H, M<W) tile images according to the data processing capability of the network processing unit (NPU), and store them in the ISP buffer. The network processing unit (NPU) then reads the image data of each tile image from the ISP buffer one by one for network reasoning. On the one hand, this will introduce additional operations and intermediate data such as slicing, zero-filling and windowing, and trimming of overlapping images to increase the requirements for the overall computing power of the system. On the other hand, it also requires the system to have a larger data cache space to greatly increase the area, power consumption and cost of the image signal processor (ISP), thereby severely limiting the development and application of the existing network processing unit (NPU) to parallel process multiple image data.

为了克服现有技术所存在的上述缺陷,本领域亟需一种多路图像数据的并行处理技术,通过取消对整张图像进行分片(Tile)处理的需求来提升图像处理的精度,并降低对系统整体算力及数据缓存空间的要求,从而在相同的硬件加工工艺及成本的条件下,提升单个网络处理单元(NPU)并行处理多路图像数据的能力,并由此灵活、充分地利用系统的整体算力来减小网络推理延迟,以满足XR设备中并行处理双目图像数据的需求。In order to overcome the above-mentioned defects of the prior art, there is an urgent need in the art for a parallel processing technology for multi-channel image data, which improves the accuracy of image processing by eliminating the need for tile processing of the entire image and reduces the requirements for the overall system computing power and data cache space. Thus, under the same hardware processing technology and cost conditions, the ability of a single network processing unit (NPU) to process multi-channel image data in parallel is improved, and the overall computing power of the system is flexibly and fully utilized to reduce network inference latency, thereby meeting the needs of parallel processing of binocular image data in XR devices.

发明内容Summary of the invention

以下给出一个或多个方面的简要概述以提供对这些方面的基本理解。此概述不是所有构想到的方面的详尽综览,并且既非旨在指认出所有方面的关键性或决定性要素亦非试图界定任何或所有方面的范围。其唯一的目的是要以简化形式给出一个或多个方面的一些概念以为稍后给出的更加详细的描述之前序。A brief summary of one or more aspects is given below to provide a basic understanding of these aspects. This summary is not an exhaustive overview of all conceived aspects, and is neither intended to identify the key or decisive elements of all aspects nor to define the scope of any or all aspects. Its only purpose is to give some concepts of one or more aspects in a simplified form as a prelude to a more detailed description that will be given later.

为了克服现有技术所存在的上述缺陷,本发明提供了一种多模型推理方法、一种多模型推理系统,以及一种计算机可读存储介质,其能够根据时延优先和/或存储空间优先的目标,从各模型的多个处理结点中分别确定至少一个切换结点,并确定各模型对应的存储空间,再根据切换结点利用网络处理单元的全计算资源依次进行各模型的时分复用推理,以实现多个模型的时延优化推理和/或存储空间优化推理。In order to overcome the above-mentioned defects of the prior art, the present invention provides a multi-model reasoning method, a multi-model reasoning system, and a computer-readable storage medium, which can determine at least one switching node from multiple processing nodes of each model according to the goals of latency priority and/or storage space priority, and determine the storage space corresponding to each model, and then use the full computing resources of the network processing unit to perform time-division multiplexing reasoning of each model in turn according to the switching node, so as to achieve latency optimized reasoning and/or storage space optimized reasoning of multiple models.

此外,本发明还提供了一种网络处理单元、一种扩展现实显示芯片、一种扩展现实显示装置、一种图像数据的并行处理方法,以及一种计算机可读存储介质,其能够并行地逐行获取多路图像数据,将各路图像数据分别以预设速度写入缓存器,再经由i模型利用网络处理单元的全计算资源,从缓存器读取并处理已写入的第i路图像数据。通过采用这些配置,本发明能通过取消对整张图像进行分片(Tile)处理的需求来提升图像处理的精度,并降低对系统整体算力及数据缓存空间的要求,从而在相同的硬件加工工艺及成本的条件下,提升单个网络处理单元(NPU)并行处理多路图像数据的能力,并由此灵活、充分地利用系统的整体算力来减小网络推理延迟,以满足XR设备中并行处理双目图像数据的需求。In addition, the present invention also provides a network processing unit, an extended reality display chip, an extended reality display device, a parallel processing method for image data, and a computer-readable storage medium, which can acquire multiple channels of image data line by line in parallel, write each channel of image data into a buffer at a preset speed, and then use the full computing resources of the network processing unit through the i model to read and process the written i-th channel of image data from the buffer. By adopting these configurations, the present invention can improve the accuracy of image processing by eliminating the need to tile the entire image, and reduce the requirements for the overall computing power of the system and data cache space, thereby improving the ability of a single network processing unit (NPU) to process multiple channels of image data in parallel under the same hardware processing technology and cost conditions, and thereby flexibly and fully utilizing the overall computing power of the system to reduce network inference delay, so as to meet the needs of parallel processing of binocular image data in XR devices.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

在结合以下附图阅读本公开的实施例的详细描述之后,能够更好地理解本发明的上述特征和优点。在附图中,各组件不一定是按比例绘制,并且具有类似的相关特性或特征的组件可能具有相同或相近的附图标记。The above features and advantages of the present invention can be better understood after reading the detailed description of the embodiments of the present disclosure in conjunction with the following drawings. In the drawings, the components are not necessarily drawn to scale, and components with similar related properties or features may have the same or similar reference numerals.

图1示出了根据本发明的一些实施例提供的网络处理单元的架构图。FIG. 1 shows an architecture diagram of a network processing unit provided according to some embodiments of the present invention.

图2示出了根据本发明的一些实施例提供的多模型推理方法的流程图。FIG2 shows a flowchart of a multi-model reasoning method provided according to some embodiments of the present invention.

图3示出了根据本发明的一些实施例提供的确定时延优化推理的切换结点及存储空间分布的流程图。FIG3 shows a flowchart of determining switching nodes and storage space distribution for latency optimization reasoning according to some embodiments of the present invention.

图4示出了根据本发明的一些实施例提供的时延优化推理的示意图。FIG. 4 shows a schematic diagram of latency optimization reasoning provided according to some embodiments of the present invention.

图5示出了根据本发明的一些实施例提供的确定存储空间优化推理的切换结点及存储空间分布的流程图。FIG. 5 shows a flowchart of determining a switching node and storage space distribution for storage space optimization reasoning according to some embodiments of the present invention.

图6示出了根据本发明的一些实施例提供的存储空间优化推理的示意图。FIG6 shows a schematic diagram of storage space optimization reasoning provided according to some embodiments of the present invention.

图7示出了根据本发明的一些实施例提供的扩展现实显示芯片的架构图。FIG. 7 shows an architecture diagram of an extended reality display chip provided according to some embodiments of the present invention.

图8示出了根据本发明的一些实施例提供的并行处理多路图像数据的流程图。FIG. 8 shows a flowchart of parallel processing of multiple channels of image data according to some embodiments of the present invention.

图9示出了根据本发明的一些实施例提供的逐行获取图像数据的示意图。FIG. 9 is a schematic diagram showing a line-by-line acquisition of image data according to some embodiments of the present invention.

图10示出了根据本发明的一些实施例提供的XR显示芯片的结构图。FIG. 10 shows a structural diagram of an XR display chip provided according to some embodiments of the present invention.

图11示出了根据本发明的一些实施例提供的降噪处理的示意图。FIG. 11 is a schematic diagram showing a noise reduction process according to some embodiments of the present invention.

具体实施方式DETAILED DESCRIPTION

以下由特定的具体实施例说明本发明的实施方式,本领域技术人员可由本说明书所揭示的内容轻易地了解本发明的其他优点及功效。虽然本发明的描述将结合优选实施例一起介绍,但这并不代表此发明的特征仅限于该实施方式。恰恰相反,结合实施方式作发明介绍的目的是为了覆盖基于本发明的权利要求而有可能延伸出的其它选择或改造。为了提供对本发明的深度了解,以下描述中将包含许多具体的细节。本发明也可以不使用这些细节实施。此外,为了避免混乱或模糊本发明的重点,有些具体细节将在描述中被省略。The following specific embodiments illustrate the implementation of the present invention, and those skilled in the art can easily understand other advantages and effects of the present invention from the contents disclosed in this specification. Although the description of the present invention will be introduced in conjunction with the preferred embodiment, this does not mean that the features of this invention are limited to this implementation. On the contrary, the purpose of introducing the invention in conjunction with the implementation is to cover other options or modifications that may be extended based on the claims of the present invention. In order to provide a deep understanding of the present invention, the following description will include many specific details. The present invention can also be implemented without using these details. In addition, in order to avoid confusion or blurring the focus of the present invention, some specific details will be omitted in the description.

在本发明的描述中,需要说明的是,除非另有明确的规定和限定,术语“安装”、“相连”、“连接”应做广义理解,例如,可以是固定连接,也可以是可拆卸连接,或一体地连接;可以是机械连接,也可以是电连接;可以是直接相连,也可以通过中间媒介间接相连,可以是两个元件内部的连通。对于本领域的普通技术人员而言,可以具体情况理解上述术语在本发明中的具体含义。In the description of the present invention, it should be noted that, unless otherwise clearly specified and limited, the terms "installed", "connected", and "connected" should be understood in a broad sense, for example, it can be a fixed connection, a detachable connection, or an integral connection; it can be a mechanical connection or an electrical connection; it can be a direct connection, or it can be indirectly connected through an intermediate medium, or it can be the internal communication of two components. For ordinary technicians in this field, the specific meanings of the above terms in the present invention can be understood according to specific circumstances.

另外,在以下的说明中所使用的“上”、“下”、“左”、“右”、“顶”、“底”、“水平”、“垂直”应被理解为该段以及相关附图中所绘示的方位。此相对性的用语仅是为了方便说明之用,其并不代表其所叙述的装置需以特定方位来制造或运作,因此不应理解为对本发明的限制。In addition, the terms "upper", "lower", "left", "right", "top", "bottom", "horizontal" and "vertical" used in the following description should be understood as the directions shown in the paragraph and the related drawings. Such relative terms are only used for the convenience of description and do not mean that the device described therein must be manufactured or operated in a specific direction, and therefore should not be understood as limiting the present invention.

能理解的是,虽然在此可使用用语“第一”、“第二”、“第三”等来叙述各种组件、区域、层和/或部分,这些组件、区域、层和/或部分不应被这些用语限定,且这些用语仅是用来区别不同的组件、区域、层和/或部分。因此,以下讨论的第一组件、区域、层和/或部分可在不偏离本发明一些实施例的情况下被称为第二组件、区域、层和/或部分。It is understood that although the terms "first", "second", "third", etc. may be used herein to describe various components, regions, layers and/or parts, these components, regions, layers and/or parts should not be limited by these terms, and these terms are only used to distinguish different components, regions, layers and/or parts. Therefore, the first component, region, layer and/or part discussed below may be referred to as a second component, region, layer and/or part without departing from some embodiments of the present invention.

如上所述,现有技术通常采用多个独立封装的网络处理单元(Network Processing Unit,NPU)芯片或多核NPU芯片来进行多个神经网络模型的多模型推理。然而,独立封装的NPU芯片具有固化的计算和存储资源,一方面存在芯片面积、功耗、成本过高,无法大规模集成的问题,另一方面存在灵活性不足,无法适应模型数量的增减来调配计算和存储资源的问题,因而无法适应扩展现实(Extended Reality,XR)显示设备多场景、多功能的应用需求。相对地,多核NPU芯片的多个NPU核心(Core)虽然能够共享芯片的部分存储资源,但其NPU核心的数量及每个NPU核心的计算资源却是独立固化的,因而也存在上述面积、功耗、成本过高以及灵活性不足的问题,无法适应XR显示设备多场景、多功能的应用需求。As mentioned above, the prior art usually uses multiple independently packaged Network Processing Unit (NPU) chips or multi-core NPU chips to perform multi-model reasoning for multiple neural network models. However, independently packaged NPU chips have fixed computing and storage resources. On the one hand, there are problems such as high chip area, power consumption, and cost, which cannot be integrated on a large scale. On the other hand, there is insufficient flexibility and the problem of being unable to adapt to the increase or decrease in the number of models to allocate computing and storage resources. Therefore, it cannot adapt to the multi-scenario and multi-functional application requirements of extended reality (XR) display devices. Relatively speaking, although the multiple NPU cores (Cores) of a multi-core NPU chip can share part of the chip's storage resources, the number of its NPU cores and the computing resources of each NPU core are independently fixed, so there are also the above-mentioned problems of area, power consumption, high cost and insufficient flexibility, which cannot adapt to the multi-scenario and multi-functional application requirements of XR display devices.

为了克服现有技术所存在的上述缺陷,本发明提供了一种多模型推理方法、一种多模型推理系统以及一种计算机可读存储介质,能够根据时延(Latency)优先和/或存储空间优先的目标,从各模型的多个处理结点中分别确定至少一个切换结点,并确定各模型对应的存储空间,再根据切换结点利用网络处理单元的全计算资源依次进行各模型的时分复用推理,以实现多个模型的时延优化推理和/或存储空间优化推理。In order to overcome the above-mentioned defects existing in the prior art, the present invention provides a multi-model reasoning method, a multi-model reasoning system and a computer-readable storage medium, which can determine at least one switching node from multiple processing nodes of each model according to the goals of latency priority and/or storage space priority, and determine the storage space corresponding to each model, and then use the full computing resources of the network processing unit to perform time-division multiplexing reasoning of each model in turn according to the switching node, so as to achieve latency optimized reasoning and/or storage space optimized reasoning of multiple models.

在一些非限制性的实施例中,本发明的第一方面提供的上述多模型推理方法可以经由本发明的第二方面提供的上述多模型推理系统来实施。具体来说,该多模型推理系统可以被配置于网络处理单元(NPU)芯片,其中配置有存储器和处理器。该存储器包括但不限于本发明的第三方面提供的上述计算机可读存储介质,其上存储有计算机指令。该处理器连接该存储器,并被配置用于执行该存储器上存储的计算机指令,以实施本发明的第一方面提供的上述多模型推理方法。In some non-limiting embodiments, the multi-model reasoning method provided in the first aspect of the present invention can be implemented via the multi-model reasoning system provided in the second aspect of the present invention. Specifically, the multi-model reasoning system can be configured in a network processing unit (NPU) chip, which is configured with a memory and a processor. The memory includes but is not limited to the computer-readable storage medium provided in the third aspect of the present invention, on which computer instructions are stored. The processor is connected to the memory and is configured to execute the computer instructions stored on the memory to implement the multi-model reasoning method provided in the first aspect of the present invention.

以下将结合一些多模型推理方法的实施例来描述上述NPU芯片及多模型推理系统的工作原理。本领域的技术人员可以理解,这些多模型推理方法的实施例只是本发明提供的一些非限制性的实施方式,旨在清楚地展示本发明的主要构思,并提供一些便于公众实施的具体方案,而非用于限制该NPU芯片及多模型推理系统的全部功能或全部工作方式。同样地,该NPU芯片及多模型推理系统也只是本发明提供的一种非限制性的实施方式,不对这些多模型推理方法中各步骤的执行主体和执行顺序构成限制。The following will describe the working principles of the above-mentioned NPU chip and multi-model reasoning system in conjunction with some embodiments of the multi-model reasoning method. Those skilled in the art will understand that the embodiments of these multi-model reasoning methods are only some non-restrictive implementation methods provided by the present invention, which are intended to clearly demonstrate the main concept of the present invention and provide some specific solutions that are convenient for the public to implement, rather than to limit all functions or all working modes of the NPU chip and multi-model reasoning system. Similarly, the NPU chip and multi-model reasoning system are only a non-restrictive implementation method provided by the present invention, and do not constitute any limitation on the execution subject and execution order of each step in these multi-model reasoning methods.

请结合参考图1及图2。图1示出了根据本发明的一些实施例提供的网络处理单元的架构图。图2示出了根据本发明的一些实施例提供的多模型推理方法的流程图。Please refer to Figure 1 and Figure 2. Figure 1 shows an architecture diagram of a network processing unit provided according to some embodiments of the present invention. Figure 2 shows a flow chart of a multi-model reasoning method provided according to some embodiments of the present invention.

如图1所示,在本发明的一些实施例中,NPU芯片10中集成有乘法累加(Multiply Accumulation,MAC)阵列计算单元11、向量处理单元(Vector Processing Unit,VPU)12、时分复用(Time Division Multiplexing,TDM)单元13,以及用于存储神经网络推理数据的静态随机存取存储器(Static Random Access Memory,SRAM)。在此,该SRAM可以适应神经网络推理的需求,而被具体地划分为存储神经网络模型参数的参数存储空间SRAM_W 141,以及存储神经网络推理的输入数据及中间数据的输入及中间数据存储空间SRAM_F 142。As shown in FIG1 , in some embodiments of the present invention, a multiplication and accumulation (MAC) array computing unit 11, a vector processing unit (VPU) 12, a time division multiplexing (TDM) unit 13, and a static random access memory (SRAM) for storing neural network reasoning data are integrated in the NPU chip 10. Here, the SRAM can meet the needs of neural network reasoning and is specifically divided into a parameter storage space SRAM_W 141 for storing neural network model parameters and an input and intermediate data storage space SRAM_F 142 for storing input data and intermediate data of neural network reasoning.

如图2所示,在进行多模型推理的过程中,NPU芯片10可以首先从XR显示设备的图像信号处理器(Image Signal Processor,ISP)等信号源获取多个模型M1~Mm的推理数据。在此,该推理数据可以是经由多路图像信号处理器同步获取的多幅完整图像的图像数据,也可以是一路图像信号处理器根据SRAM_F的缓存空间大小,对一幅完整图像进行切片(Tile)处理后获得的多张分片图像数据。该多张分片图像分别对应该完整图像的一部分,其数据量不大于SRAM_F的缓存空间。该多个模型M1~Mm可以是具有相同结构、参数和功能的相同神经网络模型,也可以是具有不同结构、参数和功能的不同神经网络模型。As shown in FIG2 , in the process of multi-model reasoning, the NPU chip 10 can first obtain the reasoning data of multiple models M1 to Mm from a signal source such as an image signal processor (ISP) of the XR display device. Here, the reasoning data can be image data of multiple complete images synchronously obtained by multiple image signal processors, or multiple sliced image data obtained by one image signal processor after slicing (Tile) a complete image according to the cache space size of SRAM_F. The multiple sliced images correspond to a part of the complete image respectively, and the amount of data is not greater than the cache space of SRAM_F. The multiple models M1 to Mm can be the same neural network model with the same structure, parameters and functions, or different neural network models with different structures, parameters and functions.

之后,NPU芯片10的TDM单元13可以根据时延优先和/或存储空间优先的目标,从各模型M1~Mm的多个处理结点中分别搜索确定至少一个切换结点,并确定各模型M1~Mm对应的静态存储空间,再由NPU芯片10将各模型M1~Mm的推理数据分别写入对应的静态存储空间中的输入及中间数据存储空间SRAM_F,以供各模型M1~Mm读取。在此,神经网络模型可以分别包含多个固有的处理结点,并通过依序执行各处理结点的相关运算,完成一个周期的神经网络推理。切换节点是根据时延优先和/或存储空间优先的目标从各处理结点中筛选获得,对应地按各神经网络模型的时延要求和/或各神经网络模型产生中间数据的总缓存数据量分布,以指示各神经网络模型的处理窗口的开启及关闭时间。Afterwards, the TDM unit 13 of the NPU chip 10 can search and determine at least one switching node from the multiple processing nodes of each model M1~Mm according to the goals of delay priority and/or storage space priority, and determine the static storage space corresponding to each model M1~Mm, and then the NPU chip 10 writes the reasoning data of each model M1~Mm into the input and intermediate data storage space SRAM_F in the corresponding static storage space for each model M1~Mm to read. Here, the neural network model can include multiple inherent processing nodes respectively, and complete a cycle of neural network reasoning by sequentially executing the relevant operations of each processing node. The switching node is obtained by screening from each processing node according to the goals of delay priority and/or storage space priority, and is correspondingly distributed according to the delay requirements of each neural network model and/or the total cache data volume of the intermediate data generated by each neural network model to indicate the opening and closing time of the processing window of each neural network model.

具体请参考图3和图4。图3示出了根据本发明的一些实施例提供的确定时延优化推理的切换节点及存储空间分布的流程图。图4示出了根据本发明的一些实施例提供的时延优化推理的示意图。For details, please refer to Figures 3 and 4. Figure 3 shows a flow chart of determining a switching node and storage space distribution for latency optimization reasoning according to some embodiments of the present invention. Figure 4 shows a schematic diagram of latency optimization reasoning according to some embodiments of the present invention.

在图3所示的实施例中,确定时延优化推理的存储空间分布的过程可以通过离线的方式,基于预先准备的大量推理数据样本来实施。具体来说,本发明可以首先假设有m个需要并行处理的神经网络模型M1~Mm,并分别确定各模型M1~Mm完成一次推理的推理周期P1~Pm(单位为ms)。之后,NPU芯片10中的TDM单元13可以根据需要并行处理的模型数量m,将NPU芯片10的全存储空间静态划分为存储各模型参数的参数存储空间W1~Wm,以及存储各模型的输入数据及中间数据的输入及中间数据存储空间F1~Fm,以确定存储各模型的推理数据的第一存储空间分布。在此,各模型M1~Mm的参数存储空间W1~Wm可以通过分别对各模型M1~Mm进行编译,并设置目标的性能FPS(Frames per second)和总带宽来确定,其总大小不应超过SRAM_W 141,即SUM(W1:Wm)≤SRAM_W。相应地,各模型M1~Mm的输入及中间数据存储空间F1~Fm也可以通过分别对各模型M1~Mm进行编译,并设置目标的性能FPS(Frames per second)和总带宽来确定,其总大小不应超过SRAM_F 142,即SUM(F1:Fm)≤SRAM_F。In the embodiment shown in FIG3 , the process of determining the storage space distribution of the latency optimization reasoning can be implemented in an offline manner based on a large number of pre-prepared reasoning data samples. Specifically, the present invention can first assume that there are m neural network models M1 to Mm that need to be processed in parallel, and respectively determine the reasoning cycle P1 to Pm (in ms) for each model M1 to Mm to complete a reasoning. Afterwards, the TDM unit 13 in the NPU chip 10 can statically divide the full storage space of the NPU chip 10 into parameter storage spaces W1 to Wm for storing parameters of each model, and input and intermediate data storage spaces F1 to Fm for storing input data and intermediate data of each model according to the number of models m that need to be processed in parallel, so as to determine the first storage space distribution for storing the reasoning data of each model. Here, the parameter storage space W1 to Wm of each model M1 to Mm can be determined by compiling each model M1 to Mm respectively, and setting the target performance FPS (Frames per second) and total bandwidth, and its total size should not exceed SRAM_W 141, that is, SUM (W1: Wm) ≤ SRAM_W. Correspondingly, the input and intermediate data storage space F1~Fm of each model M1~Mm can also be determined by compiling each model M1~Mm separately and setting the target performance FPS (Frames per second) and total bandwidth, and the total size should not exceed SRAM_F 142, that is, SUM(F1:Fm)≤SRAM_F.

再之后,TDM单元13可以根据各模型M1~Mm的推理周期P1~Pm,确定多模型推理的总周期N,并由此确定各模型M1~Mm在总周期N内的推理次数。具体来说,该总周期N可以由各模型M1~Mm的推理周期P1~Pm的最小公倍数(Least Common Multiple,LCM)来确定,即N=LCM(P1,P2,…,Pm)=LCM(P1,LCM(P2,…,LCM(Pm-1,Pm)))。TDM单元13可以通过计算Num(Mi)=N/Pi来得到每个模型i在一个总周期N内的推理次数,并由此确定各模型M1~Mm在一个总周期N内的时序分配情况(例如,在一个总周期N内进行一次模型M1、两次模型M2及四次模型M3的多模型并行推理)。Afterwards, the TDM unit 13 can determine the total cycle N of multi-model reasoning according to the reasoning cycles P1-Pm of each model M1-Mm, and thereby determine the number of inferences of each model M1-Mm within the total cycle N. Specifically, the total cycle N can be determined by the least common multiple (LCM) of the reasoning cycles P1-Pm of each model M1-Mm, that is, N=LCM(P1,P2,…,Pm)=LCM(P1,LCM(P2,…,LCM(Pm-1,Pm))). The TDM unit 13 can obtain the number of inferences of each model i within a total cycle N by calculating Num(Mi)=N/Pi, and thereby determine the timing distribution of each model M1-Mm within a total cycle N (for example, multi-model parallel reasoning of model M1 once, model M2 twice, and model M3 four times within a total cycle N).

再之后,TDM单元13可以根据上述总周期N以及各模型M1~Mm需要在一个总周期N内完成的推理次数,从各模型M1~Mm固有的多个处理结点中分别确定时分复用各模型M1~Mm的至少一个切换结点,以优先满足各模型M1~Mm的时延(即完成两次神经网络推理的间隔)要求。在此,本发明以相邻的两个切换结点i和i+1来构成对对应模型Mi进行神经网络推理的轮转时间片T。模型Mi在一个总周期N内的轮转时间片T的数量为Num_T(i)=N/T*Num(Mi)。After that, the TDM unit 13 can determine at least one switching node of each model M1-Mm to be time-division multiplexed from multiple processing nodes inherent to each model M1-Mm according to the total cycle N and the number of inferences that each model M1-Mm needs to complete within a total cycle N, so as to give priority to meeting the delay requirements of each model M1-Mm (i.e., the interval between completing two neural network inferences). Here, the present invention uses two adjacent switching nodes i and i+1 to form a rotation time slice T for performing neural network inference on the corresponding model Mi. The number of rotation time slices T of the model Mi within a total cycle N is Num_T(i)=N/T*Num(Mi).

再之后,TDM单元13可以基于上述第一存储空间分布及各模型M1~Mm的切换结点i,利用NPU芯片10的全计算资源对各模型M1~Mm的推理数据样本进行多模型推理,以确定各模型M1~Mm缺少的存储空间。Afterwards, the TDM unit 13 can perform multi-model reasoning on the inference data samples of each model M1~Mm based on the above-mentioned first storage space distribution and the switching node i of each model M1~Mm, using the full computing resources of the NPU chip 10 to determine the storage space missing of each model M1~Mm.

具体来说,上述推理数据样本可以与之后进行多模型推理的推理数据具有相同的形式和尺寸,并具有类似的内容(例如:两者同为目标尺寸且带有噪声信号和/或马赛克的图像数据)。技术人员可以将NPU芯片10连接到外部缓存器(Buffer),并按照各模型M1~Mm的编译需求将该外部缓存器的缓存空间也对应地静态划分为m块,再将各模型M1~Mm的模型参数分别加载到NPU芯片10内部的各参数存储空间W1~Wm中,以完成各模型M1~Mm的初始化。Specifically, the above-mentioned inference data samples can have the same form and size as the inference data for subsequent multi-model inference, and have similar content (for example: both are image data of the same target size and with noise signals and/or mosaics). The technician can connect the NPU chip 10 to an external buffer, and statically divide the cache space of the external buffer into m blocks according to the compilation requirements of each model M1~Mm, and then load the model parameters of each model M1~Mm into each parameter storage space W1~Wm inside the NPU chip 10 to complete the initialization of each model M1~Mm.

之后,TDM单元13可以启动NPU芯片10中的MAC阵列计算单元11及VPU 12,以如图4所示地运行NPU芯片10。响应于启动第一模型M1的第一切换结点,MAC阵列计算单元11及VPU 12可以从上述第一存储空间分布中对应第一模型M1的第一存储空间W1读取第一模型M1的模型参数,并从上述第一存储空间分布中对应第一模型M1的第一存储空间F1读取第一模型M1的推理数据样本,以利用NPU芯片10的全计算资源进行第一模型推理。响应于启动后续的第二模型M2的第二切换结点,NPU芯片10可以中断第一模型推理,先将该第一模型推理产生的第一中间数据写入SRAM_F 142的第一存储空间F1,再将溢出的第一中间数据写入对应的第一外部存储空间,并统计该溢出的第一中间数据的数据量,以确定该第一模型M1缺少的存储空间S_M1_j,其中j代表第j个周期的轮转。Afterwards, the TDM unit 13 may start the MAC array computing unit 11 and the VPU 12 in the NPU chip 10 to run the NPU chip 10 as shown in FIG4 . In response to starting the first switching node of the first model M1, the MAC array computing unit 11 and the VPU 12 may read the model parameters of the first model M1 from the first storage space W1 corresponding to the first model M1 in the above-mentioned first storage space distribution, and read the inference data samples of the first model M1 from the first storage space F1 corresponding to the first model M1 in the above-mentioned first storage space distribution, so as to perform the first model inference using the full computing resources of the NPU chip 10. In response to starting the second switching node of the subsequent second model M2, the NPU chip 10 may interrupt the first model inference, first write the first intermediate data generated by the first model inference into the first storage space F1 of the SRAM_F 142, then write the overflowed first intermediate data into the corresponding first external storage space, and count the data amount of the overflowed first intermediate data to determine the storage space S_M1_j missing from the first model M1, where j represents the rotation of the jth cycle.

进一步地,NPU芯片10的读写操作和计算操作可以同时地独立进行。NPU芯片10还可以优选地在进行第一模型推理的同时,将第二模型M2的推理数据样本写入第一存储空间分布中对应第二模型M2的第二存储空间F2。如此,响应于启动后续的第二模型M2的第二切换结点,NPU芯片10中的MAC阵列计算单元11及VPU 12即可经由片上传输的方式,实时地从该第二存储空间F2读取到第二模型M2的推理数据样本,并将NPU芯片10的全计算资源切换到第二模型M2来进行第二模型推理,以如上所述地确定第二模型推理使用的第二外部存储空间S_M2_j。通过同步处理分离的数据读写和计算操作,本发明可以在进行前一模型i的神经网络推理计算的同时,预先将后一模型i+1的推理数据写入对应的存储空间Fi,从而进一步提升多模型推理的效率和实时性。Furthermore, the read and write operations and calculation operations of the NPU chip 10 can be performed independently and simultaneously. The NPU chip 10 can also preferably write the inference data samples of the second model M2 into the second storage space F2 corresponding to the second model M2 in the first storage space distribution while performing the first model inference. In this way, in response to starting the subsequent second switching node of the second model M2, the MAC array computing unit 11 and the VPU 12 in the NPU chip 10 can read the inference data samples of the second model M2 from the second storage space F2 in real time via on-chip transmission, and switch the full computing resources of the NPU chip 10 to the second model M2 to perform the second model inference, so as to determine the second external storage space S_M2_j used for the second model inference as described above. By synchronously processing the separated data read and write and calculation operations, the present invention can write the inference data of the subsequent model i+1 into the corresponding storage space Fi in advance while performing the neural network inference calculation of the previous model i, thereby further improving the efficiency and real-time performance of multi-model inference.

依此类推,NPU芯片10可以如上所述地按照各模型M1~Mm固定的时间片T进行轮转计算,从而在消耗完总周期N/T个时间片段时完成所有模型M1~Mm的多次轮转计算。此时,每个模型Mi完成了Num(Mi)次推理计算,并得到NPU芯片10在一个总周期N内关于各模型i所需的外部存储需求大小S_Mi_j。之后,TDM单元13可以对每个模型i在多轮轮转计算时需要的外部存储空间取最大值,即S_Mi=MAX(S_Mi_j),以分别确定第一存储空间分布中各模型M1~Mm缺少的存储空间S_Mi。Similarly, the NPU chip 10 can perform round-robin calculations according to the fixed time slice T of each model M1~Mm as described above, so as to complete multiple round-robin calculations of all models M1~Mm when the total cycle N/T time slices are consumed. At this time, each model Mi has completed Num(Mi) inference calculations, and the external storage requirement size S_Mi_j required by the NPU chip 10 for each model i within a total cycle N is obtained. Afterwards, the TDM unit 13 can take the maximum value of the external storage space required for each model i in multiple rounds of round-robin calculations, that is, S_Mi=MAX(S_Mi_j), to respectively determine the storage space S_Mi missing for each model M1~Mm in the first storage space distribution.

请继续参考图3,在确定第一存储空间分布中各模型缺少的存储空间S_Mi之后,TDM单元13可以据此优化第一存储空间分布,以确定满足多模型推理需求的第二存储空间分布。Please continue to refer to Figure 3. After determining the storage space S_Mi that is missing for each model in the first storage space distribution, the TDM unit 13 can optimize the first storage space distribution accordingly to determine a second storage space distribution that meets the multi-model reasoning requirements.

具体来说,在确定第二存储空间分布的过程中,TDM单元13可以优选地根据上述第一存储空间分布中模型Mi缺少的存储空间S_Mi,扩充对应的输入数据及中间数据存储空间Fi(即Fi’=Fi+S_Mi)。此外,为了确保第二存储空间分布中各模型M1~Mm的输入及中间数据存储空间F1’~Fm’的总大小不超过SRAM_F 142,即SUM(F1’:Fm’)≤SRAM_F,TDM单元13还可以对应地减小其余的至少一个模型Mj的输入数据及中间数据存储空间Fj(即Fj’=Fj-S_Mi),以确定各模型M1~Mm均不缺少存储空间的第二存储空间分布。若无法获得SUM(F1’:Fm’)≤SRAM_F的分布方案,则TDM单元13可以判定NPU芯片10的片上存储空间无法支持m个模型的时分复用,需要减少模型数量。Specifically, in the process of determining the second storage space distribution, the TDM unit 13 can preferably expand the corresponding input data and intermediate data storage space Fi (i.e., Fi'=Fi+S_Mi) according to the storage space S_Mi that is missing in the model Mi in the above-mentioned first storage space distribution. In addition, in order to ensure that the total size of the input and intermediate data storage space F1'~Fm' of each model M1~Mm in the second storage space distribution does not exceed SRAM_F 142, that is, SUM(F1':Fm')≤SRAM_F, the TDM unit 13 can also correspondingly reduce the input data and intermediate data storage space Fj of at least one remaining model Mj (i.e., Fj'=Fj-S_Mi) to determine the second storage space distribution in which each model M1~Mm does not lack storage space. If the distribution scheme of SUM(F1':Fm')≤SRAM_F cannot be obtained, the TDM unit 13 can determine that the on-chip storage space of the NPU chip 10 cannot support the time division multiplexing of m models, and the number of models needs to be reduced.

进一步地,在优化上述第一存储空间分布,以确定第二存储空间分布之后,TDM单元13还可以基于该第二存储空间分布来重新初始化各模型M1~Mm的状态,并如上所述地重新轮转计算该第二存储空间分布中各模型M1~Mm是否还缺少存储空间。若该第二存储空间分布中各模型M1~Mm还缺少存储空间,则TDM单元13可以如上所述地再次优化NPU芯片10的片上存储空间,直到各模型Mi皆不缺少存储空间,即S_Mi=0。Further, after optimizing the above-mentioned first storage space distribution to determine the second storage space distribution, the TDM unit 13 can also reinitialize the state of each model M1~Mm based on the second storage space distribution, and re-rotate and calculate whether each model M1~Mm in the second storage space distribution still lacks storage space as described above. If each model M1~Mm in the second storage space distribution still lacks storage space, the TDM unit 13 can optimize the on-chip storage space of the NPU chip 10 again as described above until each model Mi has no lack of storage space, that is, S_Mi=0.

请继续参考图2及图4所示,在确定时延优化推理的切换结点及各模型M1~Mm对应的存储空间之后,NPU芯片10可以经由多路图像信号处理器(Image Signal Processor,ISP)同步获取关于多路图像的推理数据,和/或对获取的至少一路推理数据进行切片(Tile)处理获得的多路推理数据,再由其MAC阵列计算单元11及VPU 12根据TDM单元13确定的切换结点,利用NPU芯片10的全计算资源依次进行各模型M1~Mm的时分复用推理,以实现多个模型M1~Mm的时延优化推理。Please continue to refer to Figures 2 and 4. After determining the switching nodes for latency optimization reasoning and the storage space corresponding to each model M1~Mm, the NPU chip 10 can synchronously obtain reasoning data about multiple images through a multi-channel image signal processor (Image Signal Processor, ISP), and/or obtain multi-channel reasoning data by slicing at least one acquired reasoning data. Then, its MAC array computing unit 11 and VPU 12 use the full computing resources of the NPU chip 10 to perform time-division multiplexing reasoning of each model M1~Mm in turn according to the switching nodes determined by the TDM unit 13, so as to achieve latency optimization reasoning of multiple models M1~Mm.

具体来说,在进行时延优化推理的过程中,响应于启动第一模型M1的第一切换结点,MAC阵列计算单元11及VPU 12可以从预先确定的第二存储空间分布中对应第一模型M1的第三存储空间F1’读取第一模型M1的推理数据,以利用NPU芯片10的全计算资源进行第一模型推理。之后,响应于启动后续的第二模型M2的第二切换结点,MAC阵列计算单元11及VPU 12可以中断第一模型推理,将第一模型推理产生的第三中间数据写入第三存储空间F1’,并从第二存储空间分布中对应第二模型M2的第四存储空间F2’读取第二模型的推理数据,以利用NPU芯片10的全计算资源进行第二模型推理。依此类推,响应于启动后续的模型Mm的第m切换结点,MAC阵列计算单元11及VPU 12可以如上所述地中断第m-1模型推理,将第m-1模型推理产生的中间数据写入存储空间F(m-1)’,并从第二存储空间分布中对应模型Mm的存储空间Fm’读取模型Mm的推理数据,以利用NPU芯片10的全计算资源进行第m模型推理,从而实现多个模型M1~Mm的时延优化推理,并有效控制各模型M1~Mm的时延(即完成两次神经网络推理的间隔)。Specifically, in the process of performing latency optimization reasoning, in response to starting the first switching node of the first model M1, the MAC array computing unit 11 and the VPU 12 can read the reasoning data of the first model M1 from the third storage space F1' corresponding to the first model M1 in the predetermined second storage space distribution, so as to use the full computing resources of the NPU chip 10 to perform the first model reasoning. Afterwards, in response to starting the second switching node of the subsequent second model M2, the MAC array computing unit 11 and the VPU 12 can interrupt the first model reasoning, write the third intermediate data generated by the first model reasoning into the third storage space F1', and read the reasoning data of the second model from the fourth storage space F2' corresponding to the second model M2 in the second storage space distribution, so as to use the full computing resources of the NPU chip 10 to perform the second model reasoning. Similarly, in response to starting the subsequent mth switching node of model Mm, the MAC array computing unit 11 and the VPU 12 may interrupt the inference of the m-1th model as described above, write the intermediate data generated by the inference of the m-1th model into the storage space F(m-1)’, and read the inference data of model Mm from the storage space Fm’ corresponding to model Mm in the second storage space distribution, so as to utilize the full computing resources of the NPU chip 10 to perform the inference of the mth model, thereby realizing the delay-optimized inference of multiple models M1~Mm, and effectively controlling the delay of each model M1~Mm (i.e., the interval between completing two neural network inferences).

此外,请结合参考图5和图6。图5示出了根据本发明的一些实施例提供的确定存储空间优化推理的切换结点及存储空间分布的流程图。图6示出了根据本发明的一些实施例提供的存储空间优化推理的示意图。In addition, please refer to Figure 5 and Figure 6. Figure 5 shows a flow chart of determining a switching node and storage space distribution for storage space optimization reasoning according to some embodiments of the present invention. Figure 6 shows a schematic diagram of storage space optimization reasoning according to some embodiments of the present invention.

如上所述,在各模型M1~Mm都按照固定时间片T0或由时延优先的切换节点构成的时间片T进行时分复用的轮转推理的实施例中,很有可能出现NPU芯片10的片上存储空间无法支持m个模型的时分复用的情形,从而导致多模型推理的失败。为此,在图5所示的实施例中,本发明还可以通过离线的方式,基于预先准备的大量推理数据样本来确定存储空间优化推理的切换结点。As described above, in the embodiment where each model M1-Mm performs time-division multiplexing round-robin reasoning according to a fixed time slice T0 or a time slice T composed of delay-priority switching nodes, it is very likely that the on-chip storage space of the NPU chip 10 cannot support the time-division multiplexing of m models, thereby causing the failure of multi-model reasoning. To this end, in the embodiment shown in FIG5 , the present invention can also determine the switching node for storage space optimization reasoning in an offline manner based on a large number of pre-prepared reasoning data samples.

具体来说,本发明仍可以假设有m个需要并行处理的神经网络模型M1~Mm),并分别确定各模型M1~Mm完成一次推理的推理周期P1~Pm(单位为ms)。之后,NPU芯片10中的TDM单元13可以根据需要并行处理的模型数量m,将NPU芯片10的全存储空间静态划分为存储各模型参数的参数存储空间W1~Wm,以及存储各模型的输入数据及中间数据的输入及中间数据存储空间F1~Fm,以确定存储各模型M1~Mm的推理数据的第三存储空间分布。在此,各模型M1~Mm的参数存储空间W1~Wm可以通过分别对各模型M1~Mm进行编译,并设置目标的性能FPS(Frames per second)和总带宽来确定,其总大小不应超过SRAM_W 141,即SUM(W1:Wm)≤SRAM_W。相应地,各模型M1~Mm的输入及中间数据存储空间F1~Fm也可以通过分别对各模型M1~Mm进行编译,并设置目标的性能FPS(Frames per second)和总带宽来确定,其总大小不应超过SRAM_F 142,即SUM(F1:Fm)≤SRAM_F。Specifically, the present invention can still assume that there are m neural network models M1~Mm) that need to be processed in parallel, and determine the inference cycle P1~Pm (in ms) for each model M1~Mm to complete an inference. Afterwards, the TDM unit 13 in the NPU chip 10 can statically divide the full storage space of the NPU chip 10 into parameter storage spaces W1~Wm for storing parameters of each model, and input and intermediate data storage spaces F1~Fm for storing input data and intermediate data of each model according to the number of models m that need to be processed in parallel, so as to determine the third storage space distribution for storing the inference data of each model M1~Mm. Here, the parameter storage space W1~Wm of each model M1~Mm can be determined by compiling each model M1~Mm respectively and setting the target performance FPS (Frames per second) and total bandwidth, and its total size should not exceed SRAM_W 141, that is, SUM(W1:Wm)≤SRAM_W. Correspondingly, the input and intermediate data storage space F1~Fm of each model M1~Mm can also be determined by compiling each model M1~Mm separately and setting the target performance FPS (Frames per second) and total bandwidth, and the total size should not exceed SRAM_F 142, that is, SUM(F1:Fm)≤SRAM_F.

再之后,TDM单元13可以根据各模型M1~Mm的推理周期P1~Pm,确定多模型推理的总周期N,并由此确定各模型M1~Mm在总周期N内的推理次数,即N=LCM(P1,P2,…,Pm)=LCM(P1,LCM(P2,…,LCM(Pm-1,Pm)))。TDM单元13可以通过计算Num(Mi)=N/Pi来得到每个模型i在一个总周期N内的推理次数,并由此确定各模型M1~Mm在一个总周期N内的时序分配情况(例如:在一个总周期N内进行一次模型M1、两次模型M2及四次模型M3的多模型并行推理)。Afterwards, the TDM unit 13 can determine the total cycle N of multi-model reasoning according to the reasoning cycles P1-Pm of each model M1-Mm, and thereby determine the number of reasonings of each model M1-Mm within the total cycle N, that is, N=LCM(P1, P2,…, Pm)=LCM(P1, LCM(P2,…, LCM(Pm-1, Pm))). The TDM unit 13 can obtain the number of reasonings of each model i within a total cycle N by calculating Num(Mi)=N/Pi, and thereby determine the timing distribution of each model M1-Mm within a total cycle N (for example: multi-model parallel reasoning of model M1 once, model M2 twice, and model M3 four times within a total cycle N).

再之后,TDM单元13可以基于上述第三存储空间分布以及各模型M1~Mm固有的各处理结点,利用NPU芯片10的全计算资源分别对各模型M1~Mm的推理数据样本进行遍历的模型推理,以分别确定各模型M1~Mm在各处理结点缺少的存储空间。Afterwards, the TDM unit 13 can use the full computing resources of the NPU chip 10 to perform model reasoning on the inference data samples of each model M1~Mm based on the above-mentioned third storage space distribution and the processing nodes inherent in each model M1~Mm, so as to respectively determine the storage space lacking in each processing node of each model M1~Mm.

具体来说,推理数据样本可以与之后进行多模型推理的推理数据具有相同的形式和尺寸,并具有类似的内容(例如:两者同为目标尺寸且带有噪声信号和/或马赛克的图像数据)。技术人员可以将NPU芯片10连接到外部缓存器(Buffer),并按照各模型M1~Mm的编译需求将该外部缓存器的缓存空间也对应地静态划分为m块,再将各模型M1~Mm的模型参数分别加载到NPU芯片10内部的各参数存储空间W1~Wm中,以完成各模型M1~Mm的初始化。Specifically, the inference data sample can have the same form and size as the inference data for subsequent multi-model inference, and have similar content (for example: both are of the same target size and have image data with noise signals and/or mosaics). The technician can connect the NPU chip 10 to an external buffer, and statically divide the cache space of the external buffer into m blocks according to the compilation requirements of each model M1~Mm, and then load the model parameters of each model M1~Mm into each parameter storage space W1~Wm inside the NPU chip 10 to complete the initialization of each model M1~Mm.

之后,TDM单元13可以启动NPU芯片10中的MAC阵列计算单元11及VPU 12,以利用NPU芯片10的全计算资源依次对各模型M1~Mm的推理数据样本进行遍历的模型推理。具体来说,在对任一模型Mi进行遍历其处理结点的模型推理的过程中,NPU芯片10可以响应于模型Mi中的任一处理结点而中断当前的模型推理,先将模型推理产生的第五中间数据写入第三存储空间分布中对应模型Mi的第五存储空间Fi,再将溢出的第五中间数据写入外部存储空间,从而根据溢出的第五中间数据的数据量,分别统计模型Mi在各处理结点j处溢出到外部缓存器的数据量S_Mi_j,并由此确定各模型M1~Mm在其各处理结点缺少的存储空间。在此,j≤Ni,j为模型Mi的第j个结点,Ni为模型Mi的总结点数量。Afterwards, the TDM unit 13 can start the MAC array computing unit 11 and the VPU 12 in the NPU chip 10 to use the full computing resources of the NPU chip 10 to perform model reasoning on the reasoning data samples of each model M1~Mm in turn. Specifically, in the process of traversing the model reasoning of any model Mi through its processing nodes, the NPU chip 10 can interrupt the current model reasoning in response to any processing node in the model Mi, first write the fifth intermediate data generated by the model reasoning into the fifth storage space Fi corresponding to the model Mi in the third storage space distribution, and then write the overflowed fifth intermediate data into the external storage space, so as to count the amount of data S_Mi_j of the model Mi overflowing to the external buffer at each processing node j according to the amount of data of the overflowed fifth intermediate data, and thereby determine the storage space lacking in each model M1~Mm at its processing node. Here, j≤Ni, j is the jth node of the model Mi, and Ni is the total number of nodes of the model Mi.

请继续参考图5,在确定各模型M1~Mm在各处理结点缺少的存储空间之后,TDM单元13可以对各模型M1~Mm的各处理结点缺少的存储空间的大小进行排序,记为S_Mi[N]=sort(S_Mi_j),并按缺少的存储空间由小到大的顺序,分别为各模型划分不同数量n的子图。例如,划分两个子图时,TDM单元13可以根据处理结点S_Mi[0]确定各子图的切分位置。又例如,划分为三个子图时,TDM单元13可以根据处理结点S_Mi[0]和S_Mi[1]确定各子图的切分位置。依此类推,在划分为n个子图时,TDM单元13可以根据处理结点S_Mi[0]~S_Mi[n-2]确定各子图的切分位置,并最终得到每个模型M1~Mm划分的不同数量n的子图模型Mi=set{S_i(1),S_i(2),...,S_i(n-1)},其中S_i(n-1)表示模型Mi划分为n-1个子图集合。Please continue to refer to Figure 5. After determining the storage space missing in each processing node of each model M1~Mm, the TDM unit 13 can sort the size of the storage space missing in each processing node of each model M1~Mm, recorded as S_Mi[N]=sort(S_Mi_j), and divide each model into subgraphs of different numbers n in order of the missing storage space from small to large. For example, when dividing into two subgraphs, the TDM unit 13 can determine the cutting position of each subgraph according to the processing node S_Mi[0]. For another example, when dividing into three subgraphs, the TDM unit 13 can determine the cutting position of each subgraph according to the processing nodes S_Mi[0] and S_Mi[1]. By analogy, when divided into n subgraphs, the TDM unit 13 can determine the segmentation position of each subgraph according to the processing nodes S_Mi[0]~S_Mi[n-2], and finally obtain the subgraph model Mi=set{S_i(1), S_i(2),..., S_i(n-1)} of different numbers n divided by each model M1~Mm, where S_i(n-1) indicates that the model Mi is divided into a set of n-1 subgraphs.

再之后,TDM单元13可以对模型Mi的子图集合进行推理测试,以确定其所需的计算时长Mi_Tn,并根据推理次数Num(Mi)及计算时长Mi_Tn满足多模型推理的总性能需求的最大子图数量nmax对应的处理结点的位置,确定模型Mi的切换结点位置C_i=set{Node1,Node2,…,Node_j}。Afterwards, the TDM unit 13 can perform inference tests on the subgraph set of the model Mi to determine the required calculation time Mi_Tn, and determine the switching node position C_i = set{Node1, Node2,…, Node_j} of the model Mi based on the position of the processing node corresponding to the maximum number of subgraphs n max that meets the total performance requirements of multi-model reasoning according to the number of inferences Num(Mi) and the calculation time Mi_Tn.

具体来说,在进行推理测试的过程中,TDM单元13可以首先启动NPU芯片10中的MAC阵列计算单元11及VPU 12,利用NPU芯片10的全计算资源依次对各模型M1~Mm的推理数据样本进行模型推理。响应于任一模型Mi的第一处理结点,MAC阵列计算单元11及VPU 12可以从上述第三存储空间分布中对应模型Mi的第六存储空间Wi读取模型Mi的模型参数,并从该第三存储空间分布中对应模型Mi的第六存储空间Fi读取该模型Mi的推理数据样本,以利用NPU芯片10的全计算资源进行第六模型推理。之后,响应于该模型Mi后续的第二处理结点,NPU芯片10可以如时分复用多个模型地中断模型Mi的模型推理,先将模型推理产生的第六中间数据写入第六存储空间Fi,再从该第六存储空间重新读取该第六中间数据,以继续进行该第一处理结点及第二处理结点之间的模型推理,并依此类推,直到完成模型Mi的完整推理,以确定该模型Mi所需的计算时长Mi_Tn。Specifically, during the inference test, the TDM unit 13 may first start the MAC array computing unit 11 and the VPU 12 in the NPU chip 10, and use the full computing resources of the NPU chip 10 to perform model inference on the inference data samples of each model M1 to Mm in turn. In response to the first processing node of any model Mi, the MAC array computing unit 11 and the VPU 12 may read the model parameters of the model Mi from the sixth storage space Wi corresponding to the model Mi in the third storage space distribution, and read the inference data samples of the model Mi from the sixth storage space Fi corresponding to the model Mi in the third storage space distribution, so as to use the full computing resources of the NPU chip 10 to perform the sixth model inference. Afterwards, in response to the subsequent second processing node of the model Mi, the NPU chip 10 can interrupt the model reasoning of the model Mi by time-division multiplexing multiple models, first write the sixth intermediate data generated by the model reasoning into the sixth storage space Fi, and then re-read the sixth intermediate data from the sixth storage space to continue the model reasoning between the first processing node and the second processing node, and so on, until the complete reasoning of the model Mi is completed to determine the calculation time Mi_Tn required for the model Mi.

之后,TDM单元13可以根据模型的数量m以及多模型推理的总周期N,确定多模型推理的总性能需求,即m*N,再由此确定满足多模型推理的总性能需求(即SUM(Num(Mi)*Mi_Tn)≤m*N)的最大子图数量nmax,从而根据该最大子图数量nmax对应的处理结点的位置,确定模型Mi的切换结点位置Ci=set{Node1,Node2,…,Node_j}。在此,n值越大说明切分的子图数量越多,因而能够更频繁地切换各模型M1~Mm进行时分复用的多模型推理,从而进一步降低各模型M1~Mm的时延(即完成两次神经网络推理的间隔)。Afterwards, the TDM unit 13 can determine the total performance requirement of multi-model reasoning, i.e., m*N, according to the number of models m and the total cycle N of multi-model reasoning, and then determine the maximum number of subgraphs n max that meets the total performance requirement of multi-model reasoning (i.e., SUM(Num(Mi)*Mi_Tn)≤m*N), thereby determining the switching node position Ci=set{Node1,Node2,…,Node_j} of the model Mi according to the position of the processing node corresponding to the maximum number of subgraphs n max . Here, the larger the n value is, the more subgraphs are divided, so that each model M1~Mm can be switched more frequently for time-division multiplexing multi-model reasoning, thereby further reducing the delay of each model M1~Mm (i.e., the interval between completing two neural network reasoning).

请继续参考图2及图5所示,在确定存储空间优化推理的切换结点及各模型M1~Mm对应的存储空间之后,NPU芯片10可以经由多路图像信号处理器(ISP)同步获取关于多路图像的推理数据,和/或对获取的至少一路推理数据进行切片(Tile)处理获得的多路推理数据,再由其MAC阵列计算单元11及VPU 12根据TDM单元13确定的切换结点,利用NPU芯片10的全计算资源依次进行各模型M1~Mm的时分复用推理,以实现多个模型M1~Mm的存储空间优化推理。Please continue to refer to Figures 2 and 5. After determining the switching nodes for storage space optimization reasoning and the storage space corresponding to each model M1~Mm, the NPU chip 10 can synchronously obtain reasoning data about multiple images through a multi-channel image signal processor (ISP), and/or obtain multi-channel reasoning data by slicing at least one acquired reasoning data. Then, its MAC array computing unit 11 and VPU 12 use the full computing resources of the NPU chip 10 to perform time-division multiplexing reasoning of each model M1~Mm in turn according to the switching nodes determined by the TDM unit 13, so as to realize storage space optimization reasoning of multiple models M1~Mm.

具体来说,在进行存储空间优化推理的过程中,响应于启动第三模型Mi的第三切换结点Ci,MAC阵列计算单元11及VPU 12可以从预先确定的第三存储空间分布中对应第三模型Mi的第七存储空间F1读取该第三模型Mi的推理数据,以利用NPU芯片10的全计算资源进行第三模型推理。之后,响应于启动后续的第四模型Mi+1的第四切换结点Ci+1,MAC阵列计算单元11及VPU 12可以中断第三模型推理,将第三模型推理产生的第七中间数据写入第七存储空间F1,并从第三存储空间分布中对应第四模型Mi+1的第八存储空间读取该第四模型Mi+1的推理数据,以利用NPU芯片10的全计算资源进行第四模型推理。依此类推,响应于启动后续的模型Mm的第m切换结点Cm,MAC阵列计算单元11及VPU 12可以如上所述地中断第m-1模型推理,将第m-1模型推理产生的中间数据写入存储空间F(m-1),并从第三存储空间分布中对应模型Mm的存储空间Fm读取模型Mm的推理数据,以利用NPU芯片10的全计算资源进行第m模型推理,从而实现多个模型M1~Mm的存储空间优化推理,并有效控制各模型M1~Mm的时分复用推理所产生的中间数据。Specifically, in the process of performing storage space optimization reasoning, in response to starting the third switching node Ci of the third model Mi, the MAC array computing unit 11 and the VPU 12 can read the reasoning data of the third model Mi from the seventh storage space F1 corresponding to the third model Mi in the predetermined third storage space distribution, so as to perform the third model reasoning using the full computing resources of the NPU chip 10. Afterwards, in response to starting the fourth switching node Ci+1 of the subsequent fourth model Mi+1, the MAC array computing unit 11 and the VPU 12 can interrupt the third model reasoning, write the seventh intermediate data generated by the third model reasoning into the seventh storage space F1, and read the reasoning data of the fourth model Mi+1 from the eighth storage space corresponding to the fourth model Mi+1 in the third storage space distribution, so as to perform the fourth model reasoning using the full computing resources of the NPU chip 10. By analogy, in response to starting the mth switching node Cm of the subsequent model Mm, the MAC array computing unit 11 and the VPU 12 can interrupt the inference of the m-1th model as described above, write the intermediate data generated by the inference of the m-1th model into the storage space F(m-1), and read the inference data of the model Mm from the storage space Fm corresponding to the model Mm in the third storage space distribution, so as to utilize the full computing resources of the NPU chip 10 to perform the inference of the mth model, thereby realizing the storage space optimized inference of multiple models M1~Mm, and effectively controlling the intermediate data generated by the time-division multiplexing inference of each model M1~Mm.

综上,通过采用根据时延优先和/或存储空间优先的目标确定的切换结点,将每个模型M1~Mm的计算时间都划分成多个很小的时间槽,本发明提供的上述多模型推理方法、多模型推理系统及计算机可读存储介质,可以通过时分复用推理的方式轮转进行各模型M1~Mm的神经网络推理计算,并在对应的时间槽结束后立即保存当前模型Mi计算获得的状态信息等中间数据,以轮转进行下一个模型M(i+1)的神经网络推理计算,从而有效控制各模型M1~Mm的时延,并减小多模型推理产生的中间数据所需的缓存空间。此外,由于各模型M1~Mm都通过片内存储的方式缓存中间数据,并共享NPU芯片10的全计算资源进行神经网络推理,本发明可以有效缩短各模型M1~Mm之间的切换时间,并避免整个多模型推理系统中计算资源的闲置浪费,从而整体地提升各模型M1~Mm的计算效率,以支持多个模型M1~Mm的并发推理。In summary, by adopting the switching nodes determined according to the goals of delay priority and/or storage space priority, the calculation time of each model M1~Mm is divided into multiple small time slots. The above-mentioned multi-model reasoning method, multi-model reasoning system and computer-readable storage medium provided by the present invention can rotate the neural network reasoning calculation of each model M1~Mm by time-division multiplexing reasoning, and immediately save the intermediate data such as the state information calculated by the current model Mi after the corresponding time slot ends, so as to rotate the neural network reasoning calculation of the next model M(i+1), thereby effectively controlling the delay of each model M1~Mm and reducing the cache space required for the intermediate data generated by multi-model reasoning. In addition, since each model M1~Mm caches the intermediate data by on-chip storage and shares the full computing resources of the NPU chip 10 for neural network reasoning, the present invention can effectively shorten the switching time between each model M1~Mm and avoid the idle waste of computing resources in the entire multi-model reasoning system, thereby improving the computing efficiency of each model M1~Mm as a whole to support concurrent reasoning of multiple models M1~Mm.

此外,如上所述,在从相机模块获取其采集的分辨率为H×W的整张图像数据后,现有技术通常需要通过图像信号处理器(Image Signal Processor,ISP)对该整张图像进行分片(Tile)处理,根据网络处理单元(NPU)的数据处理能力将其分割为多张N×M(N<H,M<W)的分片图像后存入ISP缓存器,再由网络处理单元(NPU)从ISP缓存器逐一读取各分片图像的图像数据进行网络推理。这一方面会引入分片、补0加窗和剪裁重合图像等额外的操作和中间数据,以提升对系统整体算力的要求,另一方面还要求系统具备更大的数据缓存空间,以极大地提高图像信号处理器(ISP)的面积、功耗和成本,从而严重限制了现有网络处理单元(Network Processing Unit,NPU)向并行处理多路图像数据的发展和应用。In addition, as mentioned above, after obtaining the entire image data with a resolution of H×W from the camera module, the existing technology usually needs to tile the entire image through the image signal processor (Image Signal Processor, ISP), divide it into multiple N×M (N<H, M<W) tile images according to the data processing capability of the network processing unit (NPU), and store them in the ISP buffer. Then the network processing unit (NPU) reads the image data of each tile image from the ISP buffer one by one for network reasoning. On the one hand, this will introduce additional operations and intermediate data such as slicing, zero-filling and windowing, and trimming of overlapping images to increase the requirements for the overall computing power of the system. On the other hand, it also requires the system to have a larger data cache space to greatly increase the area, power consumption and cost of the image signal processor (ISP), thereby severely limiting the development and application of the existing network processing unit (Network Processing Unit, NPU) to parallel process multiple image data.

为了克服现有技术所存在的上述缺陷,本发明提供了一种网络处理单元、一种扩展现实显示芯片、一种扩展现实显示装置、一种图像数据的并行处理方法,以及一种计算机可读存储介质,能够并行地逐行获取多路图像数据,将各路图像数据分别以预设速度写入缓存器,再经由i模型利用网络处理单元的全计算资源,从缓存器读取并处理已写入的第i路图像数据。通过采用这些配置,本发明能通过取消对整张图像进行分片(Tile)处理的需求来提升图像处理的精度,并降低对系统整体算力及数据缓存空间的要求,从而在相同的硬件加工工艺及成本的条件下,提升单个网络处理单元(NPU)并行处理多路图像数据的能力,并由此灵活、充分地利用系统的整体算力来减小网络推理延迟,以满足扩展现实(Extended Reality,XR)显示设备中并行处理双目图像数据的需求。In order to overcome the above-mentioned defects existing in the prior art, the present invention provides a network processing unit, an extended reality display chip, an extended reality display device, a parallel processing method for image data, and a computer-readable storage medium, which can acquire multiple channels of image data line by line in parallel, write each channel of image data into a buffer at a preset speed, and then read and process the written i-th channel of image data from the buffer by using the full computing resources of the network processing unit through the i model. By adopting these configurations, the present invention can improve the accuracy of image processing by eliminating the need for tile processing of the entire image, and reduce the requirements for the overall computing power of the system and data cache space, thereby improving the ability of a single network processing unit (NPU) to process multiple channels of image data in parallel under the same hardware processing technology and cost conditions, and thereby flexibly and fully utilizing the overall computing power of the system to reduce network inference delay, so as to meet the needs of parallel processing of binocular image data in extended reality (Extended Reality, XR) display devices.

在一些非限制性的实施例中,本发明的第七方面提供的上述图像数据的并行处理方法可以经由本发明的第五方面提供的上述扩展现实(XR)显示芯片来实施。具体请参考图7,图7示出了根据本发明的一些实施例提供的扩展现实显示芯片的架构图。In some non-limiting embodiments, the parallel processing method of the image data provided in the seventh aspect of the present invention can be implemented via the extended reality (XR) display chip provided in the fifth aspect of the present invention. Please refer to Figure 7 for details, which shows an architecture diagram of an extended reality display chip provided according to some embodiments of the present invention.

在图7所示的实施例中,本发明的第五方面提供的上述扩展现实(XR)显示芯片可以被配置于本发明的第六方面提供的上述扩展现实(XR)显示装置,其中配置有存储器(未绘示)、至少两路图像信号处理器71~72,以及本发明的第四方面提供的网络处理单元80。该存储器包括但不限于本发明的第五方面提供的上述计算机可读存储介质,其上存储有计算机指令。该至少两路图像信号处理器71~72分别连接扩展现实显示装置的左目相机及右目相机,以获取其采集的真实世界的实景图像。该网络处理单元80分别连接该存储器及各路图像信号处理器71~72,适于读取并执行该存储器上存储的计算机指令,以实施本发明的第四方面提供的上述图像数据的并行处理方法,从而从各图像信号处理器71~72交替地逐行获取其输出的图像数据,并对获取的图像数据进行并行处理。In the embodiment shown in FIG. 7 , the above-mentioned extended reality (XR) display chip provided in the fifth aspect of the present invention can be configured in the above-mentioned extended reality (XR) display device provided in the sixth aspect of the present invention, which is configured with a memory (not shown), at least two image signal processors 71-72, and a network processing unit 80 provided in the fourth aspect of the present invention. The memory includes but is not limited to the above-mentioned computer-readable storage medium provided in the fifth aspect of the present invention, on which computer instructions are stored. The at least two image signal processors 71-72 are respectively connected to the left eye camera and the right eye camera of the extended reality display device to obtain the real scene image of the real world collected by them. The network processing unit 80 is respectively connected to the memory and each image signal processor 71-72, and is suitable for reading and executing the computer instructions stored on the memory to implement the above-mentioned parallel processing method of image data provided in the fourth aspect of the present invention, so as to alternately obtain the image data outputted by each image signal processor 71-72 line by line, and perform parallel processing on the obtained image data.

以下将结合一些图像数据的并行处理方法的实施例,描述上述网络处理单元80、扩展现实显示芯片及扩展现实显示装置的工作原理。本领域的技术人员可以理解,这些并行处理方法的实施例只是本发明提供的一些非限制性的实施方式,旨在清楚地展示本发明的主要构思,并提供一些便于公众实施的具体方案,而非用于限制该网络处理单元80、该扩展现实显示芯片及该扩展现实显示装置的全部功能或全部工作方式。同样地,该网络处理单元80、该扩展现实显示芯片及该扩展现实显示装置也只是本发明提供的一种非限制性的实施方式,不对该图像数据的并行处理方法中各步骤的执行主体和执行顺序构成限制。The following will describe the working principles of the above-mentioned network processing unit 80, extended reality display chip and extended reality display device in conjunction with some embodiments of the parallel processing method of image data. Those skilled in the art will understand that these embodiments of the parallel processing method are only some non-restrictive implementation methods provided by the present invention, which are intended to clearly demonstrate the main concept of the present invention and provide some specific solutions that are convenient for the public to implement, rather than to limit all functions or all working modes of the network processing unit 80, the extended reality display chip and the extended reality display device. Similarly, the network processing unit 80, the extended reality display chip and the extended reality display device are also only a non-restrictive implementation method provided by the present invention, and do not constitute a limitation on the execution subject and execution order of each step in the parallel processing method of the image data.

请结合参考图7及图8,图8示出了根据本发明的一些实施例提供的并行处理图像数据的流程图。Please refer to FIG. 7 and FIG. 8 . FIG. 8 shows a flowchart of parallel processing of image data according to some embodiments of the present invention.

如图7所示,本发明提供的上述网络处理单元80中配置有缓存器811~812、乘法累加(Multiply Accumulation,MAC)阵列计算单元82、向量处理单元(Vector processing unit,VPU)83、寄存器(Register)84等硬件设备,并配置有包含多个经过预先训练的图像处理模型的软件程序,其计算操作和存储操作可以分离地独立进行,从而通过MAC阵列计算单元82及向量处理单元(VPU)83实现网络处理单元80的全计算资源共享,以灵活地支持网络处理单元80进行数据读取、计算、缓存等操作。As shown in FIG7 , the network processing unit 80 provided by the present invention is configured with hardware devices such as caches 811 to 812, a multiplication and accumulation (MAC) array computing unit 82, a vector processing unit (VPU) 83, and a register 84, and is also configured with a software program containing a plurality of pre-trained image processing models, wherein computing operations and storage operations can be performed separately and independently, thereby enabling full computing resource sharing of the network processing unit 80 through the MAC array computing unit 82 and the vector processing unit (VPU) 83, so as to flexibly support the network processing unit 80 to perform operations such as data reading, computing, and caching.

在一些实施例中,技术人员可以通过离线的方式,预先按照整张实景图像的大小尺寸,训练并编译用于降噪(Denoise)和/或消除马赛克(Demosic)等图像处理的神经网络模型的输入参数,并适应XR显示装置中双目相机的具体使用场景,将训练获得的模型初始化N(例如:N=2)次,得到N个具有相同功能的神经网络模型,以分别对应各路图像信号处理器71~72。In some embodiments, technicians can train and compile the input parameters of the neural network model for image processing such as denoising (Denoise) and/or de-mosaicing (Demosic) in an offline manner in advance according to the size of the entire real-scene image, and adapt to the specific usage scenario of the binocular camera in the XR display device, initialize the trained model N (for example: N=2) times, and obtain N neural network models with the same functions to correspond to each image signal processor 71~72 respectively.

本领域的技术人员可以理解,上述配置N个具有相同功能的神经网络模型的方案,只是本发明提供的一种非限制性的实施方式,旨在清楚地展示本发明的主要构思,并提供一种便于公众实施的具体方案,而非用于限制本发明的保护范围。Those skilled in the art will understand that the above-mentioned scheme of configuring N neural network models with the same functions is only a non-limiting implementation method provided by the present invention, which is intended to clearly demonstrate the main concept of the present invention and provide a specific scheme that is easy for the public to implement, rather than to limit the scope of protection of the present invention.

可选地,在另一些实施例中,本领域的技术人员也可以适应降噪(Denoise)、消除马赛克(Demosic)等多种不同的图像处理需求,而配置多种具有不同参数、结构和/或功能的神经网络模型,以对应地满足XR设备中多功能地并行处理双目图像数据的需求。Optionally, in other embodiments, those skilled in the art may also configure a variety of neural network models with different parameters, structures and/or functions to adapt to various image processing requirements such as denoising and demosic removal, so as to correspondingly meet the requirements of multifunctional parallel processing of binocular image data in XR devices.

此外,在一些实施例中,缓存器811~812可以选用静态随机存储器(Static Random Access Memory,SRAM),其被配置于网络处理单元80的输入端,并适应XR显示装置中双目相机的具体使用场景地,被静态划分为N(例如:N=2)个独立的缓存空间。进一步地,每个缓存空间可以被优选地划分为输入缓存(Input buffer)空间、参数缓存(Parameter buffer)空间和特征缓存(Feature buffer)空间,以分别缓存对应的图像处理模型当前待处理的图像数据、模型参数数据,以及由对应的图像处理模型进行神经网络推理所产生的中间数据。In addition, in some embodiments, the buffers 811-812 may use a static random access memory (SRAM), which is configured at the input end of the network processing unit 80 and is statically divided into N (for example, N=2) independent cache spaces to adapt to the specific usage scenario of the binocular camera in the XR display device. Further, each cache space can be preferably divided into an input buffer space, a parameter buffer space, and a feature buffer space to cache the image data currently to be processed by the corresponding image processing model, the model parameter data, and the intermediate data generated by the neural network inference performed by the corresponding image processing model.

以下以第一缓存器811和第二缓存器812指代上述两个独立的缓存空间。该第一缓存器811经由ISP缓存器711连接图像信号处理器71,而该第二缓存器812经由ISP缓存器721连接图像信号处理器72。两者分别以预设的第一速度v1,交替地从对应的ISP缓存器711~712逐行读取两路图像数据,以供各图像处理模型分别根据对应的切换节点,交替地从对应的缓存器811~812读取当前已写入的图像数据,并利用网络处理单元80的全计算资源来处理该路图像数据,以进行双目相机的两路图像数据的并行处理。在此,第一速度v1不小于各路图像信号处理器71~72输出图像数据的第二速度v2的N倍。图像处理模型可以分别包含多个固有的处理结点,并通过依序执行各处理结点的相关运算,完成一个周期的神经网络推理。切换节点可以是根据时延优先和/或存储空间优先的目标,从各图像处理模型的各处理结点中筛选获得,对应地按各神经网络模型的时延要求和/或各神经网络模型产生中间数据的总缓存数据量分布,以指示各图像处理模型的处理窗口的开启及关闭时间。The first buffer 811 and the second buffer 812 are used below to refer to the two independent buffer spaces. The first buffer 811 is connected to the image signal processor 71 via the ISP buffer 711, and the second buffer 812 is connected to the image signal processor 72 via the ISP buffer 721. The two buffers read two channels of image data line by line from the corresponding ISP buffers 711-712 at a preset first speed v1 , respectively, so that each image processing model can read the currently written image data from the corresponding buffers 811-812 according to the corresponding switching nodes, and use the full computing resources of the network processing unit 80 to process the image data, so as to perform parallel processing of the two channels of image data of the binocular camera. Here, the first speed v1 is not less than N times the second speed v2 of the image data output by each image signal processor 71-72. The image processing model can include multiple inherent processing nodes, and complete a cycle of neural network reasoning by sequentially executing the relevant operations of each processing node. The switching node can be obtained by screening from the processing nodes of each image processing model according to the goals of delay priority and/or storage space priority, and correspondingly distributed according to the delay requirements of each neural network model and/or the total cache data volume of intermediate data generated by each neural network model to indicate the opening and closing time of the processing window of each image processing model.

具体来说,上述各图像处理模型可以分别包含多个固有的处理结点,并通过依序执行各处理结点的相关运算,完成一个周期的神经网络推理。Specifically, each of the above-mentioned image processing models can respectively include multiple inherent processing nodes, and complete a cycle of neural network reasoning by sequentially executing relevant operations of each processing node.

针对上述按各图像处理模型的时延要求分布的实施例,技术人员可以通过离线的方式,预先根据需要并行处理的模型数量N,划分网络处理单元80的全存储空间,以确定存储各模型的输入图像数据、模型参数数据、中间数据等推理数据的第一存储空间分布。此外,技术人员还可以根据各模型完成一次推理的推理周期,确定该N个模型推理分别完成一次推理的总周期,并由此确定各模型在该总周期内的推理次数。之后,技术人员可以根据该总周期及各模型的推理次数,从各模型的多个处理结点中分别选择确定时分复用各模型的至少一个切换结点,再基于该第一存储空间分布及各模型的切换结点,利用网络处理单元80的全计算资源对多个模型的推理数据样本进行多模型推理,以分别确定各模型缺少的存储空间。再之后,技术人员即可根据网络处理单元80的总存储空间及各模型缺少的存储空间,优化上述第一存储空间分布,以确定满足多模型推理需求的第二存储空间分布,并由此确定缓存器811~812的静态划分方案,以确保能在指定的时延范围内完成XR设备的双目图像数据的并行处理。For the above-mentioned embodiment of distribution according to the delay requirements of each image processing model, the technician can divide the full storage space of the network processing unit 80 in advance according to the number of models N that need to be processed in parallel in an offline manner to determine the first storage space distribution for storing the input image data, model parameter data, intermediate data and other reasoning data of each model. In addition, the technician can also determine the total cycle of the N model reasoning to complete one reasoning according to the reasoning cycle of each model to complete one reasoning, and thereby determine the number of reasoning times of each model in the total cycle. Afterwards, the technician can select and determine at least one switching node for time-division multiplexing each model from multiple processing nodes of each model according to the total cycle and the number of reasoning times of each model, and then based on the first storage space distribution and the switching nodes of each model, use the full computing resources of the network processing unit 80 to perform multi-model reasoning on the reasoning data samples of multiple models to determine the storage space lacking for each model. Afterwards, the technicians can optimize the first storage space distribution according to the total storage space of the network processing unit 80 and the storage space lacking in each model to determine the second storage space distribution that meets the multi-model reasoning requirements, and thereby determine the static partitioning scheme of the buffers 811 to 812 to ensure that the parallel processing of the binocular image data of the XR device can be completed within the specified delay range.

此外,针对上述按各图像处理模型产生的中间数据的总缓存数据量分布的实施例,技术人员可以根据需要并行处理的模型数量N,划分网络处理单元80的全存储空间,以确定存储各模型的推理数据的第三存储空间分布,并如上所述地根据各模型的推理周期,确定多模型推理的总周期,再由此确定各模型在所述总周期内的推理次数。之后,技术人员可以基于该第三存储空间分布以及各模型的各处理结点,利用网络处理单元80的全计算资源,分别对各模型的推理数据样本进行模型推理,以分别确定各模型在各处理结点缺少的存储空间。再之后,技术人员可以按缺少的存储空间由小到大的顺序,根据处理结点分别为各模型划分不同数量的子图,并进行推理测试,以分别确定各模型所需的计算时长。再之后,技术人员可以确定推理次数及计算时长满足多模型推理的总性能需求的最大子图数量,并根据该最大子图数量对应的处理结点的位置,分别从各模型的处理结点中选择确定需要缓存最小中间数据量的切换结点。在此,该多模型推理的总性能需求可以由各图像处理模型推理次数与计算时长的累加和来表征。通过选择最大子图数量对应的处理结点的位置来确定各模型的切换结点,本发明可以进一步减小时分复用各图像处理模型所产生的中间数据量,从而进一步降低网络处理单元80及图像信号处理器71~72的面积、功耗和成本。In addition, for the above-mentioned embodiment of the total cache data volume distribution of the intermediate data generated by each image processing model, the technician can divide the full storage space of the network processing unit 80 according to the number of models N that need to be processed in parallel to determine the third storage space distribution for storing the reasoning data of each model, and determine the total cycle of multi-model reasoning according to the reasoning cycle of each model as described above, and then determine the number of reasoning times of each model in the total cycle. Afterwards, the technician can use the full computing resources of the network processing unit 80 to perform model reasoning on the reasoning data samples of each model based on the third storage space distribution and each processing node of each model, so as to respectively determine the storage space lacking in each processing node of each model. Afterwards, the technician can divide different numbers of subgraphs for each model according to the processing node in the order of the lack of storage space from small to large, and perform reasoning tests to respectively determine the calculation time required for each model. Afterwards, the technician can determine the maximum number of subgraphs whose reasoning times and calculation time meet the total performance requirements of multi-model reasoning, and select the switching node that needs to cache the minimum amount of intermediate data from the processing nodes of each model according to the position of the processing node corresponding to the maximum number of subgraphs. Here, the total performance requirement of the multi-model reasoning can be represented by the cumulative sum of the number of reasonings and the calculation time of each image processing model. By selecting the position of the processing node corresponding to the maximum number of sub-graphs to determine the switching node of each model, the present invention can further reduce the amount of intermediate data generated by time-division multiplexing of each image processing model, thereby further reducing the area, power consumption and cost of the network processing unit 80 and the image signal processors 71-72.

由此,本发明即可根据时延优先和/或存储空间优先的具体目标,从各图像处理模型的多个处理结点中分别确定至少一个切换结点,并确定各图像处理模型对应的存储空间,以利用网络处理单元80的全计算资源依次进行各图像处理模型的时分复用推理,以实现多个模型时延优化的并行推理和/或存储空间优化的并行推理。Therefore, the present invention can determine at least one switching node from multiple processing nodes of each image processing model according to the specific goals of latency priority and/or storage space priority, and determine the storage space corresponding to each image processing model, so as to utilize the full computing resources of the network processing unit 80 to perform time-division multiplexing reasoning of each image processing model in sequence, so as to realize parallel reasoning of latency optimization and/or storage space optimization of multiple models.

如图8所示,在并行处理双目相机的两路图像数据的过程中,两路图像信号处理器71~72可以分别连接XR显示装置的左目相机及右目相机,以分别从该左目相机及右目相机的图像传感器(Sensor)获取其采集的左目图像数据及右目图像数据,并对其进行预处理。之后,两路图像信号处理器71~72可以在预设的曝光时间t1内,以乒乓缓存(Ping-pong buffer)的读写方式,将预处理后的左目图像数据及右目图像数据按上述第二速度v2连续地逐行传输到对应的ISP缓存器711~712上。在此,针对消除噪声和/或消除马赛克的图像处理功能,XR显示装置的双目相机采集的左目图像数据及右目图像数据,可以是带有噪声信号和/或马赛克(Mosaic)的原始图像数据。As shown in FIG8 , in the process of processing two channels of image data of the binocular camera in parallel, the two-channel image signal processors 71-72 can be connected to the left camera and the right camera of the XR display device respectively, so as to obtain the collected left-eye image data and right-eye image data from the image sensors of the left camera and the right camera respectively, and pre-process them. Afterwards, the two-channel image signal processors 71-72 can continuously transmit the pre-processed left-eye image data and right-eye image data to the corresponding ISP buffers 711-712 line by line at the second speed v 2 in a ping-pong buffer read-write manner within the preset exposure time t 1. Here, for the image processing function of eliminating noise and/or eliminating mosaics, the left-eye image data and the right-eye image data collected by the binocular camera of the XR display device can be original image data with noise signals and/or mosaics.

具体来说,在上述乒乓缓存的读写方式下,图像信号处理器71~72可以同步向对应的ISP缓存器711~721,逐行写入左目相机采集的左目图像数据及右目相机采集的右目图像数据。响应于预设的曝光时间t1结束,ISP缓存器711~721的输入缓存(Input buffer)空间将同时存满左目图像及右目图像的第1~M行图像数据。此时,图像信号处理器71~72可以向网络处理单元80发出中断指令,通知其以上述曝光时间t1为固定时间片,并按上述第一速度v1(v1≥N·v2)将ISP缓存器711~721上缓存的第1~M行图像数据逐行写出到网络处理单元80上对应的缓存器811~812中。Specifically, in the above-mentioned ping-pong buffer read and write mode, the image signal processors 71-72 can synchronously write the left-eye image data collected by the left-eye camera and the right-eye image data collected by the right-eye camera to the corresponding ISP buffers 711-721 line by line. In response to the end of the preset exposure time t1 , the input buffer space of the ISP buffers 711-721 will be filled with the 1st to Mth lines of image data of the left-eye image and the right-eye image at the same time. At this time, the image signal processors 71-72 can send an interrupt instruction to the network processing unit 80 to notify it to use the above-mentioned exposure time t1 as a fixed time slice and write the 1st to Mth lines of image data cached on the ISP buffers 711-721 line by line to the corresponding buffers 811-812 on the network processing unit 80 at the above-mentioned first speed v1 ( v1 ≥N· v2 ).

之后,响应于到达预先确定的触发第一模型的第一切换结点T11 网络处理单元80可以首先判定该第一模型的第一处理窗口开启,从而从第一缓存器811读取左目图像当前已写入的第1~M行图像数据,并驱动上述MAC阵列计算单元82及向量处理单元(VPU)83,以网络处理单元80的全计算资源支持该第一模型处理当前已写入的左目图像数据,并产生对应的第一中间数据。Afterwards, in response to reaching the predetermined first switching node T11 that triggers the first model The network processing unit 80 can first determine that the first processing window of the first model is open, thereby reading the 1st to Mth rows of image data currently written into the left-eye image from the first buffer 811, and driving the above-mentioned MAC array calculation unit 82 and the vector processing unit (VPU) 83, so as to support the first model to process the currently written left-eye image data with the full computing resources of the network processing unit 80, and generate corresponding first intermediate data.

再之后,响应于到达预先确定的触发第二模型的第二切换结点T21 网络处理单元80可以判定上述第一模型的第一处理窗口关闭,而该第二模型的第二处理窗口开启,从而先将第一模型在第一切换结点T11及第二切换结点T21之间产生的第一中间数据写回第一缓存器811的特征缓存(Feature buffer)空间,以空出网络处理单元80的计算资源,再从第二缓存器812读取右目图像当前已写入的第1~M行图像数据,并驱动上述MAC阵列计算单元82及向量处理单元(VPU)83,将网络处理单元80的全计算资源切换到第二模型来支持该第二模型处理当前已写入的右目图像数据,以产生对应的第二中间数据。Afterwards, in response to reaching the predetermined second switching node T21 that triggers the second model The network processing unit 80 can determine that the first processing window of the above-mentioned first model is closed, and the second processing window of the second model is opened, so as to first write the first intermediate data generated by the first model between the first switching node T11 and the second switching node T21 back to the feature buffer space of the first buffer 811 to free up the computing resources of the network processing unit 80, and then read the 1st to Mth rows of image data currently written in the right-eye image from the second buffer 812, and drive the above-mentioned MAC array computing unit 82 and vector processing unit (VPU) 83 to switch all computing resources of the network processing unit 80 to the second model to support the second model to process the currently written right-eye image data to generate corresponding second intermediate data.

进一步地,在上述乒乓缓存的读写方式下,图像信号处理器71~72可以在网络处理单元80读取ISP缓存器711~721缓存的图像数据的同时,继续向对应的ISP缓存器711~721,逐行写入左目相机采集的左目图像数据及右目相机采集的右目图像数据,以实现读入及写出数据的动态同步。响应于预设的曝光时间2t1结束,ISP缓存器711~721的输入缓存(Input buffer)空间将再次存满左目图像及右目图像的第M+1~2M行图像数据。图像信号处理器71~72可以如上所述地再次向网络处理单元30发出中断指令,通知其继续以上述曝光时间t1为固定时间片,按上述第一速度v1(v1≥N·v2)将ISP缓存器711~721上缓存的第M+1~2M行图像数据逐行写出到网络处理单元80上对应的缓存器811~812中,从而防止在下一曝光时间2t1~3t1内写入ISP缓存器711~721的第2M+1~3M行左目图像数据发生积压和溢出。Furthermore, in the above-mentioned ping-pong buffer read and write mode, the image signal processors 71-72 can continue to write the left eye image data collected by the left eye camera and the right eye image data collected by the right eye camera to the corresponding ISP buffers 711-721 line by line while the network processing unit 80 reads the image data cached in the ISP buffers 711-721, so as to realize the dynamic synchronization of reading and writing data. In response to the end of the preset exposure time 2t 1 , the input buffer space of the ISP buffers 711-721 will be filled with the image data of the M+1 to 2Mth lines of the left eye image and the right eye image again. As described above, the image signal processors 71-72 may again send an interrupt instruction to the network processing unit 30, notifying it to continue to use the above-mentioned exposure time t 1 as a fixed time slice and write out the M+ 1-2Mth lines of image data cached in the ISP buffers 711-721 to the corresponding buffers 811-812 on the network processing unit 80 line by line at the above -mentioned first speed v 1 (v 1 ≥N·v 2 ), thereby preventing the 2M+1-3Mth lines of left-eye image data written into the ISP buffers 711-721 within the next exposure time 2t 1 -3t 1 from being accumulated and overflowing.

再之后,响应于再次到达触发上述第一模型的第一切换结点T12 网络处理单元80可以判定上述第二模型的第二处理窗口关闭,而该第一模型的第一处理窗口再次开启,从而先将第二模型在第二切换结点T21及第一切换结点T12之间产生的第二中间数据写回第二缓存器812的特征缓存(Feature buffer)空间,以空出网络处理单元80的计算资源,再从第一缓存器811读取左目图像当前已写入的第M+1~2M行图像数据,以及上一轮第一神经网络推理产生的第一中间数据,并驱动上述MAC阵列计算单元82及向量处理单元(VPU)83,将网络处理单元80的全计算资源切换回第一模型来支持该第一模型继续处理当前已写入的左目图像数据,以重新产生对应的第一中间数据。Afterwards, in response to reaching the first switching node T12 that triggers the first model again The network processing unit 80 can determine that the second processing window of the above-mentioned second model is closed, and the first processing window of the first model is opened again, so as to first write the second intermediate data generated by the second model between the second switching node T 21 and the first switching node T 12 back to the feature buffer space of the second buffer 812 to free up the computing resources of the network processing unit 80, and then read the M+1~2Mth rows of image data currently written in the left-eye image from the first buffer 811, as well as the first intermediate data generated by the previous round of the first neural network reasoning, and drive the above-mentioned MAC array computing unit 82 and the vector processing unit (VPU) 83 to switch all computing resources of the network processing unit 80 back to the first model to support the first model to continue processing the currently written left-eye image data to regenerate the corresponding first intermediate data.

依此类推,网络处理单元80中的两个图像处理模型可以根据预先确定的各切换结点,按照乒乓缓存的读写方式逐窗口地交替进行左目图像及右目图像的神经网络推理,从而通过配置较少的ISP缓存空间(例如:2M行图像数据)来满足XR设备中并行处理双目图像数据的需求。By analogy, the two image processing models in the network processing unit 80 can perform neural network inference on the left-eye image and the right-eye image alternately window by window according to the predetermined switching nodes and the reading and writing method of the ping-pong cache, thereby meeting the demand for parallel processing of binocular image data in the XR device by configuring less ISP cache space (for example: 2M lines of image data).

请进一步参考图9,图9示出了根据本发明的一些实施例提供的逐行获取图像数据的示意图。Please further refer to FIG. 9 , which shows a schematic diagram of acquiring image data line by line according to some embodiments of the present invention.

如图9所示,本发明提供的上述XR显示芯片可以采用图示的滑窗,按照行优先的方向从左至右、从上至下地进行卷积计算。这种读取及计算图像数据的方式一方面与图像信号处理器71~72向ISP缓存器711~721写入数据的方向一致,可以通过定量配置图像信号处理器71~72与网络处理单元80的读写速度,实现图像信号处理器71~72到网络处理单元80的图像数据的实时流转,从而降低图像数据在ISP缓存器711~721缓存等待的时间,以降低图像处理的延迟(Latency)。As shown in FIG9 , the XR display chip provided by the present invention can use the sliding window shown in the figure to perform convolution calculation from left to right and from top to bottom in the direction of row priority. This way of reading and calculating image data is consistent with the direction in which the image signal processor 71-72 writes data to the ISP buffer 711-721. The real-time flow of image data from the image signal processor 71-72 to the network processing unit 80 can be achieved by quantitatively configuring the read and write speeds of the image signal processor 71-72 and the network processing unit 80, thereby reducing the waiting time of the image data in the ISP buffer 711-721 cache, so as to reduce the latency of image processing.

另一方面,相较于传统的列方向计算方式需要完整写入并缓存整幅图像全部的22行数据才能开始进行神经网络推理计算,因而必须将整幅图像分割为多张分片图像,以降低缓存需求的现状,本发明采用的该行优先的图像数据读取及计算方式只需要写入并缓存少数行(例如:对应滑窗大小的3行)的图像数据,即可实时开展对应图像的神经网络推理,因而可以大幅减少对ISP缓存器711~721的缓存空间要求,以大幅降低图像信号处理器的面积、功耗和成本,并取消对整张图像进行分片(Tile)处理的需求来降低对系统整体算力及数据缓存空间的要求,以降低对系统整体算力的要求。On the other hand, compared with the traditional column-wise calculation method, which requires writing and caching all 22 rows of data of the entire image before starting the neural network inference calculation, and therefore the entire image must be divided into multiple tile images to reduce the cache requirements, the row-priority image data reading and calculation method adopted by the present invention only needs to write and cache a few rows (for example: 3 rows corresponding to the sliding window size) of image data to perform neural network inference of the corresponding image in real time, thereby greatly reducing the cache space requirements of the ISP buffers 711 to 721, thereby greatly reducing the area, power consumption and cost of the image signal processor, and eliminating the need for tile processing of the entire image to reduce the requirements for the overall system computing power and data cache space, thereby reducing the requirements for the overall system computing power.

具体请参考图10,图10示出了根据本发明的一些实施例提供的XR显示芯片的结构图。Please refer to FIG. 10 for details, which shows a structural diagram of an XR display chip provided according to some embodiments of the present invention.

如图10所示,通过采用本发明提供的上述行优先的图像数据读取及计算方式,除网络处理单元90的内部缓存器91之外,XR显示芯片只需要在网络处理单元80的外部配置小面积、小容量(例如:0.4MB)的ISP缓存器92,即可满足XR设备中并行处理双目图像数据的需求,因而有利于网络处理单元80向并行处理多路图像数据的进一步发展和应用。As shown in FIG. 10 , by adopting the above-mentioned row-priority image data reading and calculation method provided by the present invention, in addition to the internal buffer 91 of the network processing unit 90, the XR display chip only needs to configure a small-area, small-capacity (for example: 0.4MB) ISP buffer 92 outside the network processing unit 80 to meet the needs of parallel processing of binocular image data in the XR device, thereby facilitating the further development and application of the network processing unit 80 to parallel processing of multi-channel image data.

进一步地,在发明的一些实施例中,各图像处理模型可以分别包括多层神经网络结构。响应于从对应的缓存器811或812读取到满足上述滑窗大小的多行(例如:3行)图像数据,图像处理模型即可按照行优先的方向从左至右、从上至下地进行卷积计算,以同步完成对应层数的神经网络推理。Furthermore, in some embodiments of the invention, each image processing model may include a multi-layer neural network structure. In response to reading multiple lines (e.g., 3 lines) of image data satisfying the above sliding window size from the corresponding buffer 811 or 812, the image processing model may perform convolution calculations from left to right and from top to bottom in a row-first direction to synchronously complete the neural network reasoning of the corresponding number of layers.

具体请参考图11,图11示出了根据本发明的一些实施例提供的降噪处理的示意图。以图11所示的AI降噪(AIDenoise)处理为例,其在算法模块层面上分为编码器和解码器两部分。基于机器学习的AIDenoise推理是一个从实际观察到的噪声图像中估计潜在干净图像的过程。图像处理模型可以通过编码器对图像进行特征映射,再通过解码器对图像特征进行整合及复原,以最终输出消除噪声的干净图像。具体来说,在进行神经网络推理的过程中,图像处理模型可以在编码器和解码器之间使用卷积组成跳跃连接部分,使得图像下采样提取的特征融入到上采样部分,以促进特征信息的融合。之后,通过对训练图像中噪声的学习,获得相应参考图像的潜在映射,图像处理模型即可根据该潜在映射对获取的噪声图像进行降噪处理的神经网络推理,以最终获得消除噪声的干净图像。相比于传统的去噪方法,该AI降噪不但对图像的边缘纹理细节有着更好的保留,而且可以利用网络处理单元(NPU)的架构进行并行化计算,从而充分利用硬件性能来加快计算的运行速率Please refer to Figure 11 for details, which shows a schematic diagram of a noise reduction process provided according to some embodiments of the present invention. Taking the AI noise reduction (AIDenoise) process shown in Figure 11 as an example, it is divided into two parts, an encoder and a decoder, at the algorithm module level. AIDenoise reasoning based on machine learning is a process of estimating a potential clean image from an actually observed noisy image. The image processing model can feature map the image through an encoder, and then integrate and restore the image features through a decoder to finally output a clean image with noise eliminated. Specifically, in the process of performing neural network reasoning, the image processing model can use convolution to form a jump connection part between the encoder and the decoder, so that the features extracted by image downsampling are integrated into the upsampling part to promote the fusion of feature information. Afterwards, by learning the noise in the training image, the potential mapping of the corresponding reference image is obtained, and the image processing model can perform neural network reasoning on the acquired noisy image according to the potential mapping to finally obtain a clean image with noise eliminated. Compared with traditional denoising methods, this AI denoising not only better preserves the edge texture details of the image, but also can use the architecture of the network processing unit (NPU) for parallel computing, thereby making full use of hardware performance to speed up the computing operation rate.

之后,响应于根据当前已写入的多行图像数据完成预设层数L的神经网络推理,图像处理模型开始生成关于对应的左目/右目图像的初始多行图像的结果数据。在此,针对上述消除噪声的图像处理功能,图像处理模型生成的结果数据可以为消除噪声的降噪图像数据。相应地,针对上述消除马赛克的图像处理功能,图像处理模型生成的结果数据也可以为消除马赛克的复原图像数据。Afterwards, in response to completing the neural network reasoning of the preset number of layers L according to the currently written multiple lines of image data, the image processing model begins to generate result data about the initial multiple lines of images of the corresponding left/right eye images. Here, for the above-mentioned image processing function of eliminating noise, the result data generated by the image processing model can be denoised image data after eliminating noise. Correspondingly, for the above-mentioned image processing function of eliminating mosaics, the result data generated by the image processing model can also be restored image data after eliminating mosaics.

再之后,网络处理单元80可以如图7所示地按照乒乓缓存(Ping-pong buffer)的读写方式,以上述第一速度v1向对应的ISP缓存器712~722逐行输出生成的结果数据,并由ISP缓存器712~722以上述第二速度v2将收到的预设行数的结果数据返回到对应的图像信号处理器71~72,以使各路图像信号处理器71~72中的图像数据经由网络处理单元80流转起来,从而减少对ISP缓存器712~722及网络处理单元80内部缓存器811~812的大小要求,并降低图像处理的延迟(Latency)。Afterwards, the network processing unit 80 can output the generated result data line by line to the corresponding ISP buffers 712~722 at the first speed v1 as shown in Figure 7 according to the reading and writing method of the Ping-pong buffer, and the ISP buffers 712~722 return the received result data of the preset number of lines to the corresponding image signal processors 71~72 at the second speed v2, so that the image data in each image signal processor 71~72 can flow through the network processing unit 80, thereby reducing the size requirements of the ISP buffers 712~722 and the internal buffers 811~812 of the network processing unit 80, and reducing the latency of image processing.

具体来说,继续以图9所示的原始图像数据为例,假设其总分辨率在1000行以上,网络处理单元80可以在处理到第K行图像数据时完成预设层数L的神经网络推理,而输出第1~M行处理结果数据,并随对应模型处理窗口的开启和关闭,逐窗口地输出第M+1~2M行、第2M+1~3M行等后续的处理结果数据。在此,K可以由图像处理模型的感受野决定。对于处理双目图像数据的相同模型,其可以具有60~100行左右的相同K值。该M的取值可以对应ISP缓存器711~721在上述固定时间片t1写出到网络处理单元80上对应缓存器811~812中的图像数据行数(例如:M=8),以防止图像数据在缓存器811~812中的积压和溢出。Specifically, continuing to take the original image data shown in FIG. 9 as an example, assuming that its total resolution is above 1000 lines, the network processing unit 80 can complete the neural network reasoning of the preset number of layers L when processing the Kth line of image data, and output the 1st to Mth lines of processing result data, and output the M+1st to 2Mth lines, 2M+1st to 3Mth lines, and other subsequent processing result data window by window as the corresponding model processing window is opened and closed. Here, K can be determined by the receptive field of the image processing model. For the same model that processes binocular image data, it can have the same K value of about 60 to 100 lines. The value of M can correspond to the number of image data lines (for example: M=8) written by the ISP buffer 711 to 721 to the corresponding buffer 811 to 812 on the network processing unit 80 at the fixed time slice t1 , so as to prevent the backlog and overflow of image data in the buffer 811 to 812.

如此,通过定量、定速地按照乒乓缓存的读写方式,向对应的ISP缓存器712~722逐行输出生成的结果数据,网络处理单元80可以在交替获取多路图像信号处理器71~72提供的多路原始图像数据的同时,等量地向各路图像信号处理器71~72返回处理结果数据,从而实现图像数据的流转和动态平衡。In this way, by outputting the generated result data line by line to the corresponding ISP buffers 712~722 in a quantitative and constant speed manner according to the reading and writing method of the ping-pong cache, the network processing unit 80 can alternately obtain the multiple original image data provided by the multiple image signal processors 71~72 while returning the processing result data to each image signal processor 71~72 in equal amounts, thereby realizing the flow and dynamic balance of image data.

进一步地,在一些实施例中,响应于完成预设层数L的神经网络推理,并向对应的ISP缓存器712~722输出生成的结果数据,网络处理单元80还可以优选地删除仅涉及前L-1层神经网络推理的多行图像数据,以节省缓存器811~812的特征缓存(Feature buffer)空间。Furthermore, in some embodiments, in response to completing the neural network inference of a preset number of layers L and outputting the generated result data to the corresponding ISP buffers 712~722, the network processing unit 80 may also preferably delete multiple lines of image data that only involve the first L-1 layers of neural network inference to save feature buffer space of the buffers 811~812.

综上,本发明提供的上述网络处理单元80、XR显示芯片、XR显示装置、图像数据的并行处理方法,以及计算机可读存储介质,均能并行地逐行获取多路图像数据,将各路图像数据分别以预设速度写入缓存器,再经由i模型利用网络处理单元的全计算资源,从缓存器读取并处理已写入的第i路图像数据。通过采用这些配置,本发明能通过取消对整张图像进行分片(Tile)处理的需求来提升图像处理的精度,并降低对系统整体算力及数据缓存空间的要求,从而在相同的硬件加工工艺及成本的条件下,提升单个网络处理单元(NPU)并行处理多路图像数据的能力,并由此灵活、充分地利用系统的整体算力来减小网络推理延迟,以满足XR设备中并行处理双目图像数据的需求。In summary, the network processing unit 80, XR display chip, XR display device, parallel processing method of image data, and computer-readable storage medium provided by the present invention can acquire multiple channels of image data line by line in parallel, write each channel of image data into the buffer at a preset speed, and then use the full computing resources of the network processing unit through the i model to read and process the written i-th channel of image data from the buffer. By adopting these configurations, the present invention can improve the accuracy of image processing by eliminating the need for tile processing of the entire image, and reduce the requirements for the overall computing power of the system and data cache space, thereby improving the ability of a single network processing unit (NPU) to process multiple channels of image data in parallel under the same hardware processing technology and cost conditions, and thereby flexibly and fully utilizing the overall computing power of the system to reduce network inference delay, so as to meet the needs of parallel processing of binocular image data in XR devices.

尽管为使解释简单化将上述方法图示并描述为一系列动作,但是应理解并领会,这些方法不受动作的次序所限,因为根据一个或多个实施例,一些动作可按不同次序发生和/或与来自本文中图示和描述或本文中未图示和描述但本领域技术人员可以理解的其他动作并发地发生。Although the above methods are illustrated and described as a series of actions for simplicity of explanation, it should be understood and appreciated that these methods are not limited by the order of the actions, because according to one or more embodiments, some actions may occur in a different order and/or concurrently with other actions from those illustrated and described herein or not illustrated and described herein but understandable to those skilled in the art.

本领域技术人员将可理解,信息、信号和数据可使用各种不同技术和技艺中的任何技术和技艺来表示。例如,以上描述通篇引述的数据、指令、命令、信息、信号、位(比特)、码元、和码片可由电压、电流、电磁波、磁场或磁粒子、光场或光学粒子、或其任何组合来表示。Those skilled in the art will appreciate that information, signals, and data may be represented using any of a variety of different technologies and techniques. For example, data, instructions, commands, information, signals, bits, symbols, and chips cited throughout the above description may be represented by voltage, current, electromagnetic waves, magnetic fields or magnetic particles, optical fields or optical particles, or any combination thereof.

本领域技术人员将进一步领会,结合本文中所公开的实施例来描述的各种解说性逻辑板块、模块、电路、和算法步骤可实现为电子硬件、计算机软件、或这两者的组合。为清楚地解说硬件与软件的这一可互换性,各种解说性组件、框、模块、电路、和步骤在上面是以其功能性的形式作一般化描述的。此类功能性是被实现为硬件还是软件取决于具体应用和施加于整体系统的设计约束。技术人员对于每种特定应用可用不同的方式来实现所描述的功能性,但这样的实现决策不应被解读成导致脱离了本发明的范围。Those skilled in the art will further appreciate that the various illustrative logic blocks, modules, circuits, and algorithm steps described in conjunction with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or a combination of the two. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps are generally described above in terms of their functionality. Whether such functionality is implemented as hardware or software depends on the specific application and the design constraints imposed on the overall system. The technician may implement the described functionality in different ways for each specific application, but such implementation decisions should not be interpreted as resulting in a departure from the scope of the present invention.

结合本文中公开的实施例描述的方法或算法的步骤可直接在硬件中、在由处理器执行的软件模块中、或在这两者的组合中体现。软件模块可驻留在RAM存储器、闪存、ROM存储器、EPROM存储器、EEPROM存储器、寄存器、硬盘、可移动盘、CD-ROM、或本领域中所知的任何其他形式的存储介质中。示例性存储介质耦合到处理器以使得该处理器能从/向该存储介质读取和写入信息。在替换方案中,存储介质可以被整合到处理器。处理器和存储介质可驻留在ASIC中。ASIC可驻留在用户终端中。在替换方案中,处理器和存储介质可作为分立组件驻留在用户终端中。The steps of the method or algorithm described in conjunction with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software module may reside in a RAM memory, a flash memory, a ROM memory, an EPROM memory, an EEPROM memory, a register, a hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to a processor so that the processor can read and write information from/to the storage medium. In an alternative, a storage medium may be integrated into a processor. The processor and the storage medium may reside in an ASIC. The ASIC may reside in a user terminal. In an alternative, the processor and the storage medium may reside in a user terminal as discrete components.

在一个或多个示例性实施例中,所描述的功能可在硬件、软件、固件或其任何组合中实现。如果在软件中实现为计算机程序产品,则各功能可以作为一条或更多条指令或代码存储在计算机可读介质上或藉其进行传送。计算机可读介质包括计算机存储介质和通信介质两者,其包括促成计算机程序从一地向另一地转移的任何介质。存储介质可以是能被计算机访问的任何可用介质。作为示例而非限定,这样的计算机可读介质可包括RAM、ROM、EEPROM、CD-ROM或其它光盘存储、磁盘存储或其它磁存储设备、或能被用来携带或存储指令或数据结构形式的合意程序代码且能被计算机访问的任何其它介质。任何连接也被正当地称为计算机可读介质。例如,如果软件是使用同轴电缆、光纤电缆、双绞线、数字订户线(DSL)、或诸如红外、无线电、以及微波之类的无线技术从web网站、服务器、或其它远程源传送而来,则该同轴电缆、光纤电缆、双绞线、DSL、或诸如红外、无线电、以及微波之类的无线技术就被包括在介质的定义之中。如本文中所使用的盘(disk)和碟(disc)包括压缩碟(CD)、激光碟、光碟、数字多用碟(DVD)、软盘和蓝光碟,其中盘(disk)往往以磁的方式再现数据,而碟(disc)用激光以光学方式再现数据。上述的组合也应被包括在计算机可读介质的范围内。In one or more exemplary embodiments, the functions described may be implemented in hardware, software, firmware, or any combination thereof. If implemented as a computer program product in software, each function may be stored on or transmitted by a computer-readable medium as one or more instructions or codes. Computer-readable media include both computer storage media and communication media, including any medium that facilitates the transfer of a computer program from one place to another. Storage media may be any available medium that can be accessed by a computer. As an example and not limitation, such a computer-readable medium may include RAM, ROM, EEPROM, CD-ROM or other optical disk storage, disk storage or other magnetic storage device, or any other medium that can be used to carry or store the desired program code in the form of an instruction or data structure and can be accessed by a computer. Any connection is also properly referred to as a computer-readable medium. For example, if the software is transmitted from a website, a server, or other remote source using a coaxial cable, a fiber optic cable, a twisted pair, a digital subscriber line (DSL), or wireless technologies such as infrared, radio, and microwaves, the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwaves are included in the definition of the medium. Disk and disc as used herein include compact disc (CD), laser disc, optical disc, digital versatile disc (DVD), floppy disk and Blu-ray disc, wherein disk often reproduces data magnetically, while disc reproduces data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.

提供对本公开的先前描述是为使得本领域任何技术人员皆能够制作或使用本公开。对本公开的各种修改对本领域技术人员来说都将是显而易见的,且本文中所定义的普适原理可被应用到其他变体而不会脱离本公开的精神或范围。由此,本公开并非旨在被限定于本文中所描述的示例和设计,而是应被授予与本文中所公开的原理和新颖性特征相一致的最广范围。The previous description of the disclosure is provided to enable any person skilled in the art to make or use the disclosure. Various modifications to the disclosure will be apparent to those skilled in the art, and the general principles defined herein may be applied to other variations without departing from the spirit or scope of the disclosure. Thus, the disclosure is not intended to be limited to the examples and designs described herein, but should be granted the widest scope consistent with the principles and novel features disclosed herein.

Claims (28)

一种多模型推理方法,其特征在于,包括以下步骤:A multi-model reasoning method, characterized by comprising the following steps: 获取多个模型的推理数据;Get inference data for multiple models; 根据时延优先和/或存储空间优先的目标,从各所述模型的多个处理结点中分别确定至少一个切换结点,并确定各所述模型对应的存储空间;According to the goals of delay priority and/or storage space priority, at least one switching node is determined from a plurality of processing nodes of each of the models, and a storage space corresponding to each of the models is determined; 将各所述模型的推理数据分别写入网络处理单元的对应存储空间;以及Writing the inference data of each model into the corresponding storage space of the network processing unit respectively; and 根据所述切换结点,利用所述网络处理单元的全计算资源依次进行各所述模型的时分复用推理,以实现所述多个模型的时延优化推理和/或存储空间优化推理。According to the switching node, the full computing resources of the network processing unit are utilized to perform time-division multiplexing reasoning of each of the models in turn, so as to realize delay optimization reasoning and/or storage space optimization reasoning of the multiple models. 如权利要求1所述的多模型推理方法,其特征在于,根据所述时延优先的目标,从各所述模型的多个处理结点中分别确定至少一个切换结点,并确定各所述模型对应的存储空间的步骤包括:The multi-model reasoning method according to claim 1 is characterized in that, according to the latency priority goal, the steps of determining at least one switching node from a plurality of processing nodes of each of the models, and determining the storage space corresponding to each of the models include: 根据需要并行处理的模型数量,划分网络处理单元的全存储空间,以确定存储各所述模型的推理数据的第一存储空间分布;Dividing the full storage space of the network processing unit according to the number of models that need to be processed in parallel to determine a first storage space distribution for storing the inference data of each of the models; 根据各所述模型的推理周期,确定多模型推理的总周期,并确定各所述模型在所述总周期内的推理次数;Determine the total cycle of multi-model reasoning according to the reasoning cycle of each model, and determine the number of reasoning times of each model in the total cycle; 根据所述总周期及各所述模型的推理次数,从各所述模型的多个处理结点中分别确定时分复用各所述模型的至少一个切换结点;According to the total cycle and the number of inferences of each model, respectively determine at least one switching node for time-division multiplexing each model from a plurality of processing nodes of each model; 基于所述第一存储空间分布及各所述模型的切换结点,利用所述网络处理单元的全计算资源,对所述多个模型的推理数据样本进行多模型推理,以确定各所述模型缺少的存储空间;以及Based on the first storage space distribution and the switching nodes of each of the models, using the full computing resources of the network processing unit, multi-model reasoning is performed on the reasoning data samples of the multiple models to determine the storage space missing from each of the models; and 根据所述网络处理单元的总存储空间及各所述模型缺少的存储空间,优化所述第一存储空间分布,以确定满足所述多模型推理需求的第二存储空间分布。According to the total storage space of the network processing unit and the storage space lacking in each of the models, the first storage space distribution is optimized to determine a second storage space distribution that meets the multi-model reasoning requirements. 如权利要求2所述的多模型推理方法,其特征在于,所述基于所述第一存储空间分布及各所述模型间的切换结点,利用所述网络处理单元的全计算资源,对所述多个模型的推理数据样本进行多模型推理,以确定各所述模型缺少的存储空间的步骤包括:The multi-model reasoning method according to claim 2 is characterized in that the step of performing multi-model reasoning on the reasoning data samples of the multiple models based on the first storage space distribution and the switching nodes between the models using the full computing resources of the network processing unit to determine the storage space missing from each of the models comprises: 将所述网络处理单元连接到外部缓存器;connecting the network processing unit to an external buffer; 响应于启动第一模型的第一切换结点,从所述第一存储空间分布中对应所述第一模型的第一存储空间读取所述第一模型的推理数据样本,利用所述网络处理单元的全计算资源进行第一模型推理,并统计所述第一模型推理使用的第一外部存储空间;以及In response to starting a first switching node of the first model, reading an inference data sample of the first model from a first storage space corresponding to the first model in the first storage space distribution, performing first model inference using full computing resources of the network processing unit, and counting a first external storage space used for the first model inference; and 根据所述第一外部存储空间,确定所述第一模型缺少的存储空间。A storage space lacking in the first model is determined according to the first external storage space. 如权利要求3所述的多模型推理方法,其特征在于,所述响应于启动第一模型的第一切换结点,从所述第一存储空间分布中对应所述第一模型的第一存储空间读取所述第一模型的推理数据样本,利用所述网络处理单元的全计算资源进行第一模型推理,并统计所述第一模型推理使用的第一外部存储空间的步骤包括:The multi-model reasoning method according to claim 3 is characterized in that, in response to starting the first switching node of the first model, reading the reasoning data sample of the first model from the first storage space corresponding to the first model in the first storage space distribution, performing the first model reasoning using the full computing resources of the network processing unit, and counting the first external storage space used for the first model reasoning comprises: 响应于启动后续的第二模型的第二切换结点,中断所述第一模型推理,先将所述第一模型推理产生的第一中间数据写入所述第一存储空间,再将溢出的第一中间数据写入所述外部存储空间;以及In response to starting a second switch node of a subsequent second model, interrupting the first model reasoning, first writing the first intermediate data generated by the first model reasoning into the first storage space, and then writing the overflowed first intermediate data into the external storage space; and 根据所述溢出的第一中间数据的数据量,确定所述第一外部存储空间。The first external storage space is determined according to the data amount of the overflowed first intermediate data. 如权利要求4所述的多模型推理方法,其特征在于,所述基于所述第一存储空间分布及各所述模型间的切换结点,利用所述网络处理单元的全计算资源,对所述多个模型的推理数据样本进行多模型推理,以确定各所述模型缺少的存储空间的步骤还包括:The multi-model reasoning method according to claim 4 is characterized in that the step of performing multi-model reasoning on the reasoning data samples of the multiple models based on the first storage space distribution and the switching nodes between the models using the full computing resources of the network processing unit to determine the storage space missing from each of the models further includes: 在进行所述第一模型推理的同时,将所述第二模型的推理数据样本写入所述第一存储空间分布中对应第二模型的第二存储空间;以及While performing the first model inference, writing the inference data samples of the second model into a second storage space corresponding to the second model in the first storage space distribution; and 响应于所述第二切换结点,从所述第二存储空间读取所述第二模型的推理数据样本,将所述网络处理单元的全计算资源切换到所述第二模型,利用所述全计算资源进行第二模型推理,并确定所述第二模型推理使用的第二外部存储空间。In response to the second switching node, the inference data sample of the second model is read from the second storage space, the full computing resources of the network processing unit are switched to the second model, the second model inference is performed using the full computing resources, and the second external storage space used for the second model inference is determined. 如权利要求3所述的多模型推理方法,其特征在于,所述推理数据包括模型参数数据、输入数据及中间数据,所述存储空间分布涉及各所述模型的参数存储空间、输入数据存储空间及中间数据存储空间,所述根据所述网络处理单元的总存储空间及各所述模型缺少的存储空间,优化所述第一存储空间分布,以确定满足所述多模型推理需求的第二存储空间分布的步骤包括:The multi-model reasoning method according to claim 3 is characterized in that the reasoning data includes model parameter data, input data and intermediate data, the storage space distribution involves the parameter storage space, input data storage space and intermediate data storage space of each of the models, and the step of optimizing the first storage space distribution according to the total storage space of the network processing unit and the storage space lacking of each of the models to determine the second storage space distribution that meets the multi-model reasoning requirements comprises: 根据所述第一存储空间分布中各所述模型缺少的存储空间,扩充对应的输入数据存储空间及中间数据存储空间,和/或减小其余模型的输入数据存储空间及中间数据存储空间,以确定各所述模型均不缺少存储空间的第二存储空间分布。According to the storage space lacking in each of the models in the first storage space distribution, the corresponding input data storage space and intermediate data storage space are expanded, and/or the input data storage space and intermediate data storage space of the remaining models are reduced to determine a second storage space distribution in which each of the models has sufficient storage space. 如权利要求1所述的多模型推理方法,其特征在于,所述根据所述切换结点,利用所述网络处理单元的全计算资源依次进行各所述模型的时分复用推理,以实现所述多个模型的时延优化推理和/或存储空间优化推理的步骤包括:The multi-model reasoning method according to claim 1 is characterized in that the step of performing time-division multiplexing reasoning of each of the models in sequence using the full computing resources of the network processing unit according to the switching node to achieve delay optimization reasoning and/or storage space optimization reasoning of the multiple models includes: 响应于启动第一模型的第一切换结点,从预先确定的第二存储空间分布中对应所述第一模型的第三存储空间读取所述第一模型的推理数据,并利用所述网络处理单元的全计算资源进行第一模型推理;以及In response to starting the first switch node of the first model, reading the inference data of the first model from a third storage space corresponding to the first model in a predetermined second storage space distribution, and performing the first model inference using the full computing resources of the network processing unit; and 响应于启动后续的第二模型的第二切换结点,中断所述第一模型推理,将所述第一模型推理产生的第三中间数据写入所述第三存储空间,并从所述第二存储空间分布中对应所述第二模型的第四存储空间读取所述第二模型的推理数据,利用所述网络处理单元的全计算资源进行第二模型推理,以实现所述多个模型的时延优化推理。In response to starting a subsequent second switching node of a second model, the first model reasoning is interrupted, the third intermediate data generated by the first model reasoning is written into the third storage space, and the reasoning data of the second model is read from the fourth storage space corresponding to the second model in the second storage space distribution, and the second model reasoning is performed using the full computing resources of the network processing unit to achieve delay optimized reasoning of the multiple models. 如权利要求1所述的多模型推理方法,其特征在于,根据所述存储空间优先的目标,从各所述模型的多个处理结点中分别确定至少一个切换结点,并确定各所述模型对应的存储空间的步骤包括:The multi-model reasoning method according to claim 1 is characterized in that, according to the storage space priority goal, the steps of determining at least one switching node from a plurality of processing nodes of each of the models, and determining the storage space corresponding to each of the models include: 根据需要并行处理的模型数量,划分网络处理单元的全存储空间,以确定存储各所述模型的推理数据的第三存储空间分布;Dividing the full storage space of the network processing unit according to the number of models that need to be processed in parallel to determine the distribution of the third storage space for storing the inference data of each of the models; 根据各所述模型的推理周期,确定多模型推理的总周期,并确定各所述模型在所述总周期内的推理次数;Determine the total cycle of multi-model reasoning according to the reasoning cycle of each model, and determine the number of reasoning times of each model in the total cycle; 基于所述第三存储空间分布以及各所述模型的各所述处理结点,利用所述网络处理单元的全计算资源,分别对各所述模型的推理数据样本进行模型推理,以确定各所述模型在各所述处理结点缺少的存储空间;Based on the third storage space distribution and the processing nodes of the models, using the full computing resources of the network processing unit, respectively performing model reasoning on the reasoning data samples of the models to determine the storage space lacking in the processing nodes of the models; 按缺少的存储空间由小到大的顺序,根据所述处理结点分别为各所述模型划分不同数量的子图,并进行推理测试,以确定各所述模型所需的计算时长;In the order of the lack of storage space from small to large, dividing each model into different numbers of subgraphs according to the processing nodes, and performing reasoning tests to determine the calculation time required for each model; 确定推理次数及计算时长满足所述多模型推理的总性能需求的最大子图数量;以及Determine the maximum number of subgraphs whose inference times and computation time meet the total performance requirements of the multi-model inference; and 根据所述最大子图数量对应的处理结点的位置,分别确定各所述模型的切换结点。According to the positions of the processing nodes corresponding to the maximum number of subgraphs, the switching nodes of each of the models are determined respectively. 如权利要求8所述的多模型推理方法,其特征在于,所述基于所述第三存储空间分布以及各所述模型的各所述处理结点,利用所述网络处理单元的全计算资源,分别对各所述模型的推理数据样本进行模型推理,以确定各所述模型在各所述处理结点缺少的存储空间的步骤包括:The multi-model reasoning method according to claim 8, characterized in that the step of performing model reasoning on the reasoning data samples of each model based on the third storage space distribution and each processing node of each model using the full computing resources of the network processing unit to determine the storage space lacking of each model at each processing node comprises: 将所述网络处理单元连接到外部缓存器;connecting the network processing unit to an external buffer; 利用所述网络处理单元的全计算资源,依次对各所述模型的推理数据样本进行模型推理;Using the full computing resources of the network processing unit, perform model reasoning on the reasoning data samples of each model in turn; 响应于任一所述模型的任一所述处理结点,中断对应模型当前的模型推理,先将所述模型推理产生的第五中间数据写入所述第三存储空间分布中对应所述模型的第五存储空间,再将溢出的第五中间数据写入所述外部存储空间;以及In response to any of the processing nodes of any of the models, interrupt the current model reasoning of the corresponding model, first write the fifth intermediate data generated by the model reasoning into the fifth storage space corresponding to the model in the third storage space distribution, and then write the overflowed fifth intermediate data into the external storage space; and 根据溢出的第五中间数据的数据量,确定所述模型在所述处理结点缺少的第五外部存储空间。According to the data amount of the overflowed fifth intermediate data, a fifth external storage space lacking of the model at the processing node is determined. 如权利要求8所述的多模型推理方法,其特征在于,所述进行推理测试,以确定各所述模型所需的计算时长的步骤包括:The multi-model reasoning method according to claim 8, wherein the step of performing reasoning testing to determine the calculation time required for each of the models comprises: 利用所述网络处理单元的全计算资源,依次对各所述模型的推理数据样本进行模型推理;Using the full computing resources of the network processing unit, perform model reasoning on the reasoning data samples of each model in turn; 响应于任一所述模型的第一处理结点,从所述第三存储空间分布中对应所述模型的第六存储空间读取所述模型的推理数据样本,并利用所述网络处理单元的全计算资源进行第六模型推理;In response to a first processing node of any of the models, reading inference data samples of the model from a sixth storage space corresponding to the model in the third storage space distribution, and performing sixth model inference using full computing resources of the network processing unit; 响应于所述模型后续的第二处理结点,中断所述模型的模型推理,先将所述模型推理产生的第六中间数据写入所述第六存储空间,再从所述第六存储空间读取所述第六中间数据,以继续进行所述第一处理结点及所述第二处理结点之间的模型推理,并依此类推,直到完成所述模型的完整推理;以及In response to a subsequent second processing node of the model, interrupting the model reasoning of the model, first writing the sixth intermediate data generated by the model reasoning into the sixth storage space, and then reading the sixth intermediate data from the sixth storage space, so as to continue the model reasoning between the first processing node and the second processing node, and so on, until the complete reasoning of the model is completed; and 根据完成所述模型的完整推理的时长,确定所述模型所需的计算时长。The computation time required for the model is determined based on the time required to complete the complete inference of the model. 如权利要求8所述的多模型推理方法,其特征在于,在确定推理次数及计算时长满足所述多模型推理的总性能需求的最大子图数量之前,所述多模型推理方法还包括以下步骤:The multi-model reasoning method according to claim 8, characterized in that before determining the maximum number of subgraphs whose number of inferences and calculation time meet the total performance requirements of the multi-model reasoning, the multi-model reasoning method further includes the following steps: 根据所述模型的数量以及所述多模型推理的总周期,确定所述多模型推理的总性能需求。The total performance requirement of the multi-model reasoning is determined according to the number of the models and the total cycle of the multi-model reasoning. 如权利要求1所述的多模型推理方法,其特征在于,所述根据所述切换结点,利用所述网络处理单元的全计算资源依次进行各所述模型的时分复用推理,以实现所述多个模型的时延优化推理和/或存储空间优化推理的步骤包括:The multi-model reasoning method according to claim 1 is characterized in that the step of performing time-division multiplexing reasoning of each of the models in sequence using the full computing resources of the network processing unit according to the switching node to achieve delay optimization reasoning and/or storage space optimization reasoning of the multiple models includes: 响应于启动第三模型的第三切换结点,从预先确定的第三存储空间分布中对应所述第三模型的第七存储空间读取所述第三模型的推理数据,并利用所述网络处理单元的全计算资源进行第三模型推理;以及In response to starting a third switch node of a third model, reading inference data of the third model from a seventh storage space corresponding to the third model in a predetermined third storage space distribution, and performing inference of the third model using full computing resources of the network processing unit; and 响应于启动后续的第四模型的第四切换结点,中断所述第三模型推理,将所述第三模型推理产生的第七中间数据写入所述第七存储空间,并从所述第三存储空间分布中对应所述第四模型的第八存储空间读取所述第四模型的推理数据,利用所述网络处理单元的全计算资源进行第四模型推理,以实现所述多个模型的存储空间优化推理。In response to starting the fourth switching node of the subsequent fourth model, the third model reasoning is interrupted, the seventh intermediate data generated by the third model reasoning is written into the seventh storage space, and the reasoning data of the fourth model is read from the eighth storage space corresponding to the fourth model in the third storage space distribution, and the fourth model reasoning is performed using the full computing resources of the network processing unit to achieve storage space optimized reasoning of the multiple models. 如权利要求1所述的多模型推理方法,其特征在于,所述获取多个模型的推理数据的步骤包括:The multi-model reasoning method according to claim 1, wherein the step of obtaining reasoning data of multiple models comprises: 经由多路图像信号处理器同步获取关于多路图像的推理数据;和/或Synchronously acquiring inference data about multiple channels of images via multiple channel image signal processors; and/or 对获取的至少一路推理数据进行数据切片处理,以获得多路推理数据。Data slicing is performed on the at least one path of acquired reasoning data to obtain multiple paths of reasoning data. 一种多模型推理系统,其特征在于,包括:A multi-model reasoning system, characterized by comprising: 存储器,其上存储有计算机指令;以及a memory having computer instructions stored thereon; and 处理器,连接所述存储器,并被配置用于执行所述存储器上存储的计算机指令,以实施如权利要求1~13中任一项所述的多模型推理方法。A processor is connected to the memory and configured to execute computer instructions stored in the memory to implement the multi-model reasoning method according to any one of claims 1 to 13. 如权利要求14所述的多模型推理系统,其特征在于,所述多模型推理系统被配置于网络处理单元。The multi-model reasoning system as described in claim 14 is characterized in that the multi-model reasoning system is configured in a network processing unit. 一种网络处理单元,连接多路图像信号处理器,并配置有缓存器及多个经过预先训练的图像处理模型,其特征在于,A network processing unit is connected to multiple image signal processors and is equipped with a buffer and multiple pre-trained image processing models, characterized in that: 所述网络处理单元交替地经由N路图像信号处理器逐行获取N路图像数据,并将各路所述图像数据分别以第一速度写入所述缓存器,其中,N为大于1的整数,所述第一速度不小于各路所述图像信号处理器输出图像数据的第二速度的N倍,The network processing unit alternately obtains N channels of image data line by line through N channels of image signal processors, and writes each channel of the image data into the buffer at a first speed, wherein N is an integer greater than 1, and the first speed is not less than N times a second speed at which each channel of the image signal processor outputs the image data. 响应于到达至少一个预设的i切换结点,图像处理模型i利用所述网络处理单元的全计算资源,从所述缓存器读取并处理当前已写入的第i路图像数据,其中,i为不大于N的整数,In response to reaching at least one preset i-switching node, the image processing model i uses the full computing resources of the network processing unit to read and process the i-th channel of image data currently written from the buffer, where i is an integer not greater than N, 响应于到达后续的i+1切换结点,所述网络处理单元缓存所述图像处理模型i产生的第i路中间数据,以待下一i切换结点继续读取并处理后续写入的第i路图像数据。In response to reaching the subsequent i+1 switching node, the network processing unit caches the i-th path of intermediate data generated by the image processing model i, waiting for the next i switching nodes to continue reading and processing the i-th path of image data that is subsequently written. 如权利要求16所述的网络处理单元,其特征在于,所述网络处理单元经由图像信号处理器i逐行地获取整幅所述第i路图像数据,The network processing unit according to claim 16, characterized in that the network processing unit obtains the entire i-th channel of image data line by line via an image signal processor i, 所述图像处理模型i根据多个对应的i切换结点,读取之前缓存的中间数据i,并逐窗口地处理当前已写入的多行所述第i路图像数据,以更新关于所述第i路图像数据的中间数据i,和/或生成关于所述第i路图像数据的结果数据i。The image processing model i reads the previously cached intermediate data i according to multiple corresponding i switching nodes, and processes the multiple rows of the i-th image data that have been written currently, window by window, to update the intermediate data i about the i-th image data, and/or generate result data i about the i-th image data. 如权利要求17所述的网络处理单元,其特征在于,所述图像处理模型i包括多层神经网络结构,并被配置为:The network processing unit according to claim 17, characterized in that the image processing model i comprises a multi-layer neural network structure and is configured as follows: 响应于根据当前已写入的多行所述图像数据完成预设层数L的神经网络推理,以所述第一速度向对应的图像信号处理器i逐行输出结果数据,并删除仅涉及前L-1层神经网络推理的多行所述图像数据。In response to completing the preset number L of neural network inferences based on the multiple lines of image data currently written, the result data is output line by line to the corresponding image signal processor i at the first speed, and the multiple lines of image data that only involve the first L-1 layers of neural network inference are deleted. 如权利要求16~18中任一项所述的网络处理单元,其特征在于,各所述切换结点按预设时间间隔分布,或按各所述图像处理模型的时延要求分布,或按各所述中间数据的总缓存数据量分布。The network processing unit as described in any one of claims 16 to 18 is characterized in that each of the switching nodes is distributed according to a preset time interval, or is distributed according to the latency requirement of each of the image processing models, or is distributed according to the total cached data volume of each of the intermediate data. 如权利要求19所述的网络处理单元,其特征在于,确定按时延要求分布的切换结点,并确定各所述图像处理模型对应的存储空间的步骤包括:The network processing unit according to claim 19, wherein the steps of determining the switching nodes distributed according to the latency requirements and determining the storage space corresponding to each of the image processing models comprises: 根据需要并行处理的模型数量,划分所述缓存器的全存储空间,以确定存储各所述图像处理模型的推理数据的第一存储空间分布;Dividing the entire storage space of the buffer according to the number of models that need to be processed in parallel to determine a first storage space distribution for storing the inference data of each of the image processing models; 根据各所述图像处理模型的推理周期,确定多模型推理的总周期,并确定各所述图像处理模型在所述总周期内的推理次数;Determine the total cycle of multi-model reasoning according to the reasoning cycle of each of the image processing models, and determine the number of reasoning times of each of the image processing models within the total cycle; 根据所述总周期及各所述图像处理模型的推理次数,从各所述图像处理模型的多个处理结点中分别确定时分复用各所述图像处理模型的至少一个切换结点;Determine at least one switching node for time-division multiplexing each of the image processing models from a plurality of processing nodes of each of the image processing models according to the total cycle and the number of inferences of each of the image processing models; 基于所述第一存储空间分布及各所述图像处理模型的切换结点,利用所述网络处理单元的全计算资源,对所述多个图像处理模型的推理数据样本进行多模型推理,以确定各所述图像处理模型缺少的存储空间;以及Based on the first storage space distribution and the switching nodes of each of the image processing models, using the full computing resources of the network processing unit, multi-model reasoning is performed on the inference data samples of the multiple image processing models to determine the storage space missing from each of the image processing models; and 根据所述全存储空间及各所述图像处理模型缺少的存储空间,优化所述第一存储空间分布,以确定满足所述多模型推理需求的第二存储空间分布。Based on the full storage space and the storage space lacking in each of the image processing models, the first storage space distribution is optimized to determine a second storage space distribution that meets the multi-model reasoning requirements. 如权利要求19所述的网络处理单元,其特征在于,确定按总缓存数据量分布的切换结点,并确定各所述图像处理模型对应的存储空间的步骤包括:The network processing unit according to claim 19, wherein the steps of determining the switching nodes distributed according to the total cache data volume and determining the storage space corresponding to each of the image processing models include: 根据需要并行处理的模型数量,划分所述缓存器的全存储空间,以确定存储各所述图像处理模型的推理数据的第三存储空间分布;Dividing the entire storage space of the buffer according to the number of models that need to be processed in parallel to determine the distribution of the third storage space for storing the inference data of each of the image processing models; 根据各所述图像处理模型的推理周期,确定多模型推理的总周期,并确定各所述图像处理模型在所述总周期内的推理次数;Determine the total cycle of multi-model reasoning according to the reasoning cycle of each of the image processing models, and determine the number of reasoning times of each of the image processing models within the total cycle; 基于所述第三存储空间分布以及各所述图像处理模型的各所述处理结点,利用所述缓存器的全计算资源,分别对各所述图像处理模型的推理数据样本进行模型推理,以确定各所述图像处理模型在各所述处理结点缺少的存储空间;Based on the third storage space distribution and the processing nodes of the image processing models, the full computing resources of the buffer are used to perform model reasoning on the inference data samples of the image processing models to determine the storage space lacking in the processing nodes of the image processing models; 按缺少的存储空间由小到大的顺序,根据所述处理结点分别为各所述图像处理模型划分不同数量的子图,并进行推理测试,以确定各所述图像处理模型所需的计算时长;In the order of the lack of storage space from small to large, dividing each of the image processing models into different numbers of subgraphs according to the processing nodes, and performing reasoning tests to determine the calculation time required for each of the image processing models; 确定推理次数及计算时长满足所述多模型推理的总性能需求的最大子图数量;以及Determine the maximum number of subgraphs whose inference times and computation time meet the total performance requirements of the multi-model inference; and 根据所述最大子图数量对应的处理结点的位置,分别确定各所述图像处理模型的切换结点。According to the positions of the processing nodes corresponding to the maximum number of sub-graphs, the switching nodes of the image processing models are determined respectively. 如权利要求16~18中任一项所述的网络处理单元,其特征在于,所述缓存器被静态划分为N个缓存空间,以缓存当前待处理的图像数据,以及之前处理产生的中间数据。The network processing unit according to any one of claims 16 to 18, characterized in that the buffer is statically divided into N buffer spaces to cache the image data currently to be processed and the intermediate data generated by previous processing. 如权利要求16所述的网络处理单元,其特征在于,所述网络处理单元经由两路图像信号处理器连接双目相机,并被配置为:The network processing unit according to claim 16, characterized in that the network processing unit is connected to the binocular camera via two image signal processors and is configured as follows: 交替地经由两路所述图像信号处理器逐行获取所述双目相机的左目图像数据及右目图像数据,并将所述左目图像数据及所述右目图像数据分别写入所述缓存器;Alternately acquiring left-eye image data and right-eye image data of the binocular camera line by line through the two image signal processors, and writing the left-eye image data and the right-eye image data into the buffer respectively; 响应于到达第一切换结点,经由第一图像处理模型利用所述网络处理单元的全计算资源,从所述缓存器读取并处理当前已写入的左目图像数据,以产生对应的第一中间数据;In response to reaching the first switching node, using the full computing resources of the network processing unit via the first image processing model, reading and processing the currently written left-eye image data from the buffer to generate corresponding first intermediate data; 响应于到达第二切换结点,缓存所述第一中间数据,并将所述网络处理单元的全计算资源切换到第二图像处理模型,经由所述第二图像处理模型从所述缓存器读取并处理当前已写入的右目图像数据,以产生对应的第二中间数据;以及In response to reaching a second switching node, the first intermediate data is cached, and the full computing resources of the network processing unit are switched to a second image processing model, and the currently written right-eye image data is read and processed from the cache via the second image processing model to generate corresponding second intermediate data; and 响应于下一所述第一切换结点,缓存所述第二中间数据,读取所述第一中间数据,并将所述网络处理单元的全计算资源切换回所述第一图像处理模型,经由所述第一图像处理模型从所述缓存器读取并处理进一步写入的左目图像数据,以更新所述第一中间数据,和/或生成关于左目图像数据的结果数据。In response to the next first switching node, the second intermediate data is cached, the first intermediate data is read, and the full computing resources of the network processing unit are switched back to the first image processing model, and the further written left-eye image data is read and processed from the cache via the first image processing model to update the first intermediate data, and/or generate result data about the left-eye image data. 如权利要求16或23所述的网络处理单元,其特征在于,所述左目图像数据及所述右目图像数据为带有噪声的原始图像数据,关于所述左目图像数据及所述左目图像数据的结果数据为消除噪声的降噪图像数据,或者The network processing unit according to claim 16 or 23, characterized in that the left-eye image data and the right-eye image data are original image data with noise, and the result data about the left-eye image data and the right-eye image data are denoised image data with noise eliminated, or 所述左目图像数据及所述右目图像数据为带有马赛克的原始图像数据,关于所述左目图像数据及所述左目图像数据的结果数据为消除马赛克的复原图像数据。The left-eye image data and the right-eye image data are original image data with mosaics, and the result data regarding the left-eye image data and the right-eye image data are restored image data with the mosaics removed. 一种扩展现实显示芯片,其特征在于,包括:An extended reality display chip, characterized by comprising: 至少两路图像信号处理器,分别连接扩展现实显示装置的左目相机及右目相机;以及At least two image signal processors, connected to the left camera and the right camera of the augmented reality display device respectively; and 如权利要求16~24中任一项所述的网络处理单元,分别连接各路所述图像信号处理器,以同步获取各路所述图像信号处理器输出的图像数据,并对其进行并行处理。The network processing unit according to any one of claims 16 to 24 is connected to each of the image signal processors, respectively, to synchronously acquire image data output by each of the image signal processors, and process the image data in parallel. 一种扩展现实显示装置,其特征在于,包括:An extended reality display device, comprising: 双目相机;以及Binocular camera; and 如权利要求25所述的扩展现实显示芯片,其中,所述扩展现实显示芯片连接所述双目相机,以同步获取其输出的多路图像数据,并对各路所述图像数据进行并行处理。The extended reality display chip as described in claim 25, wherein the extended reality display chip is connected to the binocular camera to synchronously acquire multiple channels of image data outputted by the binocular camera and perform parallel processing on each channel of the image data. 一种图像数据的并行处理方法,其特征在于,包括以下步骤:A parallel processing method for image data, characterized in that it comprises the following steps: 交替地经由N路图像信号处理器逐行获取N路图像数据,并将各路所述图像数据分别以第一速度写入网络处理单元的缓存器,其中,N为大于1的整数,所述第一速度不小于各路所述图像信号处理器输出图像数据的第二速度的N倍;Alternately acquiring N channels of image data line by line through N channels of image signal processors, and writing each channel of the image data into a buffer of a network processing unit at a first speed, wherein N is an integer greater than 1, and the first speed is not less than N times a second speed at which each channel of the image signal processor outputs the image data; 响应于到达至少一个预设的i切换结点,经由预先训练并配置于所述网络处理单元的图像处理模型i,利用所述网络处理单元的全计算资源,从所述缓存器读取并处理当前已写入的第i路图像数据,其中,i为不大于N的整数;以及In response to reaching at least one preset i-switching node, reading and processing the currently written i-th channel of image data from the buffer by using the image processing model i pre-trained and configured in the network processing unit and utilizing the full computing resources of the network processing unit, wherein i is an integer not greater than N; and 响应于到达后续的i+1切换结点,缓存所述图像处理模型i产生的第i路中间数据,以待下一i切换结点继续读取并处理后续写入的第i路图像数据。In response to reaching the subsequent i+1 switching node, the i-th channel of intermediate data generated by the image processing model i is cached, waiting for the next i switching nodes to continue reading and processing the i-th channel of image data that is subsequently written. 一种计算机可读存储介质,其上存储有计算机指令,其特征在于,所述计算机指令被处理器执行时,实施如权利要求1~13中任一项所述的多模型推理方法,或者如权利要求27所述的图像数据的并行处理方法。A computer-readable storage medium having computer instructions stored thereon, characterized in that when the computer instructions are executed by a processor, the multi-model reasoning method as described in any one of claims 1 to 13, or the parallel processing method for image data as described in claim 27 is implemented.
PCT/CN2024/132671 2023-11-24 2024-11-18 Multi-model reasoning method and system, network processing unit, extended reality display chip and apparatus, and image processing method Pending WO2025108230A1 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202311587154.0A CN120046721A (en) 2023-11-24 2023-11-24 Multi-model reasoning method, system and storage medium
CN202311582338.8A CN120047302A (en) 2023-11-24 2023-11-24 Network processing unit, augmented reality display chip, device and image processing method
CN202311587154.0 2023-11-24
CN202311582338.8 2023-11-24

Publications (1)

Publication Number Publication Date
WO2025108230A1 true WO2025108230A1 (en) 2025-05-30

Family

ID=95826037

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2024/132671 Pending WO2025108230A1 (en) 2023-11-24 2024-11-18 Multi-model reasoning method and system, network processing unit, extended reality display chip and apparatus, and image processing method

Country Status (1)

Country Link
WO (1) WO2025108230A1 (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210019652A1 (en) * 2019-07-18 2021-01-21 Qualcomm Incorporated Concurrent optimization of machine learning model performance
CN114286035A (en) * 2021-12-29 2022-04-05 杭州海康机器人技术有限公司 Image acquisition card, image acquisition method and image acquisition system
CN115511693A (en) * 2022-08-22 2022-12-23 阿里巴巴(中国)有限公司 Neural network model processing method and device
CN116010049A (en) * 2022-12-20 2023-04-25 爱芯元智半导体(上海)有限公司 Compiling and executing method, chip and electronic equipment of neural network algorithm task

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210019652A1 (en) * 2019-07-18 2021-01-21 Qualcomm Incorporated Concurrent optimization of machine learning model performance
CN114286035A (en) * 2021-12-29 2022-04-05 杭州海康机器人技术有限公司 Image acquisition card, image acquisition method and image acquisition system
CN115511693A (en) * 2022-08-22 2022-12-23 阿里巴巴(中国)有限公司 Neural network model processing method and device
CN116010049A (en) * 2022-12-20 2023-04-25 爱芯元智半导体(上海)有限公司 Compiling and executing method, chip and electronic equipment of neural network algorithm task

Similar Documents

Publication Publication Date Title
EP3664093B1 (en) Semiconductor memory device employing processing in memory (pim) and method of operating the semiconductor memory device
CN105872432B (en) Device and method for fast adaptive frame rate conversion
US11436017B2 (en) Data temporary storage apparatus, data temporary storage method and operation method
US11868871B1 (en) Circuit for executing stateful neural network
CN107832843A (en) A kind of information processing method and Related product
CN107657581A (en) A convolutional neural network (CNN) hardware accelerator and acceleration method
CN114298295B (en) Chip, accelerator card, electronic device, and data processing method
CN111028360B (en) Data reading and writing method and system in 3D image processing, storage medium and terminal
JP7108702B2 (en) Processing for multiple input datasets
CN114254563A (en) Data processing method and device, electronic equipment and storage medium
US20120203942A1 (en) Data processing apparatus
CN118261805A (en) Image processing fusion method and system based on FPGA
CN113962873B (en) Image denoising method, storage medium and terminal device
You et al. Eyecod: Eye tracking system acceleration via flatcam-based algorithm and hardware co-design
Cadenas et al. Parallel pipelined array architectures for real-time histogram computation in consumer devices
WO2025108230A1 (en) Multi-model reasoning method and system, network processing unit, extended reality display chip and apparatus, and image processing method
CN120047302A (en) Network processing unit, augmented reality display chip, device and image processing method
CN118982453A (en) A SLAM hardware acceleration architecture for resource-constrained environments
Yang et al. A communication library for mapping dataflow applications on manycore architectures
CN116739901A (en) Video super-processing method and device, electronic equipment and storage medium
JP2006107532A (en) Information processing system and information processing method
CN118537769B (en) Method, device, equipment and medium for rapid segmentation of breast lesion features in medical videos
CN120046721A (en) Multi-model reasoning method, system and storage medium
CN118632056B (en) Volumetric scene streaming transmission method, device and medium based on 3DGS
JP7721693B2 (en) Video processing method, device, electronic device, and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 24893387

Country of ref document: EP

Kind code of ref document: A1