[go: up one dir, main page]

CN118674603A - Time-based frame generation via time-aware machine learning model - Google Patents

Time-based frame generation via time-aware machine learning model Download PDF

Info

Publication number
CN118674603A
CN118674603A CN202311798194.XA CN202311798194A CN118674603A CN 118674603 A CN118674603 A CN 118674603A CN 202311798194 A CN202311798194 A CN 202311798194A CN 118674603 A CN118674603 A CN 118674603A
Authority
CN
China
Prior art keywords
graphics
frame
data
memory
processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311798194.XA
Other languages
Chinese (zh)
Inventor
S·潘尼尔
N·杰恩
S·金
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Intel Corp
Original Assignee
Intel Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US18/478,233 external-priority patent/US20240311950A1/en
Application filed by Intel Corp filed Critical Intel Corp
Publication of CN118674603A publication Critical patent/CN118674603A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Computational Mathematics (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Neurology (AREA)
  • Image Generation (AREA)

Abstract

The title of the present disclosure is "time-based frame generation via a time-aware machine learning model". Described herein is a graphics processor configured to perform time-based frame generation via a time-aware machine learning model that enables generation of frames at a target timestamp relative to a rendering time of an input frame. For example, for an extrapolated frame generated by a time-aware machine learning model, a low relative timestamp would indicate that the extrapolated frame would occur temporally near, and should occur relatively near, the last frame in the sequence of frames. A higher relative timestamp would indicate that the extrapolated frame should show a greater degree of convolution based on optical flow.

Description

经由时间感知机器学习模型的基于时间的帧生成Time-based frame generation via a time-aware machine learning model

交叉引用Cross-references

本申请要求美国临时申请号63/490618(2023年3月16日提交)的优先权,其通过引用特此被合并到本文中。This application claims priority to U.S. Provisional Application No. 63/490,618 (filed on March 16, 2023), which is hereby incorporated herein by reference.

技术领域Technical Field

本公开一般涉及经由图形处理器的数据处理,以及更特别涉及能够实现经由时间感知机器学习模型的基于时间的帧生成的方法。The present disclosure relates generally to data processing via graphics processors, and more particularly to methods that enable time-based frame generation via time-aware machine learning models.

背景技术Background Art

执行帧生成的神经网络能够基于经渲染帧数据和光流数据被训练。所述光流数据能够是游戏引擎生成的光流以及由硬件基于对经渲染帧的分析所生成的光流数据的组合。神经网络随后将学习基于某个数量的输入帧以及那些帧之间的光流来生成新的帧。The neural network that performs frame generation can be trained based on rendered frame data and optical flow data. The optical flow data can be a combination of optical flow generated by the game engine and optical flow data generated by the hardware based on analysis of rendered frames. The neural network will then learn to generate new frames based on a certain number of input frames and the optical flow between those frames.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

在所附附图的各图中以示例方式而非限制方式来图示本文中描述的实施例,在附图中,类似的附图标记指示类似的要素,并且其中:The embodiments described herein are illustrated by way of example and not by way of limitation in the figures of the accompanying drawings in which like reference numerals indicate similar elements and in which:

图1是图示配置成用于实现本文中描述的实施例的一个或多个方面的计算机系统的框图;FIG1 is a block diagram illustrating a computer system configured to implement one or more aspects of the embodiments described herein;

图2A-图2D图示并行处理器部件;2A-2D illustrate parallel processor components;

图3A-图3C是图形多处理器和基于多处理器的GPU的框图;3A-3C are block diagrams of a graphics multiprocessor and a multiprocessor-based GPU;

图4A-图4F图示在其中多个GPU通信地耦合至多个多核心处理器的示例性体系结构;4A-4F illustrate an exemplary architecture in which multiple GPUs are communicatively coupled to multiple multi-core processors;

图5图示图形处理管线;FIG5 illustrates a graphics processing pipeline;

图6图示机器学习软件栈;Figure 6 illustrates a machine learning software stack;

图7图示通用图形处理单元;FIG7 illustrates a general purpose graphics processing unit;

图8图示多GPU计算系统;FIG8 illustrates a multi-GPU computing system;

图9A-图9B图示示例性深度神经网络的层;9A-9B illustrate the layers of an exemplary deep neural network;

图10图示示例性循环神经网络;FIG10 illustrates an exemplary recurrent neural network;

图11图示深度神经网络的训练和部署;Figure 11 illustrates the training and deployment of a deep neural network;

图12A是图示分布式学习的框图;FIG12A is a block diagram illustrating distributed learning;

图12B是图示可编程网络接口和数据处理单元的框图;FIG12B is a block diagram illustrating a programmable network interface and a data processing unit;

图13图示适于使用经训练的模型执行推断的示例性推断片上系统(system on achip,SOC);FIG13 illustrates an exemplary inference system on a chip (SOC) suitable for performing inference using a trained model;

图14是处理系统的框图;FIG14 is a block diagram of a processing system;

图15A-图15C图示计算系统和图形处理器;15A-15C illustrate a computing system and a graphics processor;

图16A-图16C图示附加的图形处理器和计算加速器体系结构的框图;16A-16C illustrate block diagrams of additional graphics processor and computing accelerator architectures;

图17是图形处理器的图形处理引擎的框图;FIG17 is a block diagram of a graphics processing engine of a graphics processor;

图18A-图18C图示包括在图形处理器核心中采用的处理元件的阵列的线程执行逻辑;18A-18C illustrate thread execution logic including an array of processing elements employed in a graphics processor core;

图19图示按照实施例的多片处理器的片;FIG19 illustrates a slice of a multi-chip processor according to an embodiment;

图20是图示图形处理器指令格式的框图;FIG20 is a block diagram illustrating a graphics processor instruction format;

图21是附加的图形处理器体系结构的框图;FIG21 is a block diagram of an additional graphics processor architecture;

图22A-图22B图示图形处理器命令格式和命令序列;22A-22B illustrate a graphics processor command format and command sequence;

图23图示用于数据处理系统的示例性图形软件体系结构;FIG23 illustrates an exemplary graphics software architecture for a data processing system;

图24A是图示IP核心开发系统的框图;FIG24A is a block diagram illustrating an IP core development system;

图24B图示集成电路封装组件的截面侧视图;FIG24B illustrates a cross-sectional side view of an integrated circuit package assembly;

图24C图示封装组件,该封装组件包括连接到衬底(例如,基础管芯)的多个单元的硬件逻辑小芯片;FIG. 24C illustrates a package assembly including a hardware logic chiplet of multiple units connected to a substrate (e.g., a base die);

图24D图示包括可互换小芯片的封装组件;FIG24D illustrates a package assembly including interchangeable chiplets;

图25是图示示例性片上系统集成电路的框图;FIG25 is a block diagram illustrating an exemplary system-on-chip integrated circuit;

图26A-图26B是图示用于在SoC内使用的示例性图形处理器的框图;26A-26B are block diagrams illustrating an exemplary graphics processor for use within a SoC;

图27是按照实施例的数据处理系统的框图;FIG27 is a block diagram of a data processing system according to an embodiment;

图28A-图28B图示按照实施例、由指令管线执行的矩阵操作;28A-28B illustrate matrix operations performed by an instruction pipeline according to an embodiment;

图29图示包括编解码器使能的分解脉动逻辑的计算块;FIG29 illustrates a computational block including decomposition systolic logic for codec enablement;

图30A-图30B图示按照实施例、用于使用统一时间感知的机器学习模型的帧外推和帧内插(interpolation)的系统;30A-30B illustrate a system for frame extrapolation and frame interpolation using a unified time-aware machine learning model, according to an embodiment;

图31图示用于图形处理器上的帧渲染的端对端操作的时间线;FIG31 illustrates a timeline of end-to-end operations for frame rendering on a graphics processor;

图32图示在其中神经帧生成用来维持目标帧节奏(pace)的时间线;FIG32 illustrates a timeline in which neural frame generation is used to maintain a target frame pace;

图33A-图33B图示具有能够实现异步渲染和计算操作的多个关联命令缓冲器的多个GPU引擎;33A-33B illustrate multiple GPU engines with multiple associated command buffers enabling asynchronous rendering and compute operations;

图34图示按照实施例、生成时间感知的机器学习模型的方法;FIG34 illustrates a method of generating a time-aware machine learning model, according to an embodiment;

图35图示按照实施例、经由时间感知的机器学习模型的基于时间的帧生成的方法;FIG35 illustrates a method of time-based frame generation via a time-aware machine learning model, according to an embodiment;

图36图示按照实施例、与渲染速率异步的基于时间的帧生成的方法;FIG36 illustrates a method for time-based frame generation asynchronous to a rendering rate, according to an embodiment;

图37图示经由时间感知的帧生成来维持一致帧节奏调整的系统;以及FIG. 37 illustrates a system for maintaining consistent frame cadence via time-aware frame generation; and

图38是按照实施例、包括图形处理器的计算设备的框图。38 is a block diagram of a computing device including a graphics processor, according to an embodiment.

具体实施方式DETAILED DESCRIPTION

当前的并行图形数据处理包括被开发成对图形数据执行特定操作的系统和方法,这些特定操作诸如是例如线性内插、曲面细分、栅格化、纹理映射、深度测试等。传统意义上而言,图形处理器使用固定功能计算单元来处理图形数据。然而,更最近地,已使图形处理器的多个部分可编程,使得此类处理器能够支持更广泛种类的操作以处理顶点数据和片段数据。Current parallel graphics data processing includes systems and methods developed to perform specific operations on graphics data, such as, for example, linear interpolation, tessellation, rasterization, texture mapping, depth testing, etc. Traditionally, graphics processors have used fixed-function compute units to process graphics data. More recently, however, portions of graphics processors have been made programmable, enabling such processors to support a wider variety of operations to process vertex and fragment data.

为了进一步提升性能,图形处理器典型地实现诸如管线化的处理技术,这些处理技术尝试贯穿图形管线的不同部分并行地处理尽可能多的图形数据。具有单指令多线程(SIMT)体系结构的并行图形处理器被设计成用于使图形管线中的并行处理的量最大化。在SIMT体系结构中,成组的并行线程尝试尽可能频繁地一起同步地执行程序指令以提升处理效率。能够在Shane Cook的“CUDA Programming”第3章第37-51页(2013年)中找到用于SIMT体系结构的软件和硬件的总体概述。To further improve performance, graphics processors typically implement processing techniques such as pipelining, which attempt to process as much graphics data as possible in parallel throughout different parts of the graphics pipeline. Parallel graphics processors with a single instruction multiple thread (SIMT) architecture are designed to maximize the amount of parallel processing in the graphics pipeline. In the SIMT architecture, groups of parallel threads attempt to execute program instructions together synchronously as frequently as possible to improve processing efficiency. A general overview of software and hardware for the SIMT architecture can be found in Shane Cook's "CUDA Programming", Chapter 3, pages 37-51 (2013).

图形处理单元(graphics processing unit,GPU)通信地耦合至主机/处理器核心以加速例如图形操作、机器学习操作、模式分析操作、和/或各种通用GPU(general-purposeGPU,GPGPU)功能。GPU可通过总线或另一互连(例如,诸如PCIe或NVLink之类的高速互连)通信地耦合至主机处理器/核心。替代地,GPU可集成在与核心相同的封装或芯片上,并且通过内部处理器总线/互连(即,在封装或芯片内部)通信地耦合至核心。无论GPU被连接所采取的方式如何,处理器核心都可将工作以工作描述符中所包含的命令/指令序列的形式分配给GPU。GPU随后使用专用电路模块/逻辑来高效地处理这些命令/指令。A graphics processing unit (GPU) is communicatively coupled to a host/processor core to accelerate, for example, graphics operations, machine learning operations, pattern analysis operations, and/or various general-purpose GPU (GPGPU) functions. The GPU may be communicatively coupled to a host processor/core via a bus or another interconnect (e.g., a high-speed interconnect such as PCIe or NVLink). Alternatively, the GPU may be integrated on the same package or chip as the core and communicatively coupled to the core via an internal processor bus/interconnect (i.e., inside the package or chip). Regardless of the manner in which the GPU is connected, the processor core may assign work to the GPU in the form of a sequence of commands/instructions contained in a work descriptor. The GPU then uses dedicated circuit modules/logic to efficiently process these commands/instructions.

在以下描述中,陈述了众多特定细节以提供更透彻的理解。然而,对于本领域的技术人员将显而易见的是,可以在没有这些特定细节中的一个或多个特定细节的情况下实践本文中描述的实施例。在其他实例中,未描述公知的特征以免混淆当前实施例的细节。In the following description, numerous specific details are set forth to provide a more thorough understanding. However, it will be apparent to those skilled in the art that the embodiments described herein may be practiced without one or more of these specific details. In other instances, well-known features are not described to avoid obscuring the details of the current embodiments.

本文中描述的是一种能够实现经由时间感知的机器学习模型的基于时间的帧生成的技术。帧生成模型被训练成不仅基于输入帧和光流来生成新的帧,而且还将所述光流演进成以特定未来时间戳为目标。Described herein is a technique that enables time-based frame generation via a time-aware machine learning model. The frame generation model is trained to not only generate new frames based on input frames and optical flow, but also evolve the optical flow to target a specific future timestamp.

系统概述System Overview

图1是图示配置成用于实现本文中描述的实施例的一个或多个方面的计算系统100的框图。计算系统100包括处理子系统101,该处理子系统101具有经由互连路径通信的一个或多个处理器102和系统存储器104,该互连路径可包括存储器中枢105。存储器中枢105可以是芯片组部件内的单独部件,或者可被集成在一个或多个处理器102内。存储器中枢105经由通信链路106与I/O子系统111耦合。I/O子系统111包括I/O中枢107,该I/O中枢107可使计算系统100能够从一个或多个输入设备108接收输入。此外,I/O中枢107可使显示控制器(其可被包括在一个或多个处理器102中)将输出提供给一个或多个显示设备110A。在一个实施例中,与I/O中枢107耦合的一个或多个显示设备110A可包括本地的、内部的、或嵌入式的显示设备。FIG. 1 is a block diagram illustrating a computing system 100 configured to implement one or more aspects of the embodiments described herein. The computing system 100 includes a processing subsystem 101 having one or more processors 102 and a system memory 104 communicating via an interconnect path, which may include a memory hub 105. The memory hub 105 may be a separate component within a chipset component, or may be integrated within one or more processors 102. The memory hub 105 is coupled to an I/O subsystem 111 via a communication link 106. The I/O subsystem 111 includes an I/O hub 107 that enables the computing system 100 to receive input from one or more input devices 108. In addition, the I/O hub 107 enables a display controller (which may be included in one or more processors 102) to provide output to one or more display devices 110A. In one embodiment, the one or more display devices 110A coupled to the I/O hub 107 may include local, internal, or embedded display devices.

处理子系统101例如包括经由总线或其他通信链路113耦合至存储器中枢105的一个或多个并行处理器112。通信链路113可以是任何数量的基于标准的通信链路技术或协议中的一种,诸如但不限于PCI快速(PCI Express),或者可以是供应方特定的通信接口或通信结构(fabric)。一个或多个并行处理器112可形成可包括大量处理核心和/或处理集群的计算集中的并行或向量处理系统,诸如,集成众核心(many integrated core,MIC)处理器。例如,一个或多个并行处理器112形成图形处理子系统,该图形处理子系统可以向经由I/O中枢107耦合的一个或多个显示设备110A中的一个显示设备输出像素。一个或多个并行处理器112还可包括显示控制器和显示接口(未示出),用于启用至一个或多个显示设备110B的直接连接。The processing subsystem 101 includes, for example, one or more parallel processors 112 coupled to the memory hub 105 via a bus or other communication link 113. The communication link 113 can be one of any number of standard-based communication link technologies or protocols, such as but not limited to PCI Express, or can be a vendor-specific communication interface or communication fabric. The one or more parallel processors 112 can form a parallel or vector processing system in a computational concentration that can include a large number of processing cores and/or processing clusters, such as, for example, a many integrated core (MIC) processor. For example, the one or more parallel processors 112 form a graphics processing subsystem that can output pixels to one of the one or more display devices 110A coupled via the I/O hub 107. The one or more parallel processors 112 may also include a display controller and a display interface (not shown) for enabling direct connection to one or more display devices 110B.

在I/O子系统111内,系统存储单元114可连接到I/O中枢107,从而为计算系统100提供存储机制。I/O开关116可用于提供接口机制,以启用I/O中枢107与其他部件之间的连接,其他部件诸如,可被集成到平台中的网络适配器118和/或无线网络适配器119、以及可经由一个或多个插入式设备120而被添加的各种其他设备。(一个或多个)插入式设备120还可包括例如一个或多个外部图形处理器设备、图形卡、和/或计算加速器。网络适配器118可以是以太网适配器或另一有线网络适配器。无线网络适配器119可包括以下一者或多者:Wi-Fi、蓝牙、近场通信(near field communication,NFC)、或包括一个或多个无线的无线电装置的其他网络设备。Within the I/O subsystem 111, a system storage unit 114 may be connected to the I/O hub 107 to provide a storage mechanism for the computing system 100. An I/O switch 116 may be used to provide an interface mechanism to enable connection between the I/O hub 107 and other components, such as a network adapter 118 and/or a wireless network adapter 119 that may be integrated into the platform, and various other devices that may be added via one or more plug-in devices 120. The plug-in device(s) 120 may also include, for example, one or more external graphics processor devices, graphics cards, and/or computing accelerators. The network adapter 118 may be an Ethernet adapter or another wired network adapter. The wireless network adapter 119 may include one or more of the following: Wi-Fi, Bluetooth, near field communication (NFC), or other network devices including one or more wireless radio devices.

计算系统100可包括未显式地示出的其他部件,包括USB或其他端口连接、光学存储驱动器、视频捕捉设备等等,这些部件也可连接到I/O中枢107。将图1中的各种部件互连的通信路径可使用任何合适的协议来实现,合适的协议诸如,基于PCI(PeripheralComponent Interconnect,外围部件互连)的协议(例如,PCI快速)、或任何其他总线或点到点通信接口和/或(一种或多种)协议,诸如,NVLink高速互连、计算快速链路TM(ComputeExpress LinkTM,CXLTM)(例如,CXL.mem)、无限结构(Infinity Fabric,IF)、以太网(IEEE802.3)、远程直接存储器访问(remote direct memory access,RDMA)、无限带宽(InfiniBand)、网际广域RDMA协议(Internet Wide Area RDMA Protocol,iWARP)、传输控制协议(Transmission Control Protocol,TCP)、用户数据报协议(User DatagramProtocol,UDP)、快速UDP网际连接(quickUDP Internet Connections,QUIC)、通过汇聚以太网的RDMA(RDMA over Converged Ethernet,RoCE)、英特尔快速路径互连(IntelQuickPath Interconnect,QPI)、英特尔超路径互连(Intel Ultra Path Interconnect,UPI)、英特尔片上系统结构(Intel On-Chip System Fabric,IOSF)、全方位路径(Omnipath)、超传输(HyperTransport)、高级微控制器总线体系结构(AdvancedMicrocontroller Bus Architecture,AMBA)互连、OpenCAPI、Gen-Z、用于加速器的缓存一致互连(Cache Coherent Interconnect for Accelerators,CCIX)、3GPP长期演进(3GPPLong Term Evolution,LTE)(4G)、3GPP 5G及其变体、或本领域中已知的有线或无线互连协议。在一些示例中,可使用诸如通过结构的非易失性存储器快速(non-volatile memoryexpress,NVMe)(non-volatile memory express over Fabrics,NVME-oF)或NVMe之类的协议将数据复制或存储到虚拟化存储节点。Computing system 100 may include other components not explicitly shown, including USB or other port connections, optical storage drives, video capture devices, etc., which may also be connected to I/O hub 107 . The communication paths interconnecting the various components in FIG. 1 may be implemented using any suitable protocol, such as a PCI (Peripheral Component Interconnect) based protocol (e.g., PCI Express), or any other bus or point-to-point communication interface and/or protocol(s), such as NVLink high-speed interconnect, Compute Express Link TM (CXLTM) (e.g., CXL.mem), Infinity Fabric (IF), Ethernet (IEEE 802.3), remote direct memory access (RDMA), InfiniBand, Internet Wide Area RDMA Protocol (iWARP), Transmission Control Protocol (TCP), User Datagram Protocol (UDP), quick UDP Internet Connections (QUIC), RDMA over Converged Ethernet (RoCE), Intel Quick Path Interconnect (QPI), Intel Ultra Path Interconnect (Intel Ultra Path Interconnect), Intel 802.3 High Speed Interconnect (Intel 802.3 High Speed Interconnect (Intel 802.3 High Speed Interconnect (Intel 802.3 High Speed Interconnect (Intel 802.3 High Speed Interconnect (Intel 802.3 High Speed Interconnect (Intel 802.3 High Speed Interconnect (Intel 802.3 High Speed Interconnect (Intel 802.3 High Speed Interconnect (QPI)), Intel Ultra Path Interconnect (Intel Ultra Path Interconnect, UPI), Intel On-Chip System Fabric (IOSF), Omnipath, HyperTransport, Advanced Microcontroller Bus Architecture (AMBA) interconnect, OpenCAPI, Gen-Z, Cache Coherent Interconnect for Accelerators (CCIX), 3GPP Long Term Evolution (LTE) (4G), 3GPP 5G and its variants, or a wired or wireless interconnect protocol known in the art. In some examples, data can be copied or stored to a virtualized storage node using a protocol such as non-volatile memory express (NVMe) (non-volatile memory express over Fabrics, NVME-oF) or NVMe.

一个或多个并行处理器112可包含针对图形和视频处理进行优化的电路模块(包括例如,视频输出电路模块),并构成图形处理单元(GPU)。替代地或附加地,本文中更详细地描述,一个或多个并行处理器112可包含针对通用处理进行优化同时保留底层计算体系结构的电路模块。计算系统100的部件可与一个或多个其他系统元件集成在单个集成电路上。例如,一个或多个并行处理器112、存储器中枢105、(一个或多个)处理器102、以及I/O中枢107可被集成到片上系统(SoC)集成电路中。替代地,计算系统100的部件可被集成到单个封装中以形成系统级封装(systeminpackage,SIP)配置。在一个实施例中,计算系统100的部件的至少部分可被集成到多芯片模块(multi-chip module,MCM)中,该MCM可与其他多芯片模块一起被互连到模块化计算系统中。One or more parallel processors 112 may include circuit modules optimized for graphics and video processing (including, for example, video output circuit modules) and constitute a graphics processing unit (GPU). Alternatively or additionally, as described in more detail herein, one or more parallel processors 112 may include circuit modules optimized for general-purpose processing while retaining the underlying computing architecture. Components of the computing system 100 may be integrated on a single integrated circuit with one or more other system elements. For example, one or more parallel processors 112, a memory hub 105, (one or more) processors 102, and an I/O hub 107 may be integrated into a system-on-chip (SoC) integrated circuit. Alternatively, components of the computing system 100 may be integrated into a single package to form a system-in-package (SIP) configuration. In one embodiment, at least a portion of the components of the computing system 100 may be integrated into a multi-chip module (MCM), which may be interconnected with other multi-chip modules into a modular computing system.

将领会,本文中所示出的计算系统100是说明性的,并且变体和修改是可能的。可根据需要修改连接拓扑,包括桥接器的数量和布置、(一个或多个)处理器102的数量、以及(一个或多个)并行处理器112的数量。例如,系统存储器104可直接而不是通过桥接器连接到(一个或多个)处理器102,而其他设备经由存储器中枢105和(一个或多个)处理器102与系统存储器104通信。在其他替代拓扑中,(一个或多个)并行处理器连接到I/O中枢107或直接连接到一个或多个处理器102中的一个处理器,而不是连接到存储器中枢105。在其他实施例中,I/O中枢107和存储器中枢105可集成到单个芯片中。两个或更多个处理器102集合经由多个插槽被附连也是可能的,多个插槽可与(一个或多个)并行处理器112的两个或更多个实例耦合。It will be appreciated that the computing system 100 shown herein is illustrative, and variations and modifications are possible. The connection topology, including the number and arrangement of bridges, the number of (one or more) processors 102, and the number of (one or more) parallel processors 112, may be modified as desired. For example, the system memory 104 may be connected to the (one or more) processors 102 directly rather than through a bridge, while other devices communicate with the system memory 104 via the memory hub 105 and the (one or more) processors 102. In other alternative topologies, the (one or more) parallel processors are connected to the I/O hub 107 or directly to one of the one or more processors 102, rather than to the memory hub 105. In other embodiments, the I/O hub 107 and the memory hub 105 may be integrated into a single chip. It is also possible for two or more sets of processors 102 to be attached via multiple slots, which may be coupled to two or more instances of the (one or more) parallel processors 112.

本文中所示出的特定部件中的一些是任选的,并且可以是并非在计算系统100的所有实现方式中都包括这些部件。例如,可支持任何数量的插入式卡或外围设备,或者可消除一些部件。此外,一些体系结构可针对与图1中所图示的那些部件类似的部件使用不同的术语。例如,存储器中枢105在一些体系结构中可被称为北桥,而I/O中枢107可被称为南桥。Some of the specific components shown herein are optional, and may not be included in all implementations of computing system 100. For example, any number of plug-in cards or peripherals may be supported, or some components may be eliminated. In addition, some architectures may use different terminology for components similar to those illustrated in FIG. 1 . For example, memory hub 105 may be referred to as a north bridge in some architectures, while I/O hub 107 may be referred to as a south bridge.

图2A图示并行处理器200。并行处理器200可以是如本文中所述的GPU、GPGPU等。并行处理器200的各种部件可使用一个或多个集成电路设备来实现,一个或多个集成电路设备诸如,可编程处理器、专用集成电路(application specific integrated circuit,ASIC)、或现场可编程门阵列(field programmable gate array,FPGA)。所图示的并行处理器200可以是图1中示出的(一个或多个)并行处理器112中的一个或多个。FIG. 2A illustrates a parallel processor 200. The parallel processor 200 may be a GPU, a GPGPU, etc. as described herein. The various components of the parallel processor 200 may be implemented using one or more integrated circuit devices, such as a programmable processor, an application specific integrated circuit (ASIC), or a field programmable gate array (FPGA). The illustrated parallel processor 200 may be one or more of the parallel processor(s) 112 shown in FIG. 1 .

并行处理器200包括并行处理单元202。并行处理单元包括启用与其他设备的通信的I/O单元204,其他设备包括并行处理单元202的其他实例。I/O单元204可直接连接到其他设备。例如,I/O单元204经由使用中枢或开关接口(诸如,存储器中枢105)与其他设备连接。存储器中枢105与I/O单元204之间的连接形成通信链路113。在并行处理单元202内,I/O单元204与主机接口206以及存储器交叉开关216连接,其中,主机接口206接收涉及执行处理操作的命令,并且存储器交叉开关216接收涉及执行存储器操作的命令。Parallel processor 200 includes parallel processing unit 202. Parallel processing unit includes I/O unit 204 that enables communication with other devices, including other instances of parallel processing unit 202. I/O unit 204 can be directly connected to other devices. For example, I/O unit 204 is connected to other devices via the use of a hub or switch interface (such as memory hub 105). The connection between memory hub 105 and I/O unit 204 forms communication link 113. Within parallel processing unit 202, I/O unit 204 is connected to host interface 206 and memory cross switch 216, wherein host interface 206 receives commands related to performing processing operations and memory cross switch 216 receives commands related to performing memory operations.

当主机接口206经由I/O单元204接收命令缓冲器时,主机接口206可以将用于执行那些命令的工作操作引导至前端208。在一个实施例中,前端208与调度器210耦合,该调度器210被配置成用于将命令或其他工作项目分发给处理集群阵列212。调度器210确保在任务被分发给处理集群阵列212中的处理集群之前,处理集群阵列212被适当地配置并且处于有效状态。调度器210可经由在微控制器上执行的固件逻辑来实现。微控制器实现的调度器210可被配置成用于以粗粒度和细粒度执行复杂的调度和工作分发操作,从而实现对在处理集群阵列212上执行的线程的快速抢占和上下文切换。优选地,主机软件可以经由多个图形处理门铃中的一个来证实用于在处理集群阵列212上调度的工作负载。在其他示例中,可使用对新工作负载或中断的轮询来标识或指示要执行的工作的可用性。工作负载随后可由调度器微控制器内的调度器210逻辑跨处理集群阵列212自动地分发。When the host interface 206 receives the command buffer via the I/O unit 204, the host interface 206 can direct the work operations for executing those commands to the front end 208. In one embodiment, the front end 208 is coupled with a scheduler 210, which is configured to distribute commands or other work items to the processing cluster array 212. The scheduler 210 ensures that the processing cluster array 212 is properly configured and in a valid state before the task is distributed to the processing clusters in the processing cluster array 212. The scheduler 210 can be implemented via firmware logic executed on a microcontroller. The microcontroller-implemented scheduler 210 can be configured to perform complex scheduling and work distribution operations at coarse and fine granularity, thereby achieving fast preemption and context switching of threads executed on the processing cluster array 212. Preferably, the host software can confirm the workload for scheduling on the processing cluster array 212 via one of a plurality of graphics processing doorbells. In other examples, polling for new workloads or interrupts can be used to identify or indicate the availability of work to be executed. The workload may then be automatically distributed across the processing cluster array 212 by the scheduler 210 logic within the scheduler microcontroller.

处理集群阵列212可以包括最多“N”个处理集群(例如,集群214A、集群214B至集群214N)。处理集群阵列212中的每个集群214A-214N可执行大量的并发线程。调度器210可以使用各种调度和/或工作分发算法将工作分配给处理集群阵列212中的集群214A-214N,这些调度和/或工作分发算法可取决于针对每种类型的程序或计算产生的工作负载而变化。调度可以由调度器210动态地处置,或可以在对被配置成供处理集群阵列212执行的程序逻辑的编译期间部分地由编译器逻辑辅助。任选地,处理集群阵列212中的不同的集群214A-214N可以被分配用于处理不同类型的程序或用于执行不同类型的计算。The processing cluster array 212 may include up to "N" processing clusters (e.g., cluster 214A, cluster 214B to cluster 214N). Each cluster 214A-214N in the processing cluster array 212 may execute a large number of concurrent threads. The scheduler 210 may allocate work to the clusters 214A-214N in the processing cluster array 212 using various scheduling and/or work distribution algorithms, which may vary depending on the workload generated for each type of program or calculation. Scheduling may be handled dynamically by the scheduler 210, or may be partially assisted by compiler logic during the compilation of program logic configured for execution by the processing cluster array 212. Optionally, different clusters 214A-214N in the processing cluster array 212 may be allocated to process different types of programs or to perform different types of calculations.

处理集群阵列212可配置成用于执行各种类型的并行处理操作。例如,处理集群阵列212被配置成用于执行通用并行计算操作。例如,处理集群阵列212可包括用于执行处理任务的逻辑,这些处理任务包括视频和/或音频数据的过滤、执行包括物理操作的建模操作以及执行数据变换。Processing cluster array 212 may be configured to perform various types of parallel processing operations. For example, processing cluster array 212 is configured to perform general-purpose parallel computing operations. For example, processing cluster array 212 may include logic for performing processing tasks including filtering of video and/or audio data, performing modeling operations including physical operations, and performing data transformations.

处理集群阵列212被配置成用于执行并行图形处理操作。在其中并行处理器200被配置成用于执行图形处理操作的此类实施例中,处理集群阵列212可包括用于支持此类图形处理操作的执行的附加逻辑,包括但不限于用于执行纹理操作的纹理采样逻辑、以及曲面细分逻辑和其他顶点处理逻辑。此外,处理集群阵列212可被配置成用于执行图形处理相关的着色器程序,诸如但不限于顶点着色器、曲面细分着色器、几何着色器和像素着色器。并行处理单元202可以经由I/O单元204从系统存储器传输数据以供处理。在处理期间,可在处理期间将所传输的数据存储到片上存储器(例如,并行处理器存储器222),随后将该数据写回到系统存储器。Processing cluster array 212 is configured to perform parallel graphics processing operations. In such embodiments where parallel processor 200 is configured to perform graphics processing operations, processing cluster array 212 may include additional logic for supporting the execution of such graphics processing operations, including but not limited to texture sampling logic for performing texture operations, and tessellation logic and other vertex processing logic. In addition, processing cluster array 212 may be configured to execute graphics processing related shader programs, such as but not limited to vertex shaders, tessellation shaders, geometry shaders, and pixel shaders. Parallel processing unit 202 may transfer data from system memory via I/O unit 204 for processing. During processing, the transferred data may be stored in on-chip memory (e.g., parallel processor memory 222) during processing, and then the data may be written back to system memory.

在其中使用并行处理单元202来执行图形处理的实施例中,调度器210可被配置成用于将处理工作负载划分成近似相等大小的任务,以更好地实现图形处理操作向处理集群阵列212中的多个集群214A-214N的分发。在这些实施例中的一些实施例中,处理集群阵列212的部分可被配置成用于执行不同类型的处理。例如,第一部分可被配置成用于执行顶点着色和拓扑生成,第二部分可被配置成用于执行曲面细分和几何着色,并且第三部分可被配置成用于执行像素着色或其他屏幕空间操作,以产生用于显示的经渲染的图像。由集群214A-214N中的一个或多个集群产生的中间数据可被存储在缓冲器中,以允许中间数据在集群214A-214N之间被传送以用于进一步处理。In embodiments where parallel processing units 202 are used to perform graphics processing, scheduler 210 may be configured to divide the processing workload into tasks of approximately equal size to better enable distribution of graphics processing operations to multiple clusters 214A-214N in processing cluster array 212. In some of these embodiments, portions of processing cluster array 212 may be configured to perform different types of processing. For example, a first portion may be configured to perform vertex shading and topology generation, a second portion may be configured to perform tessellation and geometry shading, and a third portion may be configured to perform pixel shading or other screen space operations to produce a rendered image for display. Intermediate data generated by one or more of clusters 214A-214N may be stored in a buffer to allow the intermediate data to be transferred between clusters 214A-214N for further processing.

在操作期间,处理集群阵列212可以经由调度器210接收要被执行的处理任务,该调度器210从前端208接收定义处理任务的命令。对于图形处理操作,处理任务可包括要被处理的数据以及定义将如何处理该数据(例如,将执行什么程序)的状态参数和命令的索引,该数据例如,表面(补片(patch))数据、基元数据、顶点数据和/或像素数据。调度器210可被配置成用于取得(fetch)与任务对应的索引,或者可从前端208接收索引。前端208可被配置成用于确保在由传入命令缓冲器(例如,批量缓冲器、推入缓冲器等)指定的工作负载被发起之前处理集群阵列212被配置成有效状态。During operation, the processing cluster array 212 may receive processing tasks to be executed via the scheduler 210, which receives commands defining the processing tasks from the front end 208. For graphics processing operations, the processing tasks may include data to be processed, such as surface (patch) data, primitive data, vertex data, and/or pixel data, and state parameters and commands defining how the data is to be processed (e.g., what program is to be executed). The scheduler 210 may be configured to fetch an index corresponding to the task, or may receive the index from the front end 208. The front end 208 may be configured to ensure that the processing cluster array 212 is configured to a valid state before a workload specified by an incoming command buffer (e.g., a batch buffer, a push buffer, etc.) is initiated.

并行处理单元202的一个或多个实例中的每个实例可以与并行处理器存储器222耦合。可以经由存储器交叉开关216来访问并行处理器存储器222,该存储器交叉开关216可以接收来自处理集群阵列212以及I/O单元204的存储器请求。存储器交叉开关216可经由存储器接口218来访问并行处理器存储器222。存储器接口218可包括各自可耦合至并行处理器存储器222的部分(例如,存储器单元)的多个分区单元(例如,分区单元220A、分区单元220B、直到分区单元220N)。分区单元220A-220N的数量可被配置成等于存储器单元的数量,使得第一分区单元220A具有对应的第一存储器单元224A,第二分区单元220B具有对应的第二存储器单元224B,并且第N分区单元220N具有对应的第N存储器单元224N。在其他实施例中,分区单元220A-220N的数量可以不等于存储器设备的数量。Each of the one or more instances of the parallel processing unit 202 may be coupled to a parallel processor memory 222. The parallel processor memory 222 may be accessed via a memory crossbar switch 216, which may receive memory requests from the processing cluster array 212 and the I/O unit 204. The memory crossbar switch 216 may access the parallel processor memory 222 via a memory interface 218. The memory interface 218 may include a plurality of partition units (e.g., partition unit 220A, partition unit 220B, up to partition unit 220N), each of which may be coupled to a portion (e.g., memory unit) of the parallel processor memory 222. The number of partition units 220A-220N may be configured to be equal to the number of memory units, such that the first partition unit 220A has a corresponding first memory unit 224A, the second partition unit 220B has a corresponding second memory unit 224B, and the Nth partition unit 220N has a corresponding Nth memory unit 224N. In other embodiments, the number of partition units 220A-220N may not be equal to the number of memory devices.

存储器单元224A-224N可以包括各种类型的存储器设备,包括动态随机存取存储器(dynamic random-access memory,DRAM)或图形随机存取存储器,诸如,同步图形随机存取存储器(synchronous graphics random access memory,SGRAM),包括图形双倍数据速率(graphics doubledata rate,GDDR)存储器。任选地,存储器单元224A-224N还可包括3D堆叠式存储器,包括但不限于高带宽存储器(high bandwidth memory,HBM)。本领域技术人员将领会,存储器单元224A-224N的具体实现方式可有所不同,并且可从各种常规设计中的一个常规设计中选择。诸如帧缓冲器或纹理图之类的渲染目标可跨存储器单元224A-224N被存储,从而允许分区单元220A-220N并行地写入每个渲染目标的部分,以高效地使用并行处理器存储器222的可用带宽。在一些实施例中,并行处理器存储器222的本地实例可被排除,以有利于利用结合本地缓存存储器的系统存储器的统一存储器设计。The memory units 224A-224N may include various types of memory devices, including dynamic random-access memory (DRAM) or graphics random access memory, such as synchronous graphics random access memory (SGRAM), including graphics double data rate (GDDR) memory. Optionally, the memory units 224A-224N may also include 3D stacked memory, including but not limited to high bandwidth memory (HBM). Those skilled in the art will appreciate that the specific implementation of the memory units 224A-224N may vary and may be selected from one of a variety of conventional designs. Rendering targets such as frame buffers or texture maps may be stored across the memory units 224A-224N, thereby allowing the partition units 220A-220N to write portions of each rendering target in parallel to efficiently use the available bandwidth of the parallel processor memory 222. In some embodiments, local instances of the parallel processor memory 222 may be eliminated in favor of a unified memory design that utilizes system memory in conjunction with local cache memory.

任选地,处理集群阵列212中的集群214A-214N中的任一者具有用于处理将被写入到并行处理器存储器222内的存储器单元224A-224N中的任一者的数据的能力。存储器交叉开关216可被配置成用于将每个集群214A-214N的输出传输到任何分区单元220A-220N或传输到另一集群214A-214N,该另一集群214A-214N可对输出执行附加的处理操作。每个集群214A-214N可通过存储器交叉开关216与存储器接口218通信,以从各种外部存储器设备读取或向各种外部存储器设备写入。在具有存储器交叉开关216的实施例中的一个实施例中,存储器交叉开关216具有至存储器接口218的连接以与I/O单元204通信,并具有至并行处理器存储器222的本地实例的连接,从而使不同处理集群214A-214N内的处理单元能够与系统存储器或不在并行处理单元202本地的其他存储器通信。一般而言,存储器交叉开关216例如可以能够使用虚拟通道来分离集群214A-214N与分区单元220A-220N之间的通信量流。Optionally, any of the clusters 214A-214N in the processing cluster array 212 has the capability to process data to be written to any of the memory units 224A-224N within the parallel processor memory 222. The memory crossbar 216 may be configured to transmit the output of each cluster 214A-214N to any partition unit 220A-220N or to another cluster 214A-214N, which may perform additional processing operations on the output. Each cluster 214A-214N may communicate with a memory interface 218 through the memory crossbar 216 to read from or write to various external memory devices. In one embodiment of an embodiment having a memory crossbar switch 216, the memory crossbar switch 216 has connections to a memory interface 218 to communicate with the I/O unit 204 and has connections to a local instance of a parallel processor memory 222, thereby enabling processing units within different processing clusters 214A-214N to communicate with system memory or other memory that is not local to the parallel processing unit 202. In general, the memory crossbar switch 216 may be able to separate traffic flows between the clusters 214A-214N and the partition units 220A-220N using virtual channels, for example.

尽管在并行处理器200内图示出并行处理单元202的单个实例,但可以包括并行处理单元202的任何数量的实例。例如,并行处理单元202的多个实例可以被设置在单个插入式卡上,或者多个插入式卡可以被互连。例如,并行处理器200可以是插入式设备,诸如,图1的插入式设备120,该插入式设备120可以是图形卡(诸如,包括一个或多个GPU、一个或多个存储器设备、以及设备至设备或网络或结构接口的分立的图形卡)。并行处理单元202的不同实例可被配置成即便不同实例具有不同数量的处理核心、不同量的本地并行处理器存储器、和/或其他配置区别也进行互操作。任选地,并行处理单元202的一些实例相对于其他实例可包括更高精度浮点单元。包含并行处理单元202或并行处理器200的一个或多个实例的系统能以各种配置和形状因子来实现,这些配置和形状因子包括但不限于,桌面型电脑、膝上型电脑、或手持式个人计算机、服务器、工作站、游戏控制台和/或嵌入式系统。编排器可使用以下一者或多者来形成用于工作负载执行的复合节点:分解的处理器资源、缓存资源、存储器资源、存储资源、和联网资源。Although a single instance of parallel processing unit 202 is illustrated in parallel processor 200, any number of instances of parallel processing unit 202 may be included. For example, multiple instances of parallel processing unit 202 may be arranged on a single plug-in card, or multiple plug-in cards may be interconnected. For example, parallel processor 200 may be a plug-in device, such as, plug-in device 120 of Fig. 1, which may be a graphics card (such as, including one or more GPUs, one or more memory devices, and a discrete graphics card of a device to device or network or structure interface). Different instances of parallel processing unit 202 may be configured to interoperate even if different instances have different numbers of processing cores, different amounts of local parallel processor memory, and/or other configuration differences. Optionally, some instances of parallel processing unit 202 may include higher precision floating point units relative to other instances. Systems comprising one or more instances of parallel processing unit 202 or parallel processor 200 may be implemented in various configurations and form factors, including, but not limited to, desktop computers, laptop computers, or handheld personal computers, servers, workstations, game consoles, and/or embedded systems. The orchestrator may form composite nodes for workload execution using one or more of the following: decomposed processor resources, cache resources, memory resources, storage resources, and networking resources.

在一个实施例中,并行处理单元202可以被分区为多个实例。那些多个实例可以被配置用于以隔离的方式执行与不同客户端相关联的工作负载,从而使得为每个客户端提供预定的服务质量。例如,每个集群214A-214N可以与其他集群划分开并隔离开,从而允许处理集群阵列212被划分为多个计算分区或实例。在此类配置中,在隔离分区上执行的工作负载被保护以免受与在不同分区上执行的不同工作负载相关联的错误或误差。分区单元220A-220N可以被配置成用于启用到与相应的计算分区相关联的集群214A-214N的存储器的专用路径和/或隔离路径。此数据路径隔离使得分区内的计算资源可与一个或多个经指派的存储器单元224A-224N进行通信,而不会受到其他分区的活动的干扰。In one embodiment, parallel processing unit 202 can be partitioned into multiple instances. Those multiple instances can be configured to execute workloads associated with different clients in an isolated manner, so that a predetermined quality of service is provided for each client. For example, each cluster 214A-214N can be partitioned and isolated from other clusters, thereby allowing processing cluster array 212 to be partitioned into multiple computing partitions or instances. In such configurations, the workload executed on the isolated partition is protected from errors or errors associated with different workloads executed on different partitions. Partition units 220A-220N can be configured to enable dedicated paths and/or isolated paths to the memory of clusters 214A-214N associated with corresponding computing partitions. This data path isolation enables the computing resources within the partition to communicate with one or more assigned memory units 224A-224N without being disturbed by the activities of other partitions.

图2B是分区单元220的框图。分区单元220可以是图2A的分区单元220A-220N中的一个分区单元的实例。如所图示,分区单元220包括L2缓存221、帧缓冲器接口225、和ROP226(栅格操作单元)。L2缓存221是被配置成用于执行从存储器交叉开关216和ROP 226接收的加载和存储操作的读取/写入缓存。读未命中和紧急写回请求由L2缓存221输出到帧缓冲器接口225以用于处理。更新也可经由帧缓冲器接口225被发送到帧缓冲器以用于处理。在一个实施例中,帧缓冲器接口225与图2A的并行处理器存储器222内的存储器单元224A-224N中的存储器单元224对接。分区单元220还可附加地或替代地经由存储器控制器(未示出)与并行处理器存储器中的存储器单元中的一个存储器单元对接。FIG. 2B is a block diagram of a partition unit 220. The partition unit 220 may be an example of a partition unit in the partition units 220A-220N of FIG. 2A. As shown, the partition unit 220 includes an L2 cache 221, a frame buffer interface 225, and an ROP 226 (grid operation unit). The L2 cache 221 is a read/write cache configured to perform load and store operations received from the memory crossbar switch 216 and the ROP 226. Read misses and urgent write-back requests are output by the L2 cache 221 to the frame buffer interface 225 for processing. Updates may also be sent to the frame buffer via the frame buffer interface 225 for processing. In one embodiment, the frame buffer interface 225 interfaces with a memory unit 224 in the memory unit 224A-224N in the parallel processor memory 222 of FIG. The partition unit 220 may also additionally or alternatively interface with a memory unit in the memory unit of the parallel processor memory via a memory controller (not shown).

在图形应用中,ROP 226是执行栅格操作(诸如,模板印制(stencil)、z测试、混合等等)的处理单元。ROP 226随后输出经处理的图形数据,该经处理的图形数据被存储在图形存储器中。在一些实施例中,ROP 226包括编解码器(CODEC)227或与CODEC 227耦合,该CODEC 227包括压缩逻辑,该压缩逻辑用于对被写入存储器或L2缓存221的深度或颜色数据进行压缩,并对从存储器或L2缓存221读取的深度或颜色数据进行解压缩。压缩逻辑可以是利用多种压缩算法中的一种或多种的无损压缩逻辑。由CODEC 227执行的压缩的类型可以基于将要被压缩的数据的统计特性而变化。例如,在一个实施例中,逐片地对深度和颜色数据执行Δ(delta)颜色压缩。在一个实施例中,CODEC 227包括压缩和解压缩逻辑,该压缩和解压缩逻辑可对与机器学习操作相关联的计算数据进行压缩和解压缩。CODEC 227可例如对用于稀疏机器学习操作的稀疏矩阵数据进行压缩。CODEC 227还可对按稀疏矩阵格式(例如,坐标列表编码(coordinate list encoding,COO)、压缩稀疏行(compressed sparserow,CSR)、压缩稀疏列(compress sparse column,CSC)等)编码的稀疏矩阵数据进行压缩以生成经压缩且经编码的稀疏矩阵数据。经压缩且经编码的稀疏矩阵数据可在由处理元件处理之前被解压缩和/或解码,或者处理元件可被配置成用于消耗经压缩的、经编码的、或经压缩且经编码的数据以用于处理。In graphics applications, ROP 226 is a processing unit that performs raster operations such as stenciling, z-testing, blending, etc. ROP 226 then outputs processed graphics data, which is stored in graphics memory. In some embodiments, ROP 226 includes or is coupled to a codec (CODEC) 227 that includes compression logic for compressing depth or color data written to memory or L2 cache 221 and decompressing depth or color data read from memory or L2 cache 221. The compression logic may be lossless compression logic that utilizes one or more of a variety of compression algorithms. The type of compression performed by CODEC 227 may vary based on the statistical characteristics of the data to be compressed. For example, in one embodiment, delta color compression is performed on depth and color data on a slice-by-slice basis. In one embodiment, CODEC 227 includes compression and decompression logic that can compress and decompress computational data associated with machine learning operations. CODEC 227 can, for example, compress sparse matrix data for sparse machine learning operations. CODEC 227 can also compress sparse matrix data encoded in a sparse matrix format (e.g., coordinate list encoding (COO), compressed sparse row (CSR), compressed sparse column (CSC), etc.) to generate compressed and encoded sparse matrix data. Compressed and encoded sparse matrix data can be decompressed and/or decoded before being processed by a processing element, or a processing element can be configured to consume compressed, encoded, or compressed and encoded data for processing.

ROP 226可被包括在每个处理集群(例如,图2A的集群214A-214N)内而不是被包括在分区单元220内。在此类实施例中,通过存储器交叉开关216来传送对像素数据而非像素片段数据的读取和写入请求。经处理的图形数据可被显示在显示设备(诸如,图1的一个或多个显示设备110A-110B中的一个显示设备)上,被路由以用于由(一个或多个)处理器102进一步处理,或者被路由以用于由图2A的并行处理器200内的处理实体中的一个处理实体进一步处理。ROP 226 may be included within each processing cluster (e.g., clusters 214A-214N of FIG. 2A ) rather than within partition unit 220. In such embodiments, read and write requests for pixel data rather than pixel fragment data are communicated through memory crossbar 216. The processed graphics data may be displayed on a display device (such as one of the one or more display devices 110A-110B of FIG. 1 ), routed for further processing by processor(s) 102, or routed for further processing by one of the processing entities within parallel processor 200 of FIG. 2A .

图2C是并行处理单元内的处理集群214的框图。例如,处理集群是图2A的处理集群214A-214N中的一个处理集群的实例。处理集群214可被配置成用于并行地执行许多线程,其中,术语“线程”是指对特定的输入数据的集合执行的特定程序的实例。任选地,可使用单指令多数据(single-instruction,multiple-data,SIMD)指令发出技术以在不提供多个独立的指令单元的情况下支持大量线程的并行执行。替代地,可使用单指令多线程(single-instruction,multiple-thread,SIMT)技术来使用被配置成用于向处理集群中的每个处理集群内的处理引擎的集合发出指令的共同的指令单元来支持大量总体上同步的线程的并行执行。与其中所有处理引擎典型地执行相同指令的SIMD执行机制不同,SIMT执行允许不同的线程更容易地遵循通过给定的线程程序的发散的执行路径。本领域技术人员将理解,SIMD处理机制表示SIMT处理机制的功能子集。FIG. 2C is a block diagram of a processing cluster 214 within a parallel processing unit. For example, a processing cluster is an instance of a processing cluster in the processing clusters 214A-214N of FIG. 2A. The processing cluster 214 may be configured to execute many threads in parallel, wherein the term "thread" refers to an instance of a specific program executed on a set of specific input data. Optionally, a single-instruction, multiple-data (SIMD) instruction issuance technique may be used to support the parallel execution of a large number of threads without providing multiple independent instruction units. Alternatively, a single-instruction, multiple-thread (SIMT) technique may be used to support the parallel execution of a large number of threads that are generally synchronized using a common instruction unit that is configured to issue instructions to a set of processing engines within each processing cluster in the processing cluster. Unlike the SIMD execution mechanism in which all processing engines typically execute the same instruction, SIMT execution allows different threads to more easily follow the divergent execution path through a given thread program. It will be appreciated by those skilled in the art that the SIMD processing mechanism represents a functional subset of the SIMT processing mechanism.

可以经由将处理任务分发给SIMT并行处理器的管线管理器232来控制处理集群214的操作。管线管理器232接收来自图2A的调度器210的指令,并且经由图形多处理器234和/或纹理单元236来管理那些指令的执行。所图示的图形多处理器234是SIMT并行处理器的示例性实例。然而,可将不同体系结构的各种类型的SIMT并行处理器包括在处理集群214内。图形多处理器234的一个或多个实例可被包括在处理集群214内。图形多处理器234可以处理数据,并且数据交叉开关240可以用于将经处理的数据分发到多个可能的目的地中的一个目的地,包括促进处理集群214内的图形多处理器之间的数据的交换。管线管理器232可通过指定用于要经由数据交叉开关240分发的经处理的数据的目的地来促进经处理的数据的分发。The operation of the processing cluster 214 can be controlled via a pipeline manager 232 that distributes processing tasks to SIMT parallel processors. The pipeline manager 232 receives instructions from the scheduler 210 of FIG. 2A and manages the execution of those instructions via the graphics multiprocessor 234 and/or the texture unit 236. The illustrated graphics multiprocessor 234 is an exemplary instance of a SIMT parallel processor. However, various types of SIMT parallel processors of different architectures can be included in the processing cluster 214. One or more instances of the graphics multiprocessor 234 can be included in the processing cluster 214. The graphics multiprocessor 234 can process data, and the data crossbar 240 can be used to distribute the processed data to one of a plurality of possible destinations, including facilitating the exchange of data between the graphics multiprocessors within the processing cluster 214. The pipeline manager 232 can facilitate the distribution of processed data by specifying a destination for the processed data to be distributed via the data crossbar 240.

处理集群214内的每个图形多处理器234可以包括相同的功能执行逻辑集合(例如,算术逻辑单元、加载-存储单元等)。能以管线化的方式配置功能执行逻辑,按照该管线化的方式,新指令可在先前指令完成之前被发出。功能执行逻辑支持各种操作,包括整数和浮点算术、比较操作、布尔操作、比特移位、以及各种代数函数的计算。可利用相同的功能单元硬件来执行不同的操作,并且功能单元的任何组合可以存在。Each graphics multiprocessor 234 within a processing cluster 214 may include an identical set of function execution logic (e.g., arithmetic logic units, load-store units, etc.). The function execution logic may be configured in a pipelined manner in which new instructions may be issued before previous instructions are completed. The function execution logic supports a variety of operations, including integer and floating point arithmetic, comparison operations, Boolean operations, bit shifts, and calculation of various algebraic functions. The same functional unit hardware may be utilized to perform different operations, and any combination of functional units may exist.

被传送至处理集群214的指令构成线程。跨并行处理引擎的集合执行的线程的集合是线程组。线程组对不同的输入数据执行相同的程序。线程组内的每个线程可被指派给图形多处理器234内的不同的处理引擎。线程组可包括比图形多处理器234内的处理引擎的数量更少的线程。当线程组包括比处理引擎的数量更少的线程时,处理引擎中的一个或多个处理引擎在其间线程组正被处理的周期期间可以是空闲的。线程组也可包括比图形多处理器234内的处理引擎的数量更多的线程。当线程组包括比图形多处理器234内的处理引擎的数量更多的线程时,可在连续的时钟周期内执行处理。任选地,可在图形多处理器234上并发地执行多个线程组。The instructions transmitted to the processing cluster 214 constitute threads. A collection of threads executed across a collection of parallel processing engines is a thread group. A thread group executes the same program on different input data. Each thread within a thread group can be assigned to a different processing engine within the graphics multiprocessor 234. A thread group may include fewer threads than the number of processing engines within the graphics multiprocessor 234. When a thread group includes fewer threads than the number of processing engines, one or more of the processing engines may be idle during the cycle during which the thread group is being processed. A thread group may also include more threads than the number of processing engines within the graphics multiprocessor 234. When a thread group includes more threads than the number of processing engines within the graphics multiprocessor 234, processing may be performed in consecutive clock cycles. Optionally, multiple thread groups may be executed concurrently on the graphics multiprocessor 234.

图形多处理器234可包括内部缓存存储器,以执行加载和存储操作。任选地,图形多处理器234可放弃内部缓存,并使用处理集群214内的缓存存储器(例如,第一级(level1,L1)缓存248)。每个图形多处理器234还具有对分区单元(例如,图2A的分区单元220A-220N)内的第二级(L2)缓存的访问权,这些L2缓存在所有处理集群214之间被共享,并且可被用于在线程之间传输数据。图形多处理器234还可访问片外全局存储器,该片外全局存储器可包括本地并行处理器存储器和/或系统存储器中的一个或多个。并行处理单元202外部的任何存储器可被用作全局存储器。在其中处理集群214包括图形多处理器234的多个实例的实施例可共享共同的指令和数据,该共同的指令和数据可被存储在L1缓存248中。The graphics multiprocessor 234 may include internal cache memory to perform load and store operations. Optionally, the graphics multiprocessor 234 may abandon the internal cache and use the cache memory (e.g., the first level (level 1, L1) cache 248) within the processing cluster 214. Each graphics multiprocessor 234 also has access to the second level (L2) cache within the partition unit (e.g., the partition unit 220A-220N of Figure 2A), which is shared between all processing clusters 214 and can be used to transfer data between threads. The graphics multiprocessor 234 can also access off-chip global memory, which can include one or more of the local parallel processor memory and/or system memory. Any memory external to the parallel processing unit 202 can be used as global memory. In embodiments where the processing cluster 214 includes multiple instances of the graphics multiprocessor 234, common instructions and data can be shared, which can be stored in the L1 cache 248.

每个处理集群214可包括被配置成用于将虚拟地址映射到物理地址的MMU 245(memory management unit,存储器管理单元)。在其他实施例中,MMU 245的一个或多个实例可驻留在图2A的存储器接口218内。MMU 245包括用于将虚拟地址映射到片的物理地址的页表条目(page table entry,PTE)的集合,并且任选地包括缓存行索引。MMU 245可包括可驻留在处理集群214的图形多处理器234或L1缓存248内的地址转译后备缓冲器(translation lookaside buffer,TLB)或缓存。物理地址被处理,以分发表面数据访问局部性,从而允许分区单元之间的高效的请求交织。缓存行索引可用于确定对缓存行的请求是命中还是未命中。Each processing cluster 214 may include an MMU 245 (memory management unit) configured to map virtual addresses to physical addresses. In other embodiments, one or more instances of the MMU 245 may reside within the memory interface 218 of FIG. 2A. The MMU 245 includes a set of page table entries (PTEs) for mapping virtual addresses to physical addresses of slices, and optionally includes a cache line index. The MMU 245 may include an address translation lookaside buffer (TLB) or cache that may reside within the graphics multiprocessor 234 or L1 cache 248 of the processing cluster 214. The physical address is processed to distribute surface data access locality, thereby allowing efficient request interleaving between partition units. The cache line index can be used to determine whether a request for a cache line is a hit or a miss.

在图形和计算应用中,处理集群214可被配置成使得每个图形多处理器234耦合至纹理单元236以用于执行纹理映射操作,例如,确定纹理样本位置、读取纹理数据以及过滤纹理数据。纹理数据从内部纹理L1缓存(未示出)中被读取,或者在一些实施例中,从图形多处理器234内的L1缓存中被读取,并且根据需要从L2缓存、本地并行处理器存储器或系统存储器被取得。每个图形多处理器234将经处理的任务输出到数据交叉开关240,以将经处理器的任务提供给另一处理集群214以用于进一步处理,或经由存储器交叉开关216将经处理的任务存储在L2缓存、本地并行处理器存储器或系统存储器中。preROP 242(pre-rasteroperations unit,预先栅格操作单元)被配置成用于从图形多处理器234接收数据,将数据引导至ROP单元,这些ROP单元可与如本文中所描述的分区单元(例如,图2A的分区单元220A-220N)一起被定位。preROP 242单元可针对颜色混合执行优化,组织像素颜色数据,并且执行地址转译。In graphics and computing applications, the processing clusters 214 may be configured such that each graphics multiprocessor 234 is coupled to a texture unit 236 for performing texture mapping operations, such as determining texture sample locations, reading texture data, and filtering texture data. Texture data is read from an internal texture L1 cache (not shown), or in some embodiments, from an L1 cache within the graphics multiprocessor 234 and retrieved from an L2 cache, local parallel processor memory, or system memory as needed. Each graphics multiprocessor 234 outputs processed tasks to a data crossbar 240 to provide the processed tasks to another processing cluster 214 for further processing, or stores the processed tasks in an L2 cache, local parallel processor memory, or system memory via a memory crossbar 216. A preROP 242 (pre-raster operations unit) is configured to receive data from the graphics multiprocessor 234 and direct the data to ROP units, which may be located with partition units (e.g., partition units 220A-220N of FIG. 2A ) as described herein. The preROP 242 unit may perform optimizations for color blending, organize pixel color data, and perform address translation.

将领会,本文中所描述的核心体系结构是说明性的,并且变体和修改是可能的。可将任何数量的处理单元(例如,图形多处理器234、纹理单元236、preROP 242等)包括在处理集群214内。进一步地,虽然示出仅一个处理集群214,但是如本文中所描述的并行处理单元可包括处理集群214的任何数量的实例。任选地,每个处理集群214可被配置成用于使用单独且不同的处理单元、L1缓存、L2缓存等来独立于其他处理集群214进行操作。It will be appreciated that the core architecture described herein is illustrative, and variations and modifications are possible. Any number of processing units (e.g., graphics multiprocessor 234, texture unit 236, preROP 242, etc.) may be included within a processing cluster 214. Further, while only one processing cluster 214 is shown, a parallel processing unit as described herein may include any number of instances of a processing cluster 214. Optionally, each processing cluster 214 may be configured to operate independently of other processing clusters 214 using separate and distinct processing units, L1 caches, L2 caches, etc.

图2D示出图形多处理器234的示例,其中图形多处理器234与处理集群214的管线管理器232耦合。图形多处理器234具有执行管线,该执行管线包括但不限于指令缓存252、指令单元254、地址映射单元256、寄存器堆258、一个或多个通用图形处理单元(GPGPU)核心262以及一个或多个加载/存储单元266。GPGPU核心262和加载/存储单元266经由存储器和缓存互连268与缓存存储器272和共享存储器270耦合。图形多处理器234可附加地包括张量/或光线追踪核心263,该张量和/或光线追踪核心263包括用于加速矩阵和/或光线追踪操作的硬件逻辑。2D shows an example of a graphics multiprocessor 234, where the graphics multiprocessor 234 is coupled to the pipeline manager 232 of the processing cluster 214. The graphics multiprocessor 234 has an execution pipeline that includes, but is not limited to, an instruction cache 252, an instruction unit 254, an address mapping unit 256, a register file 258, one or more general purpose graphics processing unit (GPGPU) cores 262, and one or more load/store units 266. The GPGPU cores 262 and the load/store units 266 are coupled to a cache memory 272 and a shared memory 270 via a memory and cache interconnect 268. The graphics multiprocessor 234 may additionally include a tensor and/or ray tracing core 263 that includes hardware logic for accelerating matrix and/or ray tracing operations.

指令缓存252可从管线管理器232接收要执行的指令流。指令被缓存在指令缓存252中,并且被调遣以供由指令单元254执行。指令单元254可以将指令作为线程组(例如,单元组(warp))进行调遣,其中,线程组中的每个线程被指派给GPGPU核心262内的不同的执行单元。指令可通过指定统一地址空间内的地址来访问本地地址空间、共享地址空间或全局地址空间中的任一者。可以使用地址映射单元256将统一地址空间中的地址转译为可以由加载/存储单元266访问的不同的存储器地址。The instruction cache 252 may receive an instruction stream to be executed from the pipeline manager 232. The instructions are cached in the instruction cache 252 and dispatched for execution by the instruction unit 254. The instruction unit 254 may dispatch the instructions as a thread group (e.g., a unit group (warp)), wherein each thread in the thread group is assigned to a different execution unit within the GPGPU core 262. The instruction may access any of the local address space, the shared address space, or the global address space by specifying an address within the unified address space. The address mapping unit 256 may be used to translate the address in the unified address space into different memory addresses that can be accessed by the load/store unit 266.

寄存器堆258为图形多处理器234的功能单元提供寄存器的集合。寄存器堆258为连接到图形多处理器234的功能单元(例如,GPGPU核心262、加载/存储单元266)的数据路径的操作对象提供临时存储。寄存器堆258可在功能单元中的每个功能单元之间进行划分,使得每个功能单元被分配寄存器堆258的专用部分。例如,寄存器堆258可在由图形多处理器234执行的不同单元组之间进行划分。The register file 258 provides a collection of registers for the functional units of the graphics multiprocessor 234. The register file 258 provides temporary storage for operands of data paths of the functional units (e.g., GPGPU core 262, load/store unit 266) connected to the graphics multiprocessor 234. The register file 258 may be divided between each of the functional units so that each functional unit is assigned a dedicated portion of the register file 258. For example, the register file 258 may be divided between different groups of units executed by the graphics multiprocessor 234.

GPGPU核心262可以各自包括用于执行图形多处理器234的指令的浮点单元(floating point unit,FPU)和/或整数算术逻辑单元(arithmetic logic unit,ALU)。在一些实现方式中,GPGPU核心262可包括能以其他方式驻留在张量和/或光线追踪核心263内的硬件逻辑。GPGPU核心262在体系结构上可以是类似的,或者在体系结构上可以是不同的。例如并且在一个实施例中,GPGPU核心262的第一部分包括单精度FPU和整数ALU,而GPGPU核心的第二部分包括双精度FPU。任选地,FPU可实现针对浮点算术的IEEE 754-2008标准,或启用可变精度浮点算术。图形多处理器234可附加地包括用于执行特定功能的一个或多个固定功能或特殊功能单元,该特定功能诸如复制矩形或像素混合操作。GPGPU核心中的一个或多个GPGPU核心还可包括固定功能或特殊功能逻辑。The GPGPU cores 262 may each include a floating point unit (FPU) and/or an integer arithmetic logic unit (ALU) for executing instructions of the graphics multiprocessor 234. In some implementations, the GPGPU cores 262 may include hardware logic that may otherwise reside within the tensor and/or ray tracing cores 263. The GPGPU cores 262 may be similar in architecture, or may be different in architecture. For example and in one embodiment, the first portion of the GPGPU core 262 includes a single-precision FPU and an integer ALU, while the second portion of the GPGPU core includes a double-precision FPU. Optionally, the FPU may implement the IEEE 754-2008 standard for floating-point arithmetic, or enable variable-precision floating-point arithmetic. The graphics multiprocessor 234 may additionally include one or more fixed-function or special-function units for performing specific functions, such as copying rectangles or pixel blending operations. One or more of the GPGPU cores in the GPGPU core may also include fixed-function or special-function logic.

GPGPU核心262可包括能够对数据的多个集合执行单个指令的SIMD逻辑。任选地,GPGPU核心262可以物理地执行SIMD4、SIMD8和SIMD16指令,并且在逻辑上执行SIMD1、SIMD2和SIMD32指令。针对GPGPU核心的SIMD指令可以由着色器编译器在编译时生成,或在执行针对单程序多数据(single program multipledata,SPMD)或SIMT体系结构而编写并且编译的程序时自动地生成。可以经由单个SIMD指令来执行被配置成用于SIMT执行模型的程序的多个线程。例如并且在一个实施例中,可以经由单个SIMD8逻辑单元来并行地执行八个SIMT线程,这八个SIMT线程执行相同或类似的操作。GPGPU core 262 may include SIMD logic capable of executing a single instruction to multiple sets of data. Optionally, GPGPU core 262 can physically execute SIMD4, SIMD8 and SIMD16 instructions, and logically execute SIMD1, SIMD2 and SIMD32 instructions. SIMD instructions for GPGPU core can be generated by a shader compiler at compile time, or automatically generated when executing a program written and compiled for a single program multiple data (SPMD) or SIMT architecture. Multiple threads of a program configured for a SIMT execution model can be executed via a single SIMD instruction. For example and in one embodiment, eight SIMT threads can be executed in parallel via a single SIMD8 logic unit, and these eight SIMT threads perform the same or similar operations.

存储器和缓存互连268是将图形多处理器234的功能单元中的每个功能单元连接到寄存器堆258并连接至共享存储器270的互连网络。例如,存储器和缓存互连268是允许加载/存储单元266实现共享存储器270与寄存器堆258之间的加载和存储操作的交叉开关互连。寄存器堆258能以与GPGPU核心262相同的频率进行操作,因此GPGPU核心262与寄存器堆258之间的数据传输是非常低等待时间的。共享存储器270可用于启用在图形多处理器234内的功能单元上执行的线程之间的通信。缓存存储器272可被用作数据缓存,例如,以对在功能单元与纹理单元236之间传递的纹理数据进行缓存。共享存储器270还可被用作被管理的经缓存的程序。共享存储器270和缓存存储器272可与数据交叉开关240耦合,以实现与处理集群的其他部件的通信。在GPGPU核心262上执行的线程除了被存储在缓存存储器272内的被自动缓存的数据之外还能以编程方式将数据存储在共享存储器内。The memory and cache interconnect 268 is an interconnect network that connects each of the functional units of the graphics multiprocessor 234 to the register file 258 and to the shared memory 270. For example, the memory and cache interconnect 268 is a crossbar interconnect that allows the load/store unit 266 to implement load and store operations between the shared memory 270 and the register file 258. The register file 258 can operate at the same frequency as the GPGPU core 262, so data transfers between the GPGPU core 262 and the register file 258 are very low latency. The shared memory 270 can be used to enable communication between threads executed on the functional units within the graphics multiprocessor 234. The cache memory 272 can be used as a data cache, for example, to cache texture data passed between the functional unit and the texture unit 236. The shared memory 270 can also be used as a managed cached program. The shared memory 270 and the cache memory 272 can be coupled with the data crossbar 240 to enable communication with other components of the processing cluster. Threads executing on GPGPU core 262 can programmatically store data in shared memory in addition to automatically cached data stored in cache memory 272 .

图3A-图3C图示根据实施例的附加的图形多处理器。图3A-图3B图示图形多处理器325、350,图形多处理器325、350与图2C的图形多处理器234相关,并且可替代那些图形多处理器中的一个来被使用。因此,本文中结合图形多处理器234对任何特征的公开也公开了对应的与图形多处理器325、350的结合,但不限于此。图3C图示图形处理单元(GPU)380,该GPU380包括布置为多核心组365A-365N的专用的图形处理资源集合,多核心组365A-365N与图形多处理器325、350对应。所图示的图形多处理器325、350和多核心组365A-365N可以是能够同时执行大量执行线程的流式多处理器(streaming multiprocessors,SM)。3A-3C illustrate additional graphics multiprocessors according to an embodiment. FIG. 3A-3B illustrate graphics multiprocessors 325, 350, which are related to the graphics multiprocessor 234 of FIG. 2C and can be used in place of one of those graphics multiprocessors. Therefore, the disclosure of any feature in conjunction with the graphics multiprocessor 234 herein also discloses the corresponding combination with the graphics multiprocessor 325, 350, but is not limited thereto. FIG. 3C illustrates a graphics processing unit (GPU) 380, which includes a collection of dedicated graphics processing resources arranged as multi-core groups 365A-365N, which correspond to the graphics multiprocessors 325, 350. The illustrated graphics multiprocessors 325, 350 and multi-core groups 365A-365N can be streaming multiprocessors (SMs) capable of executing a large number of execution threads simultaneously.

图3A的图形多处理器325包括相对于图2D的图形多处理器234的、执行资源单元的多个附加实例。例如,图形多处理器325可包括指令单元332A-332B、寄存器堆334A-334B和(一个或多个)纹理单元344A-344B的多个实例。图形多处理器325还包括多个图形或计算执行单元集合(例如,GPGPU核心336A-336B、张量核心337A-337B、光线追踪核心338A-338B)以及多个加载/存储单元集合340A-340B。执行资源单元具有共同的指令缓存330、纹理和/或数据缓存存储器342、以及共享存储器346。The graphics multiprocessor 325 of FIG. 3A includes multiple additional instances of execution resource units relative to the graphics multiprocessor 234 of FIG. 2D. For example, the graphics multiprocessor 325 may include multiple instances of instruction units 332A-332B, register files 334A-334B, and (one or more) texture units 344A-344B. The graphics multiprocessor 325 also includes multiple graphics or compute execution unit sets (e.g., GPGPU cores 336A-336B, tensor cores 337A-337B, ray tracing cores 338A-338B) and multiple load/store unit sets 340A-340B. The execution resource units have a common instruction cache 330, texture and/or data cache memory 342, and shared memory 346.

各部件可以经由互连结构327进行通信。互连结构327可包括一个或多个交叉开关以实现图形多处理器325的各部件之间的通信。互连结构327是单独的、高速网络结构层,图形多处理器325的每个部件堆叠在该网络结构层上。图形多处理器325的部件经由互连结构327与远程部件通信。例如,核心336A-336B、337A-337B以及338A-338B可以各自经由互连结构327与共享存储器346通信。互连结构327可对图形多处理器325内的通信进行仲裁,以确保部件之间公平的带宽分配。The components may communicate via an interconnect structure 327. The interconnect structure 327 may include one or more crossbar switches to enable communication between the components of the graphics multiprocessor 325. The interconnect structure 327 is a separate, high-speed network fabric layer on which each component of the graphics multiprocessor 325 is stacked. The components of the graphics multiprocessor 325 communicate with remote components via the interconnect structure 327. For example, the cores 336A-336B, 337A-337B, and 338A-338B may each communicate with the shared memory 346 via the interconnect structure 327. The interconnect structure 327 may arbitrate communications within the graphics multiprocessor 325 to ensure fair bandwidth allocation between components.

图3B的图形多处理器350包括多个执行资源集合356A-356D,其中,如图2D和图3A中所图示,每个执行资源集合包括多个指令单元、寄存器堆、GPGPU核心以及加载存储单元。执行资源356A-356D可与用于纹理操作的(一个或多个)纹理单元360A-360D协同地工作,同时共享指令缓存354和共享存储器353。例如,执行资源356A-356D可共享指令缓存354和共享存储器353以及纹理和/或数据缓存存储器358A-358B的多个实例。各部件可经由与图3A的互连结构327类似的互连结构352进行通信。The graphics multiprocessor 350 of FIG. 3B includes a plurality of execution resource sets 356A-356D, wherein, as illustrated in FIG. 2D and FIG. 3A , each execution resource set includes a plurality of instruction units, register files, GPGPU cores, and load storage units. The execution resources 356A-356D may work in coordination with (one or more) texture units 360A-360D for texture operations, while sharing an instruction cache 354 and a shared memory 353. For example, the execution resources 356A-356D may share multiple instances of an instruction cache 354 and a shared memory 353 and texture and/or data cache memories 358A-358B. The components may communicate via an interconnect structure 352 similar to the interconnect structure 327 of FIG. 3A .

本领域技术人员将理解,图1、图2A-图2D以及图3A-图3B中所描述的体系结构是描述性的,并且在当前实施例的范围方面不是限制性的。因此,本文中描述的技术可在任何经适当地配置的处理单元上实现而不背离本文中描述的实施例的范围,这些处理单元包括但不限于:一个或多个移动应用处理器;一个或多个桌面型电脑或服务器中央处理单元(central processing unit,CPU),包括多核心CPU;一个或多个并行处理器单元,诸如,图2A的并行处理单元202以及一个或多个图形处理器或专用处理单元。Those skilled in the art will appreciate that the architectures described in FIG. 1 , FIG. 2A-FIG. 2D, and FIG. 3A-FIG. 3B are illustrative and not limiting in terms of the scope of the present embodiments. Therefore, the techniques described herein may be implemented on any appropriately configured processing unit without departing from the scope of the embodiments described herein, including but not limited to: one or more mobile application processors; one or more desktop or server central processing units (CPUs), including multi-core CPUs; one or more parallel processor units, such as the parallel processing unit 202 of FIG. 2A, and one or more graphics processors or special processing units.

本文中所描述的并行处理器或GPGPU可通信地耦合至主机/处理器核心以加速图形操作、机器学习操作、模式分析操作以及各种通用GPU(GPGPU)功能。GPU可通过总线或另一互连(例如,高速互连,诸如,PCIe、NVLink或其他已知的协议、标准化协议、或专属协议)通信地耦合至主机处理器/核心。在其他实施例中,GPU可集成在与核心相同的封装或芯片上,并且通过内部处理器总线/互连(即,在封装或芯片内部)通信地耦合至核心。无论GPU被连接所采取的方式如何,处理器核心都可将工作以工作描述符中所包含的命令/指令序列的形式分配给GPU。GPU随后使用专用电路模块/逻辑来高效地处理这些命令/指令。The parallel processor or GPGPU described herein can be communicatively coupled to a host/processor core to accelerate graphics operations, machine learning operations, pattern analysis operations, and various general-purpose GPU (GPGPU) functions. The GPU can be communicatively coupled to the host processor/core via a bus or another interconnect (e.g., a high-speed interconnect, such as PCIe, NVLink, or other known protocols, standardized protocols, or proprietary protocols). In other embodiments, the GPU can be integrated on the same package or chip as the core and communicatively coupled to the core via an internal processor bus/interconnect (i.e., inside the package or chip). Regardless of the manner in which the GPU is connected, the processor core can assign work to the GPU in the form of a sequence of commands/instructions contained in a work descriptor. The GPU then uses dedicated circuit modules/logic to efficiently process these commands/instructions.

图3C图示图形处理单元(GPU)380,该GPU 380包括布置为多核心组365A-365N的专用的图形处理资源集合。虽然提供仅单个多核心组365A的细节,但是将领会,其他多核心组365B-365N可配备有相同或类似的图形处理资源集合。关于多核心组365A-365描述的细节还可应用于本文中描述的任何图形多处理器234、325、350。3C illustrates a graphics processing unit (GPU) 380 that includes a dedicated set of graphics processing resources arranged as multi-core groups 365A-365N. Although details of only a single multi-core group 365A are provided, it will be appreciated that other multi-core groups 365B-365N may be equipped with the same or similar set of graphics processing resources. The details described with respect to multi-core groups 365A-365 may also be applied to any graphics multiprocessor 234, 325, 350 described herein.

如所图示,多核心组365A可包括图形核心的集合370、张量核心的集合371以及光线追踪核心的集合372。调度器/调遣器368调度和调遣图形线程以用于在各个核心370、371、372上执行。寄存器堆的集合369存储在执行图形线程时由核心370、371、372使用的操作对象值。这些寄存器堆可包括例如用于存储整数值的整数寄存器、用于存储浮点值的浮点寄存器、用于存储紧缩(packed)数据元素(整数和/或浮点数据元素)的向量寄存器以及用于存储张量/矩阵值的片寄存器。片寄存器可被实现为向量寄存器的经组合的集合。As illustrated, multi-core group 365A may include a set 370 of graphics cores, a set 371 of tensor cores, and a set 372 of ray tracing cores. Scheduler/dispatcher 368 schedules and dispatches graphics threads for execution on the respective cores 370, 371, 372. A set 369 of register files stores operand values used by cores 370, 371, 372 when executing graphics threads. These register files may include, for example, integer registers for storing integer values, floating point registers for storing floating point values, vector registers for storing packed data elements (integer and/or floating point data elements), and slice registers for storing tensor/matrix values. Slice registers may be implemented as a combined set of vector registers.

一个或多个经组合的第一级(L1)缓存和共享存储器单元373在本地将图形数据存储在每个多核心组365A内,图形数据诸如纹理数据、顶点数据、像素数据、光线数据、包围体数据等。一个或多个纹理单元374也可用于执行纹理操作,诸如,纹理映射和采样。由所有多核心组365A-365N或多核心组365A-365N的子集共享的第二级(L2)缓存375存储用于多个并发的图形线程的图形数据和/或指令。如所图示,可跨多个多核心组365A-365N共享L2缓存375。一个或多个存储器控制器367将GPU 380耦合至存储器366,该存储器366可以是系统存储器(例如,DRAM)和/或专用图形存储器(例如,GDDR6存储器)。One or more combined first level (L1) caches and shared memory units 373 store graphics data locally within each multi-core group 365A, such as texture data, vertex data, pixel data, light data, bounding volume data, etc. One or more texture units 374 can also be used to perform texture operations, such as texture mapping and sampling. A second level (L2) cache 375 shared by all multi-core groups 365A-365N or a subset of multi-core groups 365A-365N stores graphics data and/or instructions for multiple concurrent graphics threads. As shown, the L2 cache 375 can be shared across multiple multi-core groups 365A-365N. One or more memory controllers 367 couple the GPU 380 to a memory 366, which can be a system memory (e.g., DRAM) and/or a dedicated graphics memory (e.g., GDDR6 memory).

输入/输出(Input/output,I/O)电路模块363将GPU 380耦合至一个或多个I/O设备362,这一个或多个I/O设备362诸如数字信号处理器(digital signal processor,DSP)、网络控制器或用户输入设备。片上互连可用于将I/O设备362耦合至GPU 380和存储器366。I/O电路模块363的一个或多个I/O存储器管理单元(I/O memory management unit,IOMMU)364直接将I/O设备362耦合至系统存储器366。任选地,IOMMU 364管理用于将虚拟地址映射到系统存储器366中的物理地址的多个页表集合。I/O设备362、(一个或多个)CPU 361和(一个或多个)GPU 380随后可共享相同的虚拟地址空间。The input/output (I/O) circuit module 363 couples the GPU 380 to one or more I/O devices 362, such as a digital signal processor (DSP), a network controller, or a user input device. On-chip interconnects may be used to couple the I/O devices 362 to the GPU 380 and memory 366. One or more I/O memory management units (IOMMUs) 364 of the I/O circuit module 363 directly couple the I/O devices 362 to the system memory 366. Optionally, the IOMMU 364 manages a plurality of page table sets for mapping virtual addresses to physical addresses in the system memory 366. The I/O devices 362, the CPU(s) 361, and the GPU(s) 380 may then share the same virtual address space.

在IOMMU 364的一个实现方式中,IOMMU 364支持虚拟化。在这种情况下,IOMMU364可以管理用于将宾客/图形虚拟地址映射到宾客/图形物理地址的第一页表集合以及用于将宾客/图形物理地址映射到(例如,系统存储器366内的)系统/主机物理地址的第二页表集合。第一页表集合和第二页表集合中的每一个的基址可被存储在控制寄存器中,并且在上下文切换时被换出(例如,使得新上下文被提供有对相关页表集合的访问权)。虽然未在图3C中图示,但是核心370、371、372和/或多核心组365A-365N中的每一个可包括转译后备缓冲器(TLB),这些TLB用于对宾客虚拟至宾客物理转译、宾客物理至主机物理转译以及宾客虚拟至主机物理转译进行缓存。In one implementation of the IOMMU 364, the IOMMU 364 supports virtualization. In this case, the IOMMU 364 can manage a first set of page tables for mapping guest/graphics virtual addresses to guest/graphics physical addresses and a second set of page tables for mapping guest/graphics physical addresses to system/host physical addresses (e.g., within system memory 366). The base address of each of the first set of page tables and the second set of page tables can be stored in a control register and swapped out at a context switch (e.g., so that the new context is provided with access to the relevant set of page tables). Although not illustrated in FIG. 3C, each of the cores 370, 371, 372 and/or the multi-core groups 365A-365N may include a translation lookaside buffer (TLB) for caching guest virtual to guest physical translations, guest physical to host physical translations, and guest virtual to host physical translations.

(一个或多个)CPU 361、GPU 380和I/O设备362可以被集成在单个半导体芯片和/或芯片封装上。所图示的存储器366可集成在同一芯片上,或者可经由片外接口被耦合至存储器控制器367。在一个实现方式中,存储器366包括共享与其他物理系统级存储器相同的虚拟地址空间的GDDR6存储器,但是本文中描述的基本原理不限于该特定的实现方式。(One or more) CPU 361, GPU 380 and I/O device 362 can be integrated on a single semiconductor chip and/or chip package. The illustrated memory 366 can be integrated on the same chip, or can be coupled to a memory controller 367 via an off-chip interface. In one implementation, memory 366 includes GDDR6 memory that shares the same virtual address space as other physical system-level memory, but the basic principles described herein are not limited to this particular implementation.

张量核心371可包括专门被设计成用于执行矩阵操作的多个执行单元,这些矩阵操作是用于执行深度学习操作的基本计算操作。例如,可将同步矩阵乘法操作用于神经网络训练和推断。张量核心371可使用各种操作对象精度来执行矩阵处理,各种操作对象精度包括单精度浮点(例如,32比特)、半精度浮点(例如,16比特)、整数字(16比特)、字节(8比特)和半字节(4比特)。例如,神经网络实现方式提取每个经渲染场景的特征,从而潜在地组合来自多个帧的细节,以构建高质量的最终图像。Tensor core 371 may include multiple execution units specifically designed to perform matrix operations, which are basic computational operations for performing deep learning operations. For example, synchronized matrix multiplication operations may be used for neural network training and inference. Tensor core 371 may perform matrix processing using a variety of operand precisions, including single-precision floating point (e.g., 32 bits), half-precision floating point (e.g., 16 bits), integer words (16 bits), bytes (8 bits), and half bytes (4 bits). For example, a neural network implementation extracts features of each rendered scene, potentially combining details from multiple frames to build a high-quality final image.

在深度学习实现方式中,可调度并行的矩阵乘法工作以用于在张量核心371上执行。神经网络的训练尤其需要大量矩阵点积操作。为了处理N x N x N矩阵乘法的内积公式化,张量核心371可包括至少N个点积处理元件。在矩阵乘法开始之前,一个完整的矩阵被加载到片寄存器中,并且对于N个循环中的每个循环,第二矩阵的至少一列被加载。对于每个循环,存在被处理的N个点积。In a deep learning implementation, parallel matrix multiplication work may be scheduled for execution on the tensor core 371. The training of neural networks in particular requires a large number of matrix dot product operations. To handle the inner product formulation of the N x N x N matrix multiplication, the tensor core 371 may include at least N dot product processing elements. Before the matrix multiplication begins, a complete matrix is loaded into the slice register, and for each of the N cycles, at least one column of the second matrix is loaded. For each cycle, there are N dot products that are processed.

取决于特定的实现方式,能以不同精度来存储矩阵元素,包括16比特的字、8比特的字节(例如,INT8)以及4比特的半字节(例如,INT4)。可为张量核心371指定不同的精度模式以确保将最高效的精度用于不同的工作负载(例如,诸如推断工作负载,其可容忍至字节和半字节的量化(quantization))。所支持的格式附加地包括64比特浮点(64-bitfloating point,FP64)和非IEEE浮点格式,诸如,bfloat16格式(例如,Brain浮点)、具有一个符号比特、八个指数比特和八个有效数字比特(其中的七个被显式地存储)的16比特浮点格式。一个实施例包括对降低精度的张量浮点(TF32)模式的支持,该TF32模式使用FP32(8比特)的范围和FP16(10比特)的精度执行计算。能以相对于FP32更高的性能以及相对于FP16增加的精度对FP32输入执行降低精度的TF32操作并产生FP32输出。在一个实施例中,支持一个或多个8比特浮点格式(FP32)。Depending on the specific implementation, matrix elements can be stored with different precisions, including 16-bit words, 8-bit bytes (e.g., INT8), and 4-bit nibbles (e.g., INT4). Different precision modes can be specified for the tensor core 371 to ensure that the most efficient precision is used for different workloads (e.g., such as inference workloads, which can tolerate quantization to bytes and nibbles). Supported formats additionally include 64-bit floating point (FP64) and non-IEEE floating point formats, such as the bfloat16 format (e.g., Brain floating point), a 16-bit floating point format with one sign bit, eight exponent bits, and eight significand bits (seven of which are explicitly stored). One embodiment includes support for a reduced precision tensor floating point (TF32) mode that performs calculations using the range of FP32 (8 bits) and the precision of FP16 (10 bits). Reduced precision TF32 operations can be performed on FP32 inputs and produce FP32 outputs with higher performance relative to FP32 and increased precision relative to FP16. In one embodiment, one or more 8-bit floating point formats (FP32) are supported.

在一个实施例中,张量核心371支持用于在其中绝大多数值为零的矩阵的稀疏操作模式。张量核心371包括对以稀疏矩阵表示(例如,坐标列表编码(COO)、压缩稀疏行(CSR)、压缩稀疏列(CSC)等)来编码的稀疏输入矩阵的支持。张量核心371还包括对在稀疏矩阵表示可被进一步压缩的情况下的经压缩的稀疏矩阵表示的支持。经压缩的矩阵数据、经编码的矩阵数据和/或经压缩且经编码的矩阵数据以及相关联的压缩和/或编码元数据可由张量核心371读取,并且非零值可被提取。例如,对于给定的输入矩阵A,非零值可从矩阵A的至少部分的经压缩的和/或经编码的表示来加载。基于矩阵A中非零值的位置(其可从与非零值相关联的索引或坐标元数据确定),输入矩阵B中的对应值可被加载。取决于要执行的操作(例如,乘法),如果对应的值是零值,则从输入矩阵B加载值可被绕过。在一个实施例中,对于某些操作(诸如,乘法操作)的值的配对可由调度器逻辑预扫描,并且仅非零输入之间的操作被调度。取决于矩阵A和矩阵B的维度以及要执行的操作,输出矩阵C可以是密集或稀疏的。在输出矩阵C是稀疏的情况下且取决于张量核心371的配置,输出矩阵C可以按压缩格式、稀疏编码或压缩稀疏编码被输出。In one embodiment, the tensor core 371 supports a sparse operation mode for matrices in which the vast majority of values are zero. The tensor core 371 includes support for sparse input matrices encoded in sparse matrix representations (e.g., coordinate list encoding (COO), compressed sparse rows (CSR), compressed sparse columns (CSC), etc.). The tensor core 371 also includes support for compressed sparse matrix representations in the case where the sparse matrix representation can be further compressed. Compressed matrix data, encoded matrix data, and/or compressed and encoded matrix data and associated compression and/or encoding metadata can be read by the tensor core 371, and non-zero values can be extracted. For example, for a given input matrix A, non-zero values can be loaded from at least a portion of the compressed and/or encoded representation of matrix A. Based on the position of the non-zero value in matrix A (which can be determined from the index or coordinate metadata associated with the non-zero value), the corresponding value in the input matrix B can be loaded. Depending on the operation to be performed (e.g., multiplication), if the corresponding value is a zero value, loading the value from the input matrix B can be bypassed. In one embodiment, the pairing of values for certain operations (such as multiplication operations) can be pre-scanned by the scheduler logic, and only operations between non-zero inputs are scheduled. Depending on the dimensions of matrix A and matrix B and the operations to be performed, the output matrix C can be dense or sparse. In the case where the output matrix C is sparse and depending on the configuration of the tensor core 371, the output matrix C can be output in a compressed format, sparse coding, or compressed sparse coding.

光线追踪核心372可加速用于实时光线追踪实现方式和非实时光线追踪实现方式两者的光线追踪操作。具体而言,光线追踪核心372可包括光线遍历/相交电路模块,该光线遍历/相交电路模块用于使用包围体层次体系(bounding volume hierarchy,BVH)来执行光线遍历并标识封围在BVH体积内的光线与基元之间的相交。光线追踪核心372还可包括用于执行深度测试和剔除(例如,使用Z缓冲器或类似布置)的电路模块。在一个实现方式中,光线追踪核心372与本文中描述的图像降噪技术协同地执行遍历和相交操作,该图像降噪技术的至少部分可在张量核心371上执行。例如,张量核心371可实现深度学习神经网络以执行对由光线追踪核心372生成的帧的降噪。然而,(一个或多个)CPU 361、图形核心370和/或光线追踪核心372还可实现全部的降噪和/或深度学习算法或降噪和/或深度学习算法中的部分。Ray tracing core 372 can accelerate ray tracing operations for both real-time ray tracing implementations and non-real-time ray tracing implementations. Specifically, ray tracing core 372 may include a ray traversal/intersection circuit module that uses a bounding volume hierarchy (BVH) to perform ray traversals and identify intersections between rays and primitives enclosed within the BVH volume. Ray tracing core 372 may also include circuit modules for performing depth testing and culling (e.g., using a Z buffer or similar arrangement). In one implementation, ray tracing core 372 performs traversal and intersection operations in conjunction with the image denoising techniques described herein, at least portions of which may be performed on tensor core 371. For example, tensor core 371 may implement a deep learning neural network to perform denoising on frames generated by ray tracing core 372. However, the CPU(s) 361, graphics core 370, and/or ray tracing core 372 may also implement all or portions of the denoising and/or deep learning algorithms.

此外,如上文所描述,可采用对于降噪的分布式方法,其中,GPU 380在通过网络或高速互连而耦合至其他计算设备的计算设备中。按照该分布式方法,经互连的计算设备可共享神经网络学习/训练数据,以改善整个系统学习执行用于不同类型的图像帧和/或不同的图形应用的降噪的速度。In addition, as described above, a distributed approach to noise reduction may be employed, wherein GPU 380 is in a computing device coupled to other computing devices via a network or high-speed interconnect. According to the distributed approach, the interconnected computing devices may share neural network learning/training data to improve the speed at which the entire system learns to perform noise reduction for different types of image frames and/or different graphics applications.

光线追踪核心372可处理所有的BVH遍历和/或光线-基元相交,从而使图形核心370免于被针对每条光线的数千个指令过载。例如,每个光线追踪核心372包括用于执行包围盒测试(例如,用于遍历操作)的第一专业电路模块集合和/或用于执行光线-三角形相交测试(例如,使已被遍历的光线相交)的第二专业电路模块集合。因此,例如,多核心组365A可简单地启动光线探测,并且光线追踪核心372独立地执行光线遍历和相交,并将命中数据(例如,命中、无命中、多个命中等)返回到线程上下文。当光线追踪核心370执行遍历和相交操作时,其他核心371、372被释放以执行其他图形或计算工作。The ray tracing core 372 can handle all BVH traversals and/or ray-primitive intersections, thereby preventing the graphics core 370 from being overloaded with thousands of instructions for each ray. For example, each ray tracing core 372 includes a first set of specialized circuit modules for performing bounding box tests (e.g., for traversal operations) and/or a second set of specialized circuit modules for performing ray-triangle intersection tests (e.g., intersecting rays that have been traversed). Thus, for example, the multi-core group 365A can simply start ray detection, and the ray tracing core 372 independently performs ray traversals and intersections and returns hit data (e.g., hits, no hits, multiple hits, etc.) to the thread context. While the ray tracing core 370 performs traversal and intersection operations, the other cores 371, 372 are freed up to perform other graphics or computational work.

任选地,每个光线追踪核心372可包括用于执行BVH测试操作的遍历单元和/或执行光线-基元相交测试的相交单元。相交单元生成“命中”、“无命中”或“多个命中”响应,该相交单元将这些响应提供给适当的线程。在遍历和相交操作期间,其他核心(例如,图形核心370和张量核心371)的执行资源被释放以执行其他形式的图形工作。Optionally, each ray tracing core 372 may include a traversal unit for performing BVH test operations and/or an intersection unit for performing ray-primitive intersection tests. The intersection unit generates a "hit", "no hit", or "multiple hits" response, which it provides to the appropriate thread. During traversal and intersection operations, execution resources of other cores (e.g., graphics core 370 and tensor core 371) are freed up to perform other forms of graphics work.

在下文描述的一个任选实施例中,使用在其中工作被分布在图形核心370与光线追踪核心372之间的混合式栅格化/光线追踪方法。In one optional embodiment described below, a hybrid rasterization/ray tracing approach is used in which the work is distributed between graphics core 370 and ray tracing core 372 .

光线追踪核心372(和/或其他核心370、371)可包括对光线追踪指令集的硬件支持,光线追踪指令集诸如:微软的DirectX光线追踪(DirectX Ray Tracing,DXR),其包括DispatchRays命令;以及光线生成着色器、最近命中着色器、任何命中着色器和未命中着色器,它们使得能够为每个对象指派唯一的着色器和纹理集合。可由光线追踪核心372、图形核心370和张量核心371支持的另一光线追踪平台是Vulkan API(例如,Vulkan版本1.1.85,或者更晚的版本)。然而,要注意,本文中描述的基本原理不限于任何特定的光线追踪ISA。The ray tracing core 372 (and/or other cores 370, 371) may include hardware support for ray tracing instruction sets, such as Microsoft's DirectX Ray Tracing (DXR), which includes the DispatchRays command; and ray generation shaders, nearest hit shaders, any hit shaders, and miss shaders, which enable unique shader and texture sets to be assigned to each object. Another ray tracing platform that may be supported by the ray tracing core 372, graphics core 370, and tensor core 371 is the Vulkan API (e.g., Vulkan version 1.1.85, or later). However, it should be noted that the basic principles described herein are not limited to any particular ray tracing ISA.

一般而言,各个核心372、371、370可支持包括用于以下各项中的一项或多项的指令/函数的光线追踪指令集:光线生成、最近命中、任何命中、光线-基元相交、逐基元和层次体系包围盒构建、未命中、拜访和异常。更具体地,优选的实施例包括用于执行以下功能中的一项或多项的光线追踪指令:In general, each core 372, 371, 370 may support a ray tracing instruction set including instructions/functions for one or more of the following: ray generation, nearest hit, any hit, ray-primitive intersection, per-primitive and hierarchy bounding box construction, misses, visits, and exceptions. More specifically, preferred embodiments include ray tracing instructions for performing one or more of the following functions:

射线生成—可为每个像素、样本或其他用户定义工作指配来执行射线生成指令。 Ray Generation —Ray generation instructions can be executed for each pixel, sample, or other user-defined work assignment.

最近命中——可执行最近命中指令以对场景内光线与基元的最近交点定位。 Closest Hit - Closest hit instructions can be executed to locate the closest intersection of a ray with a primitive within the scene.

任何命中——任何命中指令标识场景内光线与基元之间的多个相交,从而潜在地标识新的最近交点。 Any Hit - The Any Hit directive identifies multiple intersections between rays and primitives within the scene, potentially identifying a new closest intersection point.

相交——相交指令执行光线-基元相交测试并输出结果。 Intersect - The Intersect command performs a ray-primitive intersection test and outputs the result.

逐基元包围盒构建——该指令围绕给定的基元或基元组建立包围盒(例如,当建立新BVH或其他加速数据结构时)。 Per-primitive bounding box construction - This instruction builds a bounding box around a given primitive or group of primitives (e.g., when building a new BVH or other acceleration data structure).

未命中——指示光线未命中场景或场景的指定区域内的所有几何体。 Miss - Indicates that the ray missed the scene or all geometry within a specified area of the scene.

拜访——指示光线将遍历的子容体。 Visits – Indicates the subvolumes that the ray will traverse.

异常——包括各种类型的异常处置器(例如,针对各种错误条件被调用)。 Exceptions - includes various types of exception handlers (eg, called for various error conditions).

在一个实施例中,光线追踪核心372可适于加速通用计算操作,这些通用计算操作可使用与光线相交测试类似的计算技术来加速。可提供计算框架,该计算框架使着色器程序能够被编译为经由光线追踪核心执行通用计算操作的低级别指令和/或基元。可受益于在光线追踪核心372上执行的计算操作的示例性计算问题包括涉及坐标空间内光束、波、光线或粒子传播的计算。可相对于坐标空间内的几何体或网格计算与那个传播相关联的交互。例如,与通过环境的电磁信号传播相关联的计算可经由使用经由光线追踪核心被执行的指令或基元来加速。信号通过环境中的对象发生的折射和反射可被计算为直接的光线追踪模拟。In one embodiment, the ray tracing core 372 may be adapted to accelerate general computing operations that can be accelerated using computing techniques similar to ray intersection testing. A computing framework may be provided that enables shader programs to be compiled into low-level instructions and/or primitives that perform general computing operations via the ray tracing core. Exemplary computing problems that may benefit from computing operations performed on the ray tracing core 372 include calculations involving the propagation of beams, waves, rays, or particles within a coordinate space. Interactions associated with that propagation may be calculated relative to geometry or meshes within the coordinate space. For example, calculations associated with the propagation of electromagnetic signals through an environment may be accelerated using instructions or primitives that are executed via the ray tracing core. Refraction and reflection of signals through objects in the environment may be calculated as direct ray tracing simulations.

光线追踪核心372还可用于执行不直接与光线追踪类似的计算。例如,可使用光线追踪核心372来加速网格投影、网格细化和体积采样计算。还可执行通用坐标空间计算,诸如,最近邻计算。例如,可通过定义坐标空间中围绕给定点的包围盒来发现该点附近的点的集合。随后可使用光线追踪核心372内的BVH和光线探测逻辑来确定包围盒内点相交的集合。相交构成原点以及那个原点的最近邻。可并行于在图形核心372和张量核心371上执行的计算来执行使用光线追踪核心372执行的计算。着色器编译器可被配置成用于将计算着色器或其他通用图形处理程序编译为能够跨图形核心370、张量核心371和光线追踪核心372被并行化的低级别基元。The ray tracing core 372 can also be used to perform calculations that are not directly similar to ray tracing. For example, the ray tracing core 372 can be used to accelerate mesh projection, mesh refinement, and volume sampling calculations. General coordinate space calculations, such as nearest neighbor calculations, can also be performed. For example, a set of points near a given point can be found by defining a bounding box around the given point in the coordinate space. The BVH and ray detection logic in the ray tracing core 372 can then be used to determine the set of intersections of points within the bounding box. The intersection constitutes the origin and the nearest neighbor of that origin. The calculations performed using the ray tracing core 372 can be performed in parallel with the calculations performed on the graphics core 372 and the tensor core 371. The shader compiler can be configured to compile a compute shader or other general graphics processing program into a low-level primitive that can be parallelized across the graphics core 370, the tensor core 371, and the ray tracing core 372.

用于GPU至主机处理器互连的技术Technologies for GPU to host processor interconnect

图4A图示在其中多个GPU 410-413(例如,诸如图2A中示出的并行处理器200)通过高速链路440A-440D(例如,总线、点到点互连等)通信地耦合至多个多核心处理器405-406的示例性体系结构。取决于实现方式,高速链路440A-440D可支持4GB/s、30GB/s、80GB/s或更高的通信吞吐量。可使用各种互连协议,包括但不限于PCIe4.0或5.0和NVLink 2.0。然而,本文中描述的基本原理不限于任何特定的通信协议或吞吐量。FIG. 4A illustrates an exemplary architecture in which multiple GPUs 410-413 (e.g., parallel processors 200 such as shown in FIG. 2A ) are communicatively coupled to multiple multi-core processors 405-406 via high-speed links 440A-440D (e.g., buses, point-to-point interconnects, etc.). Depending on the implementation, high-speed links 440A-440D may support 4 GB/s, 30 GB/s, 80 GB/s, or higher communication throughput. Various interconnect protocols may be used, including but not limited to PCIe 4.0 or 5.0 and NVLink 2.0. However, the basic principles described herein are not limited to any particular communication protocol or throughput.

GPU 410-413中的两个或更多个可通过高速链路442A-442B被互连,高速链路442A-442B可使用与针对高速链路440A-440D使用的那些协议/链路相同或不同的协议/链路来实现。类似地,多核心处理器405-406中的两个或更多个可通过高速链路443进行连接,该高速链路443可以是在20GB/s、30GB/s、120GB/s或者更低或更高速度下进行操作的对称多处理器(symmetric multi-processor,SMP)总线。替代地,图4A中示出的各种系统部件之间的所有通信可使用相同的协议/链路(例如,通过共同的互连结构)来实现。然而,如所提及,本文中描述的基本原理不限于任何特定类型的互连技术。Two or more of the GPUs 410-413 may be interconnected via high-speed links 442A-442B, which may be implemented using the same or different protocols/links as those used for high-speed links 440A-440D. Similarly, two or more of the multi-core processors 405-406 may be connected via high-speed link 443, which may be a symmetric multi-processor (SMP) bus operating at 20 GB/s, 30 GB/s, 120 GB/s, or lower or higher speeds. Alternatively, all communications between the various system components shown in FIG. 4A may be implemented using the same protocol/link (e.g., via a common interconnect structure). However, as mentioned, the basic principles described herein are not limited to any particular type of interconnect technology.

多核心处理器405和多核心处理器406中的每一个可分别经由存储器互连430A-430B通信地耦合至处理器存储器401-402,并且每个GPU 410-413分别通过GPU存储器互连450A-450D通信地耦合至GPU存储器420-423。存储器互连430A-430B和450A-450D可利用相同或不同的存储器访问技术。作为示例并且不作为限制,处理器存储器401-402和GPU存储器420-423可以是诸如动态随机存取存储器(DRAM)(包括堆叠式DRAM)、图形DDR SDRAM(GDDR)(例如,GDDR5、GDDR6)、或高带宽存储器(HBM)之类的易失性存储器,并且/或者可以是诸如3D Xpoint/Optane或Nano-Ram之类的非易失性存储器。例如,存储器的某个部分可以是易失性存储器,并且另一部分可以是非易失性存储器(例如,使用两级存储器(two-level memory,2LM)层次体系)。本文中描述的存储器子系统可与数种存储器技术兼容,这些存储器技术诸如由JEDEC(Joint Electronic Device Engineering Council,联合电子设备工程委员会)发布的双倍数据速率版本。Each of the multi-core processors 405 and 406 may be communicatively coupled to processor memories 401-402 via memory interconnects 430A-430B, respectively, and each GPU 410-413 may be communicatively coupled to GPU memories 420-423, respectively, via GPU memory interconnects 450A-450D. Memory interconnects 430A-430B and 450A-450D may utilize the same or different memory access technologies. By way of example and not limitation, processor memories 401-402 and GPU memories 420-423 may be volatile memories such as dynamic random access memory (DRAM) (including stacked DRAM), graphics DDR SDRAM (GDDR) (e.g., GDDR5, GDDR6), or high bandwidth memory (HBM), and/or may be non-volatile memories such as 3D Xpoint/Optane or Nano-Ram. For example, a portion of the memory may be volatile memory and another portion may be non-volatile memory (e.g., using a two-level memory (2LM) hierarchy). The memory subsystem described herein may be compatible with several memory technologies, such as the double data rate version published by JEDEC (Joint Electronic Device Engineering Council).

如下文所描述,尽管各处理器405-406和GPU 410-413可分别物理地耦合至特定的存储器401-402、420-423,但是可实现在其中同一虚拟系统地址空间(也被称为“有效地址”空间)在所有的各种物理存储器之间分布的统一存储器体系结构。例如,处理器存储器401-402各自可包括64GB的系统存储器地址空间,并且GPU 420-423各自可包括32GB的系统存储器地址空间(从而在该示例中产生总共256GB的可寻址存储器)。As described below, although each processor 405-406 and GPU 410-413 may be physically coupled to a specific memory 401-402, 420-423, respectively, a unified memory architecture may be implemented in which the same virtual system address space (also referred to as an "effective address" space) is distributed among all of the various physical memories. For example, processor memories 401-402 may each include 64GB of system memory address space, and GPUs 420-423 may each include 32GB of system memory address space (resulting in a total of 256GB of addressable memory in this example).

图4B图示多核心处理器407与图形加速模块446之间的互连的附加的任选细节。图形加速模块446可包括集成在线卡上的一个或多个GPU芯片,该线卡经由高速链路440耦合至处理器407。替代地,图形加速模块446可集成在与处理器407相同的封装或芯片上。4B illustrates additional optional details of the interconnection between the multi-core processor 407 and the graphics acceleration module 446. The graphics acceleration module 446 may include one or more GPU chips integrated on a line card that is coupled to the processor 407 via the high-speed link 440. Alternatively, the graphics acceleration module 446 may be integrated on the same package or chip as the processor 407.

所图示的处理器407包括多个核心460A-460D,这些核心各自具有转译后备缓冲器461A-461D以及一个或多个缓存462A-462D。核心可包括用于执行指令并处理数据的各种其他部件,未图示出这些部件以避免使本文中描述的部件(例如,指令取得单元、分支预测单元、解码器、执行单元、重排序缓冲器等)的基本原理模糊。缓存462A-462D可包括第一级(L1)缓存和第二级(L2)缓存。此外,一个或多个共享缓存456可被包括在缓存层次体系中,并由核心的集合460A-460D共享。例如,处理器407的一个实施例包括24个核心,这些核心各自具有其自身的L1缓存、十二个共享L2缓存以及十二个共享L3缓存。在该实施例中,L2缓存和L3缓存中的一者由两个相邻的核心共享。处理器407和图形加速器集成模块446与系统存储器441连接,该系统存储器441可包括处理器存储器401-402。The illustrated processor 407 includes a plurality of cores 460A-460D, each of which has a translation lookaside buffer 461A-461D and one or more caches 462A-462D. The core may include various other components for executing instructions and processing data, which are not illustrated to avoid obscuring the basic principles of the components described herein (e.g., instruction fetch units, branch prediction units, decoders, execution units, reorder buffers, etc.). Caches 462A-462D may include first-level (L1) caches and second-level (L2) caches. In addition, one or more shared caches 456 may be included in the cache hierarchy and shared by the set 460A-460D of cores. For example, one embodiment of the processor 407 includes 24 cores, each of which has its own L1 cache, twelve shared L2 caches, and twelve shared L3 caches. In this embodiment, one of the L2 cache and the L3 cache is shared by two adjacent cores. Processor 407 and graphics accelerator integrated module 446 are connected to system memory 441, which may include processor memories 401-402.

经由通过一致性总线464的核心间通信为存储在各缓存462A-462D、456和系统存储器441中的数据和指令维持一致性。例如,每个缓存可具有与其相关联的缓存一致性逻辑/电路模块,以响应于检测到的对特定缓存行的读取或写入而通过一致性总线464进行通信。在一个实现方式中,通过一致性总线464实现缓存监听协议,以监听缓存访问。缓存监听/一致性技术为本领域技术人员很好地理解,并且将不在此详细描述,以避免使本文中描述的基本原理模糊。Coherence is maintained for data and instructions stored in each cache 462A-462D, 456 and system memory 441 via inter-core communications over coherence bus 464. For example, each cache may have a cache coherence logic/circuitry module associated therewith to communicate over coherence bus 464 in response to a detected read or write to a particular cache line. In one implementation, a cache snooping protocol is implemented over coherence bus 464 to snoop cache accesses. Cache snooping/coherence techniques are well understood by those skilled in the art and will not be described in detail herein to avoid obscuring the basic principles described herein.

可以提供将图形加速模块446通信地耦合至一致性总线464的代理电路425,从而允许图形加速模块446作为核心的对等方参与缓存一致性协议。具体而言,接口435提供通过高速链路440(例如,PCIe总线、NVLink等)而至代理电路425的连接性,并且接口437将图形加速模块446连接到高速链路440。A proxy circuit 425 may be provided that communicatively couples the graphics acceleration module 446 to the coherence bus 464, thereby allowing the graphics acceleration module 446 to participate in a cache coherence protocol as a peer of the core. Specifically, an interface 435 provides connectivity to the proxy circuit 425 via a high-speed link 440 (e.g., a PCIe bus, NVLink, etc.), and an interface 437 connects the graphics acceleration module 446 to the high-speed link 440.

在一个实现方式中,加速器集成电路436代表图形加速模块446的多个图形处理引擎431、432…N提供缓存管理、存储器访问、上下文管理以及中断管理服务。图形处理引擎431、432…N各自可包括分开的图形处理单元(GPU)。替代地,图形处理引擎431、432…N可包括GPU内的不同类型的图形处理引擎,诸如,图形执行单元、媒体处理引擎(例如,视频编码器/解码器)、采样器以及块图像传输(block image transfer,BLIT)引擎。换言之,图形加速模块可以是具有多个图形处理引擎431-432…N的GPU,或者图形处理引擎431-432…N可以是集成在共同的封装、线卡或芯片上的单独的GPU。图形处理引擎431-432…N可以利用本文描述的任何图形处理器或计算加速器体系结构来配置。In one implementation, the accelerator integrated circuit 436 provides cache management, memory access, context management, and interrupt management services on behalf of multiple graphics processing engines 431, 432...N of the graphics acceleration module 446. The graphics processing engines 431, 432...N may each include a separate graphics processing unit (GPU). Alternatively, the graphics processing engines 431, 432...N may include different types of graphics processing engines within the GPU, such as a graphics execution unit, a media processing engine (e.g., a video encoder/decoder), a sampler, and a block image transfer (BLIT) engine. In other words, the graphics acceleration module may be a GPU with multiple graphics processing engines 431-432...N, or the graphics processing engines 431-432...N may be separate GPUs integrated on a common package, line card, or chip. The graphics processing engines 431-432...N may be configured using any graphics processor or computing accelerator architecture described herein.

加速器集成电路436可包括存储器管理单元(MMU)439,该MMU 439用于执行诸如虚拟到物理存储器转译(也称为有效到实际存储器转译)之类的各种存储器管理功能以及用于访问系统存储器441的存储器访问协议。MMU 439还可包括用于对虚拟/有效到物理/实际地址转译进行缓存的转译后备缓冲器(TLB)(未示出)。在一个实现方式中,缓存438存储用于由图形处理引擎431、432…N高效访问的命令和数据。存储在缓存438和图形存储器433-434…M中的数据可被保持与核心缓存462A-462D、456和系统存储器441一致。如所提及,这可经由代理电路425来完成,该代理电路425代表缓存438和存储器433-434…M参与缓存一致性机制(例如,向缓存438发送与处理器缓存462A-462D、456上的缓存行的修改/访问相关的更新,并从缓存438接收更新)。The accelerator integrated circuit 436 may include a memory management unit (MMU) 439 for performing various memory management functions such as virtual to physical memory translation (also known as effective to real memory translation) and memory access protocols for accessing the system memory 441. The MMU 439 may also include a translation lookaside buffer (TLB) (not shown) for caching virtual/effective to physical/real address translations. In one implementation, the cache 438 stores commands and data for efficient access by the graphics processing engines 431, 432...N. The data stored in the cache 438 and the graphics memories 433-434...M may be kept consistent with the core caches 462A-462D, 456 and the system memory 441. As mentioned, this may be accomplished via proxy circuitry 425, which participates in cache coherence mechanisms on behalf of cache 438 and memories 433-434...M (e.g., sending updates to cache 438 related to modifications/accesses of cache lines on processor caches 462A-462D, 456, and receiving updates from cache 438).

寄存器的集合445存储用于由图形处理引擎431-432…N执行的线程的上下文数据,并且上下文管理电路448管理这些线程上下文。例如,上下文管理电路448可执行保存和恢复操作,以在上下文切换期间保存和恢复各线程的上下文(例如,其中第一线程被保存并且第二线程被恢复,使得第二线程可由图形处理引擎执行)。例如,在上下文切换时,上下文管理电路448可将当前寄存器值存储到存储器中的指定区域(例如,由上下文指针标识)。当返回到该上下文时,其随后可恢复寄存器值。中断管理电路447例如可从系统设备接收中断,并处理从系统设备接收的中断。The set of registers 445 stores context data for threads executed by the graphics processing engines 431-432...N, and the context management circuit 448 manages these thread contexts. For example, the context management circuit 448 may perform save and restore operations to save and restore the context of each thread during a context switch (e.g., where a first thread is saved and a second thread is restored so that the second thread can be executed by the graphics processing engine). For example, upon a context switch, the context management circuit 448 may store the current register value to a specified area in memory (e.g., identified by a context pointer). When returning to that context, it may then restore the register value. The interrupt management circuit 447 may, for example, receive an interrupt from a system device and process the interrupt received from the system device.

在一个实现方式中,由MMU 439将来自图形处理器引擎431的虚拟/有效地址转译为系统存储器441中的实际/物理地址。任选地,加速器集成电路436支持多个(例如,4个、8个、16个)图形加速器模块446和/或其他加速器设备。图形加速器模块446可专用于在处理器407上执行的单个应用,或者可在多个应用之间被共享。任选地,提供虚拟化图形执行环境,在该虚拟化图形执行环境中,图形处理引擎431-432…N的资源与多个应用、虚拟机(virtual machine,VM)或容器共享。资源可被细分为“切片”,这些切片基于与VM和/或应用相关联的处理要求和优先级或基于用于图形加速器模块446的预定分区简档而被分配给不同的VM和/或应用。VM和容器在本文中可以可互换地使用。In one implementation, the virtual/effective address from the graphics processor engine 431 is translated into the actual/physical address in the system memory 441 by the MMU 439. Optionally, the accelerator integrated circuit 436 supports multiple (e.g., 4, 8, 16) graphics accelerator modules 446 and/or other accelerator devices. The graphics accelerator module 446 may be dedicated to a single application executed on the processor 407, or may be shared between multiple applications. Optionally, a virtualized graphics execution environment is provided in which the resources of the graphics processing engines 431-432 ... N are shared with multiple applications, virtual machines (VMs) or containers. Resources can be subdivided into "slices" that are allocated to different VMs and/or applications based on processing requirements and priorities associated with the VMs and/or applications or based on a predetermined partition profile for the graphics accelerator module 446. VMs and containers may be used interchangeably herein.

虚拟机(VM)可以是运行操作系统和一个或多个应用的软件。VM可由规范、配置文件、虚拟盘文件、非易失性随机存取存储器(non-volatile random access memory,NVRAM)设置文件和日志文件来定义,并且由主机计算平台的物理资源来备份。VM可包括被安装在模仿专用硬件的软件上的操作系统(operating system,OS)或应用环境。终端用户在虚拟机上具有如同他们将在专用硬件上所具有的相同体验。被称为管理程序的专业化软件完全对PC客户端或服务器的CPU、存储器、硬盘、网络和其他硬件资源进行仿真,从而使虚拟机能够共享资源。管理程序可对彼此隔离的多个虚拟硬件平台进行仿真,从而允许虚拟机在相同的底层物理主机上运行Server、VMware ESXi和其他操作系统。A virtual machine (VM) can be software that runs an operating system and one or more applications. A VM can be defined by specifications, configuration files, virtual disk files, non-volatile random access memory (NVRAM) settings files, and log files, and is backed up by the physical resources of the host computing platform. A VM may include an operating system (OS) or application environment installed on software that mimics dedicated hardware. End users have the same experience on a virtual machine as they would on dedicated hardware. Specialized software called a hypervisor completely emulates the CPU, memory, hard disk, network, and other hardware resources of a PC client or server, enabling virtual machines to share resources. A hypervisor can emulate multiple virtual hardware platforms that are isolated from each other, allowing virtual machines to run on the same underlying physical host. Server, VMware ESXi, and other operating systems.

容器可以是应用、配置和依赖关系的软件包,因此,应用在一个计算环境到另一计算环境可靠地运行。容器能够共享安装在服务器平台上的操作系统,并且作为隔离的进程运行。容器能够是包含软件运行所需的任何内容(诸如,系统工具、库和设置)的软件包。容器并不像传统的软件程序那样被安装,这允许容器与其他软件隔离且与操作系统自身隔离。容器的隔离性质提供若干益处。首先,容器中的软件将在不同环境中以相同方式运行。例如,包括PHP和MySQL的容器可在计算机和机器两者上以完全相同的方式运行。其次,容器提供增加的安全性,因为软件将不会影响主机操作系统。虽然安装的应用会更改系统设置并修改资源(诸如,Windows注册表),但容器能够仅修改该容器内的设置。A container can be a package of applications, configurations, and dependencies so that the application runs reliably from one computing environment to another. Containers can share the operating system installed on a server platform and run as isolated processes. Containers can be software packages that contain everything the software needs to run, such as system tools, libraries, and settings. Containers are not installed like traditional software programs, which allows them to be isolated from other software and from the operating system itself. The isolated nature of containers provides several benefits. First, the software in the container will run the same way in different environments. For example, a container that includes PHP and MySQL can be installed in the same way as in the container. Computer and The machines run in exactly the same way on both. Second, containers provide increased security because the software will not affect the host operating system. While installed applications can change system settings and modify resources (such as the Windows registry), containers can only modify settings within the container.

因此,加速器集成电路436充当用于图形加速模块446的、到系统的桥接器,并且提供地址转译和系统存储器缓存服务。在一个实施例中,为了促进桥接功能,加速器集成电路436还可包括共享I/O 497(例如,PCIe、USB或其他元件)和硬件,以实现对电压、时钟控制、性能、热和安全性的系统控制。共享I/O 497可利用分开的物理连接,或者可跨越高速链路440。此外,加速器集成电路436可提供虚拟化设施,以供主机处理器管理对图形处理引擎、中断和存储器管理的虚拟化。Thus, the accelerator integrated circuit 436 acts as a bridge to the system for the graphics acceleration module 446 and provides address translation and system memory caching services. In one embodiment, to facilitate the bridging function, the accelerator integrated circuit 436 may also include a shared I/O 497 (e.g., PCIe, USB, or other elements) and hardware to enable system control of voltage, clock control, performance, heat, and security. The shared I/O 497 may utilize separate physical connections or may span the high-speed link 440. In addition, the accelerator integrated circuit 436 may provide virtualization facilities for the host processor to manage virtualization of the graphics processing engine, interrupts, and memory management.

由于图形处理引擎431-432…N的硬件资源被显式地映射到由主机处理器407看到的实际地址空间,因此任何主机处理器可以使用有效地址值来直接对这些资源进行寻址。加速器集成电路436的一个任选功能是对图形处理引擎431-432…N进行物理分离,使得它们对于系统显得像独立的单元。Since the hardware resources of the graphics processing engines 431-432...N are explicitly mapped to the actual address space seen by the host processor 407, any host processor can directly address these resources using effective address values. An optional function of the accelerator integrated circuit 436 is to physically separate the graphics processing engines 431-432...N so that they appear as independent units to the system.

一个或多个图形存储器433-434…M可分别耦合至图形处理引擎431-432…N中的每一个。图形存储器433-434…M存储由图形处理引擎431-432…N中的每一个处理的指令和数据。图形存储器433-434…M可以是诸如DRAM(包括堆叠式DRAM)、GDDR存储器(例如,GDDR5、GDDR6)或HBM之类的易失性存储器,并且/或者可以是诸如3D Xpoint/Optane、三星Z-NAND、或Nano-Ram之类的非易失性存储器。One or more graphics memories 433-434 ... M may be respectively coupled to each of the graphics processing engines 431-432 ... N. Graphics memories 433-434 ... M store instructions and data processed by each of the graphics processing engines 431-432 ... N. Graphics memories 433-434 ... M may be volatile memories such as DRAM (including stacked DRAM), GDDR memories (e.g., GDDR5, GDDR6), or HBM, and/or may be non-volatile memories such as 3D Xpoint/Optane, Samsung Z-NAND, or Nano-Ram.

为了减少高速链路440上的数据通信量,可使用偏置技术来确保存储在图形存储器433-434…M中的数据是将由图形处理引擎431-432…N最频繁地使用并且优选地不由核心460A-460D使用(至少不频繁地使用)的数据。类似地,偏置机制尝试将核心(并且优选地不是图形处理引擎431-432…N)所需要的数据保持在系统存储器441和核心的缓存462A-462D、456内。To reduce the amount of data traffic on high-speed link 440, biasing techniques may be used to ensure that the data stored in graphics memory 433-434 ... M is data that will be used most frequently by graphics processing engines 431-432 ... N and preferably not used (at least not frequently) by cores 460A-460D. Similarly, the biasing mechanism attempts to keep data needed by the cores (and preferably not graphics processing engines 431-432 ... N) within system memory 441 and the cores' caches 462A-462D, 456.

根据图4C中示出的变体,加速器集成电路436被集成在处理器407内。图形处理引擎431-432…N通过高速链路440、经由接口437和接口435(其再次可利用任何形式的总线或接口协议)直接向加速器集成电路436通信。加速器集成电路436可执行与关于图4B所描述的那些操作相同的操作,但是考虑到该加速器集成电路436与一致性总线464和缓存462A-462D、456紧密邻近,其潜在地能以较高的吞吐量执行操作。According to a variation shown in FIG4C , an accelerator integrated circuit 436 is integrated within the processor 407. The graphics processing engines 431-432 ... N communicate directly to the accelerator integrated circuit 436 via the high-speed link 440, via the interface 437 and the interface 435 (which again may utilize any form of bus or interface protocol). The accelerator integrated circuit 436 may perform the same operations as those described with respect to FIG4B , but given its close proximity to the coherency bus 464 and caches 462A-462D, 456, it is potentially able to perform operations at a higher throughput.

所描述的实施例可支持不同的编程模型,这些编程模型包括专用进程编程模型(无图形加速模块虚拟化)和共享编程模型(具有虚拟化)。后者可包括受加速器集成电路436控制的编程模型以及受图形加速模块446控制的编程模型。The described embodiments may support different programming models, including a dedicated process programming model (without graphics acceleration module virtualization) and a shared programming model (with virtualization). The latter may include a programming model controlled by the accelerator integrated circuit 436 and a programming model controlled by the graphics acceleration module 446.

在专用进程模型的实施例中,图形处理引擎431、432、…、N在单个操作系统下可专用于单个应用或进程。单个应用可使其他应用请求漏到图形引擎431、432、…、N,从而在VM/分区内提供虚拟化。In an embodiment of a dedicated process model, graphics processing engines 431, 432, ..., N can be dedicated to a single application or process under a single operating system. A single application can leak other application requests to graphics engines 431, 432, ..., N, thereby providing virtualization within a VM/partition.

在专用进程编程模型中,图形处理引擎431、432…N可由多个VM/应用分区共享。共享模型要求系统管理程序使图形处理引擎431-432…N虚拟化,以允许由每个操作系统进行的访问。对于不具有管理程序的单分区系统,图形处理引擎431-432…N由操作系统拥有。在这两种情况下,操作系统可使图形处理引擎431-432…N虚拟化,以提供对每个进程或应用的访问权。In a dedicated process programming model, graphics processing engines 431, 432...N can be shared by multiple VM/application partitions. The shared model requires the hypervisor to virtualize graphics processing engines 431-432...N to allow access by each operating system. For a single partition system without a hypervisor, graphics processing engines 431-432...N are owned by the operating system. In both cases, the operating system can virtualize graphics processing engines 431-432...N to provide access to each process or application.

对于共享编程模型,图形加速模块446或各个图形处理引擎431-432…N使用进程句柄来选择进程要素。进程要素可被存储在系统存储器441中,并且可以是使用本文中所描述的有效地址到实际地址转译技术而可寻址的。进程句柄可以是在向图形处理引擎431-432…N注册其上下文(即,调用系统软件以将进程要素添加到进程要素链表)时提供给主机进程的实现方式特定的值。进程句柄的较低的16比特可以是进程要素在进程要素链表内的偏移。For the shared programming model, the graphics acceleration module 446 or the individual graphics processing engines 431-432 ... N use a process handle to select a process element. The process element may be stored in the system memory 441 and may be addressable using the effective address to real address translation techniques described herein. The process handle may be an implementation-specific value provided to the host process when registering its context with the graphics processing engine 431-432 ... N (i.e., calling the system software to add the process element to the process element linked list). The lower 16 bits of the process handle may be the offset of the process element within the process element linked list.

图4D图示示例性加速器集成切片490。如本文中所使用,“切片”包括加速器集成电路436的处理资源的指定部分。系统存储器441内的应用有效地址空间482存储进程要素483。进程要素483可响应于来自在处理器407上执行的应用480的GPU调用481而被存储。进程要素483包含对应应用480的进程状态。包含在进程要素483中的工作描述符(workdescriptor,WD)484可以是由应用请求的单个作业,或者可包含指向作业队列的指针。在后一种情况下,WD 484是指向应用的地址空间482中的作业请求队列的指针。4D illustrates an exemplary accelerator integrated slice 490. As used herein, a "slice" includes a specified portion of the processing resources of the accelerator integrated circuit 436. The application effective address space 482 within the system memory 441 stores process elements 483. The process element 483 may be stored in response to a GPU call 481 from an application 480 executed on the processor 407. The process element 483 contains the process state of the corresponding application 480. The work descriptor (WD) 484 contained in the process element 483 may be a single job requested by the application, or may contain a pointer to a job queue. In the latter case, WD 484 is a pointer to a job request queue in the address space 482 of the application.

图形加速模块446和/或各个图形处理引擎431-432…N可以由系统中的全部进程或系统中的进程的子集共享。例如,本文中描述的技术可包括用于建立进程状态并将WD484发送至图形加速模块446以在虚拟化环境中开始作业的基础设施。Graphics acceleration module 446 and/or each graphics processing engine 431-432 ... N can be shared by all processes in the system or a subset of processes in the system. For example, the technology described herein may include an infrastructure for establishing a process state and sending WD484 to graphics acceleration module 446 to start a job in a virtualized environment.

在一个实现方式中,专用进程编程模型是实现方式特定的。在该模型中,单个进程拥有图形加速模块446或单独的图形处理引擎431。由于图形加速模块446由单个进程拥有,因此图形加速模块446被指派时,管理程序针对拥有的分区对加速器集成电路436进行初始化,并且操作系统针对拥有的进程对加速器集成电路436进行初始化。In one implementation, the dedicated process programming model is implementation specific. In this model, a single process owns the graphics acceleration module 446 or a separate graphics processing engine 431. Since the graphics acceleration module 446 is owned by a single process, when the graphics acceleration module 446 is assigned, the hypervisor initializes the accelerator integrated circuit 436 for the owning partition, and the operating system initializes the accelerator integrated circuit 436 for the owning process.

在操作中,加速器集成切片490中的WD取得单元491取得下一WD 484,该下一WD484包括要由图形加速模块446的图形处理引擎中的一个图形处理引擎完成的工作的指示。如所图示,来自WD 484的数据可被存储在寄存器445中,并且由MMU 439、中断管理电路447和/或上下文管理电路448使用。例如,MMU 439可包括用于访问OS虚拟地址空间485内的段表/页表486的段/页走查电路模块。中断管理电路447可处理从图形加速模块446接收的中断事件492。当执行图形操作时,由图形处理引擎431-432…N生成的有效地址493由MMU 439转译成实际地址。In operation, the WD acquisition unit 491 in the accelerator integrated slice 490 acquires the next WD 484, which includes an indication of the work to be completed by one of the graphics processing engines of the graphics acceleration module 446. As illustrated, data from the WD 484 may be stored in the register 445 and used by the MMU 439, the interrupt management circuit 447, and/or the context management circuit 448. For example, the MMU 439 may include a segment/page walk circuit module for accessing the segment table/page table 486 within the OS virtual address space 485. The interrupt management circuit 447 may process the interrupt event 492 received from the graphics acceleration module 446. When performing graphics operations, the effective address 493 generated by the graphics processing engine 431-432 ... N is translated into an actual address by the MMU 439.

相同的寄存器集合445针对每个图形处理引擎431-432…N和/或图形加速模块446可被复制,并且可由管理程序或操作系统初始化。这些被复制的寄存器中的每个寄存器可被包括在加速器集成切片490中。在一个实施例中,每个图形处理引擎431-432…N可作为不同的图形处理器设备向管理程序496呈现。可针对特定图形处理引擎431-432…N的客户端配置QoS设置,并且可启用每个引擎的客户端之间的数据隔离。在表1中示出可由管理程序初始化的示例性寄存器。The same register set 445 may be replicated for each graphics processing engine 431-432 ... N and/or graphics acceleration module 446 and may be initialized by a hypervisor or operating system. Each of these replicated registers may be included in an accelerator integrated slice 490. In one embodiment, each graphics processing engine 431-432 ... N may be presented to a hypervisor 496 as a different graphics processor device. QoS settings may be configured for clients of a particular graphics processing engine 431-432 ... N, and data isolation between clients of each engine may be enabled. Exemplary registers that may be initialized by a hypervisor are shown in Table 1.

表1-管理程序初始化的寄存器Table 1 - Registers initialized by the hypervisor

在表2中示出可由操作系统初始化的示例性寄存器。Exemplary registers that may be initialized by the operating system are shown in Table 2.

表2-操作系统初始化的寄存器Table 2 - Registers initialized by the operating system

11 进程和线程标识Process and thread identities 22 有效地址(EffectiveAddress,EA)上下文保存/恢复指针Effective Address (EA) context save/restore pointer 33 虚拟地址(VirtualAddress,VA)加速器利用记录指针Virtual Address (VA) Accelerator Utilizes Record Pointer 44 虚拟地址(VA)存储段表指针Virtual Address (VA) Segment Table Pointer 55 权限掩码Permission Mask 66 工作描述符Job Descriptor

每个WD 484对于特定的图形加速模块446和/或图形处理引擎431-432…N可以是特定的。它包含图形处理引擎431-432…N完成其工作所需要的所有信息,或者它可以是指向应用已经建立要完成的工作的命令队列所在的存储器位置的指针。Each WD 484 may be specific to a particular graphics acceleration module 446 and/or graphics processing engine 431-432 ... N. It contains all the information needed by the graphics processing engine 431-432 ... N to do its job, or it may be a pointer to a memory location where a command queue where an application has set up work to be done is located.

图4E图示共享模型的附加的任选细节。它包括进程要素列表499被存储在其中的管理程序实际地址空间498。管理程序实际地址空间498是经由管理程序496可访问的,管理程序496针对操作系统495使图形加速模块引擎虚拟化。Figure 4E illustrates additional optional details of the sharing model. It includes a hypervisor real address space 498 in which a process element list 499 is stored. The hypervisor real address space 498 is accessible via a hypervisor 496, which virtualizes the graphics acceleration module engine for the operating system 495.

共享编程模型允许来自系统中的全部分区或系统中的分区的子集的全部进程或进程的子集使用图形加速模块446。存在在其中图形加速模块446由多个进程和分区共享的两种编程模型:时分共享和图形定向共享。The shared programming model allows all processes or a subset of processes from all partitions in the system or a subset of partitions in the system to use the graphics acceleration module 446. There are two programming models in which the graphics acceleration module 446 is shared by multiple processes and partitions: time-sharing and graphics-directed sharing.

在该模型中,系统管理程序496拥有图形加速模块446,并使其功能对所有的操作系统495可用。为使图形加速模块446支持由系统管理程序496进行的虚拟化,图形加速模块446可遵守以下要求:1)应用的作业请求必须是自主的(即,状态不需要在作业之间被维持),或者图形加速模块446必须提供上下文保存和恢复机制。2)由图形加速模块446保证应用的作业请求在所指定的时间量内完成,包括任何转换错误,或者图形加速模块446提供抢占对作业的处理的能力。3)图形加速模块446当在定向共享编程模型中操作时必须被保证进程之间的公平性。In this model, the hypervisor 496 owns the graphics acceleration module 446 and makes its functionality available to all operating systems 495. In order for the graphics acceleration module 446 to support virtualization by the hypervisor 496, the graphics acceleration module 446 may comply with the following requirements: 1) The application's job requests must be autonomous (i.e., state does not need to be maintained between jobs), or the graphics acceleration module 446 must provide a context save and restore mechanism. 2) The application's job requests are guaranteed by the graphics acceleration module 446 to complete within a specified amount of time, including any conversion errors, or the graphics acceleration module 446 provides the ability to preempt processing of jobs. 3) The graphics acceleration module 446 must be guaranteed fairness between processes when operating in a directed sharing programming model.

对于定向共享模型,可需要应用480来用图形加速模块446类型、工作描述符(WD)、权限掩码寄存器(authority mask register,AMR)值和上下文保存/恢复区域指针(context save/restore area pointer,CSRP)作出操作系统495系统调用。图形加速模块446类型描述针对系统调用的目标加速功能。图形加速模块446类型可以是系统特定值。特别针对图形加速模块446对WD进行格式化,并且WD可以采用以下形式:图形加速模块446命令、指向用户定义的结构的有效地址指针、指向命令队列的有效地址指针、或用于描述要由图形加速模块446完成的工作的任何其他数据结构。在一个实施例中,AMR值是要用于当前进程的AMR状态。值被传递到操作系统与应用设置AMR类似。如果加速器集成电路436和图形加速模块446实现方式不支持用户权限掩码覆盖寄存器(User Authority Mask OverrideRegister,UAMOR),则操作系统可在管理程序调用中传递AMR之前将当前UAMOR值应用到AMR值。管理程序496可在将AMR放置到进程要素483中之前任选地应用当前权限掩码覆盖寄存器(Authority Mask Override Register,AMOR)值。CSRP可以是包含应用的地址空间482中的区域的有效地址以供图形加速模块446用来保存和恢复上下文状态的寄存器445中的一个寄存器。如果没有状态被要求以被保存在作业之间或者当作业被抢占时,则该指针是任选的。上下文保存/恢复区域可以是钉置(pinned)的系统存储器。For the directed sharing model, the application 480 may be required to make an operating system 495 system call with a graphics acceleration module 446 type, a work descriptor (WD), an authority mask register (AMR) value, and a context save/restore area pointer (CSRP). The graphics acceleration module 446 type describes the target acceleration function for the system call. The graphics acceleration module 446 type may be a system-specific value. The WD is formatted specifically for the graphics acceleration module 446, and the WD may take the form of a graphics acceleration module 446 command, an effective address pointer to a user-defined structure, an effective address pointer to a command queue, or any other data structure for describing the work to be done by the graphics acceleration module 446. In one embodiment, the AMR value is the AMR state to be used for the current process. The value is passed to the operating system similar to the application setting the AMR. If the accelerator integrated circuit 436 and graphics acceleration module 446 implementation do not support the User Authority Mask Override Register (UAMOR), the operating system may apply the current UAMOR value to the AMR value before passing the AMR in the hypervisor call. The hypervisor 496 may optionally apply the current Authority Mask Override Register (AMOR) value before placing the AMR into the process element 483. The CSRP may be one of the registers 445 that contains the effective address of an area in the application's address space 482 for the graphics acceleration module 446 to use to save and restore context state. This pointer is optional if no state is required to be saved between jobs or when a job is preempted. The context save/restore area may be pinned system memory.

在接收到系统调用时,操作系统495可验证应用480已注册并且已被给予使用图形加速模块446的权限。操作系统492随后利用表3中示出的信息来调用管理程序496。Upon receiving the system call, the operating system 495 may verify that the application 480 has been registered and has been given permission to use the graphics acceleration module 446. The operating system 492 then calls the hypervisor 496 with the information shown in Table 3.

表3-OS对管理程序的调用参数Table 3 - OS call parameters to the hypervisor

在接收管理程序调用时,管理程序496验证操作系统495已注册并且已被给予使用图形加速模块446的权限。管理程序496随后将进程要素483置于针对对应的图形加速模块446类型的进程要素链表中。进程要素可包括表4中示出的信息。Upon receiving the hypervisor call, the hypervisor 496 verifies that the operating system 495 has registered and has been given permission to use the graphics acceleration module 446. The hypervisor 496 then places the process element 483 in the process element linked list for the corresponding type of graphics acceleration module 446. The process element may include the information shown in Table 4.

表4-进程要素信息Table 4 - Process element information

11 工作描述符(WD)Work Descriptor (WD) 22 权限掩码寄存器(AMR)值(潜在地被掩蔽)。The authority mask register (AMR) value (potentially masked). 33 有效地址(EA)上下文保存/恢复区域指针(CSRP)Effective Address (EA) Context Save/Restore Region Pointer (CSRP) 44 进程ID(PID)和任选的线程ID(TID)Process ID (PID) and optionally thread ID (TID) 55 虚拟地址(VA)加速器利用记录指针(AURP)Virtual Address (VA) Accelerator Utilization Record Pointer (AURP) 66 存储段表指针(SSTP)的虚拟地址Virtual address of the storage segment table pointer (SSTP) 77 逻辑中断服务号(LISN)Logical Interrupt Service Number (LISN) 88 中断向量表,从管理程序调用参数推导出。Interrupt vector table, derived from the hypervisor call parameters. 99 状态寄存器(state register,SR)值Status register (SR) value 1010 逻辑分区ID(logical partitionID,LPID)Logical partition ID (LPID) 1111 实际地址(RA)管理程序加速器利用记录指针Real Address (RA) Manager Accelerator Utilizes Record Pointers 1212 存储描述符寄存器(Storage Descriptor Register,SDR)Storage Descriptor Register (SDR)

管理程序可初始化多个加速器集成切片490的寄存器445。The hypervisor may initialize registers 445 of the plurality of accelerator integrated slices 490 .

如图4F中所图示,在一个任选实现方式中,采用经由共同的虚拟存储器地址空间可寻址的统一存储器,该共同的虚拟存储器地址空间被用于访问物理处理器存储器401-402和GPU存储器420-423。在该实现方式中,在GPU 410-413上执行的操作利用同一虚拟/有效存储器地址空间来访问处理器存储器401-402并且反之亦然,由此简化可编程性。虚拟/有效地址空间的第一部分可被分配给处理器存储器401,第二部分可被分配给第二处理器存储器402,第三部分可被分配给GPU存储器420,以此类推。整个虚拟/有效存储器空间(有时被称为有效地址空间)由此可跨处理器存储器401-402和GPU存储器420-423中的每一个分布,从而允许任何处理器或GPU利用映射到任一物理存储器的虚拟地址来访问该物理存储器。As shown in FIG. 4F , in an optional implementation, a unified memory addressable via a common virtual memory address space is employed, which is used to access physical processor memory 401-402 and GPU memory 420-423. In this implementation, operations executed on GPU 410-413 utilize the same virtual/effective memory address space to access processor memory 401-402 and vice versa, thereby simplifying programmability. A first portion of the virtual/effective address space may be allocated to processor memory 401, a second portion may be allocated to second processor memory 402, a third portion may be allocated to GPU memory 420, and so on. The entire virtual/effective memory space (sometimes referred to as an effective address space) can thus be distributed across each of processor memory 401-402 and GPU memory 420-423, thereby allowing any processor or GPU to access the physical memory using a virtual address mapped to any physical memory.

可提供MMU 439A-439E中的一个或多个MMU内的偏置/一致性管理电路模块494A-494E,这些偏置/一致性管理电路模块494A-494E确保主机处理器(例如,405)的缓存与GPU410-413的缓存之间的缓存一致性,并且实现指示某些类型的数据应当被存储在其中的物理存储器的偏置技术。虽然在图4F中图示偏置/一致性管理电路模块494A-494E的多个实例,但是可在一个或多个主机处理器405的MMU内和/或在加速器集成电路436内实现偏置/一致性电路模块。Bias/consistency management circuit modules 494A-494E within one or more of the MMUs 439A-439E may be provided that ensure cache coherence between the cache of the host processor (e.g., 405) and the cache of the GPUs 410-413 and implement biasing techniques that indicate the physical memory in which certain types of data should be stored. Although multiple instances of bias/consistency management circuit modules 494A-494E are illustrated in FIG. 4F , bias/consistency circuit modules may be implemented within the MMUs of one or more host processors 405 and/or within the accelerator integrated circuit 436.

GPU附连的存储器420-423可被映射为系统存储器的部分并使用共享虚拟存储器(sharedvirtual memory,SVM)技术来访问,但是不遭受与完全系统缓存一致性相关联的典型性能缺陷。GPU附连的存储器420-423在没有繁重的缓存一致性开销的情况下作为系统存储器被访问的能力为GPU迁移提供了有益的操作环境。该布置允许主机处理器405在没有传统的I/O DMA数据复制的开销的情况下设置操作对象并访问计算结果。此类传统复制涉及驱动器调用、中断、以及存储器映射的I/O(memorymapped I/O,MMIO)访问,它们相对于简单的存储器访问全都是低效的。同时,在没有缓存一致性开销的情况下访问GPU附连的存储器420-423的能力对于被迁移的计算的执行时间可以是关键的。例如,在具有大量流式写入存储器通信量的情况下,缓存一致性开销可能显著地降低由GPU 410-413看到的有效写入带宽。操作对象设置的效率、结果访问的效率以及GPU计算的效率在确定GPU迁移的有效性时全都发挥作用。The GPU-attached memory 420-423 can be mapped as part of the system memory and accessed using shared virtual memory (SVM) technology, but does not suffer from the typical performance defects associated with full system cache coherence. The ability of the GPU-attached memory 420-423 to be accessed as system memory without heavy cache coherence overhead provides a beneficial operating environment for GPU migration. This arrangement allows the host processor 405 to set the operation object and access the calculation results without the overhead of traditional I/O DMA data replication. Such traditional replication involves driver calls, interrupts, and memory-mapped I/O (MMIO) accesses, which are all inefficient relative to simple memory accesses. At the same time, the ability to access the GPU-attached memory 420-423 without cache coherence overhead can be critical for the execution time of the migrated calculation. For example, in the case of a large amount of streaming write memory traffic, the cache coherence overhead may significantly reduce the effective write bandwidth seen by the GPU 410-413. The efficiency of operand setup, the efficiency of result access, and the efficiency of GPU computation all play a role in determining the effectiveness of GPU migration.

GPU偏置与主机处理器偏置之间的选择可由偏置跟踪器数据结构驱动。例如,可使用偏置表,该偏置表可以是页粒度的结构(即,以存储器页的粒度受控制),该页粒度的结构包括每GPU附连的存储器页的1个或2个比特。偏置表可在一个或多个GPU附连的存储器420-423的偷取的存储器范围中实现,在GPU 410-413中具有或不具有偏置缓存(例如,用于对偏置表的频繁/最近使用的条目进行缓存)。替代地,整个偏置表可被维持在GPU内。The selection between GPU bias and host processor bias may be driven by a bias tracker data structure. For example, a bias table may be used, which may be a page-granular structure (i.e., controlled at the granularity of a memory page) comprising 1 or 2 bits per GPU-attached memory page. The bias table may be implemented in a stolen memory range of one or more GPU-attached memories 420-423, with or without a bias cache in the GPUs 410-413 (e.g., for caching frequently/recently used entries of the bias table). Alternatively, the entire bias table may be maintained within the GPU.

在一个实现方式中,在对GPU存储器的实际访问之前,访问与对GPU附连的存储器420-423的每次访问相关联的偏置表条目,从而导致下列操作。首先,来自GPU 410-413的、在GPU偏置中发现它们的页的本地请求直接被转发至对应的GPU存储器420-423。来自GPU的、在主机偏置中发现它们的页的本地请求被转发至处理器405(例如,如上文所讨论,通过高速链路)。任选地,来自处理器405的、在主机处理器偏置中发现所请求的页的请求像正常存储器读取那样完成该请求。替代地,涉及GPU偏置的页的请求可被转发至GPU 410-413。如果GPU当前不是正在使用该页,则GPU随后可将该页转变为主机处理器偏置。In one implementation, the bias table entry associated with each access to the GPU-attached memory 420-423 is accessed prior to the actual access to the GPU memory, resulting in the following operations. First, local requests from the GPU 410-413 for pages that find them in the GPU bias are forwarded directly to the corresponding GPU memory 420-423. Local requests from the GPU for pages that find them in the host bias are forwarded to the processor 405 (e.g., as discussed above, over a high-speed link). Optionally, requests from the processor 405 for pages that find the requested page in the host processor bias complete the request like a normal memory read. Alternatively, requests involving pages in the GPU bias may be forwarded to the GPU 410-413. If the GPU is not currently using the page, the GPU may then convert the page to the host processor bias.

可通过基于软件的机制、硬件辅助的基于软件的机制或对于有限的情况集合而言通过基于纯硬件的机制来改变页的偏置状态。The bias state of a page may be changed by a software-based mechanism, a hardware-assisted software-based mechanism, or, for a limited set of situations, by a purely hardware-based mechanism.

一种用于改变偏置状态的机制采用API调用(例如,OpenCL),其进而调用GPU的设备驱动器,该设备驱动器进而向GPU发送引导该GPU改变偏置状态并针对一些转变在主机中执行缓存转储清除操作的消息(或使命令描述符入列)。针对从主机处理器405偏置到GPU偏置的转变需要缓存转储清除操作,但针对相反的转变不需要缓存转储清除操作。One mechanism for changing the bias state employs an API call (e.g., OpenCL), which in turn calls the GPU's device driver, which in turn sends a message (or enqueues a command descriptor) to the GPU directing the GPU to change the bias state and perform a cache flush operation in the host for some transitions. A cache flush operation is required for a transition from host processor 405 bias to GPU bias, but not for the reverse transition.

可通过临时渲染不可由主机处理器405缓存的GPU偏置的页来维持缓存一致性。为了访问这些页,处理器405可请求从GPU 410访问,取决于实现方式,GPU 410可以或可以不准予立即访问。因此,为了减少主机处理器405与GPU 410之间的通信,确保GPU偏置的页是GPU所需但不是主机处理器405所需的那些页是有益的,并且反之亦然。Cache coherency may be maintained by temporarily rendering GPU-biased pages that are not cacheable by host processor 405. To access these pages, processor 405 may request access from GPU 410, which may or may not grant immediate access, depending on the implementation. Therefore, to reduce communication between host processor 405 and GPU 410, it is beneficial to ensure that GPU-biased pages are those pages that are needed by the GPU but not by host processor 405, and vice versa.

图形处理管线Graphics Processing Pipeline

图5图示图形处理管线500。图形多处理器(诸如,图2D中的图形多处理器234、图3A的图形多处理器325、图3B的图形多处理器350)可实现所图示的图形处理管线500。图形多处理器可被包括在如本文中所描述的并行处理子系统内,并行处理子系统诸如图2A的并行处理器200,其可与图1的(一个或多个)并行处理器112相关并且可替代那些并行处理器中的一个被使用。各种并行处理器系统可经由如本文中所描述的并行处理单元(例如,图2A的并行处理单元202)的一个或多个实例来实现图形处理管线500。例如,着色器单元(例如,图2C的图形多处理器234)可被配置成用于执行以下一者或多者的功能:顶点处理单元504、曲面细分控制处理单元508、曲面细分评估处理单元512、几何处理单元516、以及片段/像素处理单元524。数据组装器502、基元组装器506、514、518、曲面细分单元510、栅格化器522以及栅格操作单元526的功能也可由处理集群(例如,图2A的处理集群214)内的其他处理引擎和对应的分区单元(例如,图2A的分区单元220A-220N)来执行。图形处理管线500还可使用用于一个或多个功能的专用处理单元来实现。图形处理管线500的一个或多个部分由通用处理器(例如,CPU)内的并行处理逻辑执行也是可能的。任选地,图形处理管线500的一个或多个部分可以经由存储器接口528来访问片上存储器(例如,如图2A中的并行处理器存储器222),该存储器接口528可以是图2A的存储器接口218的实例。图形处理器管线500还可经由如图3C中的多核心组365A来实现。FIG. 5 illustrates a graphics processing pipeline 500. A graphics multiprocessor (such as the graphics multiprocessor 234 in FIG. 2D, the graphics multiprocessor 325 in FIG. 3A, the graphics multiprocessor 350 in FIG. 3B) may implement the illustrated graphics processing pipeline 500. The graphics multiprocessor may be included in a parallel processing subsystem as described herein, such as the parallel processor 200 in FIG. 2A, which may be associated with the parallel processor(s) 112 in FIG. 1 and may be used in place of one of those parallel processors. Various parallel processor systems may implement the graphics processing pipeline 500 via one or more instances of a parallel processing unit (e.g., the parallel processing unit 202 in FIG. 2A) as described herein. For example, a shader unit (e.g., the graphics multiprocessor 234 in FIG. 2C) may be configured to perform the functions of one or more of the following: a vertex processing unit 504, a tessellation control processing unit 508, a tessellation evaluation processing unit 512, a geometry processing unit 516, and a fragment/pixel processing unit 524. The functions of the data assembler 502, the primitive assemblers 506, 514, 518, the tessellation unit 510, the rasterizer 522, and the raster operation unit 526 may also be performed by other processing engines and corresponding partition units (e.g., partition units 220A-220N of FIG. 2A) within a processing cluster (e.g., processing cluster 214 of FIG. 2A). The graphics processing pipeline 500 may also be implemented using dedicated processing units for one or more functions. It is also possible that one or more portions of the graphics processing pipeline 500 are executed by parallel processing logic within a general-purpose processor (e.g., a CPU). Optionally, one or more portions of the graphics processing pipeline 500 may access on-chip memory (e.g., parallel processor memory 222 in FIG. 2A) via a memory interface 528, which may be an instance of the memory interface 218 of FIG. 2A. The graphics processor pipeline 500 may also be implemented via a multi-core group 365A as in FIG. 3C.

数据组装器502是可收集针对表面和基元的顶点数据的处理单元。数据组装器502随后将顶点数据输出到顶点处理单元504,该顶点数据包括顶点属性。顶点处理单元504是可编程执行单元,该可编程执行单元执行顶点着色器程序,从而按照顶点着色器程序所指定地来照明以及变换顶点数据。顶点处理单元504读取存储在缓存、本地或系统存储器中的数据以供在处理顶点数据时使用,并且可被编程为用于将顶点数据从基于对象的坐标表示变换为世界空间坐标空间或归一化设备坐标空间。The data assembler 502 is a processing unit that can collect vertex data for surfaces and primitives. The data assembler 502 then outputs the vertex data to the vertex processing unit 504, which includes vertex attributes. The vertex processing unit 504 is a programmable execution unit that executes the vertex shader program to illuminate and transform the vertex data as specified by the vertex shader program. The vertex processing unit 504 reads data stored in a cache, local or system memory for use in processing vertex data, and can be programmed to transform vertex data from an object-based coordinate representation to a world space coordinate space or a normalized device coordinate space.

基元组装器506的第一实例从顶点处理单元504接收顶点属性。基元组装器506根据需要读取所存储的顶点属性,并且构建图形基元以供由曲面细分控制处理单元508进行处理。图形基元包括如由各种图形处理应用编程接口(application programminginterface,API)支持的三角形、线段、点、补片等。A first instance of primitive assembler 506 receives vertex attributes from vertex processing unit 504. Primitive assembler 506 reads the stored vertex attributes as needed and builds graphics primitives for processing by tessellation control processing unit 508. Graphics primitives include triangles, line segments, points, patches, etc. as supported by various graphics processing application programming interfaces (APIs).

曲面细分控制处理单元508将输入顶点视为几何补片的控制点。将控制点从来自补片的输入表示(例如,补片的基础)变换为适合于在由曲面细分评估处理单元512进行的表面评估中使用的表示。曲面细分控制处理单元508还可计算几何补片的边(edge)的曲面细分因子。曲面细分因子应用于单个边,并对与该边相关联的依赖于视图的细节级别进行量化。曲面细分单元510被配置成用于接收补片的边的曲面细分因子,并且用于将补片曲面细分成多个几何基元(诸如,线、三角形或四边形基元),这些几何基元被传送到曲面细分评估处理单元512。曲面细分评估处理单元512对经细分的补片的参数化坐标进行操作,以生成与几何基元相关联的每个顶点的表面表示和顶点属性。The tessellation control processing unit 508 treats the input vertices as control points of a geometric patch. The control points are transformed from an input representation from the patch (e.g., a basis for the patch) to a representation suitable for use in a surface evaluation performed by the tessellation evaluation processing unit 512. The tessellation control processing unit 508 may also calculate tessellation factors for edges of the geometric patch. The tessellation factors are applied to individual edges and quantize the view-dependent level of detail associated with the edge. The tessellation unit 510 is configured to receive the tessellation factors for the edges of the patch and to tessellate the patch into a plurality of geometric primitives (such as line, triangle, or quadrilateral primitives), which are transmitted to the tessellation evaluation processing unit 512. The tessellation evaluation processing unit 512 operates on the parameterized coordinates of the tessellated patch to generate a surface representation and vertex attributes for each vertex associated with the geometric primitive.

基元组装器514的第二实例根据需要从曲面细分评估处理单元512接收顶点属性,读取所存储的顶点属性,并且构建图形基元以供几何处理单元516进行处理。几何处理单元516是可编程执行单元,该可编程执行单元执行几何着色器程序以按照几何着色器程序所指定地变换从基元组装器514接收到的图形基元。几何处理单元516可被编程为用于将图形基元细分为一个或多个新的图形基元并计算被用于对这些新的图形基元进行栅格化的参数。The second instance of primitive assembler 514 receives vertex attributes from tessellation evaluation processing unit 512 as needed, reads the stored vertex attributes, and builds graphics primitives for processing by geometry processing unit 516. Geometry processing unit 516 is a programmable execution unit that executes geometry shader programs to transform graphics primitives received from primitive assembler 514 as specified by the geometry shader programs. Geometry processing unit 516 can be programmed to tessellate graphics primitives into one or more new graphics primitives and calculate parameters used to rasterize the new graphics primitives.

几何处理单元516可以能够在几何流中添加或删除元素。几何处理单元516向基元组装器518输出指定新的图形基元的参数和顶点。基元组装器518从几何处理单元516接收参数和顶点,并构建图形基元以供视口缩放、剔除和裁剪单元520进行处理。几何处理单元516读取被存储在并行处理器存储器或系统存储器中的数据,以供在处理几何数据时使用。视口缩放、剔除和裁剪单元520执行裁剪、剔除和视口缩放,并且将经处理的图形基元输出到栅格化器522。The geometry processing unit 516 may be able to add or delete elements in the geometry stream. The geometry processing unit 516 outputs parameters and vertices that specify new graphics primitives to the primitive assembler 518. The primitive assembler 518 receives the parameters and vertices from the geometry processing unit 516 and builds graphics primitives for processing by the viewport scaling, culling, and clipping unit 520. The geometry processing unit 516 reads data stored in the parallel processor memory or system memory for use in processing the geometry data. The viewport scaling, culling, and clipping unit 520 performs clipping, culling, and viewport scaling, and outputs the processed graphics primitives to the rasterizer 522.

栅格化器522可执行深度剔除和其他基于深度的优化。栅格化器522还对新的图形基元执行扫描转换以生成片段并将那些片段和相关联的覆盖数据输出到片段/像素处理单元524。片段/像素处理单元524是被配置成用于执行片段着色器程序或像素着色器程序的可编程执行单元。片段/像素处理单元524按照片段或像素着色器程序所指定地变换从栅格化器522接收的片段或像素。例如,片段/像素处理单元524可被编程为用于执行包括但不限于纹理映射、着色、混合、纹理校正和透视校正的操作,以产生被输出到栅格操作单元526的经着色的片段或像素。片段/像素处理单元524可读取被存储在并行处理器存储器或系统存储器中的数据,以供在处理片段数据时使用。片段或像素着色器程序可被配置成用于取决于为处理单元配置的采样率而以样本、像素、片或其他粒度进行着色。The rasterizer 522 may perform depth culling and other depth-based optimizations. The rasterizer 522 also performs scan conversion on new graphics primitives to generate fragments and outputs those fragments and associated coverage data to the fragment/pixel processing unit 524. The fragment/pixel processing unit 524 is a programmable execution unit configured to execute a fragment shader program or a pixel shader program. The fragment/pixel processing unit 524 transforms the fragments or pixels received from the rasterizer 522 as specified by the fragment or pixel shader program. For example, the fragment/pixel processing unit 524 may be programmed to perform operations including but not limited to texture mapping, shading, blending, texture correction, and perspective correction to generate shaded fragments or pixels output to the raster operation unit 526. The fragment/pixel processing unit 524 may read data stored in a parallel processor memory or system memory for use in processing fragment data. The fragment or pixel shader program may be configured to shade at a sample, pixel, slice, or other granularity depending on the sampling rate configured for the processing unit.

栅格操作单元526是处理单元,该处理单元执行栅格操作并将像素数据输出为要被存储在图形存储器(例如,如图2A中的并行处理器存储器222和/或如图1中的系统存储器104)中、要在一个或多个显示设备110A-110B上显示或用于由一个或多个处理器102或并行处理器112中的一个进一步处理的经处理的图形数据,这些栅格操作包括但不限于模版印制、z测试、混合等等。栅格操作单元526可被配置成用于压缩被写入到存储器的z数据或颜色数据并对从存储器读取的z数据或颜色数据解压缩。The raster operation unit 526 is a processing unit that performs raster operations including, but not limited to, stenciling, z-testing, blending, etc., and outputs pixel data as processed graphics data to be stored in graphics memory (e.g., parallel processor memory 222 in FIG. 2A and/or system memory 104 in FIG. 1 ), to be displayed on one or more display devices 110A-110B, or for further processing by one of the one or more processors 102 or parallel processors 112. The raster operation unit 526 may be configured to compress z-data or color data written to memory and decompress z-data or color data read from memory.

机器学习概述Machine Learning Overview

上文描述的体系结构可被应用以执行使用机器学习模型的训练和推断操作。机器学习在解决许多种类的任务方面已经是成功的。当训练和使用机器学习算法(例如,神经网络)时产生的计算本身自然地适合于高效的并行实现方式。相应地,诸如通用图形处理单元(GPGPU)之类的并行处理器已在深度神经网络的实际实现方式中扮演了重要角色。具有单指令多线程(SIMT)体系结构的并行图形处理器被设计成用于使图形管线中的并行处理的量最大化。在SIMT体系结构中,成组的并行线程尝试尽可能频繁地一起同步地执行程序指令以增加处理效率。由并行机器学习算法实现方式提供的效率允许使用高容量网络并且使得能够对较大的数据集训练那些网络。The architecture described above can be applied to perform training and inference operations using machine learning models. Machine learning has been successful in solving many types of tasks. The calculations generated when training and using machine learning algorithms (e.g., neural networks) are naturally suitable for efficient parallel implementations. Accordingly, parallel processors such as general purpose graphics processing units (GPGPUs) have played an important role in the actual implementation of deep neural networks. Parallel graphics processors with single instruction multiple threads (SIMT) architectures are designed to maximize the amount of parallel processing in the graphics pipeline. In the SIMT architecture, groups of parallel threads attempt to execute program instructions together synchronously as frequently as possible to increase processing efficiency. The efficiency provided by the parallel machine learning algorithm implementation allows the use of high-capacity networks and enables those networks to be trained to larger data sets.

机器学习算法是能够基于数据集合进行学习的算法。例如,机器学习算法可以被设计成用于对数据集内的高级抽象进行建模。例如,可以使用图像识别算法来确定给定的输入属于若干类别中的哪个类别;给定输入,回归算法可以输出数值;并且可以使用模式识别算法来生成经转换的文本或者执行文本到语音和/或语音识别。A machine learning algorithm is an algorithm that is capable of learning based on a set of data. For example, a machine learning algorithm can be designed to model high-level abstractions within a data set. For example, an image recognition algorithm can be used to determine which of several categories a given input belongs to; a regression algorithm can output a numerical value given an input; and a pattern recognition algorithm can be used to generate converted text or perform text-to-speech and/or speech recognition.

示例性类型的机器学习算法是神经网络。存在许多类型的神经网络;简单类型的神经网络是前馈网络。前馈网络可以被实现为在其中按层来布置节点的非循环图。典型地,前馈网络拓扑包括由至少一个隐藏层分开的输入层和输出层。隐藏层将由输入层接收到的输入变换为对在输出层中生成输出有用的表示。网络节点经由边被完全连接到相邻层中的节点,但在每个层内的节点之间不存在边。在前馈网络的输入层的节点处接收到的数据经由激活函数被传播(即,“前馈”)至输出层的节点,该激活函数基于分别与连接这些层的边中的每一条边相关联的系数(“权重”)来计算网络中每个连续层的节点的状态。取决于正由正被执行的算法表示的特定模型,来自神经网络算法的输出可以采用各种形式。An exemplary type of machine learning algorithm is a neural network. There are many types of neural networks; a simple type of neural network is a feedforward network. A feedforward network can be implemented as an acyclic graph in which nodes are arranged in layers. Typically, a feedforward network topology includes an input layer and an output layer separated by at least one hidden layer. The hidden layer transforms the input received by the input layer into a representation useful for generating output in the output layer. The network nodes are fully connected to the nodes in the adjacent layers via edges, but there are no edges between the nodes within each layer. Data received at the nodes of the input layer of the feedforward network is propagated (i.e., "fed forward") to the nodes of the output layer via an activation function, which calculates the state of the nodes of each consecutive layer in the network based on coefficients ("weights") associated with each of the edges connecting these layers. Depending on the specific model being represented by the algorithm being executed, the output from the neural network algorithm can take various forms.

在可使用机器学习算法对特定问题建模之前,使用训练数据集来训练算法。训练神经网络涉及:选择网络拓扑;使用表示正由网络建模的问题的训练数据的集合;以及调整权重,直到网络模型针对训练数据集的所有实例都以最小误差执行。例如,在针对神经网络的有监督学习训练过程期间,由网络响应于表示训练数据集中的实例的输入而产生的输出与该实例的“正确的”标记输出进行比较,计算表示输出与标记输出之间的差异的误差信号,并且随着误差信号通过网络的层被向后传播,调整与连接相关联的权重以使那个误差最小化。当根据训练数据集的实例生成的输出中的每个输出的误差被最小化时,网络被认为是“经训练的”。Before a machine learning algorithm can be used to model a particular problem, the algorithm is trained using a training data set. Training a neural network involves: selecting a network topology; using a set of training data that represents the problem being modeled by the network; and adjusting weights until the network model performs with minimal error for all instances of the training data set. For example, during a supervised learning training process for a neural network, the output generated by the network in response to an input representing an instance in the training data set is compared to the "correct" labeled output for that instance, an error signal representing the difference between the output and the labeled output is calculated, and as the error signal is propagated backward through the layers of the network, the weights associated with the connections are adjusted to minimize that error. The network is considered "trained" when the error for each of the outputs generated from the instances of the training data set is minimized.

机器学习算法的准确度会显著地受用于训练算法的数据集的质量影响。训练过程可能是计算密集型的,并且在常规通用处理器上可能需要大量的时间。相应地,使用并行处理硬件来训练许多类型的机器学习算法。这对于优化神经网络的训练是特别有用的,因为在调整神经网络中的系数时执行的计算本身自然地适于并行实现方式。具体地,许多机器学习算法和软件应用已被适配成利用通用图形处理设备内的并行处理硬件。The accuracy of a machine learning algorithm can be significantly affected by the quality of the data set used to train the algorithm. The training process can be computationally intensive and can take a significant amount of time on a conventional general-purpose processor. Accordingly, many types of machine learning algorithms are trained using parallel processing hardware. This is particularly useful for optimizing the training of neural networks, as the calculations performed when adjusting the coefficients in a neural network are naturally suited to parallel implementations. In particular, many machine learning algorithms and software applications have been adapted to exploit the parallel processing hardware within general-purpose graphics processing devices.

图6是机器学习软件栈600的广义图。机器学习应用602是能够被配置成用于使用训练数据集来训练神经网络或使用经训练的深度神经网络来实现机器智能的任何逻辑。机器学习应用602可包括用于神经网络的训练和推断功能和/或可用于在部署之前训练神经网络的专业软件。机器学习应用602可实现任何类型的机器智能,包括但不限于:图像识别、地图创建和定位、自主导航、语音合成、医学成像或语言翻译。示例机器学习应用602包括但不限于基于语音的虚拟助手、图像或面部识别算法、自主导航以及被用于训练由机器学习应用602使用的机器学习模型的软件工具。6 is a generalized diagram of a machine learning software stack 600. A machine learning application 602 is any logic that can be configured to train a neural network using a training data set or to implement machine intelligence using a trained deep neural network. The machine learning application 602 may include training and inference functions for the neural network and/or specialized software that can be used to train the neural network before deployment. The machine learning application 602 may implement any type of machine intelligence, including but not limited to: image recognition, map creation and positioning, autonomous navigation, speech synthesis, medical imaging, or language translation. Example machine learning applications 602 include but are not limited to voice-based virtual assistants, image or facial recognition algorithms, autonomous navigation, and software tools used to train machine learning models used by the machine learning application 602.

可以经由机器学习框架604来启用用于机器学习应用602的硬件加速。机器学习框架604可提供机器学习基元的库。机器学习基元是通常由机器学习算法执行的基本操作。在没有机器学习框架604的情况下,将需要机器学习算法的开发者创建和优化与机器学习算法相关联的主计算逻辑,随后在开发新的并行处理器时重新优化计算逻辑。相反,机器学习应用可以被配置成使用由机器学习框架604提供的基元来执行必要的计算。示例性基元包括张量卷积、激活函数和池化,它们是在训练卷积神经网络(convolutional neuralnetwork,CNN)时执行的计算操作。机器学习框架604还可以提供基元以实现由许多机器学习算法执行的基本线性代数子程序,诸如,矩阵和向量操作。机器学习框架604的示例包括但不限于TensorFlow(张量流)、TensorRT、PyTorch、MXNet、Caffee以及其他高级机器学习框架。Hardware acceleration for machine learning applications 602 can be enabled via a machine learning framework 604. The machine learning framework 604 can provide a library of machine learning primitives. Machine learning primitives are basic operations typically performed by machine learning algorithms. Without the machine learning framework 604, developers of machine learning algorithms will be required to create and optimize the main computing logic associated with the machine learning algorithm, and then re-optimize the computing logic when developing new parallel processors. Instead, machine learning applications can be configured to use primitives provided by the machine learning framework 604 to perform necessary calculations. Exemplary primitives include tensor convolutions, activation functions, and pooling, which are computational operations performed when training convolutional neural networks (CNNs). The machine learning framework 604 can also provide primitives to implement basic linear algebra subroutines performed by many machine learning algorithms, such as matrix and vector operations. Examples of machine learning frameworks 604 include, but are not limited to, TensorFlow, TensorRT, PyTorch, MXNet, Caffee, and other advanced machine learning frameworks.

机器学习框架604可处理从机器学习应用602接收到的输入数据,并生成至计算框架606的适当输入。计算框架606可抽象出提供给GPGPU驱动器608的底层指令,以使得机器学习框架604能够经由GPGPU硬件610来利用硬件加速而无需机器学习框架604非常熟悉GPGPU硬件610的体系结构。此外,计算框架606可以跨各种类型和世代的GPGPU硬件610来启用用于机器学习框架604的硬件加速。示例性计算框架606包括CUDA计算框架和相关联的机器学习库,诸如,CUDA深度神经网络(CUDA Deep Neural Network,cuDNN)库。机器学习软件栈600还可包括通信库或框架以促进多GPU和多节点计算。The machine learning framework 604 can process input data received from the machine learning application 602 and generate appropriate inputs to the computing framework 606. The computing framework 606 can abstract the underlying instructions provided to the GPGPU driver 608 so that the machine learning framework 604 can take advantage of hardware acceleration via the GPGPU hardware 610 without the machine learning framework 604 being very familiar with the architecture of the GPGPU hardware 610. In addition, the computing framework 606 can enable hardware acceleration for the machine learning framework 604 across various types and generations of GPGPU hardware 610. Exemplary computing frameworks 606 include CUDA computing frameworks and associated machine learning libraries, such as CUDA Deep Neural Network (cuDNN) libraries. The machine learning software stack 600 may also include a communication library or framework to facilitate multi-GPU and multi-node computing.

GPGPU机器学习加速GPGPU machine learning acceleration

图7图示通用图形处理单元700,其可以是图2A的并行处理器200或图1的(一个或多个)并行处理器112。通用处理单元(GPGPU)700可被配置成用于提供对由机器学习框架提供的基元的硬件加速的支持,以加速处理与训练深度神经网络相关联的类型的计算工作负载。此外,GPGPU 700可直接链接到GPGPU的其他实例以创建多GPU集群,从而改善尤其是深度神经网络的训练速度。基元也受支持以加速用于经部署的神经网络的推断操作。FIG7 illustrates a general purpose graphics processing unit 700, which may be the parallel processor 200 of FIG2A or the parallel processor(s) 112 of FIG1. The general purpose processing unit (GPGPU) 700 may be configured to provide support for hardware acceleration of primitives provided by a machine learning framework to accelerate processing of computational workloads of the type associated with training deep neural networks. In addition, the GPGPU 700 may be directly linked to other instances of the GPGPU to create a multi-GPU cluster, thereby improving the training speed of deep neural networks in particular. Primitives are also supported to accelerate inference operations for deployed neural networks.

GPGPU 700包括用于启用与主机处理器的连接的主机接口702。主机接口702可以是PCI快速接口。然而,主机接口还可以是供应方特定的通信接口或通信结构。GPGPU 700从主机处理器接收命令,并且使用全局调度器704以将与那些命令相关联的执行线程分发给处理集群706A-706H的集合。处理集群706A-706H共享缓存存储器708。缓存存储器708可充当用于处理集群706A-706H内的缓存存储器的较高级别的缓存。所图示的处理集群706A-706H可与如图2A中的处理集群214A-214N相对应。GPGPU 700 includes a host interface 702 for enabling connection to a host processor. Host interface 702 may be a PCI Express interface. However, the host interface may also be a vendor-specific communication interface or communication structure. GPGPU 700 receives commands from the host processor and uses a global scheduler 704 to distribute execution threads associated with those commands to a collection of processing clusters 706A-706H. Processing clusters 706A-706H share cache memory 708. Cache memory 708 may act as a higher level cache for cache memory within processing clusters 706A-706H. The illustrated processing clusters 706A-706H may correspond to processing clusters 214A-214N as shown in FIG. 2A.

GPGPU 700包括经由存储器控制器的集合712A-712B与处理集群706A-706H耦合的存储器714A-714B。存储器714A-714B可包括各种类型的存储器设备,包括动态随机存取存储器(DRAM)或图形随机存取存储器,诸如,同步图形随机存取存储器(SGRAM),包括图形双倍数据速率(GDDR)存储器。存储器714A-714B还可包括3D堆叠式存储器,包括但不限于高带宽存储器(HBM)。GPGPU 700 includes memory 714A-714B coupled to processing clusters 706A-706H via a set of memory controllers 712A-712B. Memories 714A-714B may include various types of memory devices, including dynamic random access memory (DRAM) or graphics random access memory, such as synchronous graphics random access memory (SGRAM), including graphics double data rate (GDDR) memory. Memories 714A-714B may also include 3D stacked memory, including but not limited to high bandwidth memory (HBM).

处理集群706A-706H中的每一个可包括图形多处理器的集合,诸如,图2D的图形多处理器234、图3A的图形多处理器325、图3B的图形多处理器350或者可包括如图3C中的多核心组365A-365N。计算集群的图形多处理器包括能够以包括适用于机器学习计算的一系列精度来执行计算操作的多种类型的整数和浮点逻辑单元。例如,处理集群706A-706H中的每一个计算集群中的浮点单元的至少子集可被配置成用于执行16比特或32比特浮点操作,而浮点单元的不同子集可被配置成用于执行64比特浮点操作。Each of the processing clusters 706A-706H may include a collection of graphics multiprocessors, such as the graphics multiprocessor 234 of FIG. 2D , the graphics multiprocessor 325 of FIG. 3A , the graphics multiprocessor 350 of FIG. 3B , or may include a multi-core group 365A-365N as in FIG. 3C . The graphics multiprocessors of the computing clusters include multiple types of integer and floating point logic units capable of performing computational operations with a range of precisions including those suitable for machine learning computations. For example, at least a subset of the floating point units in each of the processing clusters 706A-706H may be configured to perform 16-bit or 32-bit floating point operations, while a different subset of the floating point units may be configured to perform 64-bit floating point operations.

GPGPU 700的多个实例可以被配置成作为计算集群进行操作。由计算集群用于同步和数据交换的通信机制跨实施例而有所不同。例如,GPGPU 700的多个实例通过主机接口702进行通信。在一个实施例中,GPGPU 700包括I/O中枢709,该I/O中枢709将GPGPU 710与GPU链路710耦合,该GPU链路710启用至GPGPU的其他实例的直接连接。GPU链路710可耦合至专用的GPU至GPU桥接器,该GPU至GPU桥接器启用GPGPU 700的多个实例之间的通信和同步。任选地,GPU链路710与高速互连耦合,以将数据传送至其他GPGPU或并行处理器并且从其他GPGPU或并行处理器接收数据。GPGPU 700的多个实例可位于单独的数据处理系统中,并且可经由网络设备进行通信,该网络设备可经由主机接口702来访问。附加于或替代于主机接口702,GPU链路710可被配置成用于启用至主机处理器的连接。Multiple instances of GPGPU 700 can be configured to operate as a computing cluster. The communication mechanism used by the computing cluster for synchronization and data exchange varies across embodiments. For example, multiple instances of GPGPU 700 communicate through a host interface 702. In one embodiment, GPGPU 700 includes an I/O hub 709 that couples GPGPU 710 with a GPU link 710 that enables direct connection to other instances of GPGPU. GPU link 710 may be coupled to a dedicated GPU to GPU bridge that enables communication and synchronization between multiple instances of GPGPU 700. Optionally, GPU link 710 is coupled to a high-speed interconnect to transmit data to and receive data from other GPGPUs or parallel processors. Multiple instances of GPGPU 700 may be located in separate data processing systems and may communicate via a network device that may be accessed via host interface 702. In addition to or in lieu of host interface 702 , GPU link 710 may be configured to enable connection to a host processor.

尽管GPGPU 700的所图示的配置可被配置成用于训练神经网络,但是GPGPU 700的替代配置可被配置成用于在高性能或低功率推断平台内的部署。在推断配置中,相对于训练配置,GPGPU 700包括处理集群706A-706H中的更少的处理集群。此外,与存储器714A-714B相关联的存储器技术在推断配置与训练配置之间可有所不同。在一个实施例中,GPGPU700的推断配置可支持推断特定指令。例如,推断配置可提供对一个或多个8比特整数或浮点点积指令的支持,这一个或多个8比特整数或浮点点积指令通常在用于经部署的神经网络的推断操作期间被使用。Although the illustrated configuration of GPGPU 700 may be configured for training neural networks, alternative configurations of GPGPU 700 may be configured for deployment within a high-performance or low-power inference platform. In the inference configuration, relative to the training configuration, GPGPU 700 includes fewer processing clusters in processing clusters 706A-706H. In addition, the memory technology associated with memory 714A-714B may differ between the inference configuration and the training configuration. In one embodiment, the inference configuration of GPGPU 700 may support inference specific instructions. For example, the inference configuration may provide support for one or more 8-bit integer or floating point dot product instructions, which are typically used during inference operations for deployed neural networks.

图8图示多GPU计算系统800。多GPU计算系统800可包括经由主机接口开关804被耦合至多个GPGPU 806A-806D的处理器802。主机接口开关804可以是将处理器802耦合至PCI快速总线的PCI快速开关设备,处理器802能够通过该PCI快速总线与GPGPU806A-806D的集合通信。多个GPGPU 806A-806D中的每一个可以是图7的GPGPU 700的实例。GPGPU 806A-806D可经由高速点到点GPU至GPU链路816的集合来互连。高速GPU至GPU链路可以经由专用GPU链路连接到GPGPU 806A-806D中的每一个,该专用GPU链路诸如图7中的GPU链路710。P2PGPU链路816启用GPGPU 806A-806D中的每一个GPGPU之间的直接通信,而不需要通过处理器802被连接到的主机接口总线进行的通信。利用定向到P2P GPU链路的GPU至GPU通信量,主机接口总线保持可用于系统存储器访问,或用于例如经由一个或多个网络设备与多GPU计算系统800的其他实例通信。虽然在图8中GPGPU 806A-806D经由主机接口开关804连接到处理器802,但是处理器802可替代地包括对P2P GPU链路816的直接支持,并直接连接到GPGPU806A-806D。在一个实施例中,P2P GPU链路816使多GPU计算系统800能够作为单个逻辑GPU进行操作。FIG8 illustrates a multi-GPU computing system 800. The multi-GPU computing system 800 may include a processor 802 coupled to a plurality of GPGPUs 806A-806D via a host interface switch 804. The host interface switch 804 may be a PCI Express switch device that couples the processor 802 to a PCI Express bus, over which the processor 802 can communicate with a collection of GPGPUs 806A-806D. Each of the plurality of GPGPUs 806A-806D may be an instance of the GPGPU 700 of FIG7 . The GPGPUs 806A-806D may be interconnected via a collection of high-speed point-to-point GPU-to-GPU links 816. The high-speed GPU-to-GPU links may be connected to each of the GPGPUs 806A-806D via a dedicated GPU link, such as the GPU link 710 of FIG7 . P2P GPU link 816 enables direct communication between each of GPGPUs 806A-806D without requiring communication through a host interface bus to which processor 802 is connected. With GPU-to-GPU traffic directed to the P2P GPU link, the host interface bus remains available for system memory access, or for communicating with other instances of multi-GPU computing system 800, for example, via one or more network devices. Although GPGPUs 806A-806D are connected to processor 802 via host interface switch 804 in FIG. 8 , processor 802 may alternatively include direct support for P2P GPU link 816 and connect directly to GPGPUs 806A-806D. In one embodiment, P2P GPU link 816 enables multi-GPU computing system 800 to operate as a single logical GPU.

机器学习神经网络实现方式Machine Learning Neural Network Implementation

本文中描述的计算体系结构可被配置成用于执行特别适合于训练和部署用于机器学习的神经网络的一类并行处理。可以将神经网络概括为具有图关系的函数的网络。如本领域中所公知,存在在机器学习中使用的各种类型的神经网络实现方式。一种示例性类型的神经网络是如先前所描述的前馈网络。The computing architecture described herein can be configured to perform a type of parallel processing that is particularly suitable for training and deploying neural networks for machine learning. A neural network can be generalized as a network of functions with a graph relationship. As is known in the art, there are various types of neural network implementations used in machine learning. One exemplary type of neural network is a feedforward network as previously described.

第二示例性类型的神经网络是卷积神经网络(CNN)。CNN是用于处理具有已知的、栅格状拓扑的数据(诸如,图像数据)的专业的前馈神经网络。相应地,CNN通常用于计算机视觉和图像识别应用,但是它们也可用于其他类型的模式识别,诸如,语音和语言处理。CNN输入层中的节点被组织为“过滤器”的集合(由视网膜中发现的感受野激发的特征检测器),并且每个过滤器集合的输出被传播至网络的连续层中的节点。用于CNN的计算包括将卷积数学运算应用于每个过滤器以产生该过滤器的输出。卷积是由两个函数执行以产生第三函数的专业种类的数学运算,该第三函数是两个原始函数中的一个函数的修改版本。在卷积网络术语中,至卷积的第一函数可被称为输入,而第二函数可被称为卷积核。输出可被称为特征图。例如,至卷积层的输入可以是定义输入图像的各种颜色分量的多维数据数组。卷积核可以是多维参数数组,其中通过用于神经网络的训练过程来使参数适配。The second exemplary type of neural network is a convolutional neural network (CNN). A CNN is a specialized feedforward neural network for processing data (such as image data) with a known, grid-like topology. Accordingly, CNNs are commonly used in computer vision and image recognition applications, but they can also be used for other types of pattern recognition, such as speech and language processing. The nodes in the input layer of a CNN are organized as a set of "filters" (feature detectors inspired by the receptive fields found in the retina), and the output of each filter set is propagated to the nodes in the successive layers of the network. The calculations used for a CNN include applying a convolution mathematical operation to each filter to produce the output of the filter. Convolution is a specialized type of mathematical operation performed by two functions to produce a third function, which is a modified version of one of the two original functions. In convolutional network terminology, the first function to the convolution may be referred to as the input, and the second function may be referred to as the convolution kernel. The output may be referred to as a feature map. For example, the input to a convolutional layer may be a multidimensional data array defining the various color components of the input image. The convolution kernel may be a multidimensional parameter array, where the parameters are adapted by a training process for the neural network.

循环神经网络(recurrent neural network,RNN)是包括层之间的反馈连接的一系列前馈神经网络。RNN通过跨神经网络的不同部分共享参数数据来启用对序列化数据进行建模。用于RNN的体系结构包括循环。这些循环表示变量的当前值在将来时刻对其自身值的影响,因为来自RNN的输出数据的至少部分被用作反馈以用于处理序列中的后续输入。由于语言数据可被组成的可变本质,这个特征使RNN变得对语言处理特别有用。A recurrent neural network (RNN) is a series of feedforward neural networks that include feedback connections between layers. RNNs enable modeling of sequential data by sharing parameter data across different parts of the neural network. The architecture for RNNs includes loops. These loops represent the effect of the current value of a variable on its own value at future moments, because at least part of the output data from the RNN is used as feedback for processing subsequent inputs in the sequence. This feature makes RNNs particularly useful for language processing due to the mutable nature of language data that can be composed.

下文描述的附图呈现了示例性前馈网络、CNN网络和RNN网络,并且描述了用于分别训练和部署那些类型的网络中的每一者的一般过程。将理解,这些描述就本文中描述的任何特定实施例而言是示例性且是非限制性的,并且一般说来所图示的概念一般可应用于深度神经网络和机器学习技术。The figures described below present exemplary feedforward networks, CNN networks, and RNN networks, and describe the general process for training and deploying each of those types of networks, respectively. It will be understood that these descriptions are exemplary and non-limiting with respect to any specific embodiments described herein, and that the concepts illustrated are generally applicable to deep neural networks and machine learning techniques in general.

上文描述的示例性神经网络可以用于执行深度学习。深度学习是使用深度神经网络的机器学习。与仅包括单个隐藏层的浅层神经网络不同,在深度学习中使用的深度神经网络是由多个隐藏层组成的人工神经网络。更深的神经网络通常训练起来是更加计算密集性的。然而,网络的附加的隐藏层启用多步模式识别,该多步模式识别得到相对于浅层机器学习技术的减小的输出误差。The exemplary neural network described above can be used to perform deep learning. Deep learning is machine learning using deep neural networks. Unlike shallow neural networks that include only a single hidden layer, the deep neural networks used in deep learning are artificial neural networks composed of multiple hidden layers. Deeper neural networks are generally more computationally intensive to train. However, the additional hidden layers of the network enable multi-step pattern recognition that results in reduced output errors relative to shallow machine learning techniques.

在深度学习中使用的深度神经网络典型地包括用于执行特征识别的前端网络,该前端网络耦合至后端网络,该后端网络表示可基于提供给数学模型的特征表示来执行操作(例如,对象分类、语音识别等)的数学模型。深度学习使得机器学习能够在无需针对模型执行手工的特征工程化的情况下被执行。相反,深度神经网络可基于输入数据内的统计结构或相关性来学习特征。所学习的特征可被提供给数学模型,该数学模型可将所检测的特征映射至输出。由网络使用的数学模型通常专用于要执行的特定任务,并且不同的模型将用于执行不同的任务。The deep neural network used in deep learning typically includes a front-end network for performing feature recognition, which is coupled to a back-end network that represents a mathematical model that can perform operations (e.g., object classification, speech recognition, etc.) based on the feature representation provided to the mathematical model. Deep learning enables machine learning to be performed without the need to perform manual feature engineering for the model. In contrast, deep neural networks can learn features based on statistical structures or correlations within the input data. The learned features can be provided to a mathematical model that can map the detected features to an output. The mathematical model used by the network is typically dedicated to a specific task to be performed, and different models will be used to perform different tasks.

一旦将神经网络结构化,就可以将学习模型应用于网络以将网络训练成执行特定任务。学习模型描述如何调整模型内的权重以减小网络的输出误差。误差的反向传播是用于训练神经网络的常用方法。输入向量被呈现给网络以供处理。使用损失函数将网络的输出与期望输出进行比较,并且为输出层中的神经元中的每个神经元计算误差值。随后,向后传播误差值,直到每个神经元具有粗略地表示该神经元对原始输出的贡献的相关联的误差值。网络随后可使用算法(诸如,随机梯度下降算法)从那些误差中进行学习,以更新神经网络的权重。Once the neural network is structured, a learning model can be applied to the network to train the network to perform a specific task. The learning model describes how to adjust the weights within the model to reduce the output error of the network. Back propagation of errors is a common method for training neural networks. An input vector is presented to the network for processing. The output of the network is compared to the desired output using a loss function, and an error value is calculated for each neuron in the output layer. Subsequently, the error values are propagated backward until each neuron has an associated error value that roughly represents the contribution of the neuron to the original output. The network can then learn from those errors using an algorithm (such as a stochastic gradient descent algorithm) to update the weights of the neural network.

图9A-图9B图示示例性卷积神经网络。图9A图示CNN内的各个层。如图9A中所示,用于对图像处理建模的示例性CNN可接收描述输入图像的红色、绿色、和蓝色(red,green,andblue,RGB)分量的输入902。输入902可由多个卷积层(例如,卷积层904、卷积层906)来处理。来自多个卷积层的输出任选地可由全连接层908的集合来处理。如先前针对前馈网络所描述,全连接层中的神经元具有至前一层中的所有激活的完全连接。来自全连接层908的输出可用于从网络生成输出结果。可使用矩阵乘法而不是卷积来计算全连接层908内的激活。并非所有CNN实现方式都利用全连接层908。例如,在一些实现方式中,卷积层906可生成CNN的输出。9A-9B illustrate an exemplary convolutional neural network. FIG. 9A illustrates the various layers within a CNN. As shown in FIG. 9A , an exemplary CNN for modeling image processing may receive an input 902 describing the red, green, and blue (RGB) components of an input image. The input 902 may be processed by a plurality of convolutional layers (e.g., convolutional layer 904, convolutional layer 906). The outputs from a plurality of convolutional layers may optionally be processed by a collection of fully connected layers 908. As previously described for feedforward networks, neurons in a fully connected layer have full connections to all activations in the previous layer. The outputs from the fully connected layer 908 may be used to generate output results from the network. Matrix multiplication may be used instead of convolution to calculate activations within the fully connected layer 908. Not all CNN implementations utilize a fully connected layer 908. For example, in some implementations, a convolutional layer 906 may generate the output of a CNN.

卷积层被稀疏地连接,这与在全连接层908中发现的传统神经网络配置不同。传统神经网络层被完全连接,使得每个输出单元与每个输入单元相互作用。然而,如所图示,卷积层被稀疏地连接,因为感受野的卷积的输出(而不是感受野中的节点中的每个节点的相应状态值)被输入到后续层的节点。与卷积层相关联的核执行卷积操作,该卷积操作的输出被发送至下一层。在卷积层内执行的降维是使得CNN能够缩放以处理大图像的一个方面。The convolutional layers are sparsely connected, which is different from the traditional neural network configuration found in the fully connected layer 908. Traditional neural network layers are fully connected so that each output unit interacts with each input unit. However, as illustrated, the convolutional layers are sparsely connected because the output of the convolution of the receptive field (rather than the corresponding state value of each node in the receptive field) is input to the nodes of the subsequent layer. The kernel associated with the convolutional layer performs a convolution operation, the output of which is sent to the next layer. The dimensionality reduction performed within the convolutional layer is one aspect that enables CNNs to scale to process large images.

图9B图示CNN的卷积层内的示例性计算阶段。可以在卷积层914的三个阶段中处理至CNN的卷积层的输入912。这三个阶段可包括卷积阶段916、检测器阶段918和池化阶段920。卷积层914随后可将数据输出至连续的卷积层。网络的最终卷积层可生成输出特征图数据或提供至全连接层的输入,例如以生成用于至CNN的输入的分类值。9B illustrates an exemplary computational stage within a convolutional layer of a CNN. An input 912 to a convolutional layer of a CNN may be processed in three stages of a convolutional layer 914. The three stages may include a convolution stage 916, a detector stage 918, and a pooling stage 920. The convolutional layer 914 may then output data to a successive convolutional layer. The final convolutional layer of the network may generate output feature map data or provide input to a fully connected layer, for example to generate a classification value for input to a CNN.

在卷积阶段916中并行地执行若干个卷积,以产生线性激活的集合。卷积阶段916可包括仿射变换,该仿射变换是可被指定为线性变换加平移的任何变换。仿射变换包括旋转、平移、缩放和这些变换的组合。卷积阶段计算连接到输入中的特定区域的函数(例如,神经元)的输出,这些特定区域可以被确定为与神经元相关联的局部区域。神经元计算神经元的权重与局部输入中神经元被连接到的区域的权重之间的点积。来自卷积阶段916的输出定义由卷积层914的连续阶段处理的线性激活的集合。Several convolutions are performed in parallel in the convolution stage 916 to produce a set of linear activations. The convolution stage 916 may include an affine transformation, which is any transformation that can be specified as a linear transformation plus a translation. Affine transformations include rotations, translations, scaling, and combinations of these transformations. The convolution stage calculates the output of a function (e.g., a neuron) connected to specific regions in the input, which can be determined as local regions associated with the neuron. The neuron calculates the dot product between the weight of the neuron and the weight of the region in the local input to which the neuron is connected. The output from the convolution stage 916 defines a set of linear activations processed by successive stages of the convolution layer 914.

线性激活可由检测器阶段918处理。在检测器阶段918中,每个线性激活由非线性激活函数处理。非线性激活函数增加整体网络的非线性性质,而不影响卷积层的感受野。可使用若干种类型的非线性激活函数。一种特定类型是修正线性单元(rectified linearunit,ReLU),其使用被定义为f(x)=max(0,x)的激活函数,使得激活的阈值为零。The linear activations may be processed by the detector stage 918. In the detector stage 918, each linear activation is processed by a nonlinear activation function. The nonlinear activation function increases the nonlinear nature of the overall network without affecting the receptive field of the convolutional layer. Several types of nonlinear activation functions may be used. One particular type is the rectified linear unit (ReLU), which uses an activation function defined as f(x) = max(0, x) so that the threshold of the activation is zero.

池化阶段920使用池化函数,该池化函数利用附近输出的摘要统计来替换第卷积层906的输出。池化函数可用于将平移不变性引入到神经网络中,使得至输入的小平移不改变经池化的输出。局部平移的不变性在其中特征在输入数据中的存在比特征的精确位置更重要的场景中可以是有用的。可以在池化阶段920期间使用各种类型的池化函数,包括最大池化、平均池化和l2范数池化。此外,一些CNN实现方式不包括池化阶段。相反,此类实现方式是相对于先前的卷积阶段具有增加的跨度的替代和附加卷积阶段。The pooling stage 920 uses a pooling function that replaces the output of the convolutional layer 906 with summary statistics of nearby outputs. The pooling function can be used to introduce translation invariance into the neural network so that a small translation to the input does not change the pooled output. The invariance of local translation can be useful in scenarios where the presence of features in the input data is more important than the precise location of the features. Various types of pooling functions can be used during the pooling stage 920, including maximum pooling, average pooling, and l2 norm pooling. In addition, some CNN implementations do not include a pooling stage. Instead, such implementations are alternative and additional convolution stages with an increased span relative to the previous convolution stage.

来自卷积层914的输出随后可由下一层922处理。下一层922可以是附加的卷积层或是全连接层908中的一个层。例如,图9A的第一卷积层904可输出到第二卷积层906,而第二卷积层可输出到全连接层908中的第一层。The output from the convolutional layer 914 may then be processed by the next layer 922. The next layer 922 may be an additional convolutional layer or a layer in the fully connected layer 908. For example, the first convolutional layer 904 of FIG. 9A may output to the second convolutional layer 906, and the second convolutional layer may output to the first layer in the fully connected layer 908.

图10图示示例性循环神经网络1000。在循环神经网络(RNN)中,网络的先前状态影响网络的当前状态的输出。能以各种方式、使用各种函数来建立RNN。RNN的使用总体上围绕着使用数学模型以基于输入的先前序列来预测未来。例如,RNN可被用于执行统计语言建模,以在给定单词的先前序列的情况下预测即将到来的单词。所图示的RNN1000可被描述为具有接收输入向量的输入层1002、用于实现循环函数的隐藏层1004、用于启用先前状态的‘记忆’的反馈机制1005以及用于输出结果的输出层1006。RNN1000基于时间步长进行操作。处于给定时间步长的RNN的状态经由反馈机制1005、基于先前的时间步长而受影响。对于给定的时间步长,隐藏层1004的状态由先前状态以及当前时间步长处的输入来定义。处于第一时间步长的初始输入(x1)可由隐藏层1004处理。第二输入(x2)可由隐藏层1004使用在初始输入(x1)的处理期间确定的状态信息来处理。给定状态可被计算为st=f(Uxt+Wst-1),其中,U和W是参数矩阵。函数f总体上是非线性,诸如,双曲正切函数(Tanh)或修正函数f(x)=max(0,x)的变体。然而,在隐藏层1004中使用的特定的数学函数可取决于RNN 1000的特定实现细节而有所不同。FIG. 10 illustrates an exemplary recurrent neural network 1000. In a recurrent neural network (RNN), the previous state of the network affects the output of the current state of the network. RNNs can be established in various ways and using various functions. The use of RNNs generally revolves around using mathematical models to predict the future based on previous sequences of inputs. For example, RNNs can be used to perform statistical language modeling to predict upcoming words given previous sequences of words. The illustrated RNN 1000 can be described as having an input layer 1002 that receives an input vector, a hidden layer 1004 for implementing a recurrent function, a feedback mechanism 1005 for enabling 'memory' of previous states, and an output layer 1006 for outputting results. RNN 1000 operates based on time steps. The state of the RNN at a given time step is affected via a feedback mechanism 1005 based on previous time steps. For a given time step, the state of the hidden layer 1004 is defined by the previous state and the input at the current time step. The initial input (x1) at the first time step can be processed by the hidden layer 1004. The second input (x2) may be processed by the hidden layer 1004 using state information determined during processing of the initial input (x1). A given state may be calculated as s t =f(Ux t +Ws t-1 ), where U and W are parameter matrices. The function f is generally nonlinear, such as a variant of the hyperbolic tangent function (Tanh) or a modified function f(x)=max(0,x). However, the specific mathematical function used in the hidden layer 1004 may vary depending on the specific implementation details of the RNN 1000.

除了所描述的基本CNN和RNN网络之外,还可启用针对那些网络的变体的加速。一个示例RNN变体是长短期记忆(long short term memory,LSTM)RNN。LSTM RNN能够学习长期依赖关系,该长期依赖关系可对于处理较长的语言序列而言是必需的。CNN的变体是卷积深度信念网络,该卷积深度信念网络具有与CNN类似的结构,并且以与深度信念网络类似的方式被训练。深度信念网络(deep belief network,DBN)是由随机性(随机)变量的多个层组成的生成式神经网络。DBN可使用贪婪无监督学习来逐层训练。DBN的所学习的权重随后可被用来通过确定用于神经网络的权重的最佳初始集合来提供预训练神经网络。在进一步的实施例中,启用用于强化学习的加速。在强化学习中,人工代理通过与其环境交互来学习。该代理被配置成用于优化某些目标以使累积回报最大化。In addition to the basic CNN and RNN networks described, acceleration for variants of those networks can also be enabled. An example RNN variant is a long short term memory (LSTM) RNN. LSTM RNN is able to learn long-term dependencies, which may be necessary for processing longer language sequences. A variant of CNN is a convolutional deep belief network, which has a similar structure to CNN and is trained in a similar manner to a deep belief network. A deep belief network (DBN) is a generative neural network consisting of multiple layers of stochastic (random) variables. DBN can be trained layer by layer using greedy unsupervised learning. The learned weights of the DBN can then be used to provide a pre-trained neural network by determining the optimal initial set of weights for the neural network. In a further embodiment, acceleration for reinforcement learning is enabled. In reinforcement learning, an artificial agent learns by interacting with its environment. The agent is configured to optimize certain goals to maximize cumulative returns.

图11图示深度神经网络的训练和部署。一旦已针对任务将给定的网络结构化,就使用训练数据集1102来训练神经网络。已开发各种训练框架1104来启用训练过程的硬件加速。例如,图6的机器学习框架604可被配置为训练框架1104。训练框架1104可接入未经训练的神经网络1106,并且使得未经训练的神经网络能够使用本文描述的并行处理资源被训练以生成经训练的神经网络1108。FIG11 illustrates the training and deployment of a deep neural network. Once a given network has been structured for a task, the neural network is trained using a training data set 1102. Various training frameworks 1104 have been developed to enable hardware acceleration of the training process. For example, the machine learning framework 604 of FIG6 can be configured as a training framework 1104. The training framework 1104 can access an untrained neural network 1106 and enable the untrained neural network to be trained using the parallel processing resources described herein to generate a trained neural network 1108.

为了开始训练过程,可随机地选择权重或通过使用深度信念网络进行预训练来选择初始权重。随后,以有监督或无监督方式执行训练循环。To start the training process, weights can be randomly selected or initial weights can be selected by pre-training using a deep belief network. Subsequently, a training cycle is performed in a supervised or unsupervised manner.

有监督学习是在其中训练被执行为介导操作的学习方法,诸如,当训练数据集1102包括与输入的期望输出配对的该输入时,或在训练数据集包括具有已知输出的输入并且神经网络的输出被手动分级的情况下。网络处理输入,并且将所得的输出与预期输出或期望输出的集合进行比较。随后,通过系统往回传播误差。训练框架1104可进行调整以调整控制未经训练的神经网络1106的权重。训练框架1104可提供工具以监测未经训练的神经网络1106正在多好地收敛于适合基于已知的输入数据生成正确答案的模型。随着网络的权重被调整以细化由神经网络生成的输出,训练过程反复地发生。训练过程可继续,直到神经网络达到与经训练的神经网络1108相关联的统计上期望的准确度。经训练的神经网络1108随后可被部署为实现任何数量的机器学习操作,以基于新数据1112的输入来生成推断结果1114。Supervised learning is a learning method in which training is performed as a mediated operation, such as when the training data set 1102 includes an input paired with an expected output of the input, or when the training data set includes an input with a known output and the output of the neural network is manually graded. The network processes the input and compares the resulting output with the expected output or a set of expected outputs. Subsequently, the error is propagated back through the system. The training framework 1104 can be adjusted to adjust the weights that control the untrained neural network 1106. The training framework 1104 can provide tools to monitor how well the untrained neural network 1106 is converging to a model suitable for generating the correct answer based on the known input data. The training process occurs repeatedly as the weights of the network are adjusted to refine the output generated by the neural network. The training process can continue until the neural network reaches a statistically expected accuracy associated with the trained neural network 1108. The trained neural network 1108 can then be deployed to implement any number of machine learning operations to generate inference results 1114 based on the input of new data 1112.

无监督学习是在其中网络尝试使用未标记数据来训练其自身的学习方法。因此,针对无监督学习,训练数据集1102将包括不具有任何相关联的输出数据的输入数据。未经训练的神经网络1106可学习未标记输入内的分组,并且可确定单独的输入如何与整个数据集相关。无监督训练可被用于生成自组织图,该自组织图是能够执行在降低数据的维度方面有用的操作的一类经训练的神经网络1108。无监督训练还可被用于执行异常检测,该异常检测允许标识输入数据集中的、从数据的正常模式偏离的数据点。Unsupervised learning is a learning method in which a network attempts to train itself using unlabeled data. Thus, for unsupervised learning, the training data set 1102 will include input data that does not have any associated output data. The untrained neural network 1106 can learn groupings within the unlabeled inputs and can determine how the individual inputs are related to the entire data set. Unsupervised training can be used to generate a self-organizing map, which is a class of trained neural networks 1108 that can perform operations useful in reducing the dimensionality of the data. Unsupervised training can also be used to perform anomaly detection, which allows identification of data points in the input data set that deviate from the normal pattern of the data.

还可采用有监督训练和无监督训练的变体。半监督学习是在其中在训练数据集1102中包括具有相同分布的经标记数据和未标记数据的混合的技术。渐进式学习是有监督学习的变体,其中连续地使用输入数据以进一步训练模型。渐进式学习使得经训练的神经网络1108能够适配于新数据1112,而不忘记在初始训练期间根植在网络内的知识。Variations of supervised and unsupervised training may also be employed. Semi-supervised learning is a technique in which a mixture of labeled and unlabeled data having the same distribution is included in the training data set 1102. Incremental learning is a variation of supervised learning in which input data is used continuously to further train the model. Incremental learning enables a trained neural network 1108 to adapt to new data 1112 without forgetting the knowledge embedded in the network during initial training.

无论是有监督还是无监督的,用于特别深的神经网络的训练过程对于单个计算节点可能是过于计算密集的。可以使用计算节点的分布式网络而不是使用单个计算节点来加速训练过程。Whether supervised or unsupervised, the training process for particularly deep neural networks may be too computationally intensive for a single computing node. The training process can be accelerated using a distributed network of computing nodes rather than using a single computing node.

图12A是图示分布式学习的框图。分布式学习是使用多个分布式计算节点来执行神经网络的有监督或无监督训练的训练模型。分布式计算节点各自可包括一个或多个主机处理器或通用处理节点中的一个或多个通用处理节点,通用处理节点诸如图7中的高度并行的通用图形处理单元700。如所图示,能以模型并行性1202、数据并行性1204或模型和数据并行性的组合1206来执行分布式学习。FIG12A is a block diagram illustrating distributed learning. Distributed learning is a training model that uses multiple distributed computing nodes to perform supervised or unsupervised training of a neural network. The distributed computing nodes may each include one or more host processors or one or more general-purpose processing nodes in a general-purpose processing node, such as the highly parallel general-purpose graphics processing unit 700 in FIG7. As illustrated, distributed learning can be performed with model parallelism 1202, data parallelism 1204, or a combination of model and data parallelism 1206.

在模型并行性1202中,分布式系统中的不同计算节点可为单个网络的不同部分执行训练计算。例如,神经网络的每一层可由分布式系统的不同处理节点来训练。模型并行性的益处包括缩放到特别大的模型的能力。分割与神经网络的不同层相关联的计算使得能够训练非常大型的神经网络,在非常大型的神经网络中,所有层的权重将无法适配到单个节点的存储器中。在一些实例中,模型并行性在执行大型神经网络的无监督训练时可以是特别有用的。In model parallelism 1202, different computing nodes in a distributed system can perform training computations for different parts of a single network. For example, each layer of a neural network can be trained by a different processing node of the distributed system. Benefits of model parallelism include the ability to scale to particularly large models. Splitting the computations associated with different layers of a neural network enables training of very large neural networks where the weights of all layers would not fit into the memory of a single node. In some instances, model parallelism can be particularly useful when performing unsupervised training of large neural networks.

在数据并行性1204中,分布式网络的不同节点具有模型的完整实例,并且每个节点接收数据的不同部分。来自不同节点的结果随后被组合。虽然实现数据并行性的不同方式是可能的,但是数据并行性训练方式全都需要将结果组合并使每个节点之间的模型参数同步的技术。用于组合数据的示例性方式包括参数求平均以及基于更新的数据并行性。参数求平均对训练数据的子集训练每个节点,并且将全局参数(例如,权重、偏置)设置为来自每个节点的参数的平均值。参数求平均使用维护参数数据的中央参数服务器。基于更新的数据并行性与参数求平均类似,例外在于,对模型的更新被传输,而不是将参数从节点传输到参数服务器。另外,能以分散化的方式执行基于更新的数据并行性,其中更新被压缩并且在节点之间传输。In data parallelism 1204, different nodes of the distributed network have complete instances of the model, and each node receives a different portion of the data. The results from different nodes are then combined. Although different ways to implement data parallelism are possible, all data parallel training methods require techniques to combine the results and synchronize the model parameters between each node. Exemplary methods for combining data include parameter averaging and update-based data parallelism. Parameter averaging trains each node on a subset of the training data, and sets global parameters (e.g., weights, biases) to the average value of the parameters from each node. Parameter averaging uses a central parameter server that maintains parameter data. Update-based data parallelism is similar to parameter averaging, with the exception that updates to the model are transmitted instead of transmitting parameters from the node to the parameter server. In addition, update-based data parallelism can be performed in a decentralized manner, where updates are compressed and transmitted between nodes.

可以例如在其中每个计算节点包括多个GPU的分布式系统中实现组合式模型和数据并行性1206。每个节点可具有模型的完整实例,其中每个节点内的单独GPU被用于训练模型的不同部分。Combined model and data parallelism 1206 can be implemented, for example, in a distributed system where each computing node includes multiple GPUs. Each node can have a complete instance of the model, where separate GPUs within each node are used to train different parts of the model.

分布式训练相对于在单个机器上的训练已增加开销。然而,本文中描述的并行处理器和GPGPU各自可实现各种技术以降低分布式训练的开销,包括用于启用高带宽GPU至GPU数据传输和经加速的远程数据同步的技术。Distributed training has increased overhead relative to training on a single machine. However, the parallel processors and GPGPUs described herein can each implement various techniques to reduce the overhead of distributed training, including techniques for enabling high-bandwidth GPU-to-GPU data transfer and accelerated remote data synchronization.

图12B是图示可编程网络接口1210和数据处理单元的框图。可编程网络接口1210是可用于加速分布式环境内的基于网络的计算任务的可编程网络引擎。可编程网络接口1210可经由主机接口1270与主机系统耦合。可编程网络接口1210可用于加速用于主机系统的CPU或GPU的网络或存储操作。主机系统例如可以是用于执行分布式训练的分布式学习系统的节点,例如,如图12A中所示。主机系统也可以是数据中心内的数据中心节点。12B is a block diagram illustrating a programmable network interface 1210 and a data processing unit. The programmable network interface 1210 is a programmable network engine that can be used to accelerate network-based computing tasks within a distributed environment. The programmable network interface 1210 can be coupled to a host system via a host interface 1270. The programmable network interface 1210 can be used to accelerate network or storage operations for a CPU or GPU of a host system. The host system can be, for example, a node of a distributed learning system for performing distributed training, for example, as shown in FIG. 12A. The host system can also be a data center node within a data center.

在一个实施例中,可由可编程网络接口1210加速对包含模型数据的远程存储装置的存取。例如,可编程网络接口1210可被配置成用于将远程存储设备呈现为主机系统的本地存储设备。可编程网络接口1210还可加速在主机系统的GPU与远程系统的GPU之间执行的远程直接存储器存取(RDMA)操作。在一个实施例中,可编程网络接口1210可启用存储功能,诸如但不限于NVME-oF。可编程网络接口1210还可代表主机系统加速用于远程存储装置的加密、数据完整性、压缩和其他操作,从而允许远程存储装置逼近直接附连至主机系统的存储设备的等待时间。In one embodiment, access to a remote storage device containing model data may be accelerated by the programmable network interface 1210. For example, the programmable network interface 1210 may be configured to present the remote storage device as a local storage device of the host system. The programmable network interface 1210 may also accelerate remote direct memory access (RDMA) operations performed between a GPU of the host system and a GPU of the remote system. In one embodiment, the programmable network interface 1210 may enable storage functions such as, but not limited to, NVME-oF. The programmable network interface 1210 may also accelerate encryption, data integrity, compression, and other operations for the remote storage device on behalf of the host system, thereby allowing the remote storage device to approach the latency of a storage device directly attached to the host system.

可编程网络接口1210还可代表主机系统执行资源分配和管理。存储安全操作可被迁移到可编程网络接口1210,并且与远程存储资源的分配和管理协同地执行。用于管理对远程存储装置的存取的、否则将由主机系统的处理器执行的基于网络的操作可替代地由可编程网络接口1210执行。The programmable network interface 1210 may also perform resource allocation and management on behalf of the host system. Storage security operations may be migrated to the programmable network interface 1210 and performed in coordination with the allocation and management of remote storage resources. Network-based operations for managing access to remote storage devices that would otherwise be performed by the host system's processor may be performed instead by the programmable network interface 1210.

在一个实施例中,网络和/或数据安全操作可从主机系统迁移到可编程网络接口1210。用于数据中心节点的数据中心安全策略可由可编程网络接口1210而不是主机系统的处理器来处置。例如,可编程网络接口1210可检测并缓解在主机系统上的被尝试的基于网络的攻击(例如,DDoS),从而防止攻击损害主机系统的可用性。In one embodiment, network and/or data security operations may be migrated from the host system to the programmable network interface 1210. Data center security policies for data center nodes may be handled by the programmable network interface 1210 rather than by the processor of the host system. For example, the programmable network interface 1210 may detect and mitigate attempted network-based attacks (e.g., DDoS) on the host system, thereby preventing the attack from compromising the availability of the host system.

可编程网络接口1210可包括经由多个处理器核心1222执行操作系统的片上系统(SoC 1220)。处理器核心1222可包括通用处理器(例如,CPU)核心。在一个实施例中,处理器核心1222还可包括一个或多个GPU核心。SoC 1220可执行存储在存储器设备1240中的指令。存储设备1250可存储本地操作系统数据。存储设备1250和存储器设备1240还可用于对用于主机系统的远程数据进行缓存。网络端口1260A-1260B启用至网络或结构的连接,并且促进针对SoC 1220的网络访问,并经由主机接口1270促进针对主机系统的网络访问。可编程网络接口1210还可包括I/O接口1275,诸如,USB接口。I/O接口1275可用于将外部设备耦合至可编程网络接口1210或耦合为调试接口。可编程网络接口1210还包括管理接口1230,该管理接口1230使主机设备上的软件能够管理和配置可编程网络接口1210和/或SoC 1220。在一个实施例中,可编程网络接口1210还可包括一个或多个加速器或GPU 1245以接受并行计算任务从经由网络端口1260A-1260B耦合的SoC 1220、主机系统或远程系统的迁移。The programmable network interface 1210 may include a system on chip (SoC 1220) that executes an operating system via multiple processor cores 1222. The processor core 1222 may include a general-purpose processor (e.g., CPU) core. In one embodiment, the processor core 1222 may also include one or more GPU cores. The SoC 1220 may execute instructions stored in a memory device 1240. The storage device 1250 may store local operating system data. The storage device 1250 and the memory device 1240 may also be used to cache remote data for a host system. The network ports 1260A-1260B enable connection to a network or structure and facilitate network access for the SoC 1220 and facilitate network access for the host system via the host interface 1270. The programmable network interface 1210 may also include an I/O interface 1275, such as a USB interface. The I/O interface 1275 may be used to couple an external device to the programmable network interface 1210 or to couple as a debug interface. Programmable network interface 1210 also includes a management interface 1230 that enables software on a host device to manage and configure programmable network interface 1210 and/or SoC 1220. In one embodiment, programmable network interface 1210 may also include one or more accelerators or GPUs 1245 to accept migration of parallel computing tasks from SoC 1220, a host system, or a remote system coupled via network ports 1260A-1260B.

示例性机器学习应用Example Machine Learning Applications

可以应用机器学习以解决各种技术问题,包括但不限于计算机视觉、自主驾驶和导航、语音识别以及语言处理。计算机视觉在传统上已是机器学习应用的最活跃研究领域之一。计算机视觉的应用范围为从重现人类视觉能力(诸如,识别人脸)到创建新类别的视觉能力。例如,计算机视觉应用可以配置成用于从在视频中可见的物体中诱发的振动来识别声波。并行处理器加速的机器学习使得计算机视觉应用能够使用比先前可行的训练数据集显著更大的训练数据集来训练,并且使得推断系统能够使用低功率并行处理器来部署。Machine learning can be applied to solve various technical problems, including but not limited to computer vision, autonomous driving and navigation, speech recognition, and language processing. Computer vision has traditionally been one of the most active research areas for machine learning applications. The application range of computer vision is from reproducing human visual capabilities (such as recognizing faces) to creating new categories of visual capabilities. For example, computer vision applications can be configured to identify sound waves from vibrations induced in objects visible in a video. Parallel processor-accelerated machine learning enables computer vision applications to be trained using significantly larger training data sets than previously feasible training data sets, and enables inference systems to be deployed using low-power parallel processors.

并行处理器加速的机器学习具有自主驾驶应用,包括车道和道路标志识别、障碍物规避、导航和驾驶控制。经加速的机器学习技术可用于基于定义对特定训练输入的适当响应的数据集来训练驾驶模型。本文中描述的并行处理器能够实现对用于自主驾驶解决方案的日益复杂的神经网络的快速训练,并且能够实现将低功率推断处理器部署在适合于集成到自主交通工具中的移动平台中。Parallel processor-accelerated machine learning has autonomous driving applications, including lane and road sign recognition, obstacle avoidance, navigation, and driving control. Accelerated machine learning techniques can be used to train driving models based on data sets that define appropriate responses to specific training inputs. The parallel processors described in this article enable rapid training of increasingly complex neural networks for autonomous driving solutions and enable the deployment of low-power inference processors in mobile platforms suitable for integration into autonomous vehicles.

并行处理器加速的深度神经网络已启用用于自动语音识别(automatic speechrecognition,ASR)的机器学习方式。ASR包括创建在给定的输入声序列的情况下计算最有可能的语言序列的函数。使用深度神经网络的经加速的机器学习已启用对于先前用于ASR的隐马尔可夫模型(hidden Markov model,HMM)和高斯混合模型(Gaussian mixturemodel,GMM)的替代。Deep neural networks accelerated by parallel processors have enabled machine learning approaches for automatic speech recognition (ASR). ASR involves creating a function that computes the most likely speech sequence given a sequence of input sounds. Accelerated machine learning using deep neural networks has enabled replacements for the hidden Markov models (HMM) and Gaussian mixture models (GMM) previously used for ASR.

并行处理器加速的机器学习还可以用于加速自然语言处理。自动学习过程可以利用统计推断算法以产生对于有误差的或不熟悉的输入稳健的模型。示例性自然语言处理器应用包括人类语言之间的自动机器翻译。Parallel processor accelerated machine learning can also be used to accelerate natural language processing. The automatic learning process can utilize statistical inference algorithms to produce models that are robust to erroneous or unfamiliar inputs. Exemplary natural language processor applications include automatic machine translation between human languages.

可以将用于机器学习的并行处理平台划分为训练平台和部署平台。训练平台总体上是高度并行的,并且包括用于加速多GPU单节点训练和多节点多GPU训练的优化。适用于训练的示例性并行处理器包括图7的通用图形处理单元700和图8的多GPU计算系统800。相反,经部署的机器学习平台一般包括适于在诸如相机、自主机器人和自主交通工具之类的产品中使用的较低功率的并行处理器。The parallel processing platform for machine learning can be divided into a training platform and a deployment platform. The training platform is generally highly parallel and includes optimizations for accelerating multi-GPU single-node training and multi-node multi-GPU training. Exemplary parallel processors suitable for training include the general-purpose graphics processing unit 700 of Figure 7 and the multi-GPU computing system 800 of Figure 8. In contrast, the deployed machine learning platform generally includes a lower-power parallel processor suitable for use in products such as cameras, autonomous robots, and autonomous vehicles.

此外,还可应用机器学习技术来加速或增强图形处理活动。例如,可将机器学习模型训练成用于识别由GPU加速的应用生成的输出并生成该输出的放大版本。可应用此类技术以加速生成用于游戏应用的高分辨率图像。各种其他图形管线活动可受益于机器学习的使用。例如,可将机器学习模型训练成用于对几何数据执行曲面细分操作以增加几何模型的复杂度,从而允许从具有相对较低的细节的几何体自动地生成细节更精细的几何体。In addition, machine learning techniques can also be applied to accelerate or enhance graphics processing activities. For example, a machine learning model can be trained to recognize output generated by a GPU-accelerated application and generate an amplified version of the output. Such techniques can be applied to accelerate the generation of high-resolution images for gaming applications. Various other graphics pipeline activities can benefit from the use of machine learning. For example, a machine learning model can be trained to perform tessellation operations on geometric data to increase the complexity of the geometric model, thereby allowing more detailed geometry to be automatically generated from relatively low-detail geometry.

图13图示适于使用经训练的模型执行推断的示例性推断片上系统(SOC)1300。SOC1300可集成处理部件,包括媒体处理器1302、视觉处理器1304、GPGPU 1306和多核心处理器1308。GPGPU 1306可以是本文所描述的GPGPU,诸如,GPGPU 700,并且多核心处理器1308可以是本文中描述的多核心处理器,诸如,多核心处理器405-406。SOC 1300可附加地包括片上存储器1305,该片上存储器1305可启用能够由处理部件中的每个处理部件访问的共享片上数据池。处理部件可针对低功率操作进行优化,以启用对包括自主交通工具和自主机器人的各种机器学习平台的部署。例如,SOC 1300的一个实现方式可被用作用于自主交通工具的主控制系统的部分。在SOC 1300被配置成在自主交通工具中使用的情况下,SOC被设计并被配置成用于符合部署管辖的相关功能安全性标准。FIG. 13 illustrates an exemplary inference system-on-chip (SOC) 1300 suitable for performing inference using a trained model. SOC 1300 may integrate processing components, including a media processor 1302, a visual processor 1304, a GPGPU 1306, and a multi-core processor 1308. GPGPU 1306 may be a GPGPU described herein, such as GPGPU 700, and multi-core processor 1308 may be a multi-core processor described herein, such as multi-core processors 405-406. SOC 1300 may additionally include on-chip memory 1305, which may enable a shared on-chip data pool that can be accessed by each of the processing components. The processing components may be optimized for low-power operation to enable deployment of various machine learning platforms including autonomous vehicles and autonomous robots. For example, an implementation of SOC 1300 may be used as part of a master control system for an autonomous vehicle. Where the SOC 1300 is configured for use in an autonomous vehicle, the SOC is designed and configured to comply with relevant functional safety standards of the deployment jurisdiction.

在操作期间,媒体处理器1302和视觉处理器1304可协同地工作以加速计算机视觉操作。媒体处理器1302可启用对多个高分辨率(例如,4K、8K)视频流的低等待时间解码。经解码的视频流可被写入片上存储器1305中的缓冲器。视觉处理器1304随后可解析经解码的视频,并且对经解码的视频的帧执行初步处理操作,以准备好使用经训练的图像识别模型来处理帧。例如,视觉处理器1304可加速用于CNN的卷积操作,该CNN用于对高分辨率视频数据执行图像识别,同时后端模型计算由GPGPU 1306执行。During operation, the media processor 1302 and the visual processor 1304 can work in tandem to accelerate computer vision operations. The media processor 1302 can enable low latency decoding of multiple high-resolution (e.g., 4K, 8K) video streams. The decoded video streams can be written to a buffer in the on-chip memory 1305. The visual processor 1304 can then parse the decoded video and perform preliminary processing operations on the frames of the decoded video to prepare for processing the frames using a trained image recognition model. For example, the visual processor 1304 can accelerate convolution operations for a CNN that is used to perform image recognition on high-resolution video data, while the back-end model calculations are performed by the GPGPU 1306.

多核心处理器1308可包括控制逻辑,该控制逻辑用于辅助对由媒体处理器1302和视觉处理器1304执行的数据传输和共享存储器操作的定序和同步。多核心处理器1308还可充当应用处理器,以用于执行能够利用GPGPU 1306的推断计算能力的软件应用。例如,导航和驾驶逻辑的至少部分可在多核心处理器1308上执行的软件中实现。此类软件可直接将计算工作负载发出至GPGPU 1306,或者计算工作负载可被发出至多核心处理器1308,该多核心处理器1308可将那些操作的至少部分迁移至GPGPU 1306。The multi-core processor 1308 may include control logic for assisting in sequencing and synchronizing data transfers and shared memory operations performed by the media processor 1302 and the vision processor 1304. The multi-core processor 1308 may also act as an application processor for executing software applications that can utilize the inference computing capabilities of the GPGPU 1306. For example, at least a portion of navigation and steering logic may be implemented in software executing on the multi-core processor 1308. Such software may issue computational workloads directly to the GPGPU 1306, or computational workloads may be issued to the multi-core processor 1308, which may migrate at least a portion of those operations to the GPGPU 1306.

GPGPU 1306可包括计算集群,诸如,通用图形处理单元700内的低功率配置的处理集群706A-706H。GPGPU 1306内的计算集群可支持专门优化成在经训练的神经网络上执行推断计算的指令。例如,GPGPU 1306可支持用于执行诸如8比特和4比特整数向量操作之类的低精度计算的指令。GPGPU 1306 may include a computing cluster, such as low-power configuration processing clusters 706A-706H within general purpose graphics processing unit 700. The computing cluster within GPGPU 1306 may support instructions specifically optimized to perform inference calculations on trained neural networks. For example, GPGPU 1306 may support instructions for performing low-precision calculations such as 8-bit and 4-bit integer vector operations.

附加系统概述Additional System Overview

图14是处理系统1400的框图。图14的具有与本文中任何其他附图的元件相同或类似名称的元件描述与其他附图中相同的元件,能以与其他附图中类似的方式进行操作或运行,可包括相同的部件,并且可链接到其他实体,该实体如本文中其他地方所描述的那些实体,但不限于此。系统1400可在以下各项中被使用:单处理器桌面型电脑系统、多处理器工作站系统或具有大量处理器1402或处理器核心1407的服务器系统。系统1400可以是被并入在片上系统(SoC)集成电路内的处理平台,该片上系统(SoC)集成电路用于在移动设备、手持式设备或嵌入式设备中使用,诸如,用于在具有至局域网或广域网的有线或无线连接性的物联网(Internet-of-things,IoT)设备内使用。FIG. 14 is a block diagram of a processing system 1400. Elements of FIG. 14 with the same or similar names as elements of any other figures herein describe the same elements as in the other figures, can operate or run in a similar manner as in the other figures, can include the same components, and can be linked to other entities, such as those described elsewhere in this document, but not limited thereto. System 1400 can be used in the following: a single-processor desktop computer system, a multi-processor workstation system, or a server system with a large number of processors 1402 or processor cores 1407. System 1400 can be a processing platform incorporated into a system-on-chip (SoC) integrated circuit for use in a mobile device, a handheld device, or an embedded device, such as for use in an Internet-of-things (IoT) device with wired or wireless connectivity to a local area network or a wide area network.

系统1400可以是具有与图1的那些部件对应的部件的处理系统。例如,在不同配置中,(一个或多个)处理器1402或(一个或多个)处理器核心1407可与图1的(一个或多个)处理器102相对应。(一个或多个)图形处理器1408可与图1的(一个或多个)并行处理器112相对应。外部图形处理器1418可以是图1的(一个或多个)插入式设备120中的一个。System 1400 may be a processing system having components corresponding to those of Figure 1. For example, in different configurations, processor(s) 1402 or processor core(s) 1407 may correspond to processor(s) 102 of Figure 1. Graphics processor(s) 1408 may correspond to parallel processor(s) 112 of Figure 1. External graphics processor 1418 may be one of plug-in device(s) 120 of Figure 1.

系统1400可包括以下各项,可与以下各项耦合,或可集成在以下各项内:基于服务器的游戏平台;游戏控制台,包括游戏和媒体控制台;移动游戏控制台、手持式游戏控制台或在线游戏控制台。系统1400可以是移动电话、智能电话、平板计算设备或移动互联网连接的设备(诸如,具有低内部存储容量的膝上型电脑)的部分。处理系统1400也可包括以下各项,与以下各项耦合,或被集成在以下各项内:可穿戴设备,诸如,智能手表可穿戴设备;智能眼镜或服装,其利用增强现实(augmented reality,AR)或虚拟现实(virtual reality,VR)特征来增强,以提供视觉、音频或触觉输出来补充现实世界视觉、音频或触觉体验或以其他方式提供文本、音频、图形、视频、全息图像或视频、或触觉反馈;其他增强现实(AR)设备;或其他虚拟现实(VR)设备。处理系统1400可包括电视机或机顶盒设备,或可以是电视机或机顶盒设备的部分。系统1400可包括自动驾驶交通工具,与自动驾驶交通工具耦合,或集成在自动驾驶交通工具内,该自动驾驶交通工具诸如,公共汽车、拖拉机拖车、汽车、电机或电力循环、飞机或滑翔机(或其任何组合)。自动驾驶交通工具可使用系统1400来处理在该交通工具周围感测到的环境。The system 1400 may include, be coupled with, or be integrated into: a server-based gaming platform; a gaming console, including gaming and media consoles; a mobile gaming console, a handheld gaming console, or an online gaming console. The system 1400 may be part of a mobile phone, a smart phone, a tablet computing device, or a mobile internet-connected device, such as a laptop with low internal storage capacity. The processing system 1400 may also include, be coupled with, or be integrated into: a wearable device, such as a smart watch wearable device; smart glasses or clothing that is enhanced with augmented reality (AR) or virtual reality (VR) features to provide visual, audio, or tactile output to supplement the real-world visual, audio, or tactile experience or otherwise provide text, audio, graphics, video, holographic images or video, or tactile feedback; other augmented reality (AR) devices; or other virtual reality (VR) devices. The processing system 1400 may include, or may be part of a television or set-top box device. System 1400 may include, be coupled to, or be integrated within an autonomous vehicle, such as a bus, a tractor trailer, an automobile, an electric or electrical cycle, an airplane, or a glider (or any combination thereof). An autonomous vehicle may use system 1400 to process the environment sensed around the vehicle.

一个或多个处理器1402可包括一个或多个处理器核心1407,该一个或多个处理器核心1407用于处理指令,这些指令当被执行时,执行用于系统和用户软件的操作。一个或多个处理器核心1407中的至少一个处理器核心可被配置成用于处理特定的指令集1409。指令集1409可促进复杂指令集计算(Complex Instruction Set Computing,CISC)、精简指令集计算(Reduced Instruction Set Computing,RISC)或经由超长指令字(Very LongInstruction Word,VLIW)的计算。一个或多个处理器核心1407可以处理不同的指令集1409,不同的指令集1409可包括用于促进对其他指令集的仿真的指令。处理器核心1407还可包括其他处理设备,诸如,数字信号处理器(Digital Signal Processor,DSP)。The one or more processors 1402 may include one or more processor cores 1407 for processing instructions that, when executed, perform operations for system and user software. At least one of the one or more processor cores 1407 may be configured to process a specific instruction set 1409. The instruction set 1409 may facilitate complex instruction set computing (CISC), reduced instruction set computing (RISC), or computing via very long instruction words (VLIW). The one or more processor cores 1407 may process different instruction sets 1409, which may include instructions for facilitating emulation of other instruction sets. The processor core 1407 may also include other processing devices, such as a digital signal processor (DSP).

处理器1402可包括缓存存储器1404。取决于体系结构,处理器1402可具有单个内部缓存或多级的内部缓存。在一些实施例中,缓存存储器在处理器1402的各种部件之间被共享。在一些实施例中,处理器1402也使用外部缓存(例如,第三级(L3)缓存或最后一级缓存(Last Level Cache,LLC))(未示出),可使用已知的缓存一致性技术在处理器核心1407之间共享该外部缓存。寄存器堆1406可附加地被包括在处理器1402中,并且可包括用于存储不同类型的数据的不同类型的寄存器(例如,整数寄存器、浮点寄存器、状态寄存器以及指令指针寄存器)。一些寄存器可以是通用寄存器,而其他寄存器可以专用于处理器1402的设计。The processor 1402 may include a cache memory 1404. Depending on the architecture, the processor 1402 may have a single internal cache or multiple levels of internal cache. In some embodiments, the cache memory is shared between various components of the processor 1402. In some embodiments, the processor 1402 also uses an external cache (e.g., a third level (L3) cache or a last level cache (Last Level Cache, LLC)) (not shown), which can be shared between the processor cores 1407 using known cache coherence techniques. A register file 1406 may be additionally included in the processor 1402 and may include different types of registers (e.g., integer registers, floating point registers, status registers, and instruction pointer registers) for storing different types of data. Some registers may be general purpose registers, while other registers may be specific to the design of the processor 1402.

一个或多个处理器1402可与一个或多个接口总线1410耦合,以在处理器1402与系统1400中的其他部件之间传送通信信号,诸如,地址、数据、或控制信号。在这些实施例中的一个实施例中,接口总线1410可以是处理器总线,诸如,直接媒体接口(Direct MediaInterface,DMI)总线的某个版本。然而,处理器总线不限于DMI总线,并且可包括一个或多个外围部件互连总线(例如,PCI、PCI快速)、存储器总线或其他类型的接口总线。例如,(一个或多个)处理器1402可包括集成存储器控制器1416和平台控制器中枢1430。存储器控制器1416促进存储器设备与系统1400的其他部件之间的通信,而平台控制器中枢(platformcontroller hub,PCH)1430提供经由本地I/O总线至I/O设备的连接。One or more processors 1402 may be coupled to one or more interface buses 1410 to transmit communication signals, such as address, data, or control signals, between the processor 1402 and other components in the system 1400. In one of these embodiments, the interface bus 1410 may be a processor bus, such as a version of a Direct Media Interface (DMI) bus. However, the processor bus is not limited to a DMI bus and may include one or more peripheral component interconnect buses (e.g., PCI, PCI Express), memory buses, or other types of interface buses. For example, the processor(s) 1402 may include an integrated memory controller 1416 and a platform controller hub 1430. The memory controller 1416 facilitates communication between memory devices and other components of the system 1400, while the platform controller hub (PCH) 1430 provides connections to I/O devices via a local I/O bus.

存储器设备1420可以是动态随机存取存储器(DRAM)设备、静态随机存取存储器(static random-access memory,SRAM)设备、闪存设备、相变存储器设备或具有合适的性能以充当进程存储器的某个其他存储器设备。存储器设备1420可以例如作为用于系统1400的系统存储器来操作,以存储数据1422和指令1421,用于在一个或多个处理器1402执行应用或进程时使用。存储器控制器1416也与任选的外部图形处理器1418耦合,该任选的外部图形处理器1418可与处理器1402中的一个或多个图形处理器1408通信以执行图形操作和媒体操作。在一些实施例中,可由加速器1412辅助图形操作、媒体操作和/或计算操作,该加速器1412是可被配置成用于执行专业的图形操作、媒体操作或计算操作的集合的协处理器。例如,加速器1412可以是用于优化机器学习或计算操作的矩阵乘法加速器。加速器1412可以是光线追踪加速器,该光线追踪加速器可用于与图形处理器1408协同地执行光线追踪操作。在一个实施例中,可替代加速器1412使用外部加速器1419,或可与加速器1412协同地使用外部加速器1419。The memory device 1420 may be a dynamic random access memory (DRAM) device, a static random access memory (SRAM) device, a flash memory device, a phase change memory device, or some other memory device having suitable performance to act as a process memory. The memory device 1420 may, for example, operate as a system memory for the system 1400 to store data 1422 and instructions 1421 for use when one or more processors 1402 execute applications or processes. The memory controller 1416 is also coupled to an optional external graphics processor 1418, which may communicate with one or more graphics processors 1408 in the processor 1402 to perform graphics operations and media operations. In some embodiments, graphics operations, media operations, and/or computing operations may be assisted by an accelerator 1412, which is a coprocessor that may be configured to perform a collection of specialized graphics operations, media operations, or computing operations. For example, the accelerator 1412 may be a matrix multiplication accelerator for optimizing machine learning or computing operations. The accelerator 1412 may be a ray tracing accelerator that may be used to perform ray tracing operations in coordination with the graphics processor 1408. In one embodiment, an external accelerator 1419 may be used instead of the accelerator 1412 or in coordination with the accelerator 1412.

可提供显示设备1411,该显示设备1411可连接到(一个或多个)处理器1402。显示设备1411可以是以下各项中的一项或多项:内部显示设备,如在移动电子设备或膝上型电脑设备中;或经由显示接口(例如,显示端口等)附接的外部显示设备。显示设备1411可以是头戴式显示器(head mounted display,HMD),诸如,用于在虚拟现实(VR)应用或增强现实(AR)应用中使用的立体显示设备。A display device 1411 may be provided and may be connected to the processor(s) 1402. The display device 1411 may be one or more of an internal display device, such as in a mobile electronic device or laptop device, or an external display device attached via a display interface (e.g., a display port, etc.). The display device 1411 may be a head mounted display (HMD), such as a stereoscopic display device for use in virtual reality (VR) applications or augmented reality (AR) applications.

平台控制器中枢1430可以使外围设备能够经由高速I/O总线连接到存储器设备1420和处理器1402。I/O外围设备包括但不限于音频控制器1446、网络控制器1434、固件接口1428、无线收发器1426、触摸传感器1425、数据存储设备1424(例如,非易失性存储器、易失性存器、硬盘驱动器、闪存、NAND、3D NAND、3D Xpoint/Optane等)。数据存储设备1424可以经由存储接口(例如,SATA)或经由外围总线(诸如,外围部件互连总线(例如,PCI、PCI快速))连接。触摸传感器1425可以包括触摸屏传感器、压力传感器或指纹传感器。无线收发器1426可以是Wi-Fi收发器、蓝牙收发器或移动网络收发器,该移动网络收发器诸如3G、4G、5G或长期演进(Long-Term Evolution,LTE)收发器。固件接口1428启用与系统固件的通信,并且可以例如是统一可扩展固件接口(unified extensible firmware interface,UEFI)。网络控制器1434可启用至有线网络的网络连接。在一些实施例中,高性能网络控制器(未示出)与接口总线1410耦合。音频控制器1446可以是多声道高清音频控制器。在这些实施例中的一些实施例中,系统1400包括用于将传统(例如,个人系统2(Personal System 2,PS/2))设备耦合至系统的任选的传统I/O控制器1440。平台控制器中枢1430还可以连接到一个或多个通用串行总线(Universal Serial Bus,USB)控制器1442,以连接到输入设备,诸如,键盘和鼠标1443组合、相机1444或其他USB输入设备。The platform controller hub 1430 can enable peripheral devices to be connected to the memory device 1420 and the processor 1402 via a high-speed I/O bus. The I/O peripherals include, but are not limited to, an audio controller 1446, a network controller 1434, a firmware interface 1428, a wireless transceiver 1426, a touch sensor 1425, a data storage device 1424 (e.g., non-volatile memory, volatile memory, hard drive, flash memory, NAND, 3D NAND, 3D Xpoint/Optane, etc.). The data storage device 1424 can be connected via a storage interface (e.g., SATA) or via a peripheral bus (e.g., PCI, PCI Express). The touch sensor 1425 can include a touch screen sensor, a pressure sensor, or a fingerprint sensor. The wireless transceiver 1426 can be a Wi-Fi transceiver, a Bluetooth transceiver, or a mobile network transceiver, such as a 3G, 4G, 5G, or Long-Term Evolution (LTE) transceiver. The firmware interface 1428 enables communication with the system firmware and can be, for example, a unified extensible firmware interface (UEFI). The network controller 1434 can enable network connection to a wired network. In some embodiments, a high-performance network controller (not shown) is coupled to the interface bus 1410. The audio controller 1446 can be a multi-channel high-definition audio controller. In some of these embodiments, the system 1400 includes an optional traditional I/O controller 1440 for coupling traditional (e.g., Personal System 2 (PS/2)) devices to the system. The platform controller hub 1430 can also be connected to one or more Universal Serial Bus (USB) controllers 1442 to connect to input devices, such as a keyboard and mouse 1443 combination, a camera 1444, or other USB input devices.

将领会,所示的系统1400是示例性而非限制性的,因为也可以使用以不同方式配置的其他类型的数据处理系统。例如,存储器控制器1416和平台控制器中枢1430的实例可以集成到分立的外部图形处理器中,该分立的外部图形处理器诸如外部图形处理器1418。平台控制器中枢1430和/或存储器控制器1416可以在一个或多个处理器1402外部。例如,系统1400可包括外部存储器控制器1416和平台控制器中枢1430,该外部存储器控制器1416和平台控制器中枢1430可以被配置为在与(一个或多个)处理器1402通信的系统芯片组内的存储器控制器中枢和外围控制器中枢。It will be appreciated that the illustrated system 1400 is exemplary and not limiting, as other types of data processing systems configured in different manners may also be used. For example, instances of the memory controller 1416 and the platform controller hub 1430 may be integrated into a discrete external graphics processor, such as the external graphics processor 1418. The platform controller hub 1430 and/or the memory controller 1416 may be external to the one or more processors 1402. For example, the system 1400 may include the external memory controller 1416 and the platform controller hub 1430, which may be configured as a memory controller hub and a peripheral controller hub within a system chipset that communicates with the processor(s) 1402.

例如,可使用电路板(“橇板(sled)”),部件(诸如,CPU、存储器和其他部件)被放置在该电路板上,并且在该电路板上部件(诸如,CPU、存储器和其他部件)经设计以实现提升的热性能。诸如处理器之类的处理部件可位于橇板的顶侧上,而诸如DIMM之类的附近存储器位于橇板的底侧上。作为由该设计提供的增强的气流的结果,部件能以比在典型系统中更高的频率和功率等级来操作,由此提高性能。此外,橇板被配置成用于盲配机架中的功率和数据通信线缆,由此增强它们被快速地移除、升级、重新安装和/或替换的能力。类似地,位于橇板上的各个部件(诸如,处理器、加速器、存储器和数据存储驱动器)由于它们距彼此的增加的间距而被配置成易于升级。在说明性实施例中,部件附加地包括用于证明它们的真实性的硬件认证特征。For example, a circuit board ("sled") may be used on which components (such as CPUs, memory and other components) are placed and on which components (such as CPUs, memory and other components) are designed to achieve improved thermal performance. Processing components such as processors may be located on the top side of the sled, while nearby memory such as DIMMs are located on the bottom side of the sled. As a result of the enhanced airflow provided by the design, components can operate at higher frequencies and power levels than in typical systems, thereby improving performance. In addition, the sled is configured to blindly mate power and data communication cables in a rack, thereby enhancing their ability to be quickly removed, upgraded, reinstalled and/or replaced. Similarly, the various components (such as processors, accelerators, memory and data storage drives) located on the sled are configured to be easily upgraded due to their increased spacing from each other. In an illustrative embodiment, the components additionally include hardware authentication features for proving their authenticity.

数据中心可利用支持多个其他网络体系结构的单个网络体系结构(“结构”),多个其他网络体系结构包括以太网和全方位路径。橇板可经由光纤耦合至交换机,这提供比典型的双绞线布线(例如,5类、5e类、6类等)更高的带宽和更低的等待时间。由于高带宽、低等待时间的互连和网络体系结构,数据中心在使用中可集中在物理上分散的诸如存储器、加速器(例如,GPU、图形加速器、FPGA、ASIC、神经网络和/或人工智能加速器等)和数据存储驱动器之类的资源,并且根据需要将它们提供给计算资源(例如,处理器),从而使计算资源能够就好像被集中的资源在本地那样访问这些被集中的资源。The data center may utilize a single network architecture ("fabric") that supports multiple other network architectures, including Ethernet and omni-path. The sleds may be coupled to the switches via optical fiber, which provides higher bandwidth and lower latency than typical twisted pair cabling (e.g., Category 5, Category 5e, Category 6, etc.). Due to the high-bandwidth, low-latency interconnect and network architecture, the data center may, in use, centralize physically dispersed resources such as memory, accelerators (e.g., GPUs, graphics accelerators, FPGAs, ASICs, neural network and/or artificial intelligence accelerators, etc.), and data storage drives, and provide them to computing resources (e.g., processors) as needed, thereby enabling the computing resources to access these centralized resources as if the centralized resources were local.

功率供应或功率源可将电压和/或电流提供给系统1400或本文中描述的任何部件或系统。在一个示例中,电力供应装置包括插入到墙壁插座中的AC到DC(交流到直流)适配器。此类AC功率可以是可再生能源(例如,太阳能)功率源。在一个示例中,功率源包括DC功率源,诸如,外部AC到DC转换器。功率源或功率供应还可包括用于通过接近充电场来充电的无线充电硬件。功率源可包括内部电池、交流供应、基于动作的功率供应、太阳能功率供应或燃料电池源。A power supply or power source can provide voltage and/or current to the system 1400 or any component or system described herein. In one example, the power supply includes an AC to DC (alternating current to direct current) adapter that plugs into a wall socket. Such AC power can be a renewable energy (e.g., solar) power source. In one example, the power source includes a DC power source, such as an external AC to DC converter. The power source or power supply can also include wireless charging hardware for charging by approaching a charging field. The power source can include an internal battery, an AC supply, a motion-based power supply, a solar power supply, or a fuel cell source.

图15A-图15C图示计算系统和图形处理器。图15A-图15C的具有与本文中任何其他附图的元件相同或类似名称的元件描述与其他附图中相同的元件,能以与其他附图中类似的方式进行操作或运行,可包括相同的部件,并且可链接到其他实体,该实体如本文中其他地方所描述的那些实体,但不限于此。Figures 15A-15C illustrate a computing system and a graphics processor. Elements of Figures 15A-15C having the same or similar names as elements of any other figure herein describe the same elements as in the other figures, can operate or function in a similar manner as in the other figures, can include the same components, and can be linked to other entities such as, but not limited to, those described elsewhere herein.

图15A是处理器1500的框图,该处理器1500可以是处理器1402中的一个处理器的变体,并且可以替代那些处理器中的一个处理器被使用。因此,本文中结合处理器1500对任何特征的公开也公开了对应的与(一个或多个)处理器1402的结合,但不限于此。处理器1500可具有一个或多个处理器核心1502A-1502N、集成存储器控制器1514以及集成图形处理器1508。在集成图形处理器1508被排除的情况下,包括处理器的系统将包括系统芯片组内的、或经由系统总线耦合的图形处理器设备。处理器1500可包括附加的核心,这些附加的核心最多为由虚线框表示的附加核心1502N并包括由虚线框表示的附加核心1502N。处理器核心1502A-1502N中的每一个包括一个或多个内部缓存单元1504A-1504N。在一些实施例中,每个处理器核心1502A-1502N也具有对一个或多个共享缓存单元1506的访问权。内部缓存单元1504A-1504N和共享缓存单元1506表示处理器1500内的缓存存储器层次体系。缓存存储器层次体系可包括每个处理器核心内的至少一个级别的指令和数据缓存以及一个或多个级别的共享的中级缓存,诸如,第二级(L2)、第三级(L3)、第四级(L4)或其他级别的缓存,其中,在外部存储器之前的最高级别的缓存被分类为LLC。在一些实施例中,缓存一致性逻辑维持各缓存单元1506与1504A-1504N之间的一致性。FIG. 15A is a block diagram of a processor 1500, which may be a variant of one of the processors 1402 and may be used in place of one of those processors. Thus, the disclosure of any feature herein in conjunction with the processor 1500 also discloses the corresponding combination with (one or more) processors 1402, but is not limited thereto. The processor 1500 may have one or more processor cores 1502A-1502N, an integrated memory controller 1514, and an integrated graphics processor 1508. In the case where the integrated graphics processor 1508 is excluded, a system including the processor will include a graphics processor device within a system chipset or coupled via a system bus. The processor 1500 may include additional cores, up to and including additional cores 1502N represented by dashed boxes. Each of the processor cores 1502A-1502N includes one or more internal cache units 1504A-1504N. In some embodiments, each processor core 1502A-1502N also has access to one or more shared cache units 1506. The internal cache units 1504A-1504N and the shared cache units 1506 represent a cache memory hierarchy within the processor 1500. The cache memory hierarchy may include at least one level of instruction and data cache within each processor core and one or more levels of shared mid-level cache, such as a second level (L2), third level (L3), fourth level (L4), or other level of cache, where the highest level of cache before external memory is classified as LLC. In some embodiments, cache coherency logic maintains coherency between each cache unit 1506 and 1504A-1504N.

处理器1500还可包括一个或多个总线控制器单元的集合1516和系统代理核心1510。一个或多个总线控制器单元1516管理外围总线的集合,诸如,一个或多个PCI总线或PCI快速总线。系统代理核心1510提供对各处理器部件的管理功能。系统代理核心1510可包括用于管理对各种外部存储器设备(未示出)的访问的一个或多个集成存储器控制器1514。The processor 1500 may also include a set of one or more bus controller units 1516 and a system agent core 1510. The one or more bus controller units 1516 manage a set of peripheral buses, such as one or more PCI buses or PCI Express buses. The system agent core 1510 provides management functions for various processor components. The system agent core 1510 may include one or more integrated memory controllers 1514 for managing access to various external memory devices (not shown).

例如,处理器核心1502A-1502N中的一个或多个处理器核心可包括针对同步多线程操作的支持。系统代理核心1510包括用于在多线程处理期间协调并操作核心1502A-1502N的部件。系统代理核心1510可附加地包括功率控制单元(power control unit,PCU),该功率控制单元(PCU)包括用于调节处理器核心1502A-1502N和图形处理器1508的功率状态的逻辑和部件。For example, one or more of the processor cores 1502A-1502N may include support for simultaneous multithreaded operations. The system agent core 1510 includes components for coordinating and operating the cores 1502A-1502N during multithreaded processing. The system agent core 1510 may additionally include a power control unit (PCU) that includes logic and components for regulating the power state of the processor cores 1502A-1502N and the graphics processor 1508.

处理器1500可附加地包括用于执行图形处理操作的图形处理器1508。在这些实施例中的一些实施例中,图形处理器1508与共享缓存单元的集合1506以及系统代理核心1510耦合,该系统代理核心1510包括一个或多个集成存储器控制器1514。系统代理核心1510还可包括用于将图形处理器输出驱动到一个或多个经耦合的显示器的显示控制器1511。显示控制器1511还可以是经由至少一个互连与图形处理器耦合的单独模块,或者可以集成在图形处理器1508内。The processor 1500 may additionally include a graphics processor 1508 for performing graphics processing operations. In some of these embodiments, the graphics processor 1508 is coupled to a set of shared cache units 1506 and a system agent core 1510, which includes one or more integrated memory controllers 1514. The system agent core 1510 may also include a display controller 1511 for driving the graphics processor output to one or more coupled displays. The display controller 1511 may also be a separate module coupled to the graphics processor via at least one interconnect, or may be integrated within the graphics processor 1508.

基于环的互连1512可用于耦合处理器1500的内部部件。然而,可以使用替代的互连单元,诸如,点到点互连、交换式互连或其他技术,包括本领域中公知的技术。在具有基于环的互连1512的这些实施例中的一些实施例中,图形处理器1508经由I/O链路1513与基于环的互连1512耦合。A ring-based interconnect 1512 may be used to couple internal components of the processor 1500. However, alternative interconnect units may be used, such as point-to-point interconnects, switched interconnects, or other technologies, including those known in the art. In some of these embodiments with a ring-based interconnect 1512, the graphics processor 1508 is coupled to the ring-based interconnect 1512 via an I/O link 1513.

示例性I/O链路1513表示多个各种各样的I/O互连中的至少一种,包括促进各处理器部件与高性能存储器模块1518(诸如,eDRAM模块或高带宽存储器(high-bandwidthmemory,HMB)模块)之间的通信的封装上I/O互连。任选地,处理器核心1502A-1502N中的每一个以及图形处理器1508可将高性能存储器模块1518用作共享的最后一级缓存。Exemplary I/O link 1513 represents at least one of a plurality of various I/O interconnects, including an on-package I/O interconnect that facilitates communication between various processor components and a high-performance memory module 1518, such as an eDRAM module or a high-bandwidth memory (HMB) module. Optionally, each of processor cores 1502A-1502N and graphics processor 1508 may use high-performance memory module 1518 as a shared last-level cache.

处理器核心1502A-1502N可以例如是执行相同的指令集体系结构的同构核心。替代地,处理器核心1502A-1502N在指令集体系结构(instruction set architecture,ISA)方面是异构的,其中,处理器核心1502A-1502N中的一个或多个执行第一指令集,而其他核心中的至少一个执行第一指令集的子集或不同的指令集。处理器核心1502A-1502N在微体系结构方面可以是异构的,其中,具有相对较高功耗的一个或多个核心与具有较低功耗的一个或多个功率核心耦合。作为另一示例,处理器核心1502A-1502N在计算能力方面是异构的。此外,处理器1500可在一个或多个芯片上实现,或者被实现为除其他部件之外还具有所图示的部件的SoC集成电路。The processor cores 1502A-1502N may be, for example, isomorphic cores that execute the same instruction set architecture. Alternatively, the processor cores 1502A-1502N are heterogeneous in terms of instruction set architecture (ISA), wherein one or more of the processor cores 1502A-1502N execute a first instruction set and at least one of the other cores executes a subset of the first instruction set or a different instruction set. The processor cores 1502A-1502N may be heterogeneous in terms of microarchitecture, wherein one or more cores with relatively high power consumption are coupled with one or more power cores with lower power consumption. As another example, the processor cores 1502A-1502N are heterogeneous in terms of computing power. In addition, the processor 1500 may be implemented on one or more chips, or as a SoC integrated circuit having the illustrated components in addition to other components.

图15B是根据本文中描述的一些实施例的图形处理器核心块1519的硬件逻辑的框图。在一些实施例中,图15B的具有与本文中任何其他附图的元件相同的附图标记(或名称)的元件能以与本文中其他地方所描述的方式类似的方式进行操作或运行。在一个实施例中,图形处理器核心块1519是图形处理器的一个分区的示例。图形处理器核心块1519可以被包括在图15A的集成图形处理器1508或分立的图形处理器、并行处理器和/或计算加速器内。如本文中所描述的图形处理器可包括基于目标功率和性能包络的多个图形核心块。每个图形处理器核心块1519可包括与多个图形核心1521A-1521F耦合的功能块1530,多个图形核心1521A-1521F包括固定功能逻辑和通用可编程逻辑的模块化的块。图形处理器核心块1519还包括能够由所有的图形核心1521A-1521F、栅格化器逻辑1537和附加的固定功能逻辑1538访问的共享/缓存存储器1536。Figure 15B is a block diagram of the hardware logic of a graphics processor core block 1519 according to some embodiments described herein. In some embodiments, the elements of Figure 15B having the same reference numerals (or names) as the elements of any other figures herein can be operated or run in a manner similar to that described elsewhere herein. In one embodiment, a graphics processor core block 1519 is an example of a partition of a graphics processor. A graphics processor core block 1519 may be included in an integrated graphics processor 1508 of Figure 15A or a discrete graphics processor, a parallel processor and/or a computing accelerator. A graphics processor as described herein may include multiple graphics core blocks based on a target power and performance envelope. Each graphics processor core block 1519 may include a functional block 1530 coupled to multiple graphics cores 1521A-1521F, and multiple graphics cores 1521A-1521F include modular blocks of fixed function logic and general programmable logic. Graphics processor core block 1519 also includes a shared/cache memory 1536 that is accessible by all graphics cores 1521A- 1521F, rasterizer logic 1537 , and additional fixed function logic 1538 .

在一些实施例中,功能块1530包括可以由图形处理器核心块1519中的所有图形核心共享的几何/固定功能管线1531。在各实施例中,几何/固定功能管线1531包括3D几何管线、视频前端单元、线程生成器和全局线程调遣器以及统一返回缓冲器管理器,该统一返回缓冲器管理器管理统一返回缓冲器。在一个实施例中,功能块1530还包括图形SoC接口1532、图形微控制器1533和媒体管线1534。图形SoC接口1532提供图形处理器核心块1519与图形处理器或计算加速器SoC内的其他核心块之间的接口。图形微控制器1533是可被配置成用于管理图形处理器核心块1519的各种功能的可编程子处理器,这些功能包括线程调遣、调度和抢占。媒体管线1534包括用于促进多媒体数据(包括图像和视频数据)的解码、编码、预处理和/或后处理的逻辑。媒体管线1534经由对图形核心1521A-1521F内的计算或采样逻辑的请求来实现媒体操作。一个或多个像素后端1535也可被包括在功能块1530内。像素后端1535包括用于存储像素颜色值的缓冲存储器,并且能够执行对经渲染的像素数据的混合操作和无损颜色压缩。In some embodiments, functional block 1530 includes a geometry/fixed function pipeline 1531 that can be shared by all graphics cores in graphics processor core block 1519. In various embodiments, geometry/fixed function pipeline 1531 includes a 3D geometry pipeline, a video front end unit, a thread generator and a global thread dispatcher, and a unified return buffer manager that manages a unified return buffer. In one embodiment, functional block 1530 also includes a graphics SoC interface 1532, a graphics microcontroller 1533, and a media pipeline 1534. Graphics SoC interface 1532 provides an interface between graphics processor core block 1519 and other core blocks within a graphics processor or computing accelerator SoC. Graphics microcontroller 1533 is a programmable subprocessor that can be configured to manage various functions of graphics processor core block 1519, including thread dispatching, scheduling, and preemption. Media pipeline 1534 includes logic for facilitating decoding, encoding, preprocessing, and/or post-processing of multimedia data (including image and video data). The media pipeline 1534 implements media operations via requests to computational or sampling logic within the graphics cores 1521A-1521F. One or more pixel backends 1535 may also be included within the functional block 1530. The pixel backend 1535 includes a buffer memory for storing pixel color values and is capable of performing blending operations and lossless color compression on rendered pixel data.

在一个实施例中,图形SoC接口1532使图形处理器核心块1519能够与SoC内的或经由外围接口与SoC耦合的系统主机CPU内的通用应用处理器核心(例如,CPU)和/或其他部件进行通信。图形SoC接口1532还启用与片外存储器层次体系元件的通信,片外存储器层次体系元件诸如,共享的最后一级缓存存储器、系统RAM和/或嵌入式片上或封装上DRAM。SoC接口1532还能够启用与SoC内的诸如相机成像管线之类的固定功能设备的通信,并且启用全局存储器原子性的使用和/或实现全局存储器原子性,该全局存储器原子性可在图形处理器核心块1519与SoC内的CPU之间被共享。图形SoC接口1532还可实现针对图形处理器核心块1519的功率管理控制,并且启用图形处理器核心块1519的时钟域与SoC内的其他时钟域之间的接口。在一个实施例中,图形SoC接口1532使得能够从命令流转化器和全局线程调遣器接收命令缓冲器,该命令流转化器和全局线程调遣器被配置成用于将命令和指令提供给图形处理器内的一个或多个图形核心中的每一个图形核心。命令和指令在媒体操作要被执行时能够被调遣到媒体管线1534,并且在图形处理操作要被执行时能够被调遣到几何和固定功能管线1531。当计算操作要被执行时,计算调遣逻辑能够将命令调遣到图形核心1521A-1521F,从而绕过几何管线和媒体管线。In one embodiment, the graphics SoC interface 1532 enables the graphics processor core block 1519 to communicate with a general-purpose application processor core (e.g., CPU) and/or other components within the SoC or within a system host CPU coupled to the SoC via a peripheral interface. The graphics SoC interface 1532 also enables communication with off-chip memory hierarchy elements, such as shared last-level cache memory, system RAM, and/or embedded on-chip or on-package DRAM. The SoC interface 1532 can also enable communication with fixed-function devices within the SoC, such as a camera imaging pipeline, and enable the use and/or implementation of global memory atomicity, which can be shared between the graphics processor core block 1519 and the CPU within the SoC. The graphics SoC interface 1532 can also implement power management controls for the graphics processor core block 1519 and enable interfaces between the clock domain of the graphics processor core block 1519 and other clock domains within the SoC. In one embodiment, graphics SoC interface 1532 enables receiving command buffers from a command stream converter and global thread dispatcher configured to provide commands and instructions to each of one or more graphics cores within a graphics processor. Commands and instructions can be dispatched to media pipeline 1534 when media operations are to be performed, and can be dispatched to geometry and fixed function pipeline 1531 when graphics processing operations are to be performed. When compute operations are to be performed, compute dispatch logic can dispatch commands to graphics cores 1521A-1521F, thereby bypassing the geometry pipeline and the media pipeline.

图形微控制器1533可被配置成用于执行针对图形处理器核心块1519的各种调度任务和管理任务。在一个实施例中,图形微控制器1533可以执行在图形核心1521A-1521F内的各向量引擎1522A-1522F、1524A-1524F和矩阵引擎1523A-1523F、1525A-1525F上调度的图形工作负载和/或计算工作负载。在该调度模型中,在包括图形处理器核心块1519的SoC的CPU核心上执行的主机软件可将工作负载提交到多个图形处理器门铃(doorbell)中的一个图形处理器门铃,这调用了对适当的图形引擎的调度操作。调度操作包括:确定接下来要运行哪个工作负载,将工作负载提交到命令流转化器,抢占在引擎上运行的现有工作负载,监测工作负载的进度,以及当工作负载完成时通知主机软件。在一个实施例中,图形微控制器1533还能够促进图形处理器核心块1519的低功率或空闲状态,从而向图形处理器核心块1519提供独立于操作系统和/或系统上的图形驱动器软件跨低功率状态转变来保存和恢复图形处理器核心块1519内的寄存器的能力。The graphics microcontroller 1533 may be configured to perform various scheduling and management tasks for the graphics processor core block 1519. In one embodiment, the graphics microcontroller 1533 may execute graphics workloads and/or computational workloads scheduled on the various vector engines 1522A-1522F, 1524A-1524F and matrix engines 1523A-1523F, 1525A-1525F within the graphics cores 1521A-1521F. In this scheduling model, host software executing on the CPU core of the SoC including the graphics processor core block 1519 may submit a workload to one of a plurality of graphics processor doorbells, which invokes a scheduling operation to the appropriate graphics engine. The scheduling operation includes determining which workload to run next, submitting the workload to the command stream converter, preempting an existing workload running on the engine, monitoring the progress of the workload, and notifying the host software when the workload is completed. In one embodiment, graphics microcontroller 1533 can also facilitate a low power or idle state for graphics processor core block 1519, thereby providing graphics processor core block 1519 with the ability to save and restore registers within graphics processor core block 1519 across low power state transitions independent of the operating system and/or graphics driver software on the system.

图形处理器核心块1519可具有多于或少于所图示的图形核心1521A-1521F,最多N个模块化图形核心。对于每个具有N个图形核心的集合,图形处理器核心块1519还可包括:共享/缓存存储器1536,其可被配置为共享存储器或缓存存储器;栅格化器逻辑1537;以及附加的固定功能逻辑1538,用于加速各种图形和计算处理操作。The graphics processor core block 1519 may have more or fewer graphics cores 1521A-1521F than those shown, up to a maximum of N modular graphics cores. For each set of N graphics cores, the graphics processor core block 1519 may also include: a shared/cache memory 1536, which may be configured as a shared memory or a cache memory; a rasterizer logic 1537; and additional fixed function logic 1538 for accelerating various graphics and compute processing operations.

在每个图形核心1521A-1521F内包括可用于响应于由图形管线、媒体管线或着色器程序作出的请求而执行图形操作、媒体操作和计算操作的执行资源的集合。图形核心1521A-1521F包括多个向量引擎1522A-1522F、1524A-1524F、矩阵加速单元1523A-1523F、1525A-1525D、缓存/共享本地存储器(shared local memory,SLM)、采样器1526A-1526F以及光线追踪单元1527A-1527F。Included within each graphics core 1521A-1521F is a collection of execution resources that can be used to perform graphics operations, media operations, and compute operations in response to requests made by a graphics pipeline, a media pipeline, or a shader program. Graphics cores 1521A-1521F include multiple vector engines 1522A-1522F, 1524A-1524F, matrix acceleration units 1523A-1523F, 1525A-1525D, cache/shared local memory (SLM), samplers 1526A-1526F, and ray tracing units 1527A-1527F.

向量引擎1522A-1522F、1524A-1524F是能够执行浮点和整数/定点逻辑操作以服务于图形操作、媒体操作或计算操作(包括图形程序、媒体程序或计算/GPGPU程序)的通用图形处理单元。向量引擎1522A-1522F、1524A-1524F能够使用SIMD执行模式、SIMT执行模式或SIMT+SIMD执行模式、以可变向量宽度进行操作。矩阵加速单元1523A-1523F、1525A-1525D包括矩阵-矩阵和矩阵-向量加速逻辑,该矩阵-矩阵和矩阵-向量加速逻辑改善矩阵操作、尤其是用于机器学习的低精度和混合精度(例如,INT8、FP16、BF16、FP8)矩阵操作的性能。在一个实施例中,矩阵加速单元1523A-1523F、1525A-1525D中的每一个包括能够对矩阵元素执行并发的矩阵乘法或点积操作的处理元件的一个或多个脉动阵列。The vector engines 1522A-1522F, 1524A-1524F are general purpose graphics processing units capable of performing floating point and integer/fixed point logic operations to serve graphics operations, media operations, or computing operations (including graphics programs, media programs, or computing/GPGPU programs). The vector engines 1522A-1522F, 1524A-1524F are capable of operating with variable vector widths using SIMD execution mode, SIMT execution mode, or SIMT+SIMD execution mode. The matrix acceleration units 1523A-1523F, 1525A-1525D include matrix-matrix and matrix-vector acceleration logic that improves the performance of matrix operations, especially low-precision and mixed-precision (e.g., INT8, FP16, BF16, FP8) matrix operations for machine learning. In one embodiment, each of the matrix acceleration units 1523A-1523F, 1525A-1525D includes one or more systolic arrays of processing elements capable of performing concurrent matrix multiplication or dot product operations on matrix elements.

采样器1526A-1526F能够将媒体数据或纹理数据读入存储器,并且能够基于经配置的采样器状态和正被读取的纹理/媒体格式以不同方式对数据采样。在向量引擎1522A-1522F、1524A-1524F或矩阵加速单元1523A-1523F、1525A-1525D上执行的线程能够利用图形核心1521A-1521F中的每个图形核心内的缓存/SLM 1528A-1528F。缓存/SLM 1528A-1528F能够被配置为在相应的图形核心1521A-1521F中的每个图形核心的本地的缓存存储器或共享存储器的池。图形核心1521A-1521F内的光线追踪单元1527A-1527F包括光线遍历/相交电路模块,该光线遍历/相交电路模块用于使用包围体层次体系(BVH)来执行光线遍历并标识封围在BVH体积内的光线与基元之间的相交。在一个实施例中,光线追踪单元1527A-1527F包括用于(例如,使用深度缓冲器或类似布置)执行深度测试和剔除的电路模块。在一个实现方式中,光线追踪单元1527A-1527F与图像降噪协作地执行遍历和相交操作,该图像降噪的至少部分可使用相关联的矩阵加速单元1523A-1523F、1525A-1525D来执行。Samplers 1526A-1526F can read media data or texture data into memory and can sample data in different ways based on the configured sampler state and the texture/media format being read. Threads executing on vector engines 1522A-1522F, 1524A-1524F or matrix acceleration units 1523A-1523F, 1525A-1525D can utilize caches/SLMs 1528A-1528F within each of graphics cores 1521A-1521F. Caches/SLMs 1528A-1528F can be configured as a pool of cache memory or shared memory local to each of the corresponding graphics cores 1521A-1521F. Ray tracing units 1527A-1527F within graphics cores 1521A-1521F include ray traversal/intersection circuitry modules for performing ray traversals using a bounding volume hierarchy (BVH) and identifying intersections between rays and primitives enclosed within the BVH volume. In one embodiment, ray tracing units 1527A-1527F include circuitry modules for performing depth testing and culling (e.g., using a depth buffer or similar arrangement). In one implementation, ray tracing units 1527A-1527F perform traversal and intersection operations in coordination with image denoising, at least a portion of which may be performed using associated matrix acceleration units 1523A-1523F, 1525A-1525D.

图15C是根据本文中描述的实施例的通用图形处理单元(GPGPU)1570的框图,该GPGPU 1570可被配置为图形处理器(例如,图形处理器1508)和/或计算加速器。GPGPU 1570可经由一个或多个系统和/或存储器总线与主机处理器(例如,一个或多个CPU 1546)和存储器1571、1572互连。存储器1571可以是可与一个或多个CPU 1546进行共享的系统存储器,而存储器1572是专用于GPGPU 1570的设备存储器。例如,GPGPU 1570和存储器1572内的部件可被映射到能够由一个或多个CPU 1546访问的存储器地址中。可经由存储器控制器1568来促进对存储器1571和1572的访问。存储器控制器1568可包括内部直接存储器存取(direct memory access,DMA)控制器1569,或可包括用于执行否则将由DMA控制器执行的操作的逻辑。15C is a block diagram of a general purpose graphics processing unit (GPGPU) 1570 according to an embodiment described herein, which GPGPU 1570 can be configured as a graphics processor (e.g., graphics processor 1508) and/or a computing accelerator. GPGPU 1570 can be interconnected with a host processor (e.g., one or more CPUs 1546) and memories 1571, 1572 via one or more system and/or memory buses. Memory 1571 can be a system memory that can be shared with one or more CPUs 1546, while memory 1572 is a device memory dedicated to GPGPU 1570. For example, components within GPGPU 1570 and memory 1572 can be mapped to memory addresses that can be accessed by one or more CPUs 1546. Access to memories 1571 and 1572 can be facilitated via a memory controller 1568. Memory controller 1568 may include an internal direct memory access (DMA) controller 1569, or may include logic for performing operations that would otherwise be performed by a DMA controller.

GPGPU 1570包括多个缓存存储器,这些缓存存储器包括L2缓存1553、L1缓存1554、指令缓存1555以及共享存储器1556,该共享存储器1556的至少部分也可被分区为缓存存储器。GPGPU 1570还包括多个计算单元1560A-1560N。每个计算单元1560A-1560N包括向量寄存器的集合1561、标量寄存器的集合1562、向量逻辑单元的集合1563以及标量逻辑单元的集合1564。计算单元1560A-1560N还可包括本地共享存储器1565和程序计数器1566。计算单元1560A-1560N可与常量缓存1567耦合,该常量缓存1567可用于存储常量数据,该常量数据是在GPGPU 1570上执行的内核程序或着色器程序的运行期间不会改变的数据。常量缓存1567可以是标量数据缓存,并且经缓存的数据可被直接取到标量寄存器1562中。GPGPU 1570 includes a plurality of cache memories, including an L2 cache 1553, an L1 cache 1554, an instruction cache 1555, and a shared memory 1556, at least a portion of which may also be partitioned as a cache memory. GPGPU 1570 also includes a plurality of computing units 1560A-1560N. Each computing unit 1560A-1560N includes a set 1561 of vector registers, a set 1562 of scalar registers, a set 1563 of vector logic units, and a set 1564 of scalar logic units. Computing units 1560A-1560N may also include a local shared memory 1565 and a program counter 1566. Computing units 1560A-1560N may be coupled to a constant cache 1567, which may be used to store constant data, which is data that does not change during the execution of a kernel program or a shader program executed on GPGPU 1570. The constant cache 1567 may be a scalar data cache, and the cached data may be directly fetched into the scalar register 1562 .

在操作期间,一个或多个CPU 1546可将命令写入到GPGPU 1570中的寄存器中,或写入到GPGPU 1570中的、已经被映射到可访问地址空间中的存储器中。命令处理器1557可从寄存器或存储器读取命令,并且确定如何将在GPGPU 1570内处理那些命令。随后可使用线程调遣器1558来将线程调遣给计算单元1560A-1560N以执行那些命令。每个计算单元1560A-1560N可独立于其他计算单元来执行线程。此外,每个计算单元1560A-1560N可被独立地配置成用于有条件计算,并且可有条件地将计算的结果输出到存储器。当所提交的命令完成时,命令处理器1557可中断一个或多个CPU 1546。During operation, one or more CPUs 1546 may write commands to registers in GPGPU 1570, or to memory in GPGPU 1570 that has been mapped into an accessible address space. Command processor 1557 may read commands from registers or memory and determine how to process those commands within GPGPU 1570. Thread dispatcher 1558 may then be used to dispatch threads to computing units 1560A-1560N to execute those commands. Each computing unit 1560A-1560N may execute threads independently of other computing units. In addition, each computing unit 1560A-1560N may be independently configured for conditional computation, and may conditionally output the results of the computation to memory. Command processor 1557 may interrupt one or more CPUs 1546 when the submitted commands are completed.

图16A-图16C图示由本文中描述、例如根据图15A-图15C的实施例提供的附加的图形处理器和计算加速器体系结构的框图。图16A-图16C的具有与本文中任何其他附图的元件相同或类似名称的元件描述与其他附图中相同的元件,能以与其他附图中类似的方式进行操作或运行,可包括相同的部件,并且可链接到其他实体,该实体如本文中其他地方所描述的那些实体,但不限于此。Figures 16A-16C illustrate block diagrams of additional graphics processor and computing accelerator architectures provided by the embodiments described herein, for example, according to Figures 15A-15C. Elements of Figures 16A-16C having the same or similar names as elements of any other figure herein describe the same elements as in the other figures, can operate or function in a similar manner as in the other figures, can include the same components, and can be linked to other entities, such as those described elsewhere herein, but not limited thereto.

图16A是图形处理器1600的框图,该图形处理器1600可以是分立的图形处理单元,或可以是与多个处理核心或其他半导体器件集成的图形处理器,其他半导体器件诸如但不限于存储器设备或网络接口。图形处理器1600可以是图形处理器1508的变体,并且可替代图形处理器1508被使用。因此,本文中结合图形处理器1508对任何特征的公开也公开了对应的与图形处理器1600的结合,但不限于此。图形处理器可经由至图形处理器上的寄存器的存储器映射的I/O接口并且利用被放置到处理器存储器中的命令进行通信。图形处理器1600可包括用于访问存储器的存储器接口1614。存储器接口1614可以是至本地存储器、一个或多个内部缓存、一个或多个共享的外部缓存和/或至系统存储器的接口。FIG. 16A is a block diagram of a graphics processor 1600, which may be a discrete graphics processing unit, or may be a graphics processor integrated with multiple processing cores or other semiconductor devices, such as but not limited to memory devices or network interfaces. Graphics processor 1600 may be a variant of graphics processor 1508 and may be used in place of graphics processor 1508. Therefore, the disclosure of any feature in conjunction with graphics processor 1508 herein also discloses the corresponding combination with graphics processor 1600, but is not limited thereto. The graphics processor may communicate via a memory-mapped I/O interface to registers on the graphics processor and using commands placed into processor memory. Graphics processor 1600 may include a memory interface 1614 for accessing memory. Memory interface 1614 may be an interface to local memory, one or more internal caches, one or more shared external caches, and/or to system memory.

任选地,图形处理器1600还包括用于将显示输出数据驱动到显示设备1618的显示控制器1602。显示控制器1602包括用于显示器的一个或多个叠加平面以及多层的视频或用户界面元素的合成的硬件。显示设备1618可以是内部或外部显示设备。在一个实施例中,显示设备1618是头戴式显示设备,诸如,虚拟现实(VR)显示设备或增强现实(AR)显示设备。图形处理器1600可包括用于将媒体编码到一种或多种媒体编码格式,从一种或多种媒体编码格式对媒体解码,或在一种或多种媒体编码格式之间对媒体转码的视频编解码器引擎1606,这一种或多种媒体编码格式包括但不限于:移动图片专家组(Moving PictureExperts Group,MPEG)格式(诸如,MPEG-2)、高级视频译码(Advanced Video Coding,AVC)格式(诸如,H.264/MPEG-4AVC、H.265/HEVC、开放媒体联盟(Alliance for Open Media,AOMedia)VP8、VP9)、以及电影和电视工程师协会(the Society ofMotion Picture&Television Engineers,SMPTE)421M/VC-1、和联合图像专家组(Joint PhotographicExperts Group,JPEG)格式(诸如,JPEG、以及运动JPEG(Motion JPEG,MJPEG)格式)。Optionally, the graphics processor 1600 also includes a display controller 1602 for driving display output data to a display device 1618. The display controller 1602 includes hardware for compositing one or more overlay planes of a display and multiple layers of video or user interface elements. The display device 1618 can be an internal or external display device. In one embodiment, the display device 1618 is a head mounted display device, such as a virtual reality (VR) display device or an augmented reality (AR) display device. The graphics processor 1600 may include a video codec engine 1606 for encoding media to one or more media coding formats, decoding media from one or more media coding formats, or transcoding media between one or more media coding formats, including but not limited to: Moving Picture Experts Group (MPEG) formats (such as MPEG-2), Advanced Video Coding (AVC) formats (such as H.264/MPEG-4 AVC, H.265/HEVC, Alliance for Open Media (AOMedia) VP8, VP9), and the Society of Motion Picture & Television Engineers (SMPTE) 421M/VC-1, and Joint Photographic Experts Group (JPEG) formats (such as JPEG and Motion JPEG (MJPEG) formats).

图形处理器1600可包括块图像传输(block image transfer,BLIT)引擎1603,用于执行二维(2D)栅格化器操作,包括例如,比特边界块传输。然而,替代地,可使用图形处理引擎(graphics processing engine,GPE)1610的一个或多个部件执行2D图形操作。在一些实施例中,GPE 1610是用于执行图形操作的计算引擎,这些图形操作包括三维(3D)图形操作和媒体操作。Graphics processor 1600 may include a block image transfer (BLIT) engine 1603 for performing two-dimensional (2D) rasterizer operations, including, for example, bit-boundary block transfers. However, alternatively, 2D graphics operations may be performed using one or more components of a graphics processing engine (GPE) 1610. In some embodiments, GPE 1610 is a compute engine for performing graphics operations, including three-dimensional (3D) graphics operations and media operations.

GPE 1610可包括用于执行3D操作的3D管线1612,该3D操作诸如,使用作用于3D基元形状(例如,矩形、三角形等)的处理函数来渲染三维图像和场景。3D管线1612包括可编程和固定功能元件,这些可编程和固定功能元件执行元件内的各种任务和/或生成到3D/媒体子系统1615的执行线程。虽然3D管线1612可用于执行媒体操作,但是GPE 1610的实施例还包括媒体管线1616,该媒体管线1616专门用于执行媒体操作,诸如,视频后处理和图像增强。GPE 1610 may include a 3D pipeline 1612 for performing 3D operations, such as rendering three-dimensional images and scenes using processing functions that act on 3D primitive shapes (e.g., rectangles, triangles, etc.). 3D pipeline 1612 includes programmable and fixed-function elements that perform various tasks within the element and/or spawn execution threads to 3D/media subsystem 1615. While 3D pipeline 1612 may be used to perform media operations, embodiments of GPE 1610 also include a media pipeline 1616 that is specifically used to perform media operations, such as video post-processing and image enhancement.

媒体管线1616可包括固定功能或可编程逻辑单元,用于代替、或代表视频编解码器引擎1606来执行一个或多个专业的媒体操作,诸如,视频解码加速、视频去隔行以及视频编码加速。媒体管线1616可附加地包括线程生成单元,用于生成线程以供在3D/媒体子系统1615上执行。所生成的线程在3D/媒体子系统1615中所包括的一个或多个图形执行单元上执行用于媒体操作的计算。The media pipeline 1616 may include fixed-function or programmable logic units for performing one or more specialized media operations, such as video decoding acceleration, video deinterlacing, and video encoding acceleration, instead of, or on behalf of, the video codec engine 1606. The media pipeline 1616 may additionally include a thread generation unit for generating threads for execution on the 3D/media subsystem 1615. The generated threads perform calculations for media operations on one or more graphics execution units included in the 3D/media subsystem 1615.

3D/媒体子系统1615可包括用于执行由3D管线1612和媒体管线1616生成的线程的逻辑。管线可将线程执行请求发送到3D/媒体子系统1615,该3D/媒体子系统1615包括用于对于对可用的线程执行资源的各种请求进行仲裁和调遣的线程调遣逻辑。执行资源包括用于处理3D线程和媒体线程的图形执行单元的阵列。3D/媒体子系统1615可包括用于线程指令和数据的一个或多个内部缓存。此外,3D/媒体子系统1615还可包括用于在线程之间共享数据并用于存储输出数据的共享存储器,其包括寄存器和可寻址存储器。3D/media subsystem 1615 may include logic for executing threads generated by 3D pipeline 1612 and media pipeline 1616. The pipeline may send thread execution requests to 3D/media subsystem 1615, which includes thread dispatch logic for arbitrating and dispatching various requests for available thread execution resources. Execution resources include an array of graphics execution units for processing 3D threads and media threads. 3D/media subsystem 1615 may include one or more internal caches for thread instructions and data. In addition, 3D/media subsystem 1615 may also include shared memory for sharing data between threads and for storing output data, which includes registers and addressable memory.

图16B图示图形处理器1620,该图形处理器1620是图形处理器1600的变体,并且可替代图形处理器1600被使用且反之亦然。因此,本文中结合图形处理器1600对任何特征的公开也公开了对应的与图形处理器1620的结合,但不限于此。根据本文中描述的实施例,图形处理器1620具有分片体系结构。图形处理器1620可包括图形处理引擎集群1622,该图形处理引擎集群1622在图形引擎片1610A-1610D内具有图16A的图形处理器引擎1610的多个实例。每个图形引擎片1610A-1610D可经由片互连的集合1623A-1623F被互连。每个图形引擎片1610A-1610D还可经由存储器互连1625A-1625D被连接到存储器模块或存储器设备1626A-1626D。存储器设备1626A-1626D可使用任何图形存储器技术。例如,存储器设备1626A-1626D可以是图形双倍数据速率(GDDR)存储器。存储器设备1626A-1626D可以是高带宽存储器(HBM)模块,这些HBM模块可与其相应的图形引擎片1610A-1610D一起在管芯上。存储器设备1626A-1626D可以是可被堆叠在其相应的图形引擎片1610A-1610D的顶部上的堆叠式存储器设备。每个图形引擎片1610A-1610D和相关联的存储器1626A-1626D可驻留在分开的小芯片上,这些分开的小芯片被接合到基础管芯或基础衬底,如在图24B-图24D中进一步详细地所描述。FIG. 16B illustrates a graphics processor 1620 that is a variant of graphics processor 1600 and can be used in place of graphics processor 1600 and vice versa. Therefore, the disclosure of any feature herein in conjunction with graphics processor 1600 also discloses the corresponding combination with graphics processor 1620, but is not limited thereto. According to the embodiments described herein, graphics processor 1620 has a slice architecture. Graphics processor 1620 may include a graphics processing engine cluster 1622 having multiple instances of graphics processor engine 1610 of FIG. 16A within graphics engine slices 1610A-1610D. Each graphics engine slice 1610A-1610D may be interconnected via a set of slice interconnects 1623A-1623F. Each graphics engine slice 1610A-1610D may also be connected to a memory module or memory device 1626A-1626D via a memory interconnect 1625A-1625D. The memory devices 1626A-1626D may use any graphics memory technology. For example, the memory devices 1626A-1626D may be graphics double data rate (GDDR) memory. The memory devices 1626A-1626D may be high bandwidth memory (HBM) modules that may be on the die with their corresponding graphics engine slices 1610A-1610D. The memory devices 1626A-1626D may be stacked memory devices that may be stacked on top of their corresponding graphics engine slices 1610A-1610D. Each graphics engine slice 1610A-1610D and associated memory 1626A-1626D may reside on separate chiplets that are bonded to a base die or base substrate, as described in further detail in FIGS. 24B-24D.

图形处理器1620可配置有非统一存储器存取(non-uniform memory access,NUMA)系统,在该NUMA系统中,存储器设备1626A-1626D与相关联的图形引擎片1610A-1610D耦合。给定的存储器设备可由与该存储器设备直接连接到的图形引擎片不同的图形引擎片访问。然而,当存取本地片时,对存储器设备1626A-1626D的存取等待时间可以最低。在一个实施例中,启用缓存一致的NUMA(cache coherent NUMA,ccNUMA)系统,该ccNUMA系统使用片互连1623A-1623F来启用图形引擎片1610A-1610D内的缓存控制器之间的通信,以便当多于一个缓存存储相同的存储器位置时保持一致的存储器图像。Graphics processor 1620 may be configured with a non-uniform memory access (NUMA) system in which memory devices 1626A-1626D are coupled to associated graphics engine slices 1610A-1610D. A given memory device may be accessed by a graphics engine slice different from the graphics engine slice to which the memory device is directly connected. However, when accessing the local slice, access latency to memory devices 1626A-1626D may be minimized. In one embodiment, a cache coherent NUMA (ccNUMA) system is enabled that uses slice interconnects 1623A-1623F to enable communication between cache controllers within graphics engine slices 1610A-1610D to maintain a consistent memory image when more than one cache stores the same memory location.

图形处理引擎集群1622可与芯片上或封装上结构互连1624连接。在一个实施例中,结构互连1624包括网络处理器、片上网络(network on a chip,NoC)、或用于使结构互连1624能充当在图形处理器1620的部件之间交换数据分组的分组交换型结构互连的另一交换处理器。结构互连1624可启用图形引擎片1610A-1610D与诸如视频编解码器1606和一个或多个复制引擎1604之类的部件之间的通信。复制引擎1604可用于将数据移出存储器设备1626A-1626D和在图形处理器1620外部的存储器(例如,系统存储器),将数据移入存储器设备1626A-1626D和在图形处理器1620外部的存储器(例如,系统存储器),并且在存储器设备1626A-1626D与在图形处理器1620外部的存储器(例如,系统存储器)之间移动数据。结构互连1624还可用于将图形引擎片1610A-1610D互连。图形处理器1620可任选地包括显示控制器1602,用于启用与外部显示设备1618的连接。图形处理器还可被配置为图形加速器或计算加速器。在加速器配置中,显示控制器1602和显示设备1618可被省略。The graphics processing engine cluster 1622 may be connected to an on-chip or on-package fabric interconnect 1624. In one embodiment, the fabric interconnect 1624 includes a network processor, a network on a chip (NoC), or another switching processor for enabling the fabric interconnect 1624 to function as a packet-switched fabric interconnect for exchanging data packets between components of the graphics processor 1620. The fabric interconnect 1624 may enable communication between the graphics engine slices 1610A-1610D and components such as the video codec 1606 and one or more replication engines 1604. The replication engines 1604 may be used to move data out of the memory devices 1626A-1626D and memory external to the graphics processor 1620 (e.g., system memory), move data into the memory devices 1626A-1626D and memory external to the graphics processor 1620 (e.g., system memory), and move data between the memory devices 1626A-1626D and memory external to the graphics processor 1620 (e.g., system memory). Fabric interconnect 1624 may also be used to interconnect graphics engine slices 1610A-1610D. Graphics processor 1620 may optionally include display controller 1602 to enable connection to external display device 1618. Graphics processor may also be configured as a graphics accelerator or a computing accelerator. In an accelerator configuration, display controller 1602 and display device 1618 may be omitted.

图形处理器1620可经由主机接口1628连接到主机系统。主机接口1628可启用图形处理器1620、系统存储器和/或其他系统部件之间的通信。主机接口1628可以是例如PCI快速总线或另一类型的主机系统接口。例如,主机接口1628可以是NVLink或NVSwitch接口。主机接口1628和结构互连1624可以协作以使图形处理器1620的多个实例能充当单个逻辑设备。主机接口1628和结构互连1624之间的协作还可使各个图形引擎片1610A-1610D能够作为不同的逻辑图形设备向主机系统呈现。Graphics processor 1620 may be connected to a host system via host interface 1628. Host interface 1628 may enable communication between graphics processor 1620, system memory, and/or other system components. Host interface 1628 may be, for example, a PCI Express bus or another type of host system interface. For example, host interface 1628 may be an NVLink or NVSwitch interface. Host interface 1628 and fabric interconnect 1624 may collaborate to enable multiple instances of graphics processor 1620 to act as a single logical device. Collaboration between host interface 1628 and fabric interconnect 1624 may also enable individual graphics engine slices 1610A-1610D to be presented to the host system as different logical graphics devices.

图16C图示根据本文中描述的实施例的计算加速器1630。计算加速器1630可包括与图16B的图形处理器1620的体系结构类似性,并且针对计算加速进行优化。计算引擎集群1632可包括计算引擎片1640A-1640D的集合,计算引擎片1640A-1640D的集合包括针对并行或基于向量的通用计算操作优化的执行逻辑。计算引擎片1640A-1640D可以不包括固定功能图形处理逻辑,但是在一些实施例中,计算引擎片1640A-1640D中的一个或多个可包括用于执行媒体加速的逻辑。计算引擎片1640A-1640D可经由存储器互连1625A-1625D连接到存储器1626A-1626D。存储器1626A-1626D和存储器互连1625A-1625D可以是与在图形处理器1620中类似的技术,或者可以是不同的技术。计算引擎片1640A-1640D还可经由片互连的集合1623A-1623F被互连,并且可与结构互连1624连接和/或通过结构互连1624被互连。在一个实施例中,计算加速器1630包括可被配置为设备范围的缓存的大型L3缓存1636。计算加速器1630还能以与图16B的图形处理器1620类似的方式经由主机接口1628连接到主机处理器和存储器。FIG. 16C illustrates a computing accelerator 1630 according to an embodiment described herein. The computing accelerator 1630 may include architectural similarities to the graphics processor 1620 of FIG. 16B and is optimized for computing acceleration. The computing engine cluster 1632 may include a collection of computing engine slices 1640A-1640D, which include execution logic optimized for parallel or vector-based general computing operations. The computing engine slices 1640A-1640D may not include fixed-function graphics processing logic, but in some embodiments, one or more of the computing engine slices 1640A-1640D may include logic for performing media acceleration. The computing engine slices 1640A-1640D may be connected to the memory 1626A-1626D via the memory interconnect 1625A-1625D. 16B . The memory 1626A-1626D and memory interconnect 1625A-1625D may be similar technologies as in the graphics processor 1620, or may be different technologies. The compute engine slices 1640A-1640D may also be interconnected via a set of slice interconnects 1623A-1623F, and may be connected to and/or interconnected through the fabric interconnect 1624. In one embodiment, the compute accelerator 1630 includes a large L3 cache 1636 that may be configured as a device-wide cache. The compute accelerator 1630 may also be connected to a host processor and memory via a host interface 1628 in a manner similar to the graphics processor 1620 of FIG. 16B .

计算加速器1630还可包括集成的网络接口1642。在一个实施例中,集成的网络接口1642包括网络处理器和控制器逻辑,该控制器逻辑使计算引擎集群1632能够在无需数据跨越主机系统的存储器的情况下通过物理层互连1644进行通信。在一个实施例中,计算引擎片1640A-1640D中的一个由网络处理器逻辑替代,并且要经由物理层互连1644传送或接收的数据可直接向存储器1626A-1626D或从存储器1626A-1626D传送。计算加速器1630的多个实例可经由物理层互连1644被结合到单个逻辑设备中。替代地,各计算引擎片1640A-1640D可被呈现为不同的网络可访问计算加速器设备。The computing accelerator 1630 may also include an integrated network interface 1642. In one embodiment, the integrated network interface 1642 includes a network processor and controller logic that enables the computing engine cluster 1632 to communicate over a physical layer interconnect 1644 without the data having to cross the memory of the host system. In one embodiment, one of the computing engine slices 1640A-1640D is replaced by the network processor logic, and the data to be transmitted or received via the physical layer interconnect 1644 can be directly transmitted to or from the memory 1626A-1626D. Multiple instances of the computing accelerator 1630 can be combined into a single logical device via the physical layer interconnect 1644. Alternatively, each computing engine slice 1640A-1640D can be presented as a different network accessible computing accelerator device.

图形处理引擎Graphics processing engine

图17是根据一些实施例的图形处理器的图形处理引擎1710的框图。图形处理引擎(GPE)1710可以是图16A中示出的GPE 1610的某个版本,并且还可表示图16B的图形引擎片1610A-1610D。图17的具有与本文中任何其他附图的元件相同或类似名称的元件描述与其他附图中相同的元件,能以与其他附图中类似的方式进行操作或运行,可包括相同的部件,并且可链接到其他实体,该实体如本文中其他地方所描述的那些实体,但不限于此。例如,在图17中也图示图16A的3D管线1612和媒体管线1616。媒体管线1616在GPE 1710的一些实施例中是任选的,并且可以不显式地被包括在GPE 1710内。例如并且在至少一个实施例中,单独的媒体和/或图像处理器被耦合至GPE 1710。FIG. 17 is a block diagram of a graphics processing engine 1710 of a graphics processor according to some embodiments. Graphics processing engine (GPE) 1710 may be a version of GPE 1610 shown in FIG. 16A and may also represent graphics engine slices 1610A-1610D of FIG. 16B. Elements of FIG. 17 having the same or similar names as elements of any other figures herein describe the same elements in the other figures, may operate or run in a similar manner as in the other figures, may include the same components, and may be linked to other entities, such as those described elsewhere herein, but not limited thereto. For example, 3D pipeline 1612 and media pipeline 1616 of FIG. 16A are also illustrated in FIG. 17. Media pipeline 1616 is optional in some embodiments of GPE 1710 and may not be explicitly included in GPE 1710. For example and in at least one embodiment, a separate media and/or image processor is coupled to GPE 1710.

GPE 1710可与命令流转化器1703耦合或包括命令流转化器1703,该命令流转化器1703将命令流提供给3D管线1612和/或媒体管线1616。替代地或附加地,命令流转化器1703可直接耦合至统一返回缓冲器1718。统一返回缓冲器1718可通信地耦合至图形核心集群1714。任选地,命令流转化器1703与存储器耦合,该存储器可以是系统存储器、或内部缓存存储器和共享缓存存储器中的一个或多个。命令流转化器1703可从存储器接收命令,并且将这些命令发送至3D管线1612和/或媒体管线1616。这些命令是从环形缓冲器取得的指示,该环形缓冲器存储用于3D管线1612和媒体管线1616的命令。环形缓冲器可附加地包括存储批量的多个命令的批量命令缓冲器。用于3D管线1612的命令还可包括对存储在存储器中的数据的引用,这些数据诸如但不限于用于3D管线1612的顶点数据和几何数据和/或用于媒体管线1616的图像数据和存储器对象。3D管线1612和媒体管线1616通过经由相应的管线内的逻辑执行操作或者通过将一个或多个执行线程调遣至图形核心集群1714来处理命令和数据。图形核心集群1714可包括一个或多个图形核心块(例如,图形核心块1715A、图形核心块1715B),每个块包括一个或多个图形核心。每个图形核心包括图形执行资源的集合,该图形执行资源的集合包括:用于执行图形操作和计算操作的通用和图形专用执行逻辑;以及固定功能纹理处理逻辑和/或机器学习和人工智能加速逻辑。The GPE 1710 may be coupled to or include a command stream converter 1703 that provides a command stream to the 3D pipeline 1612 and/or the media pipeline 1616. Alternatively or additionally, the command stream converter 1703 may be directly coupled to a unified return buffer 1718. The unified return buffer 1718 may be communicatively coupled to the graphics core cluster 1714. Optionally, the command stream converter 1703 is coupled to a memory, which may be a system memory, or one or more of an internal cache memory and a shared cache memory. The command stream converter 1703 may receive commands from the memory and send these commands to the 3D pipeline 1612 and/or the media pipeline 1616. These commands are instructions taken from a ring buffer that stores commands for the 3D pipeline 1612 and the media pipeline 1616. The ring buffer may additionally include a batch command buffer that stores a batch of multiple commands. Commands for the 3D pipeline 1612 may also include references to data stored in memory, such as, but not limited to, vertex data and geometry data for the 3D pipeline 1612 and/or image data and memory objects for the media pipeline 1616. The 3D pipeline 1612 and the media pipeline 1616 process commands and data by performing operations via logic within the corresponding pipelines or by dispatching one or more execution threads to the graphics core cluster 1714. The graphics core cluster 1714 may include one or more graphics core blocks (e.g., graphics core block 1715A, graphics core block 1715B), each block including one or more graphics cores. Each graphics core includes a collection of graphics execution resources, including: general and graphics-specific execution logic for performing graphics operations and compute operations; and fixed-function texture processing logic and/or machine learning and artificial intelligence acceleration logic.

在各实施例中,3D管线1612可包括用于通过处理指令并将执行线程调遣给图形核心集群1714来处理一个或多个着色器程序的固定功能和可编程逻辑,该一个或多个着色器程序诸如,顶点着色器、几何着色器、像素着色器、片段着色器、计算着色器或其他着色器程序。图形核心集群1714提供统一的执行资源块,以供在处理这些着色器程序时使用。图形核心集群1714的图形核心块1715A-1715B内的多功能执行逻辑(例如,执行单元)包括对各种3D API着色器语言的支持,并且可执行与多个着色器相关联的多个同步执行线程。In various embodiments, the 3D pipeline 1612 may include fixed function and programmable logic for processing one or more shader programs, such as vertex shaders, geometry shaders, pixel shaders, fragment shaders, compute shaders, or other shader programs, by processing instructions and dispatching execution threads to the graphics core cluster 1714. The graphics core cluster 1714 provides a unified execution resource block for use in processing these shader programs. The multi-function execution logic (e.g., execution units) within the graphics core blocks 1715A-1715B of the graphics core cluster 1714 includes support for various 3D API shader languages and can execute multiple simultaneous execution threads associated with multiple shaders.

图形核心集群1714可包括用于执行诸如视频和/或图像处理之类的媒体功能的执行逻辑。除了图形处理操作之外,执行单元还可包括可编程以执行并行的通用计算操作的通用逻辑。通用逻辑可并行地或结合图14的(一个或多个)处理器核心1407或如图15A中的核心1502A-1502N内的通用逻辑来执行处理操作。Graphics core cluster 1714 may include execution logic for performing media functions such as video and/or image processing. In addition to graphics processing operations, the execution units may also include general logic that can be programmed to perform parallel general computing operations. The general logic may perform processing operations in parallel or in conjunction with general logic within processor core(s) 1407 of FIG. 14 or cores 1502A-1502N as in FIG. 15A.

由在图形核心集群1714上执行的线程生成的输出数据可以将数据输出到统一返回缓冲器(unifiedreturn buffer,URB)1718中的存储器。URB 1718可存储用于多个线程的数据。URB 1718可用于在图形核心集群1714上执行的不同线程之间发送数据。URB 1718可附加地用于在图形核心集群1714上的线程与共享功能逻辑1720内的固定功能逻辑之间的同步。Output data generated by threads executing on graphics core cluster 1714 can output data to memory in unified return buffer (URB) 1718. URB 1718 can store data for multiple threads. URB 1718 can be used to send data between different threads executing on graphics core cluster 1714. URB 1718 can additionally be used for synchronization between threads on graphics core cluster 1714 and fixed function logic within shared function logic 1720.

任选地,图形核心集群1714可以是可缩放的,使得阵列包括可变数量的图形核心,每个图形核心都具有基于GPE 1710的目标功率和性能等级的可变数量的执行单元。执行资源可以是动态地可缩放的,使得执行资源可根据需要被启用或禁用。Optionally, graphics core clusters 1714 may be scalable such that the array includes a variable number of graphics cores, each with a variable number of execution units based on the target power and performance level of GPE 1710. Execution resources may be dynamically scalable such that execution resources may be enabled or disabled as needed.

图形核心集群1714与共享功能逻辑1720耦合,该共享功能逻辑1720包括在图形核心阵列中的图形核心之间被共享的多个资源。共享功能逻辑1720内的共享功能是将专业的补充功能提供给图形核心集群1714的硬件逻辑单元。在各实施例中,共享功能逻辑1720包括但不限于采样器1721逻辑、数学1722逻辑和线程间通信(inter-thread communication,ITC)1723逻辑。此外,可实现共享功能逻辑1720内的一个或多个缓存1725。The graphics core cluster 1714 is coupled to shared function logic 1720, which includes a plurality of resources shared between the graphics cores in the graphics core array. The shared functions within the shared function logic 1720 are hardware logic units that provide specialized supplemental functions to the graphics core cluster 1714. In various embodiments, the shared function logic 1720 includes, but is not limited to, sampler 1721 logic, math 1722 logic, and inter-thread communication (ITC) 1723 logic. In addition, one or more caches 1725 within the shared function logic 1720 may be implemented.

至少在其中对于给定的专业功能的需求不足以包括在图形核心集群1714内的情况下实现共享功能。相反,那个专业功能的单个实例化被实现为共享功能逻辑1720中的独立实体,并且在图形核心集群1714内的执行资源之间被共享。在图形核心集群1714之间被共享并被包括在图形核心集群1714内的确切的功能集因实施例而异。共享功能逻辑1720内的由图形核心集群1714广泛使用的特定共享功能可被包括在图形核心集群1714内的共享功能逻辑1716内。任选地,图形核心集群1714内的共享功能逻辑1716可包括共享功能逻辑1720内的一些或所有逻辑。共享功能逻辑1720内的所有逻辑元件可以在图形核心集群1714的共享功能逻辑1716内被复制。替代地,共享功能逻辑1720被排除以有利于图形核心集群1714内的共享功能逻辑1716。Shared functionality is implemented at least in cases where demand for a given specialized function is insufficient to be included within the graphics core cluster 1714. Instead, a single instantiation of that specialized function is implemented as an independent entity in shared functionality logic 1720 and shared between execution resources within the graphics core cluster 1714. The exact set of functions shared between the graphics core clusters 1714 and included within the graphics core cluster 1714 varies from embodiment to embodiment. Specific shared functions within the shared functionality logic 1720 that are widely used by the graphics core cluster 1714 may be included within the shared functionality logic 1716 within the graphics core cluster 1714. Optionally, the shared functionality logic 1716 within the graphics core cluster 1714 may include some or all of the logic within the shared functionality logic 1720. All logic elements within the shared functionality logic 1720 may be replicated within the shared functionality logic 1716 of the graphics core cluster 1714. Alternatively, the shared functionality logic 1720 is excluded in favor of the shared functionality logic 1716 within the graphics core cluster 1714.

图形处理资源Graphics processing resources

图18A-图18C图示根据本文中描述的实施例的包括在图形处理器中采用的处理元件阵列的执行逻辑。图18A图示根据实施例的图形核心集群。图18B图示根据实施例的图形核心的向量引擎。图18C图示根据实施例的图形核心的矩阵引擎。图18A-图18C的具有与本文中任何其他附图的元件相同的附图标记的元件能以与在本文中其他地方描述的方式类似的任何方式进行操作或运行,但不限于此。例如,图18A-图18C的元件能以图15B的图形处理器核心块1519和/或图17的图形核心块1715A-1715B的上下文来考虑。在一个实施例中,图18A-图18C的元件具有与图15A的图形处理器1508或图15C的GPGPU 1570的等效部件类似的功能。Figures 18A-18C illustrate the execution logic of the processing element array used in the graphics processor according to the embodiments described herein. Figure 18A illustrates a graphics core cluster according to an embodiment. Figure 18B illustrates a vector engine of a graphics core according to an embodiment. Figure 18C illustrates a matrix engine of a graphics core according to an embodiment. The elements of Figures 18A-18C with the same reference numerals as the elements of any other figures herein can be operated or run in any manner similar to the manner described elsewhere in this article, but are not limited thereto. For example, the elements of Figures 18A-18C can be considered in the context of the graphics processor core block 1519 of Figure 15B and/or the graphics core blocks 1715A-1715B of Figure 17. In one embodiment, the elements of Figures 18A-18C have functions similar to the equivalent components of the graphics processor 1508 of Figure 15A or the GPGPU 1570 of Figure 15C.

如图18A中所示,在一个实施例中,图形核心集群1714包括图形核心块1715,该图形核心块1715可以是图17的图形核心块1715A或图形核心块1715B。图形核心块1715可包括任何数量的图形核心(例如,图形核心1815A、图形核心1815B,一直到图形核心1815N)。可以包括图形核心块1715的多个实例。在一个实施例中,图形核心1815A-1815N的元件具有与图15B的图形核心1521A-1521F的元件类似或等效的功能。在此类实施例中,图形核心1815A-1815N各自包括电路模块,包括但不限于:向量引擎1802A-1802N、矩阵引擎1803A-1803N、存储器加载/存储单元1804A-1804N、指令缓存1805A-1805N、数据缓存/共享本地存储器1806A-1806N、光线追踪单元1808A-1808N、采样器1810A-1810N。图形核心1815A-1815N的电路模块可以附加地包括固定功能逻辑1812A-1812N。设计的图形核心1815A-1815N内的向量引擎1802A-1802N和矩阵引擎1803A-1803N的数量可以基于针对该设计的工作负载、性能和功率目标而变化。As shown in FIG. 18A , in one embodiment, graphics core cluster 1714 includes graphics core block 1715, which may be graphics core block 1715A or graphics core block 1715B of FIG. 17 . Graphics core block 1715 may include any number of graphics cores (e.g., graphics core 1815A, graphics core 1815B, all the way to graphics core 1815N). Multiple instances of graphics core block 1715 may be included. In one embodiment, the elements of graphics cores 1815A-1815N have similar or equivalent functions to the elements of graphics cores 1521A-1521F of FIG. 15B . In such embodiments, graphics cores 1815A-1815N each include circuit modules including, but not limited to, vector engines 1802A-1802N, matrix engines 1803A-1803N, memory load/store units 1804A-1804N, instruction caches 1805A-1805N, data caches/shared local memory 1806A-1806N, ray tracing units 1808A-1808N, and samplers 1810A-1810N. The circuit modules of graphics cores 1815A-1815N may additionally include fixed function logic 1812A-1812N. The number of vector engines 1802A-1802N and matrix engines 1803A-1803N within a designed graphics core 1815A-1815N may vary based on the workload, performance, and power targets for the design.

参考图形核心1815A,向量引擎1802A和矩阵引擎1803A可配置用于基于与着色器程序相关联的指令对以各种整数和浮点数据格式的数据执行并行的计算操作。每个向量引擎1802A和矩阵引擎1803A可以充当能够执行多个同步硬件线程同时针对每个线程并行地处理多个数据元素的可编程通用计算单元。向量引擎1802A和矩阵引擎1803A支持处理处于各种SIMD宽度的可变宽度向量,包括但不限于SIMD8、SIMD16和SIMD32。输入数据元素可以作为紧缩数据类型存储在寄存器中,并且向量引擎1802A和矩阵引擎1803A可基于元素的数据大小来处理各元素。例如,当对256比特宽的向量进行操作时,向量的256比特被存储在寄存器中,并且向量被处理为四个单独的64比特紧缩数据元素(四字(Quad-Word,QW)大小数据元素)、八个单独的32比特紧缩数据元素(双字(Double Word,DW)大小数据元素)、十六个单独的16比特紧缩数据元素(字(Word,W)大小数据元素)、或三十二个单独的8比特数据元素(字节(byte,B)大小数据元素)。然而,不同的向量宽度和寄存器大小是可能的。在一个实施例中,向量引擎1802A和矩阵引擎1803A还可配置用于对各大小的单元组和线程组(例如,8个、16个或32个线程)进行SIMT操作。With reference to graphics core 1815A, vector engine 1802A and matrix engine 1803A are configurable to perform parallel computation operations on data in various integer and floating point data formats based on instructions associated with shader programs. Each vector engine 1802A and matrix engine 1803A can act as a programmable general purpose computing unit capable of executing multiple synchronous hardware threads while processing multiple data elements in parallel for each thread. Vector engine 1802A and matrix engine 1803A support processing of variable width vectors in various SIMD widths, including but not limited to SIMD8, SIMD16 and SIMD32. Input data elements can be stored in registers as compact data types, and vector engine 1802A and matrix engine 1803A can process each element based on the data size of the element. For example, when operating on a 256-bit wide vector, the 256 bits of the vector are stored in registers, and the vector is processed as four separate 64-bit packed data elements (quad-word (QW)) size data elements), eight separate 32-bit packed data elements (double-word (DW)) size data elements), sixteen separate 16-bit packed data elements (word (W) size data elements), or thirty-two separate 8-bit data elements (byte (B) size data elements). However, different vector widths and register sizes are possible. In one embodiment, the vector engine 1802A and the matrix engine 1803A can also be configured to perform SIMT operations on various sizes of unit groups and thread groups (e.g., 8, 16 or 32 threads).

继续图形核心1815A,存储器加载/存储单元1804A服务于存储器访问请求,该存储器访问请求由向量引擎1802A、矩阵引擎1803A和/或图形核心1815A的具有对存储器的访问权的其他部件来发出。存储器访问请求可以由存储器加载/存储单元1804A处理,以将所请求的数据加载或存储到缓存或存储器、或从缓存或存储器加载或存储到与向量引擎1802A和/或矩阵引擎1803A相关联的寄存器堆中。存储器加载/存储单元1804A还可执行预取操作。另外参考图19,在一个实施例中,存储器加载/存储单元1804A被配置用于针对被存储在存储器1910中、来自经由片互连1908而对于其他片而言是本地的存储器、或者来自系统存储器的数据提供SIMT分散/聚集预取或块预取。可以对特定L1缓存(例如,数据缓存/共享本地存储器1806A)、L2缓存1904或L3缓存1906执行预取。在一个实施例中,对L3缓存1906的预取自动地导致数据被存储在L2缓存1904中。Continuing with the graphics core 1815A, the memory load/store unit 1804A services memory access requests issued by the vector engine 1802A, the matrix engine 1803A, and/or other components of the graphics core 1815A that have access to memory. The memory access request can be processed by the memory load/store unit 1804A to load or store the requested data into a cache or memory, or from a cache or memory into a register file associated with the vector engine 1802A and/or the matrix engine 1803A. The memory load/store unit 1804A can also perform prefetch operations. Referring also to FIG. 19, in one embodiment, the memory load/store unit 1804A is configured to provide SIMT scatter/gather prefetch or block prefetch for data stored in memory 1910, from memory that is local to other slices via the slice interconnect 1908, or from system memory. Prefetching may be performed to a specific L1 cache (e.g., data cache/shared local memory 1806A), L2 cache 1904, or L3 cache 1906. In one embodiment, prefetching to L3 cache 1906 automatically causes the data to be stored in L2 cache 1904.

指令缓存1805A存储要由图形核心1815A执行的指令。在一个实施例中,图形核心1815A还包括将指令取到或预取到指令缓存1805A中的指令取得和预取电路模块。图形核心1815A还包括指令解码逻辑,以用于对指令缓存1805A内的指令进行解码。数据缓存/共享本地存储器1806A可以被配置为由实现缓存替代策略的缓存控制器管理的数据缓存并且/或者被配置为被显式地管理的共享存储器。光线追踪单元1808A包括用于加速光线追踪操作的电路模块。采样器1810A为3D操作提供纹理采样,并为媒体操作提供媒体采样。固定功能逻辑1812A包括固定功能电路模块,该固定功能电路模块在向量引擎1802A和矩阵引擎1803A的各实例之间共享。图形核心1815B-1815N能以与图形核心1815A类似的方式进行操作。Instruction cache 1805A stores instructions to be executed by graphics core 1815A. In one embodiment, graphics core 1815A also includes an instruction acquisition and pre-fetch circuit module that fetches or pre-fetches instructions into instruction cache 1805A. Graphics core 1815A also includes instruction decoding logic for decoding instructions within instruction cache 1805A. Data cache/shared local memory 1806A can be configured as a data cache managed by a cache controller that implements a cache replacement strategy and/or configured as a shared memory that is explicitly managed. Ray tracing unit 1808A includes a circuit module for accelerating ray tracing operations. Sampler 1810A provides texture sampling for 3D operations and media sampling for media operations. Fixed function logic 1812A includes a fixed function circuit module that is shared between instances of vector engine 1802A and matrix engine 1803A. Graphics cores 1815B-1815N can operate in a similar manner to graphics core 1815A.

指令缓存1805A-1805N、数据缓存/共享本地存储器1806A-1806N、光线追踪单元1808A-1808N、采样器1810A-1812N和固定功能逻辑1812A-1812N的功能与本文中描述的图形处理器体系结构中的等效功能相对应。例如,指令缓存1805A-1805N能以与图15C的指令缓存1555类似的方式进行操作。数据缓存/共享本地存储器1806A-1806N、光线追踪单元1808A-1808N和采样器1810A-1812N能以与图15B的缓存/SLM 1528A-1528F、光线追踪单元1527A-1527F和采样器1526A-1526F类似的方式进行操作。固定功能逻辑1812A-1812N可以包括图15B的几何/固定功能管线1531和/或附加的固定功能逻辑1538的元件。在一个实施例中,光线追踪单元1808A-1808N包括用于执行由图3C的光线追踪核心372所执行的光线追踪加速操作的电路模块。The functions of the instruction caches 1805A-1805N, data caches/shared local memory 1806A-1806N, ray tracing units 1808A-1808N, samplers 1810A-1812N, and fixed function logic 1812A-1812N correspond to the equivalent functions in the graphics processor architecture described herein. For example, the instruction caches 1805A-1805N can operate in a manner similar to the instruction cache 1555 of FIG. 15C. The data caches/shared local memory 1806A-1806N, ray tracing units 1808A-1808N, and samplers 1810A-1812N can operate in a manner similar to the caches/SLMs 1528A-1528F, ray tracing units 1527A-1527F, and samplers 1526A-1526F of FIG. 15B. Fixed function logic 1812A-1812N may include elements of geometry/fixed function pipeline 1531 of Figure 15B and/or additional fixed function logic 1538. In one embodiment, ray tracing units 1808A-1808N include circuit modules for performing ray tracing acceleration operations performed by ray tracing core 372 of Figure 3C.

如图18B中所示,在一个实施例中,向量引擎1802包括指令取得单元1837、通用寄存器堆阵列(general register file,GRF)1824、体系结构寄存器堆阵列(architecturalregister file,ARF)1826、线程仲裁器1822、发送单元1830、分支单元1832、SIMD浮点单元(FPU)的集合1834、以及在一个实施例中的整数SIMD ALU的集合1835。GRF 1824和ARF 1826包括与可在向量引擎1802中活跃的每个硬件线程相关联的通用寄存器堆和体系结构寄存器堆的集合。在一个实施例中,每线程体系结构状态被维持在ARF 1826中,而在线程执行期间使用的数据被存储在GRF 1824中。每个线程的执行状态,包括用于每个线程的指令指针,可以被保存在ARF 1826中的线程特定寄存器中。可使用寄存器重命名来动态地将寄存器分配给硬件线程。As shown in FIG. 18B , in one embodiment, the vector engine 1802 includes an instruction fetch unit 1837, a general register file array (GRF) 1824, an architectural register file array (ARF) 1826, a thread arbiter 1822, an issue unit 1830, a branch unit 1832, a set of SIMD floating point units (FPUs) 1834, and a set of integer SIMD ALUs 1835 in one embodiment. The GRF 1824 and ARF 1826 include a set of general register files and architectural register files associated with each hardware thread that can be active in the vector engine 1802. In one embodiment, the per-thread architectural state is maintained in the ARF 1826, while data used during thread execution is stored in the GRF 1824. The execution state of each thread, including the instruction pointer for each thread, can be saved in thread-specific registers in the ARF 1826. Register renaming can be used to dynamically assign registers to hardware threads.

在一个实施例中,向量引擎1802具有作为同步多线程(Simultaneous Multi-Threading,SMT)与细粒度交织多线程(Interleaved Multi-Threading,IMT)的组合的体系结构。该体系结构具有模块化配置,该模块化配置可以基于同步线程的目标数量和每个图形核心的寄存器的数量而在设计时进行微调,其中跨用于执行多个同步线程的逻辑来划分图形核心资源。可由向量引擎1802执行的逻辑线程的数量不限于硬件线程的数量,并且可将多个逻辑线程指派给每个硬件线程。In one embodiment, the vector engine 1802 has an architecture that is a combination of Simultaneous Multi-Threading (SMT) and fine-grained Interleaved Multi-Threading (IMT). The architecture has a modular configuration that can be fine-tuned at design time based on the target number of simultaneous threads and the number of registers per graphics core, where graphics core resources are divided across logic for executing multiple simultaneous threads. The number of logical threads that can be executed by the vector engine 1802 is not limited to the number of hardware threads, and multiple logical threads can be assigned to each hardware thread.

在一个实施例中,向量引擎1802可协同发出多个指令,这些指令可以各自是不同的指令。线程仲裁器1822可以将指令调遣到发送单元1830、分支单元1832或(一个或多个)SIMD FPU 1834中的一个以供执行。每个执行线程可访问GRF 1824内的128个通用寄存器,其中,每个寄存器可存储可作为具有32字节数据元素的可变宽度向量访问的32个字节。在一个实施例中,每个线程具有对GRF 1824内的4个千字节的访问权,但是实施例并不限于此,并且在其他实施例中可以提供更多或更少的寄存器资源。在一个实施例中,向量引擎1802被分区为可独立地执行计算操作的七个硬件线程,但是每个向量引擎1802的线程数量也可根据实施例而有所不同。例如,在一个实施例中,支持最多16个硬件线程。在其中七个线程可以访问4个千字节的实施例中,GRF 1824可以存储总共28个千字节。在16个线程可访问4个千字节的情况下,GRF 1824可存储总共64个千字节。灵活的寻址模式可准许对寄存器一起进行寻址,从而有效地建立更宽的寄存器或者表示跨步式矩形块数据结构。In one embodiment, the vector engine 1802 can issue multiple instructions in coordination, and these instructions can be different instructions respectively. The thread arbiter 1822 can dispatch instructions to one of the send unit 1830, the branch unit 1832 or (one or more) SIMD FPU 1834 for execution. Each execution thread can access 128 general registers in the GRF 1824, wherein each register can store 32 bytes that can be accessed as a variable width vector with 32-byte data elements. In one embodiment, each thread has access to 4 kilobytes in the GRF 1824, but the embodiment is not limited thereto, and more or less register resources can be provided in other embodiments. In one embodiment, the vector engine 1802 is partitioned into seven hardware threads that can independently perform computing operations, but the number of threads of each vector engine 1802 can also vary according to the embodiment. For example, in one embodiment, a maximum of 16 hardware threads are supported. In an embodiment in which seven threads can access 4 kilobytes, the GRF 1824 can store a total of 28 kilobytes. With 16 threads accessing 4 kilobytes, a total of 64 kilobytes can be stored by GRF 1824. Flexible addressing modes allow registers to be addressed together, effectively creating wider registers or representing strided rectangular block data structures.

在一个实施例中,经由由消息传递发送单元1830执行的“发送”指令来调遣存储器操作、采样器操作以及其他较长等待时间的系统通信。在一个实施例中,分支指令被调遣给专用分支单元1832,以促进SIMD分散和最终的汇聚。In one embodiment, memory operations, sampler operations, and other longer latency system communications are dispatched via "send" instructions executed by message passing send unit 1830. In one embodiment, branch instructions are dispatched to a dedicated branch unit 1832 to facilitate SIMD scatter and eventual convergence.

在一个实施例中,向量引擎1802包括用于执行浮点操作的一个或多个SIMD浮点单元((一个或多个)FPU)1834。在一个实施例中,(一个或多个)FPU 1834还支持整数计算。在一个实施例中,(一个或多个)FPU 1834可以执行最多M个32比特浮点(或整数)操作,或者执行最多2M个16比特整数或16比特浮点操作。在一个实施例中,(一个或多个)FPU中的至少一个提供支持高吞吐量超越数学函数和双精度64比特浮点的扩展数学能力。在一些实施例中,8比特整数SIMD ALU的集合1835也存在,并且可专门优化成执行与机器学习计算相关联的操作。在一个实施例中,SIMD ALU由可配置用于执行整数和浮点操作的附加的SIMD ALU的集合1834来替代。在一个实施例中,SIMD FPU 1834和SIMD ALU 1835可配置用于执行SIMT程序。在一个实施例中,支持组合的SIMD+SIMT操作。In one embodiment, the vector engine 1802 includes one or more SIMD floating point units (FPU(s)) 1834 for performing floating point operations. In one embodiment, FPU(s) 1834 also supports integer calculations. In one embodiment, FPU(s) 1834 can perform up to M 32-bit floating point (or integer) operations, or up to 2M 16-bit integer or 16-bit floating point operations. In one embodiment, at least one of the FPU(s) provides extended math capabilities that support high throughput transcendental math functions and double precision 64-bit floating points. In some embodiments, a set 1835 of 8-bit integer SIMD ALUs also exists and can be specifically optimized to perform operations associated with machine learning calculations. In one embodiment, the SIMD ALUs are replaced by a set 1834 of additional SIMD ALUs that can be configured to perform integer and floating point operations. In one embodiment, SIMD FPU 1834 and SIMD ALU 1835 can be configured to execute SIMT programs. In one embodiment, combined SIMD+SIMT operations are supported.

在一个实施例中,向量引擎1802的多个实例的阵列可在图形核心中被实例化。为了可缩放性,产品架构师可以选择每图形核心分组的向量引擎的确切数量。在一个实施例中,向量引擎1802可以跨多个执行通道来执行指令。在进一步的实施例中,在不同通道上执行在向量引擎1802上执行的每个线程。In one embodiment, an array of multiple instances of vector engine 1802 may be instantiated in a graphics core. For scalability, product architects may choose the exact number of vector engines grouped per graphics core. In one embodiment, vector engine 1802 may execute instructions across multiple execution lanes. In further embodiments, each thread executed on vector engine 1802 is executed on a different lane.

如图18C中所示,在一个实施例中,矩阵引擎1803包括被配置用于执行张量操作的处理元件的阵列,该张量操作包括向量/矩阵操作和矩阵/矩阵操作,诸如但不限于矩阵乘法和/或点积操作。可以利用M行和N列的处理元件(1852AA-1852MN)来配置矩阵引擎1803,该处理元件(PE 1852AA-PE-1852MN)包括以管线化方式组织的乘法器和加法器电路。在一个实施例中,处理元件1852AA-1852MN组成N宽和M深的脉动阵列的物理管线阶段,该脉动阵列可用于以数据并行的方式执行向量/矩阵操作或矩阵/矩阵操作,包括矩阵乘法、融合乘加、点积或其他通用矩阵-矩阵乘法(GEMM)操作。在一个实施例中,矩阵引擎1803支持16比特和8比特的浮点操作,以及8比特、4比特、2比特和二进制整数操作。矩阵引擎1803还可以被配置用于加速特定机器学习操作。在此类实施例中,矩阵引擎1803可配置有对于相对于电气和电子工程师学会(Institute ofElectrical and Electronics Engineers,IEEE)754格式具有不同数量的尾数比特和指数比特的bfloat(brain浮点)16比特浮点格式、或张量浮点32比特浮点格式(TF32)的支持。As shown in Figure 18C, in one embodiment, matrix engine 1803 includes an array of processing elements configured to perform tensor operations, and the tensor operations include vector/matrix operations and matrix/matrix operations, such as but not limited to matrix multiplication and/or dot product operations. Matrix engine 1803 can be configured using processing elements (1852AA-1852MN) of M rows and N columns, and the processing elements (PE 1852AA-PE-1852MN) include multipliers and adder circuits organized in a pipelined manner. In one embodiment, processing elements 1852AA-1852MN form the physical pipeline stage of the systolic array of N width and M depth, and the systolic array can be used to perform vector/matrix operations or matrix/matrix operations in a data parallel manner, including matrix multiplication, fusion multiplication and addition, dot product or other general matrix-matrix multiplication (GEMM) operations. In one embodiment, matrix engine 1803 supports 16-bit and 8-bit floating point operations, as well as 8-bit, 4-bit, 2-bit and binary integer operations. The matrix engine 1803 may also be configured to accelerate certain machine learning operations. In such embodiments, the matrix engine 1803 may be configured with support for a bfloat (brain floating point) 16-bit floating point format with a different number of mantissa bits and exponent bits relative to the Institute of Electrical and Electronics Engineers (IEEE) 754 format, or a tensor floating point 32-bit floating point format (TF32).

在一个实施例中,在每个周期期间,每个阶段可以将在该阶段执行的操作的结果添加至前一阶段的输出。在其他实施例中,在计算周期的集合之后,处理元件1852AA-1852MN之间的数据移动的模式可以基于被执行的指令或宏操作而变化。例如,在一个实施例中,部分和回路(partial sum loopback)被启用,并且处理元件可以替代地将当前周期的输出与前一周期中生成的输出相加。在一个实施例中,脉动阵列的最终阶段可被配置有到脉动阵列的初始阶段的回路。在此类实施例中,物理管线阶段的数量可以与由矩阵引擎1803所支持的逻辑管线阶段的数量解耦。例如,在处理元件1852AA-1852MN被配置为M个物理阶段的脉动阵列的情况下,从阶段M至初始管线阶段的回路可以使得处理元件1852AA-1852MN能够作为例如2M、3M、4M等的逻辑管线阶段的脉动阵列来进行操作。In one embodiment, during each cycle, each stage may add the result of the operation performed at that stage to the output of the previous stage. In other embodiments, after a set of computation cycles, the pattern of data movement between processing elements 1852AA-1852MN may vary based on the instructions or macro operations executed. For example, in one embodiment, a partial sum loopback is enabled, and the processing element may instead add the output of the current cycle to the output generated in the previous cycle. In one embodiment, the final stage of the systolic array may be configured with a loop to the initial stage of the systolic array. In such embodiments, the number of physical pipeline stages may be decoupled from the number of logical pipeline stages supported by the matrix engine 1803. For example, where the processing elements 1852AA-1852MN are configured as a systolic array of M physical stages, a loop from stage M to the initial pipeline stage may enable the processing elements 1852AA-1852MN to operate as a systolic array of logical pipeline stages, such as 2M, 3M, 4M, etc.

在一个实施例中,矩阵引擎1803包括存储器1841A-1841N、1842A-1842M,用于以针对输入矩阵的行和列数据的形式存储输入数据。存储器1842A-1842M可配置用于存储第一输入矩阵的行元素(A0-Am),并且存储器1841A-1841N可配置用于存储第二输入矩阵的列元素(B0-Bn)。行元素和列元素被提供为到处理元件1852AA-1852MN的输入以供处理。在一个实施例中,输入矩阵的元素行和列元素可以在这些元素被提供给存储器1841A-1841N、1842A-1842M之前存储在矩阵引擎1803内的脉动寄存器堆1840中。在一个实施例中,排除脉动寄存器堆1840,并且从相关联的向量引擎中的寄存器(例如,图18B的向量引擎1802的GRF1824)或包括矩阵引擎1803的图形核心的其他存储器(例如,图18A的用于矩阵引擎1803A的数据缓存/共享本地存储器1806A)加载存储器1841A-1841N、1842A-1842M。由处理元件1852AA-1852MN生成的结果随后被输出到输出缓冲器和/或被写入到寄存器堆(例如,脉动寄存器堆1840、GRF 1824、数据缓存/共享本地存储器1806A-1806N),以供图形处理器的其他功能单元进一步处理或供输出到存储器。In one embodiment, matrix engine 1803 includes memory 1841A-1841N, 1842A-1842M, for storing input data in the form of row and column data for input matrix. Memory 1842A-1842M can be configured to store the row elements (A0-Am) of the first input matrix, and memory 1841A-1841N can be configured to store the column elements (B0-Bn) of the second input matrix. Row elements and column elements are provided as input to processing element 1852AA-1852MN for processing. In one embodiment, the element row and column elements of the input matrix can be stored in the systolic register stack 1840 in the matrix engine 1803 before these elements are provided to memory 1841A-1841N, 1842A-1842M. In one embodiment, systolic register file 1840 is excluded and memories 1841A-1841N, 1842A-1842M are loaded from registers in an associated vector engine (e.g., GRF 1824 of vector engine 1802 of FIG. 18B ) or other memory of a graphics core including matrix engine 1803 (e.g., data cache/shared local memory 1806A for matrix engine 1803A of FIG. 18A ). Results generated by processing elements 1852AA-1852MN are then output to output buffers and/or written to register files (e.g., systolic register file 1840, GRF 1824, data cache/shared local memory 1806A-1806N) for further processing by other functional units of the graphics processor or for output to memory.

在一些实施例中,矩阵引擎1803被配置有对输入稀疏度的支持,其中,输入数据的稀疏区域的乘法操作可通过跳过具有零值的操作对象的乘法操作而被绕过。在一个实施例中,处理元件1852AA-1852MN被配置用于跳过具有零值输入的某些操作的执行。在一个实施例中,输入矩阵内的稀疏度可以被检测,并且具有已知零输出值的操作在被提交给处理元件1852AA-1852MN之前可以被绕过。将零值操作对象加载到处理元件中可以被绕过,并且处理元件1852AA-1852MN可被配置用于对非零值输入元素执行乘法。矩阵引擎1803还可被配置有对输出稀疏度的支持,使得具有被预定为零的结果的操作可被绕过。对于输入稀疏度和/或输出稀疏度,在一个实施例中,将元数据提供给处理元件1852AA-1852MN,以指示对于某一处理周期,哪些处理元件和/或数据通道在该周期期间将是活跃的。In some embodiments, the matrix engine 1803 is configured with support for input sparsity, wherein multiplication operations of sparse regions of input data can be bypassed by skipping multiplication operations of operands with zero values. In one embodiment, processing elements 1852AA-1852MN are configured to skip the execution of certain operations with zero-value inputs. In one embodiment, the sparsity within the input matrix can be detected, and operations with known zero output values can be bypassed before being submitted to processing elements 1852AA-1852MN. Loading zero-value operands into processing elements can be bypassed, and processing elements 1852AA-1852MN can be configured to perform multiplication on non-zero-value input elements. The matrix engine 1803 can also be configured with support for output sparsity, so that operations with results predetermined to zero can be bypassed. For input sparsity and/or output sparsity, in one embodiment, metadata is provided to processing elements 1852AA-1852MN to indicate which processing elements and/or data channels will be active during the cycle for a certain processing cycle.

在一个实施例中,矩阵引擎1803包括用于启用对具有稀疏矩阵的压缩表示的稀疏数据的操作的硬件,该稀疏矩阵存储非零值和标识该非零值在矩阵内的位置的元数据。示例性压缩表示包括但不限于压缩张量表示,诸如,压缩稀疏行(CSR)表示、压缩稀疏列(CSC)表示、压缩稀疏纤维(compressed sparse fiber,CSF)表示。对压缩表示的支持使得操作能够对按压缩张量格式的输入执行而无需压缩表示被解压缩或解码。在此类实施例中,可仅对非零输入值执行操作,并且所得到的非零输出值可被映射到输出矩阵中。在一些实施例中,还提供对机器特定无损数据压缩格式的硬件支持,这些机器特定无损数据压缩格式当在硬件内传送数据或跨系统总线传送数据时被使用。此类数据可按用于稀疏输入数据的压缩格式被保留,并且矩阵引擎1803可使用用于经压缩数据的压缩元数据,以使得操作能够仅对非零值执行或使得对于乘法操作能够绕过零数据输入的块。In one embodiment, the matrix engine 1803 includes hardware for enabling operations on sparse data with a compressed representation of a sparse matrix, which stores non-zero values and metadata identifying the location of the non-zero values in the matrix. Exemplary compressed representations include, but are not limited to, compressed tensor representations, such as, compressed sparse rows (CSR) representations, compressed sparse columns (CSC) representations, compressed sparse fibers (CSF) representations. Support for compressed representations enables operations to be performed on inputs in compressed tensor format without the need for compressed representations to be decompressed or decoded. In such embodiments, operations can be performed only on non-zero input values, and the resulting non-zero output values can be mapped into the output matrix. In some embodiments, hardware support for machine-specific lossless data compression formats is also provided, which are used when transmitting data within hardware or across system buses. Such data can be retained in a compressed format for sparse input data, and the matrix engine 1803 can use compressed metadata for compressed data to enable operations to be performed only on non-zero values or to enable blocks of zero data input to be bypassed for multiplication operations.

在各种实施例中,可以由编程器以压缩张量表示来提供输入数据,或者编解码器可以将输入数据压缩为压缩张量表示或另一稀疏数据编码。此外,为了支持压缩张量表示,可在输入数据被提供到处理元件1852AA-1852MN之前执行稀疏输入数据的流式压缩。在一个实施例中,对被写入到与图形核心集群1714相关联的缓存存储器的数据执行压缩,其中利用由矩阵引擎1803所支持的编码来执行该压缩。在一个实施例中,矩阵引擎1803包括对具有结构化稀疏度的输入的支持,在该结构化稀疏度中,预定级别或预定模式的稀疏度被施加在输入数据上。该数据可被压缩到已知的压缩率,其中,经压缩的数据由压缩元件1852AA-1852MN根据与经压缩的数据相关联的元数据来处理。In various embodiments, the input data may be provided by the programmer in a compressed tensor representation, or the codec may compress the input data into a compressed tensor representation or another sparse data encoding. In addition, in order to support the compressed tensor representation, streaming compression of the sparse input data may be performed before the input data is provided to the processing element 1852AA-1852MN. In one embodiment, compression is performed on the data written to the cache memory associated with the graphics core cluster 1714, wherein the compression is performed using the encoding supported by the matrix engine 1803. In one embodiment, the matrix engine 1803 includes support for inputs with structured sparsity, in which a predetermined level or predetermined pattern of sparsity is applied to the input data. The data may be compressed to a known compression ratio, wherein the compressed data is processed by the compression element 1852AA-1852MN according to metadata associated with the compressed data.

图19图示根据实施例的多片处理器的片1900。在一个实施例中,片1900表示图16B的图形引擎片1610A-1610D或图16C的计算引擎片1640A-1640D中的一个。多片图形处理器的片1900包括图形核心集群的阵列(例如,图形核心集群1714A、图形核心集群1714B,一直到图形核心集群1714N),其中,每个图形核心集群具有图形核心的阵列515A-515N。片1900还包括全局调遣器1902,用于将线程调遣到片1900的处理资源。FIG. 19 illustrates a slice 1900 of a multi-slice processor according to an embodiment. In one embodiment, slice 1900 represents one of the graphics engine slices 1610A-1610D of FIG. 16B or the compute engine slices 1640A-1640D of FIG. 16C . Slice 1900 of a multi-slice graphics processor includes an array of graphics core clusters (e.g., graphics core cluster 1714A, graphics core cluster 1714B, all the way to graphics core cluster 1714N), wherein each graphics core cluster has an array 515A-515N of graphics cores. Slice 1900 also includes a global dispatcher 1902 for dispatching threads to processing resources of slice 1900.

片1900可以包括L3缓存1906和存储器1910或与L3缓存1906和存储器1910耦合。在各实施例中,可以排除L3缓存1906,或者片1900可以包括附加级别的缓存,诸如L4缓存。在一个实施例中,诸如图16B和图16C中,多片图形处理器中的片1900的每个实例具有相关联的存储器1910。在一个实施例中,多片处理器可以被配置为多芯片模块,在该多芯片模块中,L3缓存1906和/或存储器1910驻留在与图形核心集群1714A-1714N不同的单独的小芯片上。在该上下文中,小芯片是至少部分地被封装的集成电路,该至少部分地被封装的集成电路包括可与其他小芯片一起被组装到更大的封装中的不同的逻辑单元。例如,L3缓存1906可被包括在专用缓存小芯片中,或驻留在与图形核心集群1714A-1714N相同的小芯片上。在一个实施例中,L3缓存1906可被包括在如图24C所图示的活跃的基础管芯或活跃的中介层中。Slice 1900 may include or be coupled with L3 cache 1906 and memory 1910. In various embodiments, L3 cache 1906 may be excluded, or slice 1900 may include additional levels of cache, such as L4 cache. In one embodiment, such as in FIG. 16B and FIG. 16C, each instance of slice 1900 in a multi-slice graphics processor has an associated memory 1910. In one embodiment, the multi-slice processor may be configured as a multi-chip module in which L3 cache 1906 and/or memory 1910 reside on a separate chiplet that is different from the graphics core cluster 1714A-1714N. In this context, a chiplet is an at least partially packaged integrated circuit that includes different logic units that can be assembled into a larger package with other chiplets. For example, L3 cache 1906 may be included in a dedicated cache chiplet, or reside on the same chiplet as the graphics core cluster 1714A-1714N. In one embodiment, the L3 cache 1906 may be included in an active base die or an active interposer as illustrated in FIG. 24C .

存储器结构1903启用图形核心集群1714A-1714N、L3缓存1906和存储器1910之间的通信。L2缓存1904与存储器结构1903耦合,并且可配置用于缓存经由存储器结构1903执行的事务。片互连1908启用与图形处理器上的其他片的通信,并且可以是图16B和图16C的片互连1623A-1623F中的一个。在从片1900排除L3缓存1906的实施例中,L2缓存1904可以被配置为组合的L2/L3缓存。存储器结构1903可配置用于基于L3缓存1906存在于或不存在于特定实现方式中,而将数据路由到L3缓存1906或路由到与存储器1910相关联的存储器控制器。L3缓存1906可被配置为逐片(per-tile)缓存,该逐片缓存专用于片1900的处理资源或者可以是GPU宽的L3缓存的部分。Memory structure 1903 enables communication between graphics core clusters 1714A-1714N, L3 cache 1906, and memory 1910. L2 cache 1904 is coupled to memory structure 1903 and is configurable to cache transactions executed via memory structure 1903. Slice interconnect 1908 enables communication with other slices on the graphics processor and may be one of slice interconnects 1623A-1623F of FIGS. 16B and 16C. In embodiments where L3 cache 1906 is excluded from slice 1900, L2 cache 1904 may be configured as a combined L2/L3 cache. Memory structure 1903 may be configured to route data to L3 cache 1906 or to a memory controller associated with memory 1910 based on the presence or absence of L3 cache 1906 in a particular implementation. L3 cache 1906 may be configured as a per-tile cache that is dedicated to the processing resources of tile 1900 or may be part of a GPU-wide L3 cache.

图20是图示图形处理器指令格式2000的框图。图形处理器执行单元支持具有按照多种格式的指令的指令集。实线框图示通常被包括在执行单元指令中的组成部分,而虚线包括任选的或仅被包括在指令的子集中的组成部分。在一些实施例中,所描述和图示的图形处理器指令格式2000是宏指令,因为它们是供应至执行单元的指令,这与产生自一旦指令被处理就进行的指令解码的微操作相反。因此,单个指令可使硬件执行多个微操作。FIG. 20 is a block diagram illustrating a graphics processor instruction format 2000. The graphics processor execution unit supports an instruction set having instructions in a variety of formats. Solid line boxes illustrate components that are typically included in the execution unit instructions, while dashed lines include components that are optional or included only in a subset of the instructions. In some embodiments, the graphics processor instruction format 2000 described and illustrated are macroinstructions because they are instructions supplied to the execution unit, as opposed to micro-operations that result from instruction decoding that is performed once the instruction is processed. Therefore, a single instruction can cause the hardware to perform multiple micro-operations.

如本文中所描述的图形处理器执行单元可以原生地支持128比特指令格式2010的指令。基于所选择的指令、指令选项和操作对象数量,64比特紧凑指令格式2030可用于一些指令。原生的128比特指令格式2010提供对所有指令选项的访问,而一些选项和操作在64比特格式2030中受限。64比特格式2030中可用的原生指令因实施例而异。使用索引字段2013中的索引值的集合将指令部分地压缩。执行单元硬件基于索引值来引用压缩表的集合,并使用压缩表输出来重构128比特指令格式2010的原生指令。可以使用其他大小和格式的指令。As described herein, the graphics processor execution unit can natively support instructions of the 128-bit instruction format 2010. Based on the selected instructions, instruction options, and the number of operands, the 64-bit compact instruction format 2030 can be used for some instructions. The native 128-bit instruction format 2010 provides access to all instruction options, while some options and operations are limited in the 64-bit format 2030. The native instructions available in the 64-bit format 2030 vary from embodiment to embodiment. The instructions are partially compressed using a set of index values in the index field 2013. The execution unit hardware references a set of compression tables based on the index values and uses the compression table output to reconstruct the native instructions of the 128-bit instruction format 2010. Instructions of other sizes and formats can be used.

针对每种格式,指令操作码2012限定执行单元要执行的操作。执行单元跨每个操作对象的多个数据元素并行地执行每个指令。例如,响应于加法指令,执行单元跨表示纹理元素或图片元素的每个颜色通道执行同步加法操作。默认地,执行单元跨操作对象的所有数据通道执行每个指令。指令控制字段2014可启用对某些执行选项(诸如,通道选择(例如,谓词(predication))和数据通道顺序(例如,拌和(swizzle)))的控制。针对128比特指令格式2010的指令,执行大小字段2016限制将被并行地执行的数据通道的数量。执行大小字段2016可能不可用于64比特紧凑指令格式2030。For each format, the instruction opcode 2012 defines the operation to be performed by the execution unit. The execution unit executes each instruction in parallel across multiple data elements of each operand. For example, in response to an addition instruction, the execution unit performs a synchronous addition operation across each color channel representing a texture element or a picture element. By default, the execution unit executes each instruction across all data channels of the operand. The instruction control field 2014 can enable control of certain execution options (such as channel selection (e.g., predication) and data channel order (e.g., swizzle)). For instructions in the 128-bit instruction format 2010, the execution size field 2016 limits the number of data channels that will be executed in parallel. The execution size field 2016 may not be available for the 64-bit compact instruction format 2030.

一些执行单元指令具有最多三个操作对象,包括两个源操作对象src02020、src12022以及一个目的地操作对象(dest 2018)。其他指令(诸如,例如数据操纵指令、点积指令、乘加指令、或乘法累加指令)可具有第三源操作对象(例如,SRC22024)。指令操作码2012确定源操作对象的数量。指令的最后一个源操作对象可以是与指令一起被传递的立即数(例如,硬编码的)值。执行单元还可以支持多个目的地指令,其中目的地中的一个或多个是基于指令和/或所指定的目的地而隐含的或隐式的。Some execution unit instructions have up to three operands, including two source operands src0 2020, src1 2022 and one destination operand (dest 2018). Other instructions (such as, for example, data manipulation instructions, dot product instructions, multiply-add instructions, or multiply-accumulate instructions) may have a third source operand (e.g., src2 2024). The instruction opcode 2012 determines the number of source operands. The last source operand of the instruction may be an immediate (e.g., hard-coded) value passed with the instruction. The execution unit may also support multiple destination instructions, where one or more of the destinations are implicit or implicit based on the instruction and/or the specified destination.

128比特指令格式2010可包括访问/寻址模式字段2026,该访问/寻址模式字段2026例如指定使用直接寄存器寻址模式还是间接寄存器寻址模式。当使用直接寄存器寻址模式时,由指令中的比特直接提供一个或多个操作对象的寄存器地址。The 128-bit instruction format 2010 may include an access/addressing mode field 2026 that specifies, for example, whether direct register addressing mode or indirect register addressing mode is used. When direct register addressing mode is used, the register addresses of one or more operands are provided directly by bits in the instruction.

128比特指令格式2010还可包括访问/寻址模式字段2026,该访问/寻址模式字段2026指定指令的寻址模式和/或访问模式。访问模式可用于限定指令的数据访问对齐。可支持包括16字节对齐访问模式和1字节对齐访问模式的访问模式,其中,访问模式的字节对齐确定指令操作对象的访问对齐。例如,当处于第一模式时,指令可将字节对齐的寻址用于源操作对象和目的地操作对象,并且当处于第二模式时,指令可将16字节对齐的寻址用于所有的源操作对象和目的地操作对象。128-bit instruction format 2010 may also include an access/addressing mode field 2026, which specifies the addressing mode and/or access mode of the instruction. The access mode may be used to define the data access alignment of the instruction. The access mode may support access modes including 16-byte aligned access modes and 1-byte aligned access modes, wherein the byte alignment of the access mode determines the access alignment of the instruction operand. For example, when in the first mode, the instruction may use byte-aligned addressing for source operands and destination operands, and when in the second mode, the instruction may use 16-byte aligned addressing for all source operands and destination operands.

访问/寻址模式字段2026的寻址模式部分可以确定指令要使用直接寻址还是间接寻址。当使用直接寄存器寻址模式时,指令中的比特直接提供一个或多个操作对象的寄存器地址。当使用间接寄存器寻址模式时,可以基于指令中的地址寄存器值和地址立即数字段来计算一个或多个操作对象的寄存器地址。The addressing mode portion of the access/addressing mode field 2026 may determine whether the instruction uses direct or indirect addressing. When direct register addressing mode is used, the bits in the instruction directly provide the register address of one or more operands. When indirect register addressing mode is used, the register address of one or more operands may be calculated based on the address register value and the address immediate field in the instruction.

可以基于操作码2012比特字段对指令进行分组从而简化操作码解码2040。针对8比特的操作码,比特4、比特5、和比特6允许执行单元确定操作码的类型。所示出的确切的操作码分组仅是示例。移动和逻辑操作码组2042可以包括数据移动和逻辑指令(例如,移动(mov)、比较(cmp))。移动和逻辑组2042可以共享五个最低有效的比特(least significantbit,LSB),其中,移动(mov)指令采用0000xxxxb的形式,而逻辑指令采用0001xxxxb的形式。流控制指令组2044(例如,调用(call)、跳转(jmp))包括0010xxxxb(例如,0x20)形式的指令。混杂指令组2046包括指令的混合,包括0011xxxxb(例如,0x30)形式的同步指令(例如,等待(wait)、发送(send))。并行数学指令组2048包括0100xxxxb(例如,0x40)形式的逐分量的算术指令(例如,加、乘(mul))。并行数学指令组2048跨数据通道并行地执行算术操作。向量数学组2050包括0101xxxxb(例如,0x50)形式的算术指令(例如,dp4)。向量数学组对向量操作对象执行算术,诸如,点积计算。在一个实施例中,所图示的操作码解码2040可用于确定执行单元的哪个部分将用于执行经解码的指令。例如,一些指令可被指定为将由脉动阵列执行的脉动指令。其他指令(诸如,光线追踪指令(未示出))可被路由至执行逻辑的切片或分区内的光线追踪核心或光线追踪逻辑。Instructions can be grouped based on the opcode 2012 bit fields to simplify opcode decoding 2040. For 8-bit opcodes, bit 4, bit 5, and bit 6 allow the execution unit to determine the type of opcode. The exact opcode grouping shown is only an example. Move and logic opcode group 2042 can include data movement and logic instructions (e.g., move (mov), compare (cmp)). Move and logic group 2042 can share five least significant bits (least significant bit, LSB), wherein the move (mov) instruction adopts the form of 0000xxxxb, and the logic instruction adopts the form of 0001xxxxb. Flow control instruction group 2044 (e.g., call (call), jump (jmp)) includes instructions in the form of 0010xxxxb (e.g., 0x20). Miscellaneous instruction group 2046 includes a mixture of instructions, including synchronization instructions (e.g., wait (wait), send (send)) in the form of 0011xxxxb (e.g., 0x30). The parallel math instruction group 2048 includes component-by-component arithmetic instructions (e.g., addition, multiplication (mul)) of the form 0100xxxxb (e.g., 0x40). The parallel math instruction group 2048 performs arithmetic operations in parallel across data channels. The vector math group 2050 includes arithmetic instructions (e.g., dp4) of the form 0101xxxxb (e.g., 0x50). The vector math group performs arithmetic on vector operands, such as dot product calculations. In one embodiment, the illustrated opcode decoding 2040 can be used to determine which portion of the execution unit will be used to execute the decoded instructions. For example, some instructions may be designated as systolic instructions to be executed by a systolic array. Other instructions, such as ray tracing instructions (not shown), may be routed to a ray tracing core or ray tracing logic within a slice or partition of the execution logic.

图形管线Graphics Pipeline

图21是根据另一个实施例的图形处理器2100的框图。图21的具有与本文中任何其他附图的元件相同或类似名称的元件描述与其他附图中相同的元件,能以与其他附图中类似的方式进行操作或运行,可包括相同的部件,并且可链接到其他实体,该实体如本文中其他地方所描述的那些实体,但不限于此。Figure 21 is a block diagram of a graphics processor 2100 according to another embodiment. Elements of Figure 21 having the same or similar names as elements of any other figures herein describe the same elements as in the other figures, can operate or function in a similar manner as in the other figures, can include the same components, and can be linked to other entities such as, but not limited to, those described elsewhere herein.

图形处理器2100可包括不同类型的图形处理管线,诸如,几何管线2120、媒体管线2130、显示引擎2140、线程执行逻辑2150、以及渲染输出管线2170。图形处理器2100可以是包括一个或多个通用处理核心的多核心处理系统内的图形处理器。图形处理器可通过至一个或多个控制寄存器(未示出)的寄存器写入或者经由通过环形互连2102发出至图形处理器2100的命令被控制。环形互连2102可将图形处理器2100耦合至其他处理部件(诸如,其他图形处理器或通用处理器)。由命令流转化器2103解释来自环形互连2102的命令,该命令流转化器2103将指令供应至几何管线2120或媒体管线2130的各个部件。The graphics processor 2100 may include different types of graphics processing pipelines, such as a geometry pipeline 2120, a media pipeline 2130, a display engine 2140, a thread execution logic 2150, and a rendering output pipeline 2170. The graphics processor 2100 may be a graphics processor within a multi-core processing system that includes one or more general-purpose processing cores. The graphics processor may be controlled by register writes to one or more control registers (not shown) or via commands issued to the graphics processor 2100 through a ring interconnect 2102. The ring interconnect 2102 may couple the graphics processor 2100 to other processing components (such as other graphics processors or general-purpose processors). Commands from the ring interconnect 2102 are interpreted by a command stream converter 2103, which supplies instructions to various components of the geometry pipeline 2120 or the media pipeline 2130.

命令流转化器2103可引导顶点取得器2105的操作,该顶点取得器2105从存储器读取顶点数据,并执行由命令流转化器2103提供的顶点处理命令。顶点取得器2105可将顶点数据提供给顶点着色器2107,该顶点着色器2107对每一个顶点执行坐标空间变换和照明操作。顶点取得器2105和顶点着色器2107可通过经由线程调遣器2131将执行线程调遣给图形核心2152A-2152B来执行顶点处理指令。The command stream converter 2103 may direct the operation of the vertex fetcher 2105, which reads vertex data from memory and executes vertex processing commands provided by the command stream converter 2103. The vertex fetcher 2105 may provide the vertex data to the vertex shader 2107, which performs coordinate space transformation and lighting operations on each vertex. The vertex fetcher 2105 and the vertex shader 2107 may execute the vertex processing instructions by dispatching execution threads to the graphics cores 2152A-2152B via the thread dispatcher 2131.

图形核心2152A-2152B可以是具有用于执行图形操作和媒体操作的指令集的向量处理器的阵列。图形核心2152A-2152B可具有专用于每个阵列或在阵列之间被共享的所附接的L1缓存2151。缓存可以被配置为数据缓存、指令缓存、或被分区为在不同分区中包含数据和指令的单个缓存。The graphics cores 2152A-2152B may be an array of vector processors with instruction sets for performing graphics operations and media operations. The graphics cores 2152A-2152B may have an attached L1 cache 2151 that is dedicated to each array or shared between arrays. The cache may be configured as a data cache, an instruction cache, or partitioned into a single cache containing data and instructions in different partitions.

几何管线2120可包括用于执行3D对象的硬件加速曲面细分的曲面细分部件。可编程壳体着色器2111可配置曲面细分操作。可编程域着色器2117可提供对曲面细分输出的后端评估。曲面细分器2113可在外壳着色器2111的指示下进行操作,并且可包含用于基于粗糙的几何模型来生成详细的几何对象集合的专用逻辑,该粗糙的几何模型作为输入被提供给几何管线2120。此外,如果不使用曲面细分,则可以绕过曲面细分部件(例如,外壳着色器2111、曲面细分器2113和域着色器2117)。曲面细分部件可基于从顶点着色器2107接收的数据进行操作。The geometry pipeline 2120 may include a tessellation component for performing hardware accelerated tessellation of 3D objects. A programmable shell shader 2111 may configure tessellation operations. A programmable domain shader 2117 may provide back-end evaluation of tessellation output. The tessellation 2113 may operate under the direction of the shell shader 2111 and may contain dedicated logic for generating a detailed set of geometric objects based on a coarse geometric model that is provided as input to the geometry pipeline 2120. In addition, if tessellation is not used, the tessellation component (e.g., the shell shader 2111, the tessellation 2113, and the domain shader 2117) may be bypassed. The tessellation component may operate based on data received from the vertex shader 2107.

完整的几何对象可由几何着色器2119经由被调遣给图形核心2152A-2152B的一个或多个线程来处理,或者可以直接行进至裁剪器2129。几何着色器可对整个几何对象操作,而不是像在图形管线的先前的阶段中那样对顶点或顶点的补片进行操作。如果曲面细分被禁用,则几何着色器2119从顶点着色器2107接收输入。几何着色器2119可以是可由几何着色器程序编程的,以便在曲面细分单元被禁用的情况下执行几何曲面细分。The complete geometric object may be processed by the geometry shader 2119 via one or more threads dispatched to the graphics core 2152A-2152B, or may proceed directly to the clipper 2129. The geometry shader may operate on the entire geometric object, rather than on vertices or patches of vertices as in previous stages of the graphics pipeline. If tessellation is disabled, the geometry shader 2119 receives input from the vertex shader 2107. The geometry shader 2119 may be programmable by a geometry shader program to perform geometry tessellation when the tessellation unit is disabled.

在栅格化之前,裁剪器2129处理顶点数据。裁剪器2129可以是固定功能裁剪器或具有裁剪和几何着色器功能的可编程裁剪器。渲染输出管线2170中的栅格化器和深度测试部件2173可调遣像素着色器以将几何对象转换为逐像素表示。像素着色器逻辑可被包括在线程执行逻辑2150中。任选地,应用可绕过栅格化器和深度测试部件2173,并且经由流出单元2123访问未栅格化的顶点数据。Before rasterization, the clipper 2129 processes the vertex data. The clipper 2129 can be a fixed function clipper or a programmable clipper with clipping and geometry shader functions. The rasterizer and depth test component 2173 in the render output pipeline 2170 can dispatch a pixel shader to convert the geometric object into a pixel-by-pixel representation. The pixel shader logic can be included in the thread execution logic 2150. Optionally, the application can bypass the rasterizer and depth test component 2173 and access the unrasterized vertex data via the outflow unit 2123.

图形处理器2100具有互连总线、互连结构、或允许数据和消息在处理器的主要部件之间传递的某个其他互连机制。在一些实施例中,图形核心2152A-2152B和相关联的逻辑单元(例如,L1缓存2151、采样器2154、纹理缓存2158等)经由数据端口2156进行互连,以执行存储器访问并且与处理器的渲染输出管线部件进行通信。采样器2154、缓存2151、2158和图形核心2152A-2152B各自可具有单独的存储器访问路径。任选地,纹理缓存2158也可被配置为采样器缓存。The graphics processor 2100 has an interconnect bus, interconnect structure, or some other interconnect mechanism that allows data and messages to be passed between the main components of the processor. In some embodiments, the graphics core 2152A-2152B and associated logic units (e.g., L1 cache 2151, sampler 2154, texture cache 2158, etc.) are interconnected via data ports 2156 to perform memory accesses and communicate with the rendering output pipeline components of the processor. Samplers 2154, caches 2151, 2158, and graphics cores 2152A-2152B can each have a separate memory access path. Optionally, texture cache 2158 can also be configured as a sampler cache.

渲染输出管线2170可包含栅格化器和深度测试部件2173,该栅格化器和深度测试部件2173将基于顶点的对象转换为相关联的基于像素的表示。栅格化器逻辑可包括用于执行固定功能三角形和线栅格化的窗口器/掩码器单元。在一些实施例中,相关联的渲染缓存2178和深度缓存2179也是可用的。像素操作部件2177对数据执行基于像素的操作,但是在一些实例中,与2D操作相关联的像素操作(例如,利用混合的比特块图像传输)由2D引擎2141执行,或者在显示时由显示控制器2143使用叠加显示平面来代替。共享的L3缓存2175可以可用于所有的图形部件,从而允许在不使用主系统存储器的情况下共享数据。The rendering output pipeline 2170 may include a rasterizer and depth test component 2173, which converts vertex-based objects into associated pixel-based representations. The rasterizer logic may include a window device/masker unit for performing fixed-function triangle and line rasterization. In some embodiments, an associated rendering buffer 2178 and a depth buffer 2179 are also available. A pixel operation component 2177 performs pixel-based operations on data, but in some instances, pixel operations associated with 2D operations (e.g., using mixed bit block image transmission) are performed by the 2D engine 2141, or replaced by a display controller 2143 using an overlay display plane when displayed. A shared L3 cache 2175 may be available to all graphics components, allowing data to be shared without using main system memory.

媒体管线2130可包括媒体引擎2137和视频前端2134。视频前端2134可从命令流转化器2103接收管线命令。媒体管线2130可包括单独的命令流转化器。视频前端2134可在将媒体命令发送到媒体引擎2137之前处理该媒体命令。媒体引擎2137可包括用于生成线程以用于经由线程调遣器2131调遣给线程执行逻辑2150的线程生成功能。The media pipeline 2130 may include a media engine 2137 and a video front end 2134. The video front end 2134 may receive pipeline commands from the command stream converter 2103. The media pipeline 2130 may include a separate command stream converter. The video front end 2134 may process the media commands before sending them to the media engine 2137. The media engine 2137 may include a thread generation function for generating threads for dispatching to the thread execution logic 2150 via the thread dispatcher 2131.

图形处理器2100可包括显示引擎2140。该显示引擎2140可在处理器2100外部,并且可经由环形互连2102、或某个其他互连总线或结构来与图形处理器耦合。显示引擎2140可包括2D引擎2141和显示控制器2143。显示引擎2140可包含能够独立于3D管线进行操作的专用逻辑。显示控制器2143可与显示设备(未示出)耦合,该显示设备可以是如在膝上型电脑中的系统集成的显示设备或经由显示设备连接器而附连的外部显示设备。The graphics processor 2100 may include a display engine 2140. The display engine 2140 may be external to the processor 2100 and may be coupled to the graphics processor via a ring interconnect 2102, or some other interconnect bus or structure. The display engine 2140 may include a 2D engine 2141 and a display controller 2143. The display engine 2140 may contain dedicated logic capable of operating independently of the 3D pipeline. The display controller 2143 may be coupled to a display device (not shown), which may be a system-integrated display device such as in a laptop or an external display device attached via a display device connector.

几何管线2120和媒体管线2130可以可被配置成用于基于多个图形和媒体编程接口来执行操作,并且不专用于任何一个应用编程接口(API)。用于图形处理器的驱动器软件可将专用于特定图形或媒体库的API调用转换为可由图形处理器处理的命令。可以为全部来自Khronos Group的开放图形库(Open Graphics Library,OpenGL)、开放计算语言(OpenComputing Language,OpenCL)和/或Vulkan图形和计算API提供支持。也可以为来自微软公司的Direct3D库提供支持。可支持这些库的组合。还可以为开源计算机视觉库(OpenSource Computer Vision Library,OpenCV)提供支持。如果可进行从未来API的管线到图形处理器的管线的映射,则具有兼容3D管线的未来API也将受到支持。The geometry pipeline 2120 and the media pipeline 2130 can be configured to perform operations based on multiple graphics and media programming interfaces and are not dedicated to any one application programming interface (API). Driver software for a graphics processor can convert API calls dedicated to a specific graphics or media library into commands that can be processed by the graphics processor. Support can be provided for all Open Graphics Library (OpenGL), Open Computing Language (OpenCL) and/or Vulkan graphics and computing APIs from the Khronos Group. Support can also be provided for the Direct3D library from Microsoft. Combinations of these libraries can be supported. Support can also be provided for the OpenSource Computer Vision Library (OpenCV). If a mapping from the pipeline of a future API to the pipeline of a graphics processor can be performed, future APIs with compatible 3D pipelines will also be supported.

图形管线编程Graphics pipeline programming

图22A是图示用于对图形处理管线编程的图形处理器命令格式2200的框图,图形处理管线诸如例如本文中结合图16A、图17、图21描述的管线。图22B是图示根据实施例的图形处理器命令序列2210的框图。图22A中的实线框图示一般被包括在图形命令中的组成部分,而虚线包括任选的或仅被包括在图形命令的子集中的组成部分。图22A的示例性图形处理器命令格式2200包括用于标识命令的客户端2202、命令操作代码(操作码)2204和数据字段2206的字段。子操作码2205和命令大小2208也被包括在一些命令中。FIG. 22A is a block diagram illustrating a graphics processor command format 2200 for programming a graphics processing pipeline, such as, for example, the pipelines described herein in conjunction with FIGS. 16A , 17 , and 21 . FIG. 22B is a block diagram illustrating a graphics processor command sequence 2210 according to an embodiment. The solid line boxes in FIG. 22A illustrate components that are generally included in a graphics command, while the dashed lines include components that are optional or included only in a subset of graphics commands. The exemplary graphics processor command format 2200 of FIG. 22A includes fields for identifying a client 2202 of a command, a command operation code (opcode) 2204, and a data field 2206. A subopcode 2205 and a command size 2208 are also included in some commands.

客户端2202可指定图形设备的、处理命令数据的客户端单元。图形处理器命令解析器可检查每一个命令的客户端字段以调整对命令的进一步的处理,并且将命令数据路由至适当的客户端单元。图形处理器客户端单元可包括存储器接口单元、渲染单元、2D单元、3D单元和媒体单元。每个客户端单元可具有处理命令的对应的处理管线。一旦由客户端单元接收到命令,客户端单元就读取操作码2204以及子操作码2205(如果存在)以确定要执行的操作。客户端单元使用数据字段2206中的信息来执行命令。针对一些命令,预期显式的命令大小2208指定命令的大小。命令解析器可基于命令操作码自动地确定命令中的至少一些命令的大小。命令可经由双字的倍数被对齐。还可使用其他命令格式。Client 2202 may specify a client unit of a graphics device that processes command data. A graphics processor command parser may check the client field of each command to adjust further processing of the command and route the command data to the appropriate client unit. A graphics processor client unit may include a memory interface unit, a rendering unit, a 2D unit, a 3D unit, and a media unit. Each client unit may have a corresponding processing pipeline for processing commands. Once a command is received by a client unit, the client unit reads an opcode 2204 and a subopcode 2205 (if present) to determine the operation to be performed. The client unit uses the information in the data field 2206 to execute the command. For some commands, an explicit command size 2208 is expected to specify the size of the command. The command parser may automatically determine the size of at least some of the commands based on the command opcode. Commands may be aligned via multiples of double words. Other command formats may also be used.

图22B中的流程图示示例性图形处理器命令序列2210。以示例性图形处理器为特征的数据处理系统的软件或固件可使用所示出的命令序列的某个版本来建立、执行并终止图形操作的集合。仅出于示例目的示出并描述样本命令序列,并且样本命令序列不限于这些特定的命令或该命令序列。此外,命令可以作为批量的命令在命令序列中被发出,使得图形处理器将以至少部分地并发的方式处理命令序列。The flowchart in FIG22B illustrates an exemplary graphics processor command sequence 2210. Software or firmware of a data processing system featuring an exemplary graphics processor may use a version of the command sequence shown to establish, execute, and terminate a set of graphics operations. The sample command sequence is shown and described for exemplary purposes only, and is not limited to these particular commands or command sequences. In addition, commands may be issued as batches of commands in a command sequence so that the graphics processor will process the command sequence in an at least partially concurrent manner.

图形处理器命令序列2210能以管线转储清除命令2212开始,以使任何活跃的图形管线完成用于管线的当前未决的命令。任选地,3D管线2222和媒体管线2224可以不并发地操作。执行管线转储清除以使活跃的图形管线完成任何未决命令。响应于管线转储清除,用于图形处理器的命令解析器将暂停命令处理,直到活跃的绘画引擎完成未决操作并且相关的读缓存被无效。任选地,渲染缓存中被标记为“脏”的任何数据可以被转储清除到存储器。管线转储清除命令2212可用于管线同步,或可在将图形处理器置于低功率状态之前被使用。The graphics processor command sequence 2210 can begin with a pipeline flush command 2212 to cause any active graphics pipeline to complete currently pending commands for the pipeline. Optionally, the 3D pipeline 2222 and the media pipeline 2224 may not operate concurrently. The pipeline flush is performed to cause the active graphics pipeline to complete any pending commands. In response to the pipeline flush, the command parser for the graphics processor will suspend command processing until the active paint engines complete pending operations and the associated read caches are invalidated. Optionally, any data marked as "dirty" in the render cache can be flushed to memory. The pipeline flush command 2212 can be used for pipeline synchronization, or can be used before placing the graphics processor in a low power state.

当命令序列需要图形处理器在管线之间显式地切换时,可使用管线选择命令2213。在发出管线命令之前可在执行上下文中仅需要一次管线选择命令2213,除非上下文是发出针对这两条管线的命令。可紧接在经由管线选择命令2213进行的管线切换之前需要管线转储清除命令2212。When a command sequence requires the graphics processor to explicitly switch between pipelines, a pipeline select command 2213 may be used. A pipeline select command 2213 may be required only once in an execution context before issuing a pipeline command, unless the context is issuing commands for both pipelines. A pipeline flush command 2212 may be required immediately before a pipeline switch via a pipeline select command 2213.

管线控制命令2214可配置用于操作的图形管线,并且可用于对3D管线2222和媒体管线2224进行编程。管线控制命令2214可为活跃的管线配置管线状态。管线控制命令2214可用于管线同步,并且用于在处理批量的命令之前清除来自活跃管线内的一个或多个缓存存储器的数据。Pipeline control commands 2214 may configure the graphics pipeline for operation and may be used to program 3D pipeline 2222 and media pipeline 2224. Pipeline control commands 2214 may configure pipeline states for active pipelines. Pipeline control commands 2214 may be used for pipeline synchronization and to clear data from one or more cache memories within an active pipeline before processing a batch of commands.

与返回缓冲器状态2216有关的命令可用于将用于相应管线的返回缓冲器的集合配置成用于写入数据。一些管线操作需要对一个或多个返回缓冲器的分配、选择或配置,在处理期间操作将中间数据写入这一个或多个返回缓冲器中。图形处理器也可使用一个或多个返回缓冲器以存储输出数据并执行跨线程通信。返回缓冲器状态2216可包括选择要用于管线操作的集合的返回缓冲器的大小和数量。Commands associated with return buffer state 2216 may be used to configure a set of return buffers for a corresponding pipeline for writing data. Some pipeline operations require the allocation, selection, or configuration of one or more return buffers into which the operation writes intermediate data during processing. A graphics processor may also use one or more return buffers to store output data and perform cross-thread communication. Return buffer state 2216 may include selecting the size and number of return buffers to be used for a set of pipeline operations.

命令序列中的其余命令基于用于操作的活跃管线而不同。基于管线判定2220,命令序列被定制成用于以3D管线状态2230开始的3D管线2222、或者在媒体管线状态2240处开始的媒体管线2224。The remaining commands in the command sequence differ based on the active pipeline for operation. Based on pipeline decision 2220 , the command sequence is tailored for either 3D pipeline 2222 starting at 3D pipeline state 2230 , or media pipeline 2224 starting at media pipeline state 2240 .

用于配置3D管线状态2230的命令包括用于顶点缓冲器状态、顶点元素状态、常量颜色状态、深度缓冲器状态、以及将在处理3D基元命令之前配置的其他状态变量的3D状态设置命令。这些命令的值至少部分基于使用中的特定3D API被确定。3D管线状态2230命令也可以能够在将不会使用某些管线元件的情况下选择性地禁用或绕过那些元件。The commands for configuring the 3D pipeline state 2230 include 3D state setup commands for vertex buffer state, vertex element state, constant color state, depth buffer state, and other state variables that will be configured before processing 3D primitive commands. The values of these commands are determined at least in part based on the specific 3D API in use. The 3D pipeline state 2230 commands may also be able to selectively disable or bypass certain pipeline elements if those elements will not be used.

3D基元2232命令可用于提交要由3D管线处理的3D基元。经由3D基元2232命令被传递到图形处理器的命令及关联参数被转发到图形管线中的顶点取得功能。顶点取得功能使用3D基元2232命令数据来生成顶点数据结构。顶点数据结构被存储在一个或多个返回缓冲器中。3D基元2232命令可用于经由顶点着色器对3D基元执行顶点操作。为了处理顶点着色器,3D管线2222将着色器执行线程调遣给图形处理器执行单元。3D primitive 2232 commands can be used to submit 3D primitives to be processed by the 3D pipeline. The commands and associated parameters passed to the graphics processor via the 3D primitive 2232 commands are forwarded to the vertex acquisition function in the graphics pipeline. The vertex acquisition function uses the 3D primitive 2232 command data to generate a vertex data structure. The vertex data structure is stored in one or more return buffers. The 3D primitive 2232 commands can be used to perform vertex operations on 3D primitives via a vertex shader. In order to process the vertex shader, the 3D pipeline 2222 dispatches the shader execution thread to the graphics processor execution unit.

3D管线2222可经由执行2234命令或事件来触发。寄存器可写入触发命令执行。可经由命令序列中的“去往(go)”或“踢除(kick)”命令来触发执行。命令执行可使用管线同步命令以通过图形管线对命令序列转储清除来触发。3D管线将执行针对3D基元的几何处理。一旦操作完成,就对所得到的几何对象进行栅格化,并且像素引擎对所得到的像素进行着色。对于那些操作,还可以包括用于控制像素着色和像素后端操作的附加命令。The 3D pipeline 2222 can be triggered via an execute 2234 command or event. A register can be written to trigger a command execution. Execution can be triggered via a "go" or "kick" command in a command sequence. Command execution can be triggered using a pipeline synchronization command to flush the command sequence through the graphics pipeline. The 3D pipeline will perform geometry processing for 3D primitives. Once the operation is completed, the resulting geometric objects are rasterized and the pixel engine shades the resulting pixels. For those operations, additional commands for controlling pixel shading and pixel backend operations may also be included.

当执行媒体操作时,图形处理器命令序列2210可遵循媒体管线2224路径。一般而言,针对媒体管线2224进行编程的特定用途和方式取决于要执行的媒体或计算操作。在媒体解码期间,特定的媒体解码操作可被迁移到媒体管线。也可绕过媒体管线,并且可使用由一个或多个通用处理核心提供的资源完全地或部分地执行媒体解码。媒体管线还可包括用于通用图形处理器单元(GPGPU)操作的元件,其中,图形处理器用于使用计算着色器程序来执行SIMD向量操作,这些计算着色器程序并不显式地与图形基元的渲染相关。When performing media operations, the graphics processor command sequence 2210 may follow the media pipeline 2224 path. In general, the specific purpose and manner of programming the media pipeline 2224 depends on the media or computing operations to be performed. During media decoding, specific media decoding operations may be migrated to the media pipeline. The media pipeline may also be bypassed, and the media decoding may be performed in whole or in part using resources provided by one or more general-purpose processing cores. The media pipeline may also include elements for general-purpose graphics processor unit (GPGPU) operations, where the graphics processor is used to perform SIMD vector operations using compute shader programs that are not explicitly related to the rendering of graphics primitives.

能以与3D管线2222类似的方式来配置媒体管线2224。用于配置媒体管线状态2240的命令的集合在媒体对象命令2242之前被调遣或被放置到命令队列中。用于媒体管线状态2240的命令可包括用于配置将被用于处理媒体对象的媒体管线元件的数据。这包括用于在媒体管线内配置视频解码和视频编码逻辑的数据,诸如编码或解码格式。用于媒体管线状态2240的命令还可支持使用指向包含批量的状态设置的“间接”状态元素的一个或多个指针。The media pipeline 2224 can be configured in a similar manner to the 3D pipeline 2222. A set of commands for configuring the media pipeline state 2240 are dispatched or placed into a command queue before the media object commands 2242. The commands for the media pipeline state 2240 may include data for configuring the media pipeline elements that will be used to process the media objects. This includes data for configuring the video decoding and video encoding logic within the media pipeline, such as encoding or decoding formats. The commands for the media pipeline state 2240 may also support the use of one or more pointers to "indirect" state elements that contain batches of state settings.

媒体对象命令2242可供应指向用于由媒体管线处理的媒体对象的指针。媒体对象包括存储器缓冲器,该存储器缓冲器包含要处理的视频数据。任选地,在发出媒体对象命令2242之前,所有的媒体管线状态必须是有效的。一旦管线状态被配置并且媒体对象命令2242被排队,就经由执行命令2244或等效的执行事件(例如,寄存器写入)来触发媒体管线2224。随后可通过由3D管线2222或媒体管线2224提供的操作对来自媒体管线2224的输出进行后处理。能以与媒体操作类似的方式配置和执行GPGPU操作。The media object command 2242 may supply a pointer to a media object for processing by the media pipeline. The media object includes a memory buffer that contains the video data to be processed. Optionally, all media pipeline states must be valid before issuing the media object command 2242. Once the pipeline state is configured and the media object command 2242 is queued, the media pipeline 2224 is triggered via an execute command 2244 or an equivalent execute event (e.g., a register write). The output from the media pipeline 2224 may then be post-processed by operations provided by the 3D pipeline 2222 or the media pipeline 2224. GPGPU operations may be configured and executed in a manner similar to media operations.

图形软件体系结构Graphics Software Architecture

图23图示用于数据处理系统2300的示例性图形软件体系结构。此类软件体系结构可包括3D图形应用2310、操作系统2320以及至少一个处理器2330。处理器2330可包括图形处理器2332以及一个或多个通用处理器核心2334。处理器2330可以是处理器1402或本文中描述的处理器中的任何其他处理器的变体。可替代处理器1402或本文中描述的处理器中的任何其他处理器来使用处理器2330。因此,结合处理器1402或本文中描述的处理器中的任何其他处理器对任何特征的公开也公开了对应的与图形处理器2330的结合,但不限于此。此外,图23的具有与本文中任何其他附图的元件相同或类似名称的元件描述与其他附图中相同的元件,能以与其他附图中类似的方式进行操作或运行,可包括相同的部件,并且可链接到其他实体,该实体如本文中其他地方所描述的那些实体,但不限于此。图形应用2310和操作系统2320各自在数据处理系统的系统存储器2350中执行。FIG. 23 illustrates an exemplary graphics software architecture for a data processing system 2300. Such a software architecture may include a 3D graphics application 2310, an operating system 2320, and at least one processor 2330. Processor 2330 may include a graphics processor 2332 and one or more general-purpose processor cores 2334. Processor 2330 may be a variant of processor 1402 or any other processor described herein. Processor 2330 may be used in place of processor 1402 or any other processor described herein. Therefore, the disclosure of any feature in conjunction with processor 1402 or any other processor described herein also discloses the corresponding combination with graphics processor 2330, but is not limited thereto. In addition, elements of FIG. 23 having the same or similar names as elements of any other figures herein describe the same elements in other figures, can operate or run in a similar manner as in other figures, may include the same components, and may be linked to other entities, such as those described elsewhere herein, but not limited thereto. Graphics application 2310 and operating system 2320 are each executed in system memory 2350 of the data processing system.

3D图形应用2310可包含一个或多个着色器程序,该一个或多个着色器程序包括着色器指令2312。着色器语言指令可以采用高级着色器语言,诸如,Direct3D的高级着色器语言(High-Level Shader Language,HLSL)、OpenGL着色器语言(OpenGL Shader Language,GLSL),等等。应用还可包括采用适于由通用处理器核心2334执行的机器语言的可执行指令2314。应用还可包括由顶点数据限定的图形对象2316。The 3D graphics application 2310 may include one or more shader programs including shader instructions 2312. The shader language instructions may be in a high-level shader language, such as Direct3D's High-Level Shader Language (HLSL), OpenGL Shader Language (GLSL), etc. The application may also include executable instructions 2314 in a machine language suitable for execution by a general purpose processor core 2334. The application may also include graphics objects 2316 defined by vertex data.

操作系统2320可以是来自微软公司的操作系统、专属的类UNIX操作系统或使用Linux内核的变体的开放源类UNIX操作系统。操作系统2320可支持图形API2322,诸如,Direct3D API、OpenGL API或Vulkan API。当Direct3D API在使用中时,操作系统2320使用前端着色器编译器2324以将采用HLSL的任何着色器指令2312编译成较低级的着色器语言。编译可以是即时(just-in-time,JIT)编译或者应用可执行着色器预编译。在3D图形应用2310的编译期间,高级着色器可被编译为低级着色器。着色器指令2312能以中间形式提供,该中间形式诸如,由Vulkan API使用的标准便携式中间表示(StandardPortable Intermediate Representation,SPIR)的某个版本。The operating system 2320 may be a system from Microsoft Corporation. An operating system, a proprietary UNIX-like operating system, or an open source UNIX-like operating system using a variant of the Linux kernel. The operating system 2320 may support a graphics API 2322, such as a Direct3D API, an OpenGL API, or a Vulkan API. When the Direct3D API is in use, the operating system 2320 uses a front-end shader compiler 2324 to compile any shader instructions 2312 using HLSL into a lower-level shader language. The compilation may be just-in-time (JIT) compilation or application executable shader precompilation. During the compilation of the 3D graphics application 2310, high-level shaders may be compiled into low-level shaders. The shader instructions 2312 may be provided in an intermediate form, such as a version of the Standard Portable Intermediate Representation (SPIR) used by the Vulkan API.

用户模式图形驱动器2326可包含后端着色器编译器2327以将着色器指令2312编译为硬件特定表示。当OpenGL API在使用中时,采用GLSL高级语言的着色器指令2312被传递到用户模式图形驱动程序2326以用于编译。用户模式图形驱动器2326可使用操作系统内核模式功能2328来与内核模式图形驱动器2329通信。内核模式图形驱动器2329可与图形处理器2332通信以调遣命令和指令。The user mode graphics driver 2326 may include a backend shader compiler 2327 to compile shader instructions 2312 into a hardware specific representation. When the OpenGL API is in use, shader instructions 2312 in the GLSL high level language are passed to the user mode graphics driver 2326 for compilation. The user mode graphics driver 2326 may use operating system kernel mode functions 2328 to communicate with the kernel mode graphics driver 2329. The kernel mode graphics driver 2329 may communicate with the graphics processor 2332 to dispatch commands and instructions.

IP核心实现方式IP core implementation

一个或多个方面可以由存储在机器可读介质上的代表性代码实现,该机器可读介质表示和/或限定集成电路(诸如,处理器)内的逻辑。例如,机器可读介质可包括表示处理器内的各种逻辑的指令。当由机器读取时,指令可使机器制造用于执行本文所描述的技术的逻辑。此类表示(被称为“IP核心”)是集成电路的逻辑的可重复使用单元,这些可重复使用单元可以作为描述集成电路的结构的硬件模型而被存储在有形的、机器可读介质上。可以将硬件模型供应至在制造集成电路的制造机器上加载硬件模型的各客户或制造设施。可以制造集成电路,使得电路执行与本文中描述的实施例中的任一实施例相关联地描述的操作。One or more aspects may be implemented by representative code stored on a machine-readable medium that represents and/or defines logic within an integrated circuit (such as a processor). For example, a machine-readable medium may include instructions representing various logic within a processor. When read by a machine, the instructions may cause the machine to manufacture logic for performing the techniques described herein. Such representations (referred to as "IP cores") are reusable units of logic for an integrated circuit that can be stored on a tangible, machine-readable medium as a hardware model describing the structure of the integrated circuit. The hardware model may be supplied to each customer or manufacturing facility that loads the hardware model on a manufacturing machine that manufactures the integrated circuit. The integrated circuit may be manufactured so that the circuit performs the operations described in association with any of the embodiments described herein.

图24A是图示根据实施例的可用于制造集成电路以执行操作的IP核心开发系统2400的框图。IP核心开发系统2400可以用于生成可并入到更大的设计中或用于构建整个集成电路(例如,SOC集成电路)的模块化、可重复使用的设计。设计设施2430可生成采用高级编程语言(例如,C/C++)的IP核心设计的软件仿真2410。软件仿真2410可用于使用仿真模型2412来设计、测试并验证IP核心的行为。仿真模型2412可包括功能仿真、行为仿真和/或时序仿真。随后可从仿真模型2412创建或合成寄存器传输级(register transfer level,RTL)设计2415。RTL设计2415是对硬件寄存器之间的数字信号的流进行建模的集成电路(包括使用建模的数字信号来执行的相关联的逻辑)的行为的抽象。除了RTL设计2415之外,还可创建、设计或合成逻辑级或晶体管级的较低级别设计。由此,初始设计和仿真的特定细节可有所不同。FIG. 24A is a block diagram illustrating an IP core development system 2400 that can be used to manufacture an integrated circuit to perform operations according to an embodiment. The IP core development system 2400 can be used to generate a modular, reusable design that can be incorporated into a larger design or used to build an entire integrated circuit (e.g., a SOC integrated circuit). The design facility 2430 can generate a software simulation 2410 of the IP core design using a high-level programming language (e.g., C/C++). The software simulation 2410 can be used to design, test, and verify the behavior of the IP core using a simulation model 2412. The simulation model 2412 can include functional simulation, behavioral simulation, and/or timing simulation. A register transfer level (RTL) design 2415 can then be created or synthesized from the simulation model 2412. The RTL design 2415 is an abstraction of the behavior of an integrated circuit (including associated logic executed using the modeled digital signals) that models the flow of digital signals between hardware registers. In addition to the RTL design 2415, a lower level design at the logic level or transistor level can also be created, designed, or synthesized. Thus, the specific details of the initial design and simulation can be different.

可由设计设施进一步将RTL设计2415或等效方案合成到硬件模型2420中,该硬件模型2420可以采用硬件描述语言(hardware description language,HDL)或物理设计数据的某种其他表示。可以进一步仿真或测试HDL以验证IP核心设计。可使用非易失性存储器2440(例如,硬盘、闪存或任何非易失性存储介质)来存储IP核心设计以用于递送至第三方制造设施2465。替代地,可通过有线连接2450或无线连接2460(例如,经由互联网)来传送IP核心设计。制造设施2465随后可制造至少部分地基于IP核心设计的集成电路。所制造的集成电路可被配置成用于执行根据本文中描述的至少一个实施例的操作。The RTL design 2415 or equivalent may be further synthesized by the design facility into a hardware model 2420, which may be in a hardware description language (HDL) or some other representation of physical design data. The HDL may be further simulated or tested to verify the IP core design. The IP core design may be stored using a non-volatile memory 2440 (e.g., a hard disk, flash memory, or any non-volatile storage medium) for delivery to a third-party manufacturing facility 2465. Alternatively, the IP core design may be transmitted via a wired connection 2450 or a wireless connection 2460 (e.g., via the Internet). The manufacturing facility 2465 may then manufacture an integrated circuit based at least in part on the IP core design. The manufactured integrated circuit may be configured to perform operations according to at least one embodiment described herein.

图24B图示集成电路封装组件2470的截面侧视图。集成电路封装组件2470图示如本文中所描述的一个或多个处理器或加速器设备的实现方式。封装组件2470包括连接至衬底2480的多个硬件逻辑单元2472、2474。逻辑2472、2474可至少部分地在可配置逻辑或固定功能逻辑硬件中实现,并且可包括本文中描述的(一个或多个)处理器核心、(一个或多个)图形处理器或其他加速器设备中的任一者的一个或多个部分。每个逻辑单元2472、2474可在半导体管芯内实现,并且经由互连组织2473与衬底2480耦合。互连组织2473可被配置成用于在逻辑2472、2474与衬底2480之间路由电信号,并且可包括互连,该互连诸如但不限于凸块或支柱。互连组织2473可被配置成路由电信号,诸如例如,与逻辑2472、2474的操作相关联的输入/输出(I/O)信号和/或功率或接地信号。任选地,衬底2480可以是基于环氧树脂的层压衬底。衬底2480还可包括其他合适类型的衬底。封装组件2470可经由封装互连2483连接到其他电气设备。封装互连2483可耦合至衬底2480的表面以将电信号路由到其他电气设备,诸如,主板、其他芯片组或多芯片模块。24B illustrates a cross-sectional side view of an integrated circuit package assembly 2470. The integrated circuit package assembly 2470 illustrates an implementation of one or more processors or accelerator devices as described herein. The package assembly 2470 includes a plurality of hardware logic units 2472, 2474 connected to a substrate 2480. The logic 2472, 2474 may be implemented at least in part in configurable logic or fixed-function logic hardware, and may include one or more portions of any of the (one or more) processor cores, (one or more) graphics processors, or other accelerator devices described herein. Each logic unit 2472, 2474 may be implemented within a semiconductor die and coupled to the substrate 2480 via an interconnect organization 2473. The interconnect organization 2473 may be configured to route electrical signals between the logic 2472, 2474 and the substrate 2480, and may include interconnects such as, but not limited to, bumps or pillars. The interconnection organization 2473 can be configured to route electrical signals, such as, for example, input/output (I/O) signals and/or power or ground signals associated with the operation of the logic 2472, 2474. Optionally, the substrate 2480 can be an epoxy-based laminate substrate. The substrate 2480 can also include other suitable types of substrates. The package assembly 2470 can be connected to other electrical devices via the package interconnect 2483. The package interconnect 2483 can be coupled to the surface of the substrate 2480 to route electrical signals to other electrical devices, such as a motherboard, other chipsets, or multi-chip modules.

逻辑单元2472、2474可与桥接器2482电耦合,该桥接器2482被配置成用于在逻辑2472与逻辑2474之间路由电信号。桥接器2482可以是为电信号提供路由的密集互连组织。桥接器2482可包括由玻璃或合适的半导体材料构成的桥接器衬底。电路由特征可形成在桥接器衬底上,以提供逻辑2472与逻辑2474之间的芯片到芯片连接。The logic units 2472, 2474 may be electrically coupled to a bridge 2482 configured to route electrical signals between the logic 2472 and the logic 2474. The bridge 2482 may be a dense interconnection fabric that provides routing for electrical signals. The bridge 2482 may include a bridge substrate composed of glass or a suitable semiconductor material. Circuit features may be formed on the bridge substrate to provide a chip-to-chip connection between the logic 2472 and the logic 2474.

尽管图示了两个逻辑单元2472、2474和桥接器2482,但是本文中所描述的实施例可包括在一个或多个管芯上的更多或更少的逻辑单元。这一个或多个管芯可以由零个或更多个桥接器连接,因为当逻辑被包括在单个管芯上时,可以排除桥接器2482。替代地,多个管芯或逻辑单元可以由一个或多个桥接器连接。此外,多个逻辑单元、管芯和桥接器可按其他可能的配置(包括三维配置)被连接在一起。Although two logic units 2472, 2474 and bridge 2482 are illustrated, the embodiments described herein may include more or fewer logic units on one or more dies. The one or more dies may be connected by zero or more bridges, as bridge 2482 may be excluded when the logic is included on a single die. Alternatively, multiple dies or logic units may be connected by one or more bridges. In addition, multiple logic units, dies, and bridges may be connected together in other possible configurations, including three-dimensional configurations.

图24C图示封装组件2490,该封装组件2490包括连接到衬底2480(例如,基础管芯)的多个单元的硬件逻辑小芯片。如本文中所描述的图形处理单元、并行处理器和/或计算加速器可由分开制造的各种硅小芯片组成。在该上下文中,小芯片是至少部分地被封装的集成电路,该至少部分地被封装的集成电路包括可与其他小芯片一起被组装到更大的封装中的不同的逻辑单元。具有不同IP核心逻辑的各种集合的小芯片可被组装到单个设备中。此外,小芯片可使用有源中介层(interposer)技术而被集成到基础管芯或基础小芯片中。本文中描述的概念启用GPU内的不同形式的IP之间的互连和通信。IP核心可使用不同的工艺技术来制造并在制造期间被构成,这避免了尤其是对于具有若干风格的IP的大型SoC的将多个IP汇聚到同一制造工艺的复杂性。允许使用多种工艺技术改善了上市时间,并提供具有成本效益的方法来创建多个产品SKU。此外,分解的IP更易修改以被独立地功率门控,对于给定工作负载不在使用中的部件可被关断,从而降低总功耗。FIG24C illustrates a package assembly 2490 including a hardware logic chiplet of multiple units connected to a substrate 2480 (e.g., a base die). A graphics processing unit, parallel processor, and/or computing accelerator as described herein may be composed of various silicon chiplets manufactured separately. In this context, a chiplet is an integrated circuit that is at least partially packaged and includes different logic units that can be assembled into a larger package with other chiplets. Chipsets with various sets of different IP core logic can be assembled into a single device. In addition, chiplets can be integrated into a base die or base chiplet using active interposer technology. The concepts described herein enable interconnection and communication between different forms of IP within a GPU. IP cores can be manufactured using different process technologies and constructed during manufacturing, which avoids the complexity of converging multiple IPs into the same manufacturing process, especially for large SoCs with several styles of IP. Allowing the use of multiple process technologies improves time to market and provides a cost-effective method to create multiple product SKUs. Furthermore, decomposed IP is more easily modified to be independently power gated, and components not in use for a given workload can be shut down, thereby reducing overall power consumption.

在各实施例中,封装组件2490可包括由结构2485或一个或多个桥接器2487互连的更少或更多数量的部件和小芯片。封装组件2490内的小芯片可具有使用芯片-晶片-衬底(Chip-on-Wafer-on-Substrate)堆叠的2.5D布置,其中,多个管芯并排地堆叠在硅中介层上,该硅中介层包括硅通孔(through-silicon vias,TSV)以将小芯片与衬底2480耦合,该衬底2480包括至封装互连2483的电气连接。In various embodiments, the package assembly 2490 may include a fewer or greater number of components and chiplets interconnected by structures 2485 or one or more bridges 2487. The chiplets within the package assembly 2490 may have a 2.5D arrangement using chip-on-wafer-on-substrate stacking, where multiple dies are stacked side-by-side on a silicon interposer that includes through-silicon vias (TSVs) to couple the chiplets with a substrate 2480 that includes electrical connections to the package interconnects 2483.

在一个实施例中,硅中介层是有源中介层2489,该有源中介层2489除了TSV之外还包括嵌入式逻辑。在此类实施例中,封装组件2490内的小芯片使用3D面对面管芯堆叠被布置在有源中介层2489的顶部上。有源中介层2489除互连结构2485和硅桥接器2487外还可包括用于I/O 2491的硬件逻辑、缓存存储器2492和其他硬件逻辑2493。结构2485启用各种逻辑小芯片2472、2474与有源中介层2489内的逻辑2491、2493之间的通信。结构2485可以是在封装组件的部件之间交换数据分组的NoC互连或另一形式的分组交换型结构。对于复杂组件,结构2485可以是启用封装组件2490的各硬件逻辑之间的通信的专用小芯片。In one embodiment, the silicon interposer is an active interposer 2489 that includes embedded logic in addition to TSVs. In such embodiments, the chiplets within the package assembly 2490 are arranged on top of the active interposer 2489 using 3D face-to-face die stacking. The active interposer 2489 may include hardware logic for I/O 2491, cache memory 2492, and other hardware logic 2493 in addition to the interconnect structure 2485 and the silicon bridge 2487. The structure 2485 enables communication between the various logic chiplets 2472, 2474 and the logic 2491, 2493 within the active interposer 2489. The structure 2485 can be a NoC interconnect or another form of packet switching structure that exchanges data packets between components of the package assembly. For complex components, the structure 2485 can be a dedicated chiplet that enables communication between the hardware logics of the package assembly 2490.

有源中介层2489内的桥接器结构2487可用于促进例如逻辑或I/O小芯片2474与存储器小芯片2475之间的点到点互连。在一些实现方式中,桥接器结构2487还可被嵌入在衬底2480内。A bridge structure 2487 within the active interposer 2489 may be used to facilitate point-to-point interconnection between, for example, a logic or I/O chiplet 2474 and a memory chiplet 2475. In some implementations, the bridge structure 2487 may also be embedded within the substrate 2480.

硬件逻辑小芯片可包括专用硬件逻辑小芯片2472、逻辑或I/O小芯片2474和/或存储器小芯片2475。硬件逻辑小芯片2472以及逻辑或I/O小芯片2474可以至少部分地在可配置逻辑或固定功能逻辑硬件中实现,并且可包括本文中描述的(一个或多个)处理器核心、(一个或多个)图形处理器、并行处理器或其他加速器设备中的任一个的一个或多个部分。存储器小芯片2475可以是DRAM(例如,GDDR、HBM)存储器或缓存(SRAM)存储器。有源中介层2489(或衬底2480)内的缓存存储器2492可充当用于封装组件2490的全局缓存,充当分布式全局缓存的部分,或充当用于结构2485的专用缓存。The hardware logic chiplets may include dedicated hardware logic chiplets 2472, logic or I/O chiplets 2474, and/or memory chiplets 2475. The hardware logic chiplets 2472 and logic or I/O chiplets 2474 may be implemented at least in part in configurable logic or fixed-function logic hardware and may include one or more portions of any of the processor core(s), graphics processor(s), parallel processors, or other accelerator devices described herein. The memory chiplets 2475 may be DRAM (e.g., GDDR, HBM) memory or cache (SRAM) memory. The cache memory 2492 within the active interposer 2489 (or substrate 2480) may act as a global cache for the package assembly 2490, as part of a distributed global cache, or as a dedicated cache for the structure 2485.

每个小芯片可被制造为单独的半导体管芯,并且可与基础管芯耦合,该基础管芯嵌入在衬底2480内或与衬底2480耦合。与衬底2480的耦合可经由互连组织2473来执行。互连组织2473可被配置成用于在衬底2480内的各种小芯片与逻辑之间路由电信号。互连组织2473可包括互连,诸如但不限于凸块或支柱。在一些实施例中,互连组织2473可被配置成用于路由电信号,诸如例如,与逻辑、I/O和存储器小芯片的操作相关联的输入/输出(I/O)信号和/或功率或接地信号。在一个实施例中,附加的互连组织将有源中介层2489与衬底2480耦合。Each chiplet may be fabricated as a separate semiconductor die and may be coupled to a base die that is embedded within or coupled to a substrate 2480. Coupling to the substrate 2480 may be performed via an interconnect organization 2473. The interconnect organization 2473 may be configured to route electrical signals between various chiplets and logic within the substrate 2480. The interconnect organization 2473 may include interconnects such as, but not limited to, bumps or pillars. In some embodiments, the interconnect organization 2473 may be configured to route electrical signals, such as, for example, input/output (I/O) signals and/or power or ground signals associated with the operation of logic, I/O, and memory chiplets. In one embodiment, an additional interconnect organization couples the active interposer 2489 to the substrate 2480.

衬底2480可以是基于环氧树脂的层压衬底,然而,它不限于此,并且衬底2480还可包括其他合适类型的衬底。封装组件2490可经由封装互连2483连接到其他电气设备。封装互连2483可耦合至衬底2480的表面以将电信号路由到其他电气设备,诸如,主板、其他芯片组或多芯片模块。Substrate 2480 may be an epoxy-based laminate substrate, however, it is not limited thereto, and substrate 2480 may also include other suitable types of substrates. Package assembly 2490 may be connected to other electrical devices via package interconnect 2483. Package interconnect 2483 may be coupled to the surface of substrate 2480 to route electrical signals to other electrical devices, such as a motherboard, other chipsets, or multi-chip modules.

逻辑或I/O小芯片2474和存储器小芯片2475可经由桥接器2487被电耦合,该桥接器2487被配置成用于在逻辑或I/O小芯片2474与存储器小芯片2475之间路由电信号。桥接器2487可以是为电信号提供路由的密集互连组织。桥接器2487可包括由玻璃或合适的半导体材料构成的桥接器衬底。电路由特征可形成在桥接器衬底上以提供逻辑或I/O小芯片2474与存储器小芯片2475之间的芯片到芯片连接。桥接器2487还可被称为硅桥接器或互连桥接器。例如,桥接器2487是嵌入式多管芯互连桥接器(Embedded Multi-dieInterconnect Bridge,EMIB)。替代地,桥接器2487可简单地是从一个小芯片到另一小芯片的直接连接。The logic or I/O chiplet 2474 and the memory chiplet 2475 can be electrically coupled via a bridge 2487, which is configured to route electrical signals between the logic or I/O chiplet 2474 and the memory chiplet 2475. The bridge 2487 can be a dense interconnect organization that provides routing for electrical signals. The bridge 2487 may include a bridge substrate composed of glass or a suitable semiconductor material. Circuit features may be formed on the bridge substrate to provide chip-to-chip connections between the logic or I/O chiplet 2474 and the memory chiplet 2475. The bridge 2487 may also be referred to as a silicon bridge or an interconnect bridge. For example, the bridge 2487 is an embedded multi-die interconnect bridge (Embedded Multi-die Interconnect Bridge, EMIB). Alternatively, the bridge 2487 may simply be a direct connection from one chiplet to another chiplet.

图24D图示根据实施例的包括可互换小芯片2495的封装组件2494。可互换小芯片2495可被组装到一个或多个基础小芯片2496、2498上的标准化插槽中。基础小芯片2496、2498可经由桥接器互连2497被耦合,该桥接器互连2497可与本文中描述的其他桥接器互连类似,并且可以是例如EMIB。存储器小芯片也可经由桥接器互连被连接到逻辑或I/O小芯片。I/O和逻辑小芯片可经由互连结构进行通信。基础小芯片各自都能以用于逻辑或I/O或存储器/缓存中的一者的标准化格式来支持一个或多个插槽。24D illustrates a package assembly 2494 including interchangeable chiplets 2495 according to an embodiment. Interchangeable chiplets 2495 can be assembled into standardized sockets on one or more base chiplets 2496, 2498. The base chiplets 2496, 2498 can be coupled via a bridge interconnect 2497, which can be similar to other bridge interconnects described herein and can be, for example, EMIB. Memory chiplets can also be connected to logic or I/O chiplets via a bridge interconnect. The I/O and logic chiplets can communicate via an interconnect structure. The base chiplets can each support one or more sockets in a standardized format for one of logic or I/O or memory/cache.

SRAM和功率递送电路可被制造到基础小芯片2496、2498中的一个或多个中,基础小芯片2496、2498可使用相对于可互换小芯片2495不同的工艺技术来制造,可互换小芯片2495堆叠在基础小芯片的顶部上。例如,可使用较大工艺技术来制造基础小芯片2496、2498,同时可使用较小工艺技术来制造可互换小芯片。可互换小芯片2495中的一个或多个可以是存储器(例如,DRAM)小芯片。可基于针对使用封装组件2494的产品的功率和/或性能来为封装组件2494选择不同的存储器密度。此外,可在组装时基于针对产品的功率和/或性能来选择具有不同数量的类型的功能单元的逻辑小芯片。此外,可将包含具有不同类型的IP逻辑核心的小芯片插入到可互换小芯片插槽中,从而启用可混合并匹配不同技术的IP块的混合式处理器设计。The SRAM and power delivery circuits may be fabricated into one or more of the base chiplets 2496, 2498, which may be fabricated using a different process technology relative to the interchangeable chiplets 2495, which are stacked on top of the base chiplets. For example, the base chiplets 2496, 2498 may be fabricated using a larger process technology while the interchangeable chiplets may be fabricated using a smaller process technology. One or more of the interchangeable chiplets 2495 may be memory (e.g., DRAM) chiplets. Different memory densities may be selected for the package assembly 2494 based on power and/or performance for the product using the package assembly 2494. In addition, logic chiplets with different numbers of types of functional units may be selected at assembly time based on power and/or performance for the product. In addition, chiplets containing IP logic cores of different types may be inserted into interchangeable chiplet sockets, thereby enabling hybrid processor designs that can mix and match IP blocks of different technologies.

示例性片上系统集成电路Exemplary System-on-Chip Integrated Circuit

图25-图26B图示可使用一个或多个IP核心制造的示例性集成电路和相关联的图形处理器。除了所图示的内容之外,还可包括其他逻辑和电路,包括附加的图形处理器/核心、外围接口控制器或通用处理器核心。图25-图26B的具有与本文中任何其他附图的元件相同或类似名称的元件描述与其他附图中相同的元件,能以与其他附图中类似的方式进行操作或运行,可包括相同的部件,并且可链接到其他实体,该实体如本文中其他地方所描述的那些实体,但不限于此。Figures 25-26B illustrate exemplary integrated circuits and associated graphics processors that can be manufactured using one or more IP cores. In addition to what is illustrated, other logic and circuits may be included, including additional graphics processors/cores, peripheral interface controllers, or general purpose processor cores. Elements of Figures 25-26B with the same or similar names as elements of any other figure herein describe the same elements as in the other figures, can operate or function in a similar manner as in the other figures, may include the same components, and may be linked to other entities, such as, but not limited to, those described elsewhere herein.

图25是图示可使用一个或多个IP核心来制造的示例性片上系统集成电路2500的框图。示例性集成电路2500包括一个或多个应用处理器2505(例如,CPU)、至少一个图形处理器2510,该至少一个图形处理器2510可以是图形处理器1408、1508、2510的变体,或者可以是本文中描述的并且可替代所描述的任何图形处理器被使用的任何图形处理器。因此,本文中结合图形处理器对任何特征的公开也公开了对应的与图形处理器2510的结合,但不限于此。集成电路2500可附加地包括图像处理器2515和/或视频处理器2520,图像处理器2515和视频处理器2520中的任一者可以是来自相同的设计设施或多个不同的设计设施的模块化IP核心。集成电路2500可包括外围或总线逻辑,包括USB控制器2525、UART控制器2530、SPI/SDIO控制器2535和I2S/I2C控制器2540。此外,集成电路可包括显示设备2545,该显示设备2545耦合至高清晰度多媒体接口(high-definition multimedia interface,HDMI)控制器2550和移动行业处理器接口(mobile industryprocessor interface,MIPI)显示接口2555中的一个或多个。可以由闪存子系统2560(包括闪存和闪存控制器)来提供存储。可以经由存储器控制器2565来提供存储器接口以获得对SDRAM或SRAM存储器设备的访问。一些集成电路附加地包括嵌入式安全引擎2570。FIG. 25 is a block diagram illustrating an exemplary system-on-chip integrated circuit 2500 that can be manufactured using one or more IP cores. The exemplary integrated circuit 2500 includes one or more application processors 2505 (e.g., CPUs), at least one graphics processor 2510, which can be a variation of the graphics processors 1408, 1508, 2510, or can be any graphics processor described herein and can be used in place of any of the described graphics processors. Therefore, the disclosure of any feature in conjunction with a graphics processor herein also discloses the corresponding combination with the graphics processor 2510, but is not limited thereto. The integrated circuit 2500 may additionally include an image processor 2515 and/or a video processor 2520, either of which can be a modular IP core from the same design facility or multiple different design facilities. The integrated circuit 2500 may include peripheral or bus logic, including a USB controller 2525, a UART controller 2530, a SPI/SDIO controller 2535, and an I2S/I2C controller 2540. In addition, the integrated circuit may include a display device 2545 coupled to one or more of a high-definition multimedia interface (HDMI) controller 2550 and a mobile industry processor interface (MIPI) display interface 2555. Storage may be provided by a flash memory subsystem 2560 (including flash memory and a flash memory controller). A memory interface may be provided via a memory controller 2565 to obtain access to SDRAM or SRAM memory devices. Some integrated circuits additionally include an embedded security engine 2570.

图26A-图26B是图示根据本文中所描述的实施例的用于在SoC内使用的示例性图形处理器的框图。所图示的图形处理器可以是图形处理器1408、1508、2510、或本文中描述的任何其他图形处理器的变体。图形处理器可替代图形处理器1408、1508、2510、或本文中描述的图形处理器中的任何其他图形处理器被使用。因此,结合图形处理器1408、1508、2510或本文中描述的图形处理器中的任何其他图形处理器对任何特征的公开也公开了对应的与图26A-图26B的图形处理器的结合,但不限于此。图26A图示根据实施例的可以使用一个或多个IP核心来制造的片上系统集成电路的示例性图形处理器2610。图26B图示根据实施例的可以使用一个或多个IP核心来制造的片上系统集成电路的附加的示例性图形处理器2640。图26A的图形处理器2610是低功率图形处理器核心的示例。图26B的图形处理器2640是较高性能的图形处理器核心的示例。例如,如本段开头所提及,图形处理器2610和图形处理器2640中的每个图形处理器可以是图25的图形处理器2510的变体。Figures 26A-26B are block diagrams illustrating an exemplary graphics processor for use within a SoC according to an embodiment described herein. The illustrated graphics processor may be a variant of graphics processor 1408, 1508, 2510, or any other graphics processor described herein. The graphics processor may be used in place of graphics processor 1408, 1508, 2510, or any other graphics processor in the graphics processor described herein. Therefore, the disclosure of any feature in conjunction with graphics processor 1408, 1508, 2510, or any other graphics processor in the graphics processor described herein also discloses the corresponding combination with the graphics processor of Figures 26A-26B, but is not limited thereto. Figure 26A illustrates an exemplary graphics processor 2610 of a system-on-chip integrated circuit that can be manufactured using one or more IP cores according to an embodiment. Figure 26B illustrates an additional exemplary graphics processor 2640 of a system-on-chip integrated circuit that can be manufactured using one or more IP cores according to an embodiment. The graphics processor 2610 of Figure 26A is an example of a low-power graphics processor core. Graphics processor 2640 of Figure 26B is an example of a higher performance graphics processor core. For example, as mentioned at the beginning of this paragraph, each of graphics processors 2610 and graphics processor 2640 may be a variation of graphics processor 2510 of Figure 25 .

如图26A中所示,图形处理器2610包括顶点处理器2605和一个或多个片段处理器2615A-2615N(例如,2615A、2615B、2615C、2615D,一直到2615N-1和2615N)。图形处理器2610可以经由单独的逻辑执行不同的着色器程序,使得顶点处理器2605被优化以执行用于顶点着色器程序的操作,而一个或多个片段处理器2615A-2615N执行用于片段或像素着色器程序的片段(例如,像素)着色操作。顶点处理器2605执行3D图形管线的顶点处理阶段,并生成基元和顶点数据。(一个或多个)片段处理器2615A-2615N使用由顶点处理器2605生成的基元数据和顶点数据来产生被显示在显示设备上的帧缓冲器。(一个或多个)片段处理器2615A-2615N可被优化以执行如在OpenGL API中提供的片段着色器程序,这些片段着色器程序可以用于执行与如在Direct3D API中提供的像素着色器程序类似的操作。As shown in FIG. 26A , the graphics processor 2610 includes a vertex processor 2605 and one or more fragment processors 2615A-2615N (e.g., 2615A, 2615B, 2615C, 2615D, up to 2615N-1 and 2615N). The graphics processor 2610 can execute different shader programs via separate logic, so that the vertex processor 2605 is optimized to perform operations for the vertex shader program, while the one or more fragment processors 2615A-2615N perform fragment (e.g., pixel) shading operations for the fragment or pixel shader program. The vertex processor 2605 performs the vertex processing stage of the 3D graphics pipeline and generates primitives and vertex data. (One or more) fragment processors 2615A-2615N use the primitive data and vertex data generated by the vertex processor 2605 to generate a frame buffer that is displayed on a display device. The fragment processor(s) 2615A-2615N may be optimized to execute fragment shader programs as provided in the OpenGL API, which may be used to perform similar operations as pixel shader programs as provided in the Direct3D API.

图形处理器2610附加地包括一个或多个存储器管理单元(MMU)2620A-2620B、(一个或多个)缓存2625A-2625B以及(一个或多个)电路互连2630A-2630B。该一个或多个MMU2620A-2620B为图形处理器2610(包括为顶点处理器2605和/或(一个或多个)片段处理器2615A-2615N)提供虚拟到物理地址映射,除了存储在一个或多个缓存2625A-2625B中的顶点数据或图像/纹理数据之外,该虚拟到物理地址映射还可以引用存储在存储器中的顶点数据或图像/纹理数据。一个或多个MMU 2620A-2620B可以与系统内的其他MMU同步,使得每个处理器2505-2520可以参与共享或统一的虚拟存储器系统,系统内的其他MMU包括与图25的一个或多个应用处理器2505、图像处理器2515和/或视频处理器2520相关联的一个或多个MMU。图形处理器2610的部件可与本文中描述的其他图形处理器的部件相对应。一个或多个MMU 2620A-2620B可与图2C的MMU 245相对应。顶点处理器2605和片段处理器2615A-2515N可与图形多处理器234相对应。根据实施例,一个或多个电路互连2630A-2630B使得图形处理器2610能够经由SoC的内部总线或经由直接连接来与SoC内的其他IP核心对接。一个或多个电路互连2630A-2630B可与图2C的数据交叉开关240相对应。可在图形处理器2610的类似部件与本文中描述的各种图形处理器体系结构之间发现进一步的对应性。The graphics processor 2610 additionally includes one or more memory management units (MMUs) 2620A-2620B, cache(s) 2625A-2625B, and circuit interconnect(s) 2630A-2630B. The one or more MMUs 2620A-2620B provide virtual to physical address mappings for the graphics processor 2610 (including for the vertex processor 2605 and/or the fragment processor(s) 2615A-2615N), which may reference vertex data or image/texture data stored in memory in addition to vertex data or image/texture data stored in the one or more caches 2625A-2625B. One or more MMUs 2620A-2620B can be synchronized with other MMUs within the system so that each processor 2505-2520 can participate in a shared or unified virtual memory system, and other MMUs within the system include one or more MMUs associated with one or more application processors 2505, image processors 2515, and/or video processors 2520 of Figure 25. The components of the graphics processor 2610 may correspond to the components of other graphics processors described herein. One or more MMUs 2620A-2620B may correspond to the MMU 245 of Figure 2C. The vertex processor 2605 and the fragment processor 2615A-2515N may correspond to the graphics multiprocessor 234. According to an embodiment, one or more circuit interconnects 2630A-2630B enable the graphics processor 2610 to interface with other IP cores within the SoC via the internal bus of the SoC or via a direct connection. One or more circuit interconnects 2630A-2630B may correspond to the data crossbar switch 240 of Figure 2C. Further correspondence may be found between similar components of graphics processor 2610 and the various graphics processor architectures described herein.

如图26B中所示,图形处理器2640包括图26A的图形处理器2610的一个或多个MMU2620A-2620B、缓存2625A-2625B和电路互连2630A-2630B。图形处理器2640包括一个或多个着色器核心2655A-2655N(例如,2655A、2655B、26555C、2655D、2655E、2655F,一直到2655N-1和2655N),其提供统一的着色器核心体系结构,其中,单个核心或一种类型的核心可执行所有类型的可编程着色器代码,包括用于实现顶点着色器、片段着色器和/或计算着色器的着色器程序代码。存在的着色器核心的确切数量可以因实施例和实现方式而异。此外,图形处理器2640包括核心间任务管理器2645,该核心间任务管理器2645充当用于将执行线程调遣给一个或多个着色器核心2655A-2655N的线程调遣器和用于加速对基于片的渲染的分片操作的分片单元2658,在基于片的渲染中,针对场景的渲染操作在图像空间中被细分,例如以利用场景内的局部空间一致性或优化内部缓存的使用。着色器核心2655A-2655N可例如与图2D中的图形多处理器234相对应,或分别与图3A和图3B的图形多处理器325、350相对应,或与图3C的多核心组365A相对应。As shown in FIG. 26B , graphics processor 2640 includes one or more MMUs 2620A-2620B, caches 2625A-2625B, and circuit interconnects 2630A-2630B of graphics processor 2610 of FIG. 26A . Graphics processor 2640 includes one or more shader cores 2655A-2655N (e.g., 2655A, 2655B, 26555C, 2655D, 2655E, 2655F, all the way to 2655N-1 and 2655N) that provide a unified shader core architecture in which a single core or type of core can execute all types of programmable shader code, including shader program code for implementing vertex shaders, fragment shaders, and/or compute shaders. The exact number of shader cores present may vary depending on the embodiment and implementation. In addition, the graphics processor 2640 includes an inter-core task manager 2645 that acts as a thread dispatcher for dispatching execution threads to one or more shader cores 2655A-2655N and a tiling unit 2658 for accelerating tiling operations for tile-based rendering, in which rendering operations for a scene are subdivided in image space, for example to exploit local spatial coherence within a scene or to optimize the use of internal caches. The shader cores 2655A-2655N may correspond, for example, to the graphics multiprocessor 234 in FIG. 2D, or to the graphics multiprocessors 325, 350 of FIGS. 3A and 3B, respectively, or to the multi-core group 365A of FIG. 3C.

用于图形和机器学习工作负荷的张量加速逻辑Tensor acceleration logic for graphics and machine learning workloads

图27是按照实施例的数据处理系统2700的框图。数据处理系统2700是一种异构处理系统,所述处理系统具有处理器2702、统一存储器2710和GPGPU 2720,包括机器学习加速逻辑。处理器2702和GPGPU 2720可以是如本文中描述的处理器和GPGPU/并行处理器中的任何一个。例如,附加地参照图1,处理器2702可以是所图示的一个或多个处理器102中的处理器的变体和/或与该处理器共享体系结构,以及GPGPU 2720可以是所图示的一个或多个并行处理器112中的并行处理器的变体和/或与该并行处理器共享体系结构。附加地参照图4,处理器2702可以是所图示的(一个或多个)处理器1402中的一个处理器的变体和/或与该处理器共享体系结构,以及GPGPU 2720可以是所图示的(一个或多个)图形处理器1408中的一个图形处理器的变体和/或与该图形处理器共享体系结构。FIG27 is a block diagram of a data processing system 2700 according to an embodiment. The data processing system 2700 is a heterogeneous processing system having a processor 2702, a unified memory 2710, and a GPGPU 2720, including machine learning acceleration logic. The processor 2702 and the GPGPU 2720 may be any of the processors and GPGPU/parallel processors described herein. For example, with additional reference to FIG1, the processor 2702 may be a variant of a processor in the illustrated one or more processors 102 and/or share an architecture with the processor, and the GPGPU 2720 may be a variant of a parallel processor in the illustrated one or more parallel processors 112 and/or share an architecture with the parallel processor. With additional reference to FIG4, the processor 2702 may be a variant of a processor in the illustrated (one or more) processors 1402 and/or share an architecture with the processor, and the GPGPU 2720 may be a variant of a graphics processor in the illustrated (one or more) graphics processors 1408 and/or share an architecture with the graphics processor.

处理器2702可以执行系统存储器2712中存储的编译器2715的指令。编译器2715在处理器2702上执行,以将源代码2714A编译到经编译代码2714B中。经编译的代码2714B能够包括可以由处理器2702执行的指令和/或可以由GPGPU 2720执行的指令。可以使用着色器或计算程序编译器(诸如,例如图23中的着色器编译器2327和/或着色器编译器2324)来促进对将要由GPGPU执行的指令的编译。在编译期间,编译器2715可以执行插入元数据的操作,该元数据包括关于经编译的代码2714B中存在的数据并行性的级别的提示和/或关于与基于经编译的代码2714B将被调遣的线程关联的数据局部性的提示。编译器2715可以包括执行此类操作所需的信息,或者操作可以借助于运行时库2716来执行。运行时库2716还可以帮助编译器2715对源代码2714A进行编译,并且还可以包括指令,这些指令在运行时与经编译的代码2714B链接,以促进在GPGPU 2720上对经编译的指令的执行。编译器2715还可以经由寄存器分配器(RA)来促进变量的寄存器分配,并且生成加载和存储指令,以在存储器与为所述变量所指派的寄存器之间移动变量的数据。The processor 2702 may execute instructions of a compiler 2715 stored in the system memory 2712. The compiler 2715 is executed on the processor 2702 to compile the source code 2714A into the compiled code 2714B. The compiled code 2714B may include instructions that may be executed by the processor 2702 and/or instructions that may be executed by the GPGPU 2720. A shader or compute program compiler (such as, for example, the shader compiler 2327 and/or the shader compiler 2324 in FIG. 23) may be used to facilitate the compilation of instructions to be executed by the GPGPU. During compilation, the compiler 2715 may perform operations of inserting metadata including hints about the level of data parallelism present in the compiled code 2714B and/or hints about data locality associated with threads to be dispatched based on the compiled code 2714B. The compiler 2715 may include the information required to perform such operations, or the operations may be performed with the aid of a runtime library 2716. The runtime library 2716 may also assist the compiler 2715 in compiling the source code 2714A and may also include instructions that are linked with the compiled code 2714B at runtime to facilitate execution of the compiled instructions on the GPGPU 2720. The compiler 2715 may also facilitate register allocation of variables via a register allocator (RA) and generate load and store instructions to move data of a variable between memory and the registers assigned to the variable.

统一存储器2710表示统一地址空间,该统一地址空间可由处理器2702和GPGPU2720访问。统一存储器可以包括系统存储器2712以及GPGPU存储器2718。GPGPU存储器2718是GPGPU 2720的地址空间内的存储器,并且可以包括一些或所有系统存储器2712。在一个实施例中,在系统存储器2712中存储的经编译代码2714B可以被映射到GPGPU存储器2718以用于由GPGPU 2720访问。GPGPU存储器2718还包括GPGPU 2720的GPGPU本地存储器2728。GPGPU本地存储器2728可以包括例如HBM或GDDR存储器。Unified memory 2710 represents a unified address space that is accessible by processor 2702 and GPGPU 2720. Unified memory may include system memory 2712 and GPGPU memory 2718. GPGPU memory 2718 is memory within the address space of GPGPU 2720 and may include some or all of system memory 2712. In one embodiment, compiled code 2714B stored in system memory 2712 may be mapped to GPGPU memory 2718 for access by GPGPU 2720. GPGPU memory 2718 also includes GPGPU local memory 2728 of GPGPU 2720. GPGPU local memory 2728 may include, for example, HBM or GDDR memory.

GPGPU 2720包括多个计算块2724A-2724N,这些计算块可以包括本文中描述的多种处理资源中的一个或多个处理资源。处理资源能够是或者包括多种不同的计算资源,诸如,例如执行单元、计算单元、流式多处理器、图形多处理器或多核心组。在一个实施例中,GPGPU 2720附加地包括张量加速器2723(例如矩阵加速器),该张量加速器可以包括一个或多个特殊功能计算单元,这些特殊功能计算单元被设计成加速矩阵操作的子集(例如点积等)。张量加速器2723又可被称为张量加速器或张量核心。在一个实施例中,张量加速器2723内的逻辑部件可被分布于多个计算块2724A-2724N的处理资源。GPGPU 2720 includes multiple computing blocks 2724A-2724N, which may include one or more of the various processing resources described herein. Processing resources can be or include a variety of different computing resources, such as, for example, execution units, computing units, streaming multiprocessors, graphics multiprocessors, or multi-core groups. In one embodiment, GPGPU 2720 additionally includes a tensor accelerator 2723 (e.g., a matrix accelerator), which may include one or more special function computing units designed to accelerate a subset of matrix operations (e.g., dot products, etc.). Tensor accelerator 2723 may also be referred to as a tensor accelerator or tensor core. In one embodiment, the logic components within tensor accelerator 2723 may be distributed across the processing resources of multiple computing blocks 2724A-2724N.

GPGPU 2720还可以包括可以由计算块2724A-2724N和张量加速器2723共享的资源集合,包括但不限于缓存2727、功率和性能模块2726和寄存器2725的集合。在一个实施例中,寄存器2725包括直接和间接可访问的寄存器,其中间接可访问的寄存器被优化以用于由张量加速器2723使用。功率和性能模块2726可被配置成调整计算块2724A-2724N的功率输送和时钟频率,以便为计算块2724A-2724N内的门空闲(gate idle)部件供电。在各个实施例中,缓存2727可以包括指令缓存和/或更低级别的数据缓存。GPGPU 2720 may also include a collection of resources that may be shared by compute blocks 2724A-2724N and tensor accelerator 2723, including, but not limited to, cache 2727, power and performance module 2726, and a collection of registers 2725. In one embodiment, registers 2725 include directly and indirectly accessible registers, wherein the indirectly accessible registers are optimized for use by tensor accelerator 2723. Power and performance module 2726 may be configured to adjust the power delivery and clock frequency of compute blocks 2724A-2724N to power gate idle components within compute blocks 2724A-2724N. In various embodiments, cache 2727 may include an instruction cache and/or a lower level data cache.

GPGPU 2720附加地可以包括L3数据缓存2730,该L3数据缓存可以用来缓存由张量加速器2723和/或计算块2724A-2724N内的计算元件从统一存储器2710所访问的数据。在一个实施例中,L3数据缓存2730包括共享本地存储器2732,该共享本地存储器可由计算块2724A-2724N内的计算元件和张量加速器2723共享。GPGPU 2720 may additionally include an L3 data cache 2730, which may be used to cache data accessed from unified memory 2710 by tensor accelerator 2723 and/or computing elements within compute blocks 2724A-2724N. In one embodiment, L3 data cache 2730 includes shared local memory 2732, which may be shared by computing elements within compute blocks 2724A-2724N and tensor accelerator 2723.

在一个实施例中,GPGPU 2720包括指令处置逻辑,诸如取得和解码单元2721以及调度器控制器2722。取得和解码单元2721包括取得单元和解码单元,以对指令进行取得和解码,以用于由张量加速器2723或计算块2724A-2724N中的一个或多个来执行。指令可以经由调度器控制器2722被调度到张量加速器或计算块2724A-2724N内的适当功能单元。在一个实施例中,调度器控制器2722是可配置成执行高级调度操作的ASIC。在一个实施例中,调度器控制器2722是能够执行从固件模块所加载的调度器指令的微控制器或者每指令低能量处理核心。In one embodiment, the GPGPU 2720 includes instruction handling logic, such as a fetch and decode unit 2721 and a scheduler controller 2722. The fetch and decode unit 2721 includes a fetch unit and a decode unit to fetch and decode instructions for execution by one or more of the tensor accelerator 2723 or the compute blocks 2724A-2724N. Instructions can be dispatched to appropriate functional units within the tensor accelerator or the compute blocks 2724A-2724N via the scheduler controller 2722. In one embodiment, the scheduler controller 2722 is an ASIC configurable to perform advanced scheduling operations. In one embodiment, the scheduler controller 2722 is a microcontroller or a low-energy per-instruction processing core capable of executing scheduler instructions loaded from a firmware module.

在一个实施例中,将要由计算块2724A-2724N执行的一些功能可以被直接调度到或者被迁移到张量加速器2723。在各个实施例中,张量加速器2723包括处理元件逻辑,该处理元件逻辑配置成高效地执行矩阵计算操作,诸如由3D图形或计算着色器程序使用的乘法和加法运算以及点积运算。在一个实施例中,张量加速器2723可被配置成加速由机器学习框架使用的操作。在一个实施例中,张量加速器2723是专用集成电路,该专用集成电路显式配置成执行并行矩阵乘法和/或加法运算的特定集合。在一个实施例中,张量加速器2723是现场可编程门阵列(FPGA),该FPGA提供能够在工作负荷之间被更新的固定功能逻辑。在一个实施例中,可以由张量加速器2723执行的计算操作集合相对于可以由计算块2724A-2724N执行的操作可受到限制。然而,张量加速器2723能以相对于计算块2724A-2724N明显更高的吞吐量来执行并行张量操作。In one embodiment, some functions to be performed by the computational blocks 2724A-2724N may be directly scheduled or migrated to the tensor accelerator 2723. In various embodiments, the tensor accelerator 2723 includes processing element logic configured to efficiently perform matrix computation operations, such as multiplication and addition operations and dot product operations used by 3D graphics or computational shader programs. In one embodiment, the tensor accelerator 2723 may be configured to accelerate operations used by a machine learning framework. In one embodiment, the tensor accelerator 2723 is a dedicated integrated circuit that is explicitly configured to perform a specific set of parallel matrix multiplication and/or addition operations. In one embodiment, the tensor accelerator 2723 is a field programmable gate array (FPGA) that provides fixed-function logic that can be updated between workloads. In one embodiment, the set of computational operations that can be performed by the tensor accelerator 2723 may be limited relative to the operations that can be performed by the computational blocks 2724A-2724N. However, tensor accelerator 2723 can perform parallel tensor operations at a significantly higher throughput relative to compute blocks 2724A-2724N.

图28A-图28B图示按照实施例、由指令管线2800执行的矩阵操作2805。图28A图示当被配置有张量加速器2723内的脉动阵列2808时的指令管线2800。图28B图示当被配置有各自包括矩阵引擎2812A-2812N的图形处理器核心2810A-2810N时的指令管线。Figures 28A-28B illustrate matrix operations 2805 performed by the instruction pipeline 2800, according to an embodiment. Figure 28A illustrates the instruction pipeline 2800 when configured with a systolic array 2808 within a tensor accelerator 2723. Figure 28B illustrates the instruction pipeline when configured with graphics processor cores 2810A-2810N each including a matrix engine 2812A-2812N.

如图28A中所示,指令管线2800可被配置成执行矩阵运算2805,诸如但不限于点积运算。两个向量的点积是标量值,该标量值等于向量的对应分量的乘积之和。28A, the instruction pipeline 2800 may be configured to perform a matrix operation 2805, such as but not limited to a dot product operation. The dot product of two vectors is a scalar value that is equal to the sum of the products of the corresponding components of the vectors.

可如下式(1)中所示来计算点积。The dot product may be calculated as shown in equation (1) below.

点积可以用于卷积神经网络(CNN)的卷积运算中。虽然图示2D卷积,但可以使用N维过滤器对N维体积执行N维卷积。感受野片2802加亮显示输入体积缓冲器2804中的输入体积的一部分。输入体积缓冲器可被存储在存储器2830中。可以在感受野片2802内的数据与卷积过滤器之间执行点积矩阵运算2805,以生成输出缓冲器2806内的数据点,该数据点也可被存储在存储器2830中。存储器2830可以是本文中描述的存储器中的任何存储器,包括如图27中的系统存储器2712、GPGPU存储器2718或者一个或多个缓存存储器2727、2730。The dot product can be used in the convolution operation of a convolutional neural network (CNN). Although a 2D convolution is shown, an N-dimensional convolution can be performed on an N-dimensional volume using an N-dimensional filter. The receptive field sheet 2802 highlights a portion of the input volume in the input volume buffer 2804. The input volume buffer can be stored in a memory 2830. A dot product matrix operation 2805 can be performed between the data in the receptive field sheet 2802 and the convolution filter to generate data points in the output buffer 2806, which can also be stored in the memory 2830. The memory 2830 can be any of the memories described herein, including the system memory 2712, the GPGPU memory 2718, or one or more cache memories 2727, 2730 as shown in Figure 27.

输出缓冲器2806内的数据点的组合表示由卷积运算生成的激活图。通过跨输入体积缓冲器2804滑动感受野片来生成激活图内的每个点。激活图数据可被输入到激活功能,以确定输出激活值。在一个实施例中,输入体积缓冲器2804的卷积可在框架内被定义为高级矩阵运算2805。高级矩阵运算能够经由诸如基本线性代数子程序(BLAS)运算的基元运算来执行。可以经由由指令管线2800执行的硬件指令来加速基元运算。The combination of data points within the output buffer 2806 represents an activation map generated by the convolution operation. Each point within the activation map is generated by sliding a receptive field slice across the input volume buffer 2804. The activation map data may be input to an activation function to determine an output activation value. In one embodiment, the convolution of the input volume buffer 2804 may be defined within the framework as a high-level matrix operation 2805. The high-level matrix operation can be performed via primitive operations such as basic linear algebra subroutines (BLAS) operations. Primitive operations may be accelerated via hardware instructions executed by the instruction pipeline 2800.

用来加速硬件指令的指令管线2800可以包括:指令取得和解码单元2721,其可以对硬件指令进行取得和解码;以及调度器控制器2722,其可以将经解码指令调度到张量加速器2723和/或计算块2724A-2724N内的一个或多个处理资源。在一个实施例中,硬件指令可被调度到计算块2724A-2724N,并且被迁移到张量加速器2723。执行矩阵运算2805的一个或多个硬件指令及关联数据可被存储在存储器2830中。硬件指令的输出也可被存储在存储器2830中。The instruction pipeline 2800 for accelerating hardware instructions may include: an instruction acquisition and decoding unit 2721, which can acquire and decode hardware instructions; and a scheduler controller 2722, which can schedule decoded instructions to one or more processing resources within the tensor accelerator 2723 and/or the computing blocks 2724A-2724N. In one embodiment, the hardware instructions may be scheduled to the computing blocks 2724A-2724N and migrated to the tensor accelerator 2723. One or more hardware instructions and associated data for performing the matrix operation 2805 may be stored in the memory 2830. The output of the hardware instructions may also be stored in the memory 2830.

在一个实施例中,张量加速器2723可以执行一个或多个硬件指令,以使用处理元件的脉动阵列2808来执行矩阵运算2805。脉动阵列2808包括可编程和固定功能硬件的组合,该组合可配置成执行矩阵-矩阵和矩阵-向量点积运算以及其他运算,诸如矩阵-矩阵和矩阵-向量融合乘加运算。脉动阵列2808包括矩阵引擎的阵列,这些矩阵引擎中的每个可与图18C的矩阵引擎1803被类似地配置。In one embodiment, the tensor accelerator 2723 can execute one or more hardware instructions to perform matrix operations 2805 using a systolic array 2808 of processing elements. The systolic array 2808 includes a combination of programmable and fixed-function hardware that can be configured to perform matrix-matrix and matrix-vector dot product operations and other operations such as matrix-matrix and matrix-vector fused multiply-add operations. The systolic array 2808 includes an array of matrix engines, each of which can be configured similarly to the matrix engine 1803 of FIG. 18C.

在各个实施例中,作为对张量加速器2723的替代或附加,矩阵加速逻辑也可被包括在计算块2724A-2724N的处理资源内。例如,如图28B中所示,在一个实施例中,每个计算块(例如计算块2724N)包括图形核心2810A-2810N的阵列。图形核心2810A-2810N的阵列中的每个图形核心可以包括矩阵加速器2812A-2812N。在一个实施例中,图形核心2810A-2810N是如图18A中的图形核心1815A-1815N,以及矩阵加速器2812A-2812N包括图18C的矩阵引擎1803的版本。调度器控制器2722可以将矩阵运算(点积、融合乘加等)调度到各种计算块2724A-2724N的图形核心2810A-2810N内的可用矩阵加速器2812A-2812N。In various embodiments, as an alternative or in addition to the tensor accelerator 2723, matrix acceleration logic may also be included in the processing resources of the computing blocks 2724A-2724N. For example, as shown in FIG. 28B, in one embodiment, each computing block (e.g., computing block 2724N) includes an array of graphics cores 2810A-2810N. Each graphics core in the array of graphics cores 2810A-2810N may include a matrix accelerator 2812A-2812N. In one embodiment, the graphics cores 2810A-2810N are graphics cores 1815A-1815N such as in FIG. 18A, and the matrix accelerators 2812A-2812N include a version of the matrix engine 1803 of FIG. 18C. The scheduler controller 2722 may schedule matrix operations (dot products, fused multiply-adds, etc.) to available matrix accelerators 2812A-2812N within the graphics cores 2810A-2810N of the various compute blocks 2724A-2724N.

虽然在一个实施例中,计算块2724A-2724N中的每个包括图形核心2810A-2810N的阵列,但在另一实施例中,计算块2724A-2724N共享具有图2A中的处理集群阵列的处理集群214A-214N的体系结构。在此类实施例中,计算块2724A-2724N包括如图2C中的多个图形多处理器234,这些图形多处理器包括如图2D中所图示的内部部件。因此,计算块内的图形多处理器可以包括加载/存储单元266、GPGPU核心262和张量/RT核心263。在一个实施例中,计算块2724A-2724N可以包括图3C的GPU 380的多核心组365A-365N,并且包括GFX核心370、张量核心371和光线追踪核心372的多个集合。在此类实施例中,调度器控制器2722可以将执行矩阵运算的指令调度到计算块2724A-2724N内的张量/RT核心263和/或张量核心371。经加速矩阵运算包括点积运算、矩阵乘法运算、和/或融合乘加运算,它们可以对整数或浮点矩阵元素以及各个级别的精度被执行。附加地,在一个实施例中,计算块2724A-2724N可以包括图15C的计算单元1560A-1560N的变体,其中此类变体包括如本文中描述的矩阵加速逻辑(例如脉动阵列、张量核心、脉动张量核心),所述矩阵加速逻辑可以执行整数或浮点矩阵加速指令。在每个配置中,每个计算块2724A-2724N内的处理元件可以协作执行内核程序的线程块集群。While in one embodiment, each of the compute blocks 2724A-2724N includes an array of graphics cores 2810A-2810N, in another embodiment, the compute blocks 2724A-2724N share the architecture of the processing clusters 214A-214N having the array of processing clusters in FIG. 2A. In such embodiments, the compute blocks 2724A-2724N include multiple graphics multiprocessors 234 as in FIG. 2C, which include internal components as illustrated in FIG. 2D. Thus, the graphics multiprocessor within the compute block may include a load/store unit 266, a GPGPU core 262, and a tensor/RT core 263. In one embodiment, the compute blocks 2724A-2724N may include the multi-core group 365A-365N of the GPU 380 of FIG. 3C, and include multiple sets of GFX cores 370, tensor cores 371, and ray tracing cores 372. In such embodiments, the scheduler controller 2722 can schedule instructions to perform matrix operations to the tensor/RT core 263 and/or tensor core 371 within the computing blocks 2724A-2724N. Accelerated matrix operations include dot product operations, matrix multiplication operations, and/or fused multiply-add operations, which can be performed on integer or floating-point matrix elements and various levels of precision. Additionally, in one embodiment, the computing blocks 2724A-2724N may include variants of the computing units 1560A-1560N of FIG. 15C, wherein such variants include matrix acceleration logic (e.g., systolic arrays, tensor cores, systolic tensor cores) as described herein, which can execute integer or floating-point matrix acceleration instructions. In each configuration, the processing elements within each computing block 2724A-2724N can collaborate to execute a cluster of thread blocks of a kernel program.

图29图示包括编解码器使能的分解脉动逻辑的计算块2900。在一个实施例中,不是如图28A中在单独的张量加速器2723中包括脉动阵列2808或者在每个图形核心2815A-2815N中包括矩阵引擎2812A-2812N,而是可以将脉动阵列2912A-2912B的分解集合包括在计算块2900中,其类似于图27的计算块2724A-2724N中的一个。计算块2900可以包括多个经互连处理资源(PR 2908A-2908O),所述处理资源可以与本文中描述的任何处理资源体系结构(诸如但不限于本文中描述的处理资源)类似。在一个实施例中,脉动阵列2912A-2912B包括编解码器2924A-2924B,所述编解码器启用对被接收以用于处理的输入和输出数据的编码和解码。FIG. 29 illustrates a compute block 2900 including decomposed systolic logic enabled by a codec. In one embodiment, rather than including a systolic array 2808 in a separate tensor accelerator 2723 or a matrix engine 2812A-2812N in each graphics core 2815A-2815N as in FIG. 28A , a decomposed set of systolic arrays 2912A-2912B may be included in a compute block 2900 similar to one of the compute blocks 2724A-2724N of FIG. 27 . The compute block 2900 may include a plurality of interconnected processing resources (PRs 2908A-2908O) that may be similar to any of the processing resource architectures described herein, such as, but not limited to, the processing resources described herein. In one embodiment, the systolic arrays 2912A-2912B include codecs 2924A-2924B that enable encoding and decoding of input and output data received for processing.

脉动阵列2912A-2912B包括可以用来以脉动的方式执行向量或其他数据并行运算的数据处理单元的W宽和D深的网络,与本文中描述的其他脉动阵列类似。在一个实施例中,脉动阵列2912A-2912B可被配置成执行矩阵运算,诸如矩阵点积运算。在一个实施例中,脉动阵列2912A-2912B支持16比特和8比特浮点运算以及8比特和4比特整数运算、三元运算、二进制、双极二进制、三元和独热比特运算。在一个实施例中,脉动阵列2912A-2912B可被配置成加速机器学习操作。在此类实施例中,脉动阵列2912A-2912B可被配置有对brain浮点(bfloat)16比特浮点格式的支持。通过在计算块2900内但在PR 2908A-2908O外部包括脉动阵列2912A-2912B,脉动阵列2912A-2912B的大小和数量可独立于PR 2908A-2908O的数量被缩放。附加地,PR内原本会由脉动阵列活动消耗的通信带宽可被保存。此外,当矩阵工作负荷未被执行时,脉动阵列2912A-2912B可以是时钟/功率门控的。The systolic arrays 2912A-2912B include a W-wide and D-deep network of data processing elements that can be used to perform vector or other data parallel operations in a systolic manner, similar to other systolic arrays described herein. In one embodiment, the systolic arrays 2912A-2912B may be configured to perform matrix operations, such as matrix dot product operations. In one embodiment, the systolic arrays 2912A-2912B support 16-bit and 8-bit floating point operations as well as 8-bit and 4-bit integer operations, ternary operations, binary, bipolar binary, ternary, and one-hot bit operations. In one embodiment, the systolic arrays 2912A-2912B may be configured to accelerate machine learning operations. In such embodiments, the systolic arrays 2912A-2912B may be configured with support for the brain float (bfloat) 16-bit floating point format. By including the systolic arrays 2912A-2912B within the compute block 2900 but outside the PRs 2908A-2908O, the size and number of systolic arrays 2912A-2912B can be scaled independently of the number of PRs 2908A-2908O. Additionally, communication bandwidth within the PRs that would otherwise be consumed by systolic array activity can be saved. Furthermore, the systolic arrays 2912A-2912B can be clock/power gated when matrix workloads are not being executed.

脉动阵列2912A-2912B与PR 2908A-2908O之间的通信可经由缓存或共享本地存储器(缓存/SLM 2910)和/或共享寄存器堆2914被执行。在一个实施例中,不是不同的共享寄存器堆2914,而是缓存/SLM 2910可被分区以用于用作共享寄存器堆。共享寄存器堆2914可与本文中描述的其他GPGPU寄存器堆类似地被构造。共享寄存器堆还可以包括专用寄存器的集合,所述专用寄存器用来配置脉动阵列2912A-2912B与PR 2908A-2908O之间的交互。缓存/LM 2910可以是L1缓存、L2缓存和/或显式可寻址的管芯上存储器的块。The communication between the systolic arrays 2912A-2912B and the PRs 2908A-2908O may be performed via a cache or shared local memory (cache/SLM 2910) and/or a shared register file 2914. In one embodiment, rather than a distinct shared register file 2914, the cache/SLM 2910 may be partitioned for use as a shared register file. The shared register file 2914 may be constructed similarly to other GPGPU register files described herein. The shared register file may also include a set of special registers that are used to configure the interaction between the systolic arrays 2912A-2912B and the PRs 2908A-2908O. The cache/LM 2910 may be a block of an L1 cache, an L2 cache, and/or an explicitly addressable on-die memory.

用于由脉动阵列2912A-2912B处理的矩阵数据可被存储在缓存/SLM 2910中。处理命令或指令能够经由共享寄存器堆2914被提供给脉动阵列2912A-2912B。处理结果可由PR2908A-2908O从缓存/SLM 2910中或者从共享寄存器堆内的目的地/输出寄存器中被读取。在操作期间,不是消耗PR 2908A-2908O内的总线/结构带宽,通信通信量可被定位到脉动阵列2912A-2912B、缓存/SLM 2910和/或共享寄存器堆2914。计算块2900内的PR 2908A-2908O中的任何PR可以将矩阵工作负荷迁移到一个或两个脉动阵列2912A-2912B。消息可采用命令从PR发送到脉动阵列,所述命令指定将被执行的操作以及该操作的操作数。脉动阵列2912A-2912B可以执行所请求的操作(乘/加、融合乘/加、乘/累加、点积等),并且将结果输出到共享寄存器堆2914。所请求操作的输入、中间和/或输出数据可被存储在缓存/SLM2910中,并且多个依赖操作可被链接。在一个实施例中,当执行用于神经网络的训练或推断的操作时,脉动阵列2928A-2928B还可执行激活功能,包括但不限于S型、ReLU和双曲正切(TanH)激活。在此类实施例中,用于神经网络的操作能以粗粒度被迁移到脉动阵列2912A-2912B。Matrix data for processing by the systolic arrays 2912A-2912B may be stored in the cache/SLM 2910. Processing commands or instructions can be provided to the systolic arrays 2912A-2912B via the shared register file 2914. Processing results may be read by the PRs 2908A-2908O from the cache/SLM 2910 or from the destination/output registers within the shared register file. During operation, rather than consuming bus/structure bandwidth within the PRs 2908A-2908O, communication traffic may be directed to the systolic arrays 2912A-2912B, the cache/SLM 2910, and/or the shared register file 2914. Any of the PRs 2908A-2908O within the compute block 2900 may migrate the matrix workload to one or both systolic arrays 2912A-2912B. Messages may be sent from the PRs to the systolic arrays using commands that specify the operation to be performed and the operands of the operation. The systolic arrays 2912A-2912B may perform the requested operation (multiply/add, fused multiply/add, multiply/accumulate, dot product, etc.) and output the result to the shared register file 2914. The input, intermediate, and/or output data for the requested operation may be stored in the cache/SLM 2910, and multiple dependent operations may be chained. In one embodiment, when performing operations for training or inference of a neural network, the systolic arrays 2928A-2928B may also perform activation functions including, but not limited to, sigmoid, ReLU, and hyperbolic tangent (TanH) activations. In such embodiments, operations for the neural network may be migrated to the systolic arrays 2912A-2912B at a coarse granularity.

PR 2908A-2908O可以将输入数据以压缩的格式提供给脉动阵列2912A-2912B,以及编解码器2924A-2924B可用来对数据进行解压缩。当输出数据准备好被提供给PR 2908A-2908O时,如果PR将执行操作和数据并且不支持对经压缩数据的直接读取,则数据可保持为被解压缩。如果PR 2908A-2908O支持对经压缩数据的读取或者将不对数据执行附加操作,则输出数据可被重新编码。可使用基于零的编码,并且可基于数据稀疏性的程度来启用或禁用压缩。替代地,可以基于将被处理或输出的数据的分布来使用其他形式的编码。例如,编解码器2924A-2924B可被配置成对稀疏数据进行解码,所述稀疏数据根据基于零的压缩或者使用本文中描述的另一形式的压缩(例如基于一、基于二、接近零、接近一、接近二等)被编码。可被支持的附加示例性编码或压缩技术包括唯一绝对值(UAV)表编码、有效性图(SM)编码、表编码(TE)、唯一值坐标(UVC)编码和均值编码(ME)。经编码的数据的元数据指示用于数据的编码格式的类型。在一个实施例中,可以针对特定类型的数据(诸如,内核数据或特征数据)来选择特定编码格式。在一个实施例中,在编码之前对数据执行统计分析,以使适当的编码器能够针对数据的每个块被选择。在一个实施例中,在SM编码期间所生成的数据可以用来促进经压缩数据到脉动阵列2912A-2912B的供应。在基于零的SM编码模式中,只有块中的非零值被编码。样本块中的非零值的数量在报头中被指示,之后接着指示块内非零值的映射的有效性图。样本的非零值随后以在流内的出现顺序被编码。The PRs 2908A-2908O may provide input data to the systolic arrays 2912A-2912B in a compressed format, and the codecs 2924A-2924B may be used to decompress the data. When output data is ready to be provided to the PRs 2908A-2908O, if the PRs will perform operations and data and do not support direct reading of compressed data, the data may remain decompressed. If the PRs 2908A-2908O support reading of compressed data or will not perform additional operations on the data, the output data may be re-encoded. Zero-based encoding may be used, and compression may be enabled or disabled based on the degree of data sparsity. Alternatively, other forms of encoding may be used based on the distribution of the data to be processed or output. For example, the codecs 2924A-2924B may be configured to decode sparse data that is encoded according to zero-based compression or using another form of compression described herein (e.g., one-based, two-based, near-zero, near-one, near-two, etc.). Additional exemplary encoding or compression techniques that may be supported include unique absolute value (UAV) table encoding, significance map (SM) encoding, table encoding (TE), unique value coordinate (UVC) encoding, and mean encoding (ME). The metadata of the encoded data indicates the type of encoding format used for the data. In one embodiment, a specific encoding format may be selected for a specific type of data (such as kernel data or feature data). In one embodiment, a statistical analysis is performed on the data before encoding so that an appropriate encoder can be selected for each block of data. In one embodiment, the data generated during SM encoding may be used to facilitate the supply of compressed data to the systolic arrays 2912A-2912B. In a zero-based SM encoding mode, only non-zero values in a block are encoded. The number of non-zero values in a sample block is indicated in the header, followed by a significance map indicating the mapping of non-zero values within the block. The non-zero values of the samples are then encoded in the order of appearance within the stream.

经由时间感知机器学习模型的基于时间的帧生成Time-based frame generation via a time-aware machine learning model

现有帧生成模型基于帧和光流数据被训练,但是未被训练成具有在帧之间经过的时间量的概念。本文中描述的是执行下列操作的技术:1)对帧数据、光流数据以及以目标帧速率所生成的帧的渲染时间戳来训练内插/外推模型,以及2)如果进行中的帧在下一帧最后期限之前将未完成,则使用所产生的模型根据需要对帧进行外推或内插,以维持目标帧速率。采用对每个显示刷新周期的已更新帧数据来维持一致的帧节奏调整使平滑和响应迅速的游戏可玩性体验能够甚至以更低渲染速率被实现,因为时间相关帧能够被生成。Existing frame generation models are trained based on frame and optical flow data, but are not trained to have a concept of the amount of time that passes between frames. Described herein are techniques that perform the following operations: 1) train an interpolation/extrapolation model on frame data, optical flow data, and rendering timestamps of frames generated at a target frame rate, and 2) use the resulting model to extrapolate or interpolate frames as needed to maintain the target frame rate if the in-progress frame will not be completed before the next frame deadline. Maintaining consistent frame pacing with updated frame data for each display refresh cycle enables a smooth and responsive gameplay experience to be achieved even at lower rendering rates because time-correlated frames can be generated.

图30A-图30B图示按照实施例、用于使用统一时间感知的机器学习模型的帧外推和帧内插的系统。图30A图示用于基于帧数据、光流和渲染时间戳的帧外推的系统3000。图30B图示用于基于帧数据、光流和渲染时间戳的帧内插的系统3010。一般而言,帧外推被认为是相对于帧内插更困难的问题,因为它从过去的帧以及与这些过去帧相关的其他输入(诸如它们之间的光流)来预测未来的帧。对比而言,帧内插是相对更容易的问题,因为它从过去和当前帧来预测中间帧。然而,帧内插伴随更高的等待时间成本,并且可能不适合于一些用例。通过启用帧内插和外推,用户可能选择使用帧内插用于具有更高质量要求的等待时间敏感应用,并且使用帧外推用于等待时间敏感的但是对可感知质量具有相对不太严格要求的应用。附加地,通过提供与训练帧序列关联的渲染时间戳,机器学习模型能够学习在所渲染帧之间经过的相对时间量的概念。一旦被正确训练,模型就可以用来在输入帧之间进行内插或者相对于另一帧在特定时间外推下一帧。Figures 30A-30B illustrate a system for frame extrapolation and frame interpolation using a unified time-aware machine learning model according to an embodiment. Figure 30A illustrates a system 3000 for frame extrapolation based on frame data, optical flow, and rendering timestamps. Figure 30B illustrates a system 3010 for frame interpolation based on frame data, optical flow, and rendering timestamps. In general, frame extrapolation is considered to be a more difficult problem relative to frame interpolation because it predicts future frames from past frames and other inputs related to these past frames (such as optical flow between them). In contrast, frame interpolation is a relatively easier problem because it predicts intermediate frames from past and current frames. However, frame interpolation comes with a higher latency cost and may not be suitable for some use cases. By enabling frame interpolation and extrapolation, users may choose to use frame interpolation for latency-sensitive applications with higher quality requirements, and use frame extrapolation for latency-sensitive but relatively less stringent requirements for perceptible quality. Additionally, by providing rendering timestamps associated with a training frame sequence, the machine learning model is able to learn the concept of the relative amount of time that has passed between rendered frames. Once properly trained, the model can be used to interpolate between input frames or to extrapolate the next frame at a specific time relative to another frame.

如图30A中所示,配置用于帧外推的系统3000可以为统一神经网络3005加载外推权重3007,所述外推权重通过训练统一神经网络3005以基于两个现有帧外推未来的帧被生成。第一帧3001(帧1)和第二帧3002(帧2)可以作为对统一神经网络3005的输入连同帧1与帧2之间的光流3006一起被提供。统一神经网络3005随后可以生成经外推的帧3003(帧3)。30A , a system 3000 configured for frame extrapolation may load a unified neural network 3005 with extrapolation weights 3007 that are generated by training the unified neural network 3005 to extrapolate future frames based on two existing frames. A first frame 3001 (frame 1) and a second frame 3002 (frame 2) may be provided as input to the unified neural network 3005 along with an optical flow 3006 between frame 1 and frame 2. The unified neural network 3005 may then generate an extrapolated frame 3003 (frame 3).

当统一神经网络3005是使用包括指示帧之间的相对时间量的时间戳的训练数据已被训练的时间感知的神经网络时,统一神经网络3005可以用来以用于帧生成的特定相对时间戳为目标。例如,可以提供目标时间戳3004,所述目标时间戳指示应当通过经外推的帧所表示的第二帧3002之后的相对时间量。低相对时间戳会指示经外推的帧3003将在时间上更接近第二帧3002之后出现,并且应当比较接近第二帧3002出现。更高的相对时间戳会指示经外推的帧3003应当基于光流3006示出更大程度的卷积,因为该帧将在第二帧3002的显示之后被显示更长的时间量。When the unified neural network 3005 is a time-aware neural network that has been trained using training data that includes timestamps indicating the relative amount of time between frames, the unified neural network 3005 can be used to target specific relative timestamps for frame generation. For example, a target timestamp 3004 can be provided that indicates the relative amount of time after the second frame 3002 that should be represented by the extrapolated frame. A low relative timestamp would indicate that the extrapolated frame 3003 will appear closer in time to the second frame 3002 and should appear relatively close to the second frame 3002. A higher relative timestamp would indicate that the extrapolated frame 3003 should show a greater degree of convolution based on the optical flow 3006 because the frame will be displayed for a longer amount of time after the display of the second frame 3002.

如图30B中所示,配置用于帧内插的系统3010可以为统一神经网络3005加载内插权重3017,所述内插权重通过训练统一神经网络3005以在两个现有帧之间内插中间帧被生成。第一帧3011(帧1)和第三帧3013(帧3)可以作为对统一神经网络3005的输入连同帧1与帧3之间的光流3016一起被提供。统一神经网络3005随后可以生成帧1与帧3之间的经内插的帧3012(帧2)。30B , a system 3010 configured for frame interpolation may load a unified neural network 3005 with interpolation weights 3017 generated by training the unified neural network 3005 to interpolate an intermediate frame between two existing frames. A first frame 3011 (frame 1) and a third frame 3013 (frame 3) may be provided as input to the unified neural network 3005 along with an optical flow 3016 between frames 1 and 3. The unified neural network 3005 may then generate an interpolated frame 3012 (frame 2) between frames 1 and 3.

当统一神经网络3005是使用包括指示帧之间的相对时间量的时间戳的训练数据已被训练的时间感知的神经网络时,统一神经网络3005可以用来生成帧,所述帧会在第一帧3011与第三帧3013之间以特定目标时间被显示。例如,可以提供目标时间戳3014,所述目标时间戳指示第一帧3011与第三帧3013之间的时间点。低相对时间戳会指示经内插的帧3012将在时间上更接近第一帧3011之后出现,并且应当更接近第二帧3002出现。更高的相对时间戳会指示经内插的帧3012应当更接近第三帧3013出现。When the unified neural network 3005 is a time-aware neural network that has been trained using training data that includes timestamps indicating the relative amount of time between frames, the unified neural network 3005 can be used to generate frames that will be displayed at a specific target time between the first frame 3011 and the third frame 3013. For example, a target timestamp 3014 can be provided that indicates a point in time between the first frame 3011 and the third frame 3013. A low relative timestamp would indicate that the interpolated frame 3012 will appear closer in time to the first frame 3011 and should appear closer to the second frame 3002. A higher relative timestamp would indicate that the interpolated frame 3012 should appear closer to the third frame 3013.

现有技术能够应用自适应分辨率或自适应质量来命中帧速率目标。此类技术将减少渲染分辨率或渲染质量,以使图形处理器能够满足或超过预定的帧速率目标。与自适应分辨率或质量技术形成对照,如果做出进行中的帧将不会及时完成以满足下一帧显示最后期限的确定,则本文中描述的实施例可以自适应地执行神经帧生成,以满足目标帧速率。在此类情形中,新的帧可基于先前渲染的帧被外推,以便在下一帧显示最后期限被显示。在时间感知的机器学习模型在使用中的情况下,特定目标时间戳可被提供给机器学习模型,以指示其中所生成的帧会相对于输入帧被显示的时间。Existing techniques are able to apply adaptive resolution or adaptive quality to hit a frame rate target. Such techniques will reduce the rendering resolution or rendering quality to enable the graphics processor to meet or exceed a predetermined frame rate target. In contrast to adaptive resolution or quality techniques, if a determination is made that an ongoing frame will not be completed in time to meet the next frame display deadline, the embodiments described herein may adaptively perform neural frame generation to meet a target frame rate. In such cases, new frames may be extrapolated based on previously rendered frames in order to be displayed at the next frame display deadline. In the case where a time-aware machine learning model is in use, a specific target timestamp may be provided to the machine learning model to indicate the time at which a generated frame will be displayed relative to the input frame.

图形处理器可被配置成呈现帧以用于以预定的节奏来显示,这在一个实施例中对应于图形处理器的输出将被显示到的显示器的刷新速率。例如,具有60Hz的刷新速率的显示器将每秒刷新图像60次,并且能以每隔16.6毫秒显示新的图像。具有120Hz的刷新速率的显示器将每秒刷新图像一百二十次,并且能以每隔8.3毫秒显示新的图像。当垂直同步(Vsync)被启用时,图形处理器配置成将帧数据的更新与显示器的刷新速率同步。这个同步使图形处理器仅当显示器准备好显示新的帧(诸如在垂直消隐间隔(Vblank)期间,这是帧的最终可见行的结束与下一帧的第一可见行的开始之间的时间)时才更新显示器。在垂直消隐间隔期间,图形处理器将包含新渲染的帧的缓冲器与当前在屏幕上被显示的缓冲器进行交换。当显示器开始从新的缓冲器进行刷新时,在屏幕上显示已更新的帧。这个同步防止诸如屏幕撕裂的视觉伪影。然而,如果图形处理器不能够及时渲染新的帧以满足下一个垂直消隐间隔,则将对下一帧呈现先前渲染的帧。如果太多的帧在向显示器呈现新帧数据之前越过(pass),则卡顿(stuttering)可能出现。The graphics processor may be configured to present frames for display at a predetermined rhythm, which in one embodiment corresponds to the refresh rate of the display to which the output of the graphics processor will be displayed. For example, a display with a refresh rate of 60 Hz will refresh the image 60 times per second and can display a new image every 16.6 milliseconds. A display with a refresh rate of 120 Hz will refresh the image one hundred and twenty times per second and can display a new image every 8.3 milliseconds. When vertical synchronization (Vsync) is enabled, the graphics processor is configured to synchronize the updating of frame data with the refresh rate of the display. This synchronization enables the graphics processor to update the display only when the display is ready to display a new frame (such as during the vertical blanking interval (Vblank), which is the time between the end of the last visible line of the frame and the beginning of the first visible line of the next frame). During the vertical blanking interval, the graphics processor swaps the buffer containing the newly rendered frame with the buffer currently displayed on the screen. When the display begins to refresh from the new buffer, the updated frame is displayed on the screen. This synchronization prevents visual artifacts such as screen tearing. However, if the graphics processor cannot render a new frame in time to meet the next vertical blanking interval, the previously rendered frame will be presented for the next frame. If too many frames pass before new frame data is presented to the display, stuttering may occur.

自适应刷新速率技术、可变刷新速率(VRR)允许显示器调整其刷新速率以匹配图形卡的输出,由此在无需Vsync的情况下消除屏幕撕裂和卡顿。虽然VRR使显示器能够适合图形处理器更新节奏,但图形处理器满足目标帧韵律(cadence)以维持平滑和响应迅速的游戏可玩性仍然是最佳的。在一个实施例中,这个帧韵律可由用户设置,或者用户可以从由图形处理器支持的可用帧韵律集合中进行选择。Adaptive refresh rate technology, Variable Refresh Rate (VRR), allows the display to adjust its refresh rate to match the output of the graphics card, thereby eliminating screen tearing and stuttering without the need for Vsync. While VRR enables the display to adapt to the graphics processor update cadence, it is still best for the graphics processor to meet a target frame cadence to maintain smooth and responsive gameplay. In one embodiment, this frame cadence can be set by the user, or the user can select from a set of available frame cadences supported by the graphics processor.

图31图示用于图形处理器上的帧渲染的端对端操作的时间线3100。关键输入帧是与由游戏应用(诸如本地执行或基于云的游戏应用)的用户提供的输入对应的帧。在时间T1所接收的关键输入3102通过游戏应用3104传播。游戏应用3104或者与游戏应用3104关联的渲染引擎可以生成GPU命令3106以便被放置在渲染队列中。GPU转储清除3108将与关键输入3102关联的命令入列到GPU渲染队列中的先前被入列的命令后面,先前被入列的命令可以包括渲染帧F0-F3的命令。GPU在T5完成渲染(GPU完成3110),并且在T7向显示器呈现帧F4。当显示器被定时到垂直同步信号(V-sync信号3112)时,与关键输入3102对应的帧的显示被延迟直到在T6的V-sync事件。因此,关键输入3102与关键输入帧的显示3114之间的T7-T1等待时间3120基于关键输入帧以及在关键输入帧前面被入列以用于渲染队列中的渲染的帧的渲染时间被确定。附加地要注意,当新的帧没有准备好对下一个显示更新周期及时被呈现(这可通过V-sync信号3112来指示)时,重新显示前一帧。Figure 31 illustrates a timeline 3100 of end-to-end operation for frame rendering on a graphics processor. A key input frame is a frame corresponding to an input provided by a user of a game application (such as a local execution or a cloud-based game application). The key input 3102 received at time T1 is propagated through a game application 3104. The game application 3104 or a rendering engine associated with the game application 3104 can generate a GPU command 3106 to be placed in a rendering queue. GPU dump clearing 3108 queues the command associated with the key input 3102 behind the previously queued command in the GPU rendering queue, and the previously queued command may include a command for rendering frames F0-F3. The GPU completes rendering (GPU completion 3110) at T5, and presents frame F4 to the display at T7. When the display is timed to a vertical synchronization signal (V-sync signal 3112), the display of the frame corresponding to the key input 3102 is delayed until the V-sync event at T6. Thus, the T7-T1 wait time 3120 between the key input 3102 and the display 3114 of the key input frame is determined based on the rendering time of the key input frame and the frames that were enqueued for rendering in the rendering queue before the key input frame. Additionally note that when a new frame is not ready to be presented in time for the next display update cycle (which can be indicated by the V-sync signal 3112), the previous frame is redisplayed.

图32图示在其中神经帧生成用来维持目标帧节奏的时间线3200。当由GPU对新帧的呈现经由等待V-sync配置被定时到垂直同步(V-sync)信号3112时,如果帧没有及时准备用于呈现,则该帧的显示可被延迟直到下一个V-sync事件。当新的帧没有准备好被呈现时,前一帧被重新显示。例如,对于渲染队列中的帧,帧F0及时完成以便在下一个目标显示更新间隔期间被显示(如基于V-blank或V-sync信号3112间隔或者预定的目标帧速率所确定的)。帧F1在时间T2针对下一个目标显示更新间隔及时完成。Figure 32 illustrates a timeline 3200 in which neural frame generation is used to maintain a target frame rhythm. When the presentation of a new frame by the GPU is timed to the vertical synchronization (V-sync) signal 3112 via a wait V-sync configuration, if the frame is not ready for presentation in time, the display of the frame can be delayed until the next V-sync event. When the new frame is not ready to be presented, the previous frame is redisplayed. For example, for frames in the rendering queue, frame F0 is completed in time to be displayed during the next target display update interval (as determined based on the V-blank or V-sync signal 3112 interval or a predetermined target frame rate). Frame F1 is completed in time for the next target display update interval at time T2.

然而,在一个实施例中,在时间T5做出帧F2将没有针对下一个目标显示更新间隔及时完成的确定。在这个时间,帧生成请求可被提交到GPU/AI命令3205的队列,以基于帧F0和帧F1来执行神经帧生成以外推帧A1。神经帧生成随后以确定性的时间量被执行,并且在时间T6针对在时间T7的显示及时完成(AI完成3206)。帧A1能够基于帧F0、帧F1以及指定帧F1被完成与当所生成帧A1将被显示之间的时间量(例如T7-T2)的目标显示时间戳被生成。帧A1能够通过使用帧F1与所生成帧A1之间预测的光流扭曲帧F1被生成,或者直接像素生成模型可以用来基于帧F0和帧F1以及帧F0与帧F1之间的光流来外推帧A1的帧数据。However, in one embodiment, a determination is made at time T5 that frame F2 will not be completed in time for the next target display update interval. At this time, a frame generation request may be submitted to the queue of GPU/AI commands 3205 to perform neural frame generation to extrapolate frame A1 based on frame F0 and frame F1. The neural frame generation is then performed at a deterministic amount of time and completed at time T6 in time for display at time T7 (AI completion 3206). Frame A1 can be generated based on frame F0, frame F1, and a target display timestamp specifying the amount of time between when frame F1 is completed and when the generated frame A1 will be displayed (e.g., T7-T2). Frame A1 can be generated by warping frame F1 using a predicted optical flow between frame F1 and the generated frame A1, or a direct pixel generation model can be used to extrapolate frame data for frame A1 based on frame F0 and frame F1 and the optical flow between frame F0 and frame F1.

在一个实施例中,如果帧F2实际上及时完成以在下一个目标更新间隔期间被呈现,则帧F2可代替帧A1被显示。在一个实施例中,帧数据可能被丢弃并且未被使用。在一个实施例中,进行中的帧F2被抢占以启用帧F3的早期开始,从而增加帧F3将在下一个目标显示更新间隔之前被完成的可能性。在一个实施例中,针对帧2所生成的数据可以与帧A1和/或帧F1组合地用来生成帧A2以用于在下一个目标显示更新间隔期间显示。例如,在给定包括帧F1和帧F2的完成时间戳以及帧A1的目标显示时间的输入的情况下,时间感知的机器学习模型(例如统一神经网络3005)随后可以生成帧数据,该帧数据应当在帧A2将被显示的目标显示更新间隔的时间被显示。在一个实施例中,帧A2可通过基于帧A1、帧F2(和可选的帧F1)以及在帧A2将被显示的时间的场景的演进的估计在时间上扭曲帧F2而被生成。附加帧生成可被执行,以生成帧A3以用于在渲染帧F3的同时显示。所生成帧A3和所渲染帧F3的组合可以用来生成帧A4和帧A5以用于在渲染帧F4的同时显示,该渲染帧包含对在时间T1所接收的输入进行响应的场景数据。帧F4能以完成之后的下一个目标显示更新间隔被显示。In one embodiment, if frame F2 is actually completed in time to be presented during the next target update interval, frame F2 may be displayed in place of frame A1. In one embodiment, frame data may be discarded and not used. In one embodiment, the ongoing frame F2 is preempted to enable an early start of frame F3, thereby increasing the likelihood that frame F3 will be completed before the next target display update interval. In one embodiment, the data generated for frame 2 can be used in combination with frame A1 and/or frame F1 to generate frame A2 for display during the next target display update interval. For example, given an input including completion timestamps of frames F1 and F2 and a target display time for frame A1, a time-aware machine learning model (e.g., unified neural network 3005) can then generate frame data that should be displayed at the time of the target display update interval when frame A2 will be displayed. In one embodiment, frame A2 can be generated by temporally warping frame F2 based on an estimate of the evolution of the scene at the time when frame A2 will be displayed, frame A1, frame F2 (and optionally frame F1), and the time when frame A2 will be displayed. Additional frame generation may be performed to generate frame A3 for display while rendering frame F3. The combination of generated frame A3 and rendered frame F3 may be used to generate frames A4 and A5 for display while rendering frame F4, which contains scene data responsive to the input received at time T1. Frame F4 may be displayed at the next target display update interval after completion.

图33A-图33B图示具有能够实现异步渲染和计算操作的多个关联命令缓冲器的多个GPU引擎。图33A示出GPU 3360的复制引擎3310、图形引擎3320和计算引擎3330。图33B示出到GPU 3360的多个命令队列中的多线程输入。如图33A中所示,复制引擎3310、图形引擎3320和计算引擎3330可以独立地执行工作负荷,并且还可以当一个引擎的工作依赖于另一引擎或者与另一引擎相关时进行同步。例如,复制引擎3310可以执行命令以执行复制资源操作3312,以复制将由图形引擎3320执行的绘制操作3324使用的资源。图形引擎3320可以执行等待操作3322,直到从复制引擎3310接收指示复制资源操作3312完成的信号3314。在完成绘制操作3324之后,图形引擎3320可以发信号通知3326计算引擎3330,该计算引擎可以执行计算操作(诸如AI绘制操作3334),以基于一个或多个先前帧来产生AI生成的帧。当AI绘制操作3334完成时,计算引擎3330随后可以发信号通知3336图形引擎3320。在一个实施例中,图形引擎3320可以继续执行绘制操作3328,以渲染即将到来帧的附加帧数据。Figures 33A-33B illustrate multiple GPU engines with multiple associated command buffers that can implement asynchronous rendering and computing operations. Figure 33A shows a copy engine 3310, a graphics engine 3320, and a computing engine 3330 of a GPU 3360. Figure 33B shows multithreaded inputs to multiple command queues of a GPU 3360. As shown in Figure 33A, the copy engine 3310, the graphics engine 3320, and the computing engine 3330 can independently perform workloads, and can also synchronize when the work of one engine depends on or is related to another engine. For example, the copy engine 3310 can execute a command to perform a copy resource operation 3312 to copy the resources used by the drawing operation 3324 to be performed by the graphics engine 3320. The graphics engine 3320 can perform a wait operation 3322 until a signal 3314 indicating that the copy resource operation 3312 is completed is received from the copy engine 3310. After completing the drawing operation 3324, the graphics engine 3320 may signal 3326 the compute engine 3330, which may perform a compute operation, such as the AI drawing operation 3334, to produce an AI-generated frame based on one or more previous frames. When the AI drawing operation 3334 is complete, the compute engine 3330 may then signal 3336 the graphics engine 3320. In one embodiment, the graphics engine 3320 may continue to perform the drawing operation 3328 to render additional frame data for the upcoming frame.

如图33B中所示,由CPU执行的单独线程3342A-3342D可以将工作负荷插入到与GPU3360关联的各种命令队列中。线程3342A-3342D可以是例如图形驱动器或者管理或促进GPU3360的操作的其他CPU执行的软件的线程。每个线程3342A-3342D可以将命令列表(例如命令列表3344)插入到复制队列3350、渲染队列3352和计算队列3354中的任何一个中。命令列表3344可以包括一个或多个命令缓冲器,所述命令缓冲器包括将要由各种引擎执行的命令。复制队列3350用来向复制引擎3310提交命令。渲染队列3352可以包括由图形引擎3320和复制引擎3310执行的命令。计算队列3354可以包括由计算引擎3330、图形引擎3320和复制引擎3310执行的命令。As shown in FIG. 33B , separate threads 3342A-3342D executed by the CPU can insert workloads into various command queues associated with the GPU 3360. The threads 3342A-3342D can be threads of software executed by other CPUs, such as graphics drivers or management or facilitation of the operation of the GPU 3360. Each thread 3342A-3342D can insert a command list (e.g., command list 3344) into any one of the copy queue 3350, the rendering queue 3352, and the calculation queue 3354. The command list 3344 can include one or more command buffers, which include commands to be executed by various engines. The copy queue 3350 is used to submit commands to the copy engine 3310. The rendering queue 3352 can include commands executed by the graphics engine 3320 and the copy engine 3310. The calculation queue 3354 can include commands executed by the calculation engine 3330, the graphics engine 3320, and the copy engine 3310.

在一个实施例中,线程3342A-3342D中的一个(例如CPU线程3342A)可以执行包括工作负荷执行跟踪的工作负荷管理操作。各种引擎上的工作负荷执行的进度可以基于例如与复制引擎3310、图形引擎3320和计算引擎3330关联的性能和/或事件监测电路模块被跟踪。当线程的工作负荷跟踪逻辑确定进行中的绘制命令可能无法满足下一个目标显示更新间隔时,可以异步地提交命令,以基于两个或更多先前渲染的帧和关联光流来生成AI帧。生成AI帧的命令可以在计算引擎3330上相对于在图形引擎3320上执行的渲染操作异步地执行,其中AI帧基于所渲染帧的集合、那些帧之间的光流以及预期所生成的帧将被显示所针对的目标时间戳而被生成。In one embodiment, one of the threads 3342A-3342D (e.g., CPU thread 3342A) may perform workload management operations including workload execution tracking. The progress of workload execution on the various engines may be tracked based on, for example, performance and/or event monitoring circuit modules associated with the replication engine 3310, the graphics engine 3320, and the compute engine 3330. When the workload tracking logic of the thread determines that the drawing command in progress may not meet the next target display update interval, a command may be submitted asynchronously to generate an AI frame based on two or more previously rendered frames and associated optical flows. The command to generate the AI frame may be executed asynchronously on the compute engine 3330 relative to the rendering operation performed on the graphics engine 3320, where the AI frame is generated based on a set of rendered frames, the optical flow between those frames, and the target timestamp for which the generated frame is expected to be displayed.

在一个实施例中,计算引擎3330配置成基于来自图形引擎3320的输出经由AI绘制和/或帧生成模型来连续生成新帧。渲染队列3352可以接收命令列表,以基于由图形应用的游戏引擎提交的API命令来渲染帧,而计算队列3354接收命令以经由计算引擎3330相对于图形引擎3320的操作异步地执行AI绘制和/或帧生成模型。在此类实施例中,所渲染的帧或AI生成的帧可以被显示到屏幕,这取决于预定的帧节奏调整。预定的帧节奏调整可以例如被绑定到显示刷新循环,同时基于真实GPU时间戳来考虑帧之间的运动增量而不是基于帧的运动增量。在一个实施例中,基于时间戳的生成可以用来以与高频自适应刷新速率同步的的频率来连续生成新帧。AI帧生成能以具有确定性执行时间的固定的ML执行代价被执行。AI帧生成可被配置成基于由图形引擎3320基于来自游戏应用的命令所生成的源帧的滑动窗口来生成用于显示的所有帧。In one embodiment, the computing engine 3330 is configured to continuously generate new frames via an AI drawing and/or frame generation model based on output from the graphics engine 3320. The rendering queue 3352 may receive a command list to render frames based on API commands submitted by the game engine of the graphics application, while the computing queue 3354 receives commands to asynchronously execute the AI drawing and/or frame generation model via the computing engine 3330 relative to the operation of the graphics engine 3320. In such embodiments, the rendered frame or the AI generated frame may be displayed to the screen, depending on a predetermined frame rhythm adjustment. The predetermined frame rhythm adjustment may, for example, be bound to a display refresh cycle while considering motion increments between frames based on real GPU timestamps rather than frame-based motion increments. In one embodiment, timestamp-based generation may be used to continuously generate new frames at a frequency synchronized with a high-frequency adaptive refresh rate. AI frame generation can be executed at a fixed ML execution cost with a deterministic execution time. AI frame generation may be configured to generate all frames for display based on a sliding window of source frames generated by the graphics engine 3320 based on commands from a game application.

在一个实施例中,工作负荷执行跟踪和/或异步AI帧生成的至少一部分可由GPU的硬件、固件或内核程序来管理。例如,配置用于在GPU上执行的永久程序代码(诸如在图形微控制器上执行的GPU固件或者在专用GPU硬件线程上执行的GPGPU程序)可被配置成基于被设计成存储最近渲染的帧的预先配置存储器位置内存储的数据以及包含基于光流数据已被扭曲和对齐的一个或多个先前帧的历史缓冲器来自动触发AI绘制例程的执行。在此类实施例中,最近的帧和一个或多个先前渲染的帧用作用于AI绘制和/或帧生成模型的连续更新源数据。永久程序代码随后可以使用时间戳感知的AI模型基于GPU所附连的显示器的所配置的刷新循环来连续生成帧。In one embodiment, at least a portion of the workload execution tracking and/or asynchronous AI frame generation may be managed by the hardware, firmware, or kernel program of the GPU. For example, a permanent program code configured for execution on the GPU (such as GPU firmware executed on a graphics microcontroller or a GPGPU program executed on a dedicated GPU hardware thread) may be configured to automatically trigger the execution of an AI drawing routine based on data stored in a preconfigured memory location designed to store the most recently rendered frame and a history buffer containing one or more previous frames that have been warped and aligned based on optical flow data. In such embodiments, the most recent frame and one or more previously rendered frames are used as continuously updated source data for the AI drawing and/or frame generation model. The permanent program code can then use a timestamp-aware AI model to continuously generate frames based on the configured refresh cycle of the display to which the GPU is attached.

图34图示按照实施例、生成时间感知的机器学习模型的方法3400。该模型可以通过执行操作以使用帧数据、光流以及与该帧数据关联的时间戳训练内插/外推帧生成模型以生成该模型的内插权重而被生成(3402)。进一步的操作可被执行,以使用帧数据、光流以及与该帧数据关联的时间戳来训练内插/外推帧生成模型,以生成该模型的外推权重(3404)。该模型可被部署为能够相对于输入帧对目标时间戳进行内插或外推的内插/外推帧生成模型(3406)。34 illustrates a method 3400 for generating a time-aware machine learning model, in accordance with an embodiment. The model may be generated by performing operations to train an interpolation/extrapolation frame generation model using frame data, optical flow, and timestamps associated with the frame data to generate interpolation weights for the model (3402). Further operations may be performed to train the interpolation/extrapolation frame generation model using frame data, optical flow, and timestamps associated with the frame data to generate extrapolation weights for the model (3404). The model may be deployed as an interpolation/extrapolation frame generation model capable of interpolating or extrapolating a target timestamp relative to an input frame (3406).

图35图示按照实施例、经由时间感知的机器学习模型的基于时间的帧生成的方法3500。图形处理器可以接收工作负荷,以基于从应用所接收的渲染帧的API命令来渲染帧数据(3502)。可以从与应用关联的图形驱动器来接收工作负荷。图形处理器随后可以在图形引擎的命令队列来处理工作负荷(3504)。工作负荷可以由图形驱动器基于从应用所接收的API命令来提交到命令队列。在渲染期间,图形处理器或图形驱动器内的工作负荷管理逻辑可以跟踪对于帧的所提交工作负荷的渲染进度(3506)。图形处理器的工作负荷管理逻辑可以确定进行中的帧是否将满足目标显示更新最后期限(3507)。如果帧将满足最后期限(3507的“是”),则图形处理器可以继续执行帧的工作负荷(3508),并且随后基于经处理的工作负荷来显示所渲染的帧(3509)。如果帧将不满足最后期限(3507的“否”),则图形处理器可以经由图形处理器的计算引擎来触发神经帧生成(3510)。图形处理器随后可以显示所生成的帧(3511)。FIG35 illustrates a method 3500 for time-based frame generation via a time-aware machine learning model in accordance with an embodiment. A graphics processor may receive a workload to render frame data based on an API command to render a frame received from an application (3502). The workload may be received from a graphics driver associated with the application. The graphics processor may then process the workload in a command queue of a graphics engine (3504). The workload may be submitted to the command queue by the graphics driver based on the API command received from the application. During rendering, a workload management logic within the graphics processor or graphics driver may track the rendering progress of the submitted workload for the frame (3506). The workload management logic of the graphics processor may determine whether the frame in progress will meet the target display update deadline (3507). If the frame will meet the deadline (“yes” of 3507), the graphics processor may continue to execute the workload of the frame (3508), and then display the rendered frame based on the processed workload (3509). If the frame will not meet the deadline ("No" of 3507), the graphics processor may trigger neural frame generation (3510) via the graphics processor's compute engine. The graphics processor may then display the generated frame (3511).

图36图示按照实施例、与渲染速率异步的基于时间的帧生成的方法3600。图形处理器能够基于API命令经由图形引擎来渲染帧(3602)。与3D API关联的API命令可以由图形驱动器接收,并且用来渲染与应用(诸如3D游戏应用)关联的3D场景的帧。图形处理器随后可以基于帧数据来生成光流数据(3604),该帧数据包括当前渲染的帧和至少一个先前渲染的帧。图形处理器随后可以将帧数据和光流数据存储到设备存储器中的预定位置(3606)。所述预定位置处于由图形和计算两种资源共同已知并且可访问的存储器地址。与经由图形引擎对新帧的渲染异步地,帧生成逻辑可以从设备存储器中读取帧数据和光流数据(3608)。帧生成逻辑可以经由图形处理器的计算引擎被执行。帧生成逻辑随后可以基于与下一个显示更新对应的时间戳经由计算引擎来生成用于显示的帧(3610)。使用这种方式,新的帧可以基于光流数据和帧数据的最近可用集合使用时间感知的机器学习模型在每个帧更新周期被生成以用于显示。Figure 36 illustrates a method 3600 for time-based frame generation asynchronous with a rendering rate according to an embodiment. A graphics processor can render a frame via a graphics engine based on an API command (3602). The API command associated with the 3D API can be received by a graphics driver and used to render a frame of a 3D scene associated with an application (such as a 3D game application). The graphics processor can then generate optical flow data (3604) based on the frame data, the frame data including the currently rendered frame and at least one previously rendered frame. The graphics processor can then store the frame data and the optical flow data to a predetermined location in a device memory (3606). The predetermined location is at a memory address that is known and accessible to both graphics and computing resources. Asynchronously with the rendering of a new frame via the graphics engine, frame generation logic can read the frame data and the optical flow data from the device memory (3608). The frame generation logic can be executed via a computing engine of the graphics processor. The frame generation logic can then generate a frame for display via the computing engine based on a timestamp corresponding to the next display update (3610). In this way, new frames can be generated for display at each frame update cycle based on the optical flow data and the most recently available set of frame data using a time-aware machine learning model.

图37图示经由时间感知的帧生成来维持一致帧节奏调整的系统3700。系统3700包括GPU 3360,该GPU经由显示控制器3710与显示器3720耦合。显示控制器配置成以通过显示器3720的刷新速率所确定的或者与其成比例的速率向显示器3720提供帧数据。新帧可以至少部分基于经由计算块2900A-2900N内的图形处理资源所执行的着色器程序经由图形引擎3320被生成。为了增加帧速率并且提供平滑游戏可玩性的外观,附加帧能够使用神经帧生成模型被内插或外推,该神经帧生成模型经由计算块2900A-2900N内的矩阵或张量加速器(例如图29的脉动阵列2912A-2912B)执行。在常规GPU中,当新的帧没有及时准备好用于下一个更新时,先前渲染的帧将被重新呈现。FIG37 illustrates a system 3700 that maintains consistent frame pacing via time-aware frame generation. The system 3700 includes a GPU 3360 coupled to a display 3720 via a display controller 3710. The display controller is configured to provide frame data to the display 3720 at a rate determined by or proportional to the refresh rate of the display 3720. New frames can be generated via the graphics engine 3320 based at least in part on shader programs executed via graphics processing resources within the compute blocks 2900A-2900N. To increase the frame rate and provide the appearance of smooth gameplay, additional frames can be interpolated or extrapolated using a neural frame generation model that is executed via a matrix or tensor accelerator (e.g., the systolic array 2912A-2912B of FIG29) within the compute blocks 2900A-2900N. In a conventional GPU, when a new frame is not ready in time for the next update, the previously rendered frame will be re-rendered.

系统3700配置成尝试在每个显示更新间隔呈现用于显示的新帧(最多为预定最大刷新速率)。在一个实施例中,如果所渲染帧数据3702准备好用于显示,则呈现新渲染的帧,否则可以显示所生成帧数据3708。例如,如果做出新渲染的帧将没有及时准备好用于下一个显示更新的确定,则神经帧生成可以被触发,以使用本文中描述的时间感知的机器学习模型来生成新的帧。在各个实施例中,该确定可以由CPU执行工作负荷管理逻辑3705A、GPU执行的工作负荷管理逻辑3705B或者工作负荷管理逻辑3705A-3705B(其在CPU和GPU两者上执行)做出。在一个实施例中,GPU执行的工作负荷管理逻辑3705B经由图15B的图形微控制器1533执行。新渲染的帧将没有及时准备好用于下一个显示更新的确定可以经由图形引擎3320基于帧完成的估计时间(与时间感知帧生成的估计完成时间相比)而被做出。The system 3700 is configured to attempt to present a new frame for display at each display update interval (up to a predetermined maximum refresh rate). In one embodiment, if the rendered frame data 3702 is ready for display, the newly rendered frame is presented, otherwise the generated frame data 3708 may be displayed. For example, if a determination is made that the newly rendered frame will not be ready in time for the next display update, neural frame generation may be triggered to generate a new frame using the time-aware machine learning model described herein. In various embodiments, this determination may be made by the CPU executing workload management logic 3705A, the GPU executing workload management logic 3705B, or workload management logic 3705A-3705B (which executes on both the CPU and GPU). In one embodiment, the GPU executing workload management logic 3705B is executed via the graphics microcontroller 1533 of FIG. 15B. The determination that the newly rendered frame will not be ready in time for the next display update may be made via the graphics engine 3320 based on an estimated time for frame completion (compared to an estimated completion time for time-aware frame generation).

由于用于帧生成的机器学习模型是时间感知的,因此与其中将显示所生成的帧的时间对应的时间戳作为对帧生成过程的输入被提供,从而允许机器学习模型演进在所渲染帧的集合之间所生成的光流数据3704,以便以来自所渲染的帧数据3702的输入帧之间或者最近渲染的帧之外的特定时间为目标。光流数据3704可以使用媒体/OFA引擎3730被生成,该媒体/OFA引擎执行媒体编码/解码操作,生成输入帧之间的密集光流,或者生成输入帧之间的稀疏运动向量,其中这些稀疏运动向量可以使用运动向量扩大模型被扩大,这些运动向量扩大模型经由计算块2900A-2900N内的矩阵或张量加速器执行。还可以维持历史缓冲器3706(其中先前帧被扭曲并且在时间上累加),以提供利用时间累加数据的抗混叠、超采样或帧生成模型的附加采样源。Since the machine learning model for frame generation is time-aware, a timestamp corresponding to the time at which the generated frame will be displayed is provided as an input to the frame generation process, allowing the machine learning model to evolve the optical flow data 3704 generated between the set of rendered frames to target a specific time between the input frames from the rendered frame data 3702 or outside the most recently rendered frame. The optical flow data 3704 can be generated using a media/OFA engine 3730 that performs media encoding/decoding operations, generates dense optical flow between input frames, or generates sparse motion vectors between input frames, where these sparse motion vectors can be expanded using motion vector expansion models that are executed via matrix or tensor accelerators within the computation blocks 2900A-2900N. A history buffer 3706 (where previous frames are warped and accumulated in time) can also be maintained to provide an additional sampling source for anti-aliasing, supersampling, or frame generation models that utilize temporal accumulated data.

在一个实施例中,系统3700可配置成一种模式,在该模式中,严格帧定时通过主要或完全从所生成帧数据3708进行呈现被维持。以能够与显示器3720的更新速率同步或相互关连的预定间隔,所生成的帧数据3708经由显示控制器被呈现给显示器3720。在这种模式中,经由计算块2900A-2900N所执行的时间感知的机器学习模型可以通过按照被演进到与所生成的帧的显示的目标时间对应的特定时间戳的光流数据扭曲所渲染帧来生成新帧。In one embodiment, the system 3700 may be configured in a mode in which strict frame timing is maintained by rendering primarily or entirely from generated frame data 3708. The generated frame data 3708 is presented to the display 3720 via a display controller at predetermined intervals that can be synchronized or correlated with the update rate of the display 3720. In this mode, the time-aware machine learning model executed via computation blocks 2900A-2900N may generate new frames by warping the rendered frames according to the optical flow data evolved to a specific timestamp corresponding to the target time for display of the generated frames.

附加示例性计算设备Additional Exemplary Computing Devices

图38是按照实施例、包括图形处理器3804的计算设备3800的框图。计算设备3800的版本可以是或者被包括在诸如机顶盒(例如基于因特网的有线电视机顶盒等)、基于全球定位系统(GPS)的设备等的通信装置内。计算设备3800也可以是或者被包括在诸如蜂窝电话、智能电话、个人数字助理(PDA)、平板计算机、膝上型计算机、电子阅读器、智能电视机、电视平台、可穿戴设备(例如眼镜、手表、手镯、智能卡、珠宝、服装物品等)、媒体播放器等的移动计算设备内。例如,在一个实施例中,计算设备3800包括移动计算设备,该移动计算设备采用在单个芯片上集成计算设备3800的各种硬件和/或软件部件的集成电路(“IC”),诸如片上系统((“SoC”或“SOC”)。38 is a block diagram of a computing device 3800 including a graphics processor 3804, according to an embodiment. A version of the computing device 3800 may be or be included in a communication device such as a set-top box (e.g., an Internet-based cable TV set-top box, etc.), a device based on a global positioning system (GPS), etc. The computing device 3800 may also be or be included in a mobile computing device such as a cellular phone, a smart phone, a personal digital assistant (PDA), a tablet computer, a laptop computer, an e-reader, a smart TV, a TV platform, a wearable device (e.g., glasses, watches, bracelets, smart cards, jewelry, clothing items, etc.), a media player, etc. For example, in one embodiment, the computing device 3800 includes a mobile computing device that employs an integrated circuit ("IC") that integrates various hardware and/or software components of the computing device 3800 on a single chip, such as a system on a chip ("SoC" or "SOC").

计算设备3800包括图形处理器3804。图形处理器3804表示本文中描述的任何图形处理器。在一个实施例中,图形处理器3804包括缓存3814,该缓存可以是单个缓存或者被划分为缓存存储器的多个段,包括但不限于任何数量的L1、L2、L3或L4缓存、渲染缓存、深度缓存、采样器缓存和/或着色器单元缓存。在一个实施例中,缓存3814可以是与应用处理器3806共享的最后一级缓存。在一个实施例中,计算设备3800包括CSL逻辑3812,以促进应用处理器3806与图形处理器3804之间的数据的共享和传输。计算设备3800还可以包括硬件和软件逻辑3813,以能够实现可缩放I/O虚拟化(S-IOV)或单根I/O虚拟化(SRIOV),以向由应用处理器3806执行的软件域提供图形处理器3804的虚拟实例。The computing device 3800 includes a graphics processor 3804. The graphics processor 3804 represents any graphics processor described herein. In one embodiment, the graphics processor 3804 includes a cache 3814, which can be a single cache or divided into multiple segments of cache memory, including but not limited to any number of L1, L2, L3 or L4 caches, rendering caches, depth caches, sampler caches and/or shader unit caches. In one embodiment, the cache 3814 can be a last level cache shared with the application processor 3806. In one embodiment, the computing device 3800 includes CSL logic 3812 to facilitate the sharing and transmission of data between the application processor 3806 and the graphics processor 3804. The computing device 3800 may also include hardware and software logic 3813 to enable scalable I/O virtualization (S-IOV) or single root I/O virtualization (SRIOV) to provide a virtual instance of the graphics processor 3804 to the software domain executed by the application processor 3806.

在一个实施例中,图形处理器3804包括图形微控制器,该图形微控制器实现图形处理器的控制和调度逻辑。控制和调度逻辑可以是由图形微控制器3815执行的固件。固件可由图形驱动器逻辑3822在根加载。固件还可被编程到电可擦可编程只读存储器或者从图形微控制器3815内的闪速存储器设备被加载。固件可以能够实现包括GPU OS 3816,该GPUOS包括设备管理逻辑3817和驱动器逻辑3818以及调度器3819。GPU OS 3816还可包括图形存储器管理器3820,该图形存储器管理器可以补充或替代图形驱动器逻辑3822内的图形存储器管理器3821。In one embodiment, the graphics processor 3804 includes a graphics microcontroller that implements the control and scheduling logic of the graphics processor. The control and scheduling logic may be firmware executed by the graphics microcontroller 3815. The firmware may be loaded at the root by the graphics driver logic 3822. The firmware may also be programmed into an electrically erasable programmable read-only memory or loaded from a flash memory device within the graphics microcontroller 3815. The firmware may be capable of including a GPU OS 3816, which includes device management logic 3817 and driver logic 3818 and a scheduler 3819. The GPU OS 3816 may also include a graphics memory manager 3820, which may supplement or replace the graphics memory manager 3821 within the graphics driver logic 3822.

图形处理器3804还包括GPGPU引擎3844,该GPGPU引擎包括一个或多个图形引擎、图形处理器核心以及如本文中描述的其他图形执行资源。此类图形执行资源能以包括但不限于以下形式的形式被呈现:执行单元、着色器引擎、片段处理器、顶点处理器、图形多处理器、流式多处理器、图形处理器集群或者适合于图形资源或图像资源的处理或者在包括集成或分立图形和/或并行处理元件的异构处理系统中执行通用计算操作的计算资源的任何集合。GPGPU引擎3844的处理资源可被包括在被连接到衬底的硬件逻辑的多个片内。GPGPU引擎3844可以包括GPU片3845,这些GPU片包括图形处理和执行资源、缓存、采样器等。GPU片3845还可包括本地易失性存储器,或者可以与一个或多个存储器片耦合。The graphics processor 3804 also includes a GPGPU engine 3844, which includes one or more graphics engines, a graphics processor core, and other graphics execution resources as described herein. Such graphics execution resources can be presented in the form of, but not limited to, execution units, shader engines, fragment processors, vertex processors, graphics multiprocessors, streaming multiprocessors, graphics processor clusters, or any collection of computing resources suitable for processing graphics resources or image resources or performing general computing operations in a heterogeneous processing system including integrated or discrete graphics and/or parallel processing elements. The processing resources of the GPGPU engine 3844 may be included in multiple slices of hardware logic connected to the substrate. The GPGPU engine 3844 may include GPU slices 3845, which include graphics processing and execution resources, caches, samplers, etc. The GPU slice 3845 may also include local volatile memory, or may be coupled to one or more memory slices.

GPGPU引擎3844还可以包括一个或多个特殊片3846,这些特殊片包括例如非易失性存储器片3856、网络处理器片3857和/或通用计算片3858。GPGPU引擎3844还包括矩阵乘法加速器3860。通用计算片3858还可包括加速矩阵乘法运算的逻辑。非易失性存储器片3856可以包括非易失性存储器单元和控制器逻辑。非易失性存储器片3856的控制器逻辑可以由设备管理逻辑3817或驱动器逻辑3818中的一个来管理。网络处理器片3857可以包括网络处理资源,这些网络处理资源被耦合到计算设备3800的输入/输出(I/O)资源3810内的物理接口。网络处理器片3857可由设备管理逻辑3817或驱动器逻辑3818中的一个或多个来管理。The GPGPU engine 3844 may also include one or more special slices 3846, which include, for example, a non-volatile memory slice 3856, a network processor slice 3857, and/or a general computing slice 3858. The GPGPU engine 3844 also includes a matrix multiplication accelerator 3860. The general computing slice 3858 may also include logic to accelerate matrix multiplication operations. The non-volatile memory slice 3856 may include a non-volatile memory unit and controller logic. The controller logic of the non-volatile memory slice 3856 may be managed by one of the device management logic 3817 or the driver logic 3818. The network processor slice 3857 may include network processing resources that are coupled to physical interfaces within the input/output (I/O) resources 3810 of the computing device 3800. The network processor slice 3857 may be managed by one or more of the device management logic 3817 or the driver logic 3818.

在一个实施例中,矩阵乘法加速器3860是模块化可缩放稀疏矩阵乘法加速器。矩阵乘法加速器3860可以包括多个处理路径,其中每个处理路径包括多个管线阶段。每个处理路径可以执行单独的指令。在各个实施例中,矩阵乘法加速器3860可以具有本文中描述的矩阵乘法加速器中的任何一个或多个的体系结构特征。例如,在一个实施例中,矩阵乘法加速器3860是脉动阵列,该脉动阵列可配置成与四的倍数个逻辑阶段(例如四个、八个、十二个、十六个等)配合操作。在一个实施例中,矩阵乘法加速器3860包括具有四阶段管线的二路径矩阵乘法加速器或者具有二阶段管线的四路径矩阵乘法加速器的一个或多个实例。在一个实施例中,矩阵乘法加速器3860包括被配置为可缩放稀疏矩阵乘法加速器的处理元件。矩阵乘法加速器3860可以用来加速经由XMX扩展所执行的矩阵运算或者促进矩阵计算操作的另一计算库。作为示例,矩阵乘法加速器3860可以用来执行与卷积和上卷积/转置卷积层关联的操作,用于本文中描述的光流估计、帧生成和神经超分辨率网络,其可被包括在由计算设备3800使用的基于时间的帧生成器3823中。基于时间的帧生成器3823可以包括如本文中描述的一个或多个时间感知的神经网络,诸如图30A-图30B的统一神经网络3005。In one embodiment, the matrix multiplication accelerator 3860 is a modular scalable sparse matrix multiplication accelerator. The matrix multiplication accelerator 3860 may include multiple processing paths, each of which includes multiple pipeline stages. Each processing path can execute a separate instruction. In various embodiments, the matrix multiplication accelerator 3860 may have any one or more of the architectural features of the matrix multiplication accelerators described herein. For example, in one embodiment, the matrix multiplication accelerator 3860 is a systolic array, which can be configured to cooperate with multiples of four logical stages (e.g., four, eight, twelve, sixteen, etc.). In one embodiment, the matrix multiplication accelerator 3860 includes one or more instances of a two-path matrix multiplication accelerator with a four-stage pipeline or a four-path matrix multiplication accelerator with a two-stage pipeline. In one embodiment, the matrix multiplication accelerator 3860 includes a processing element configured as a scalable sparse matrix multiplication accelerator. The matrix multiplication accelerator 3860 can be used to accelerate the matrix operations performed via the XMX extension or promote another computing library of matrix calculation operations. As an example, the matrix multiplication accelerator 3860 may be used to perform operations associated with convolution and upconvolution/transposed convolution layers for the optical flow estimation, frame generation, and neural super-resolution networks described herein, which may be included in a time-based frame generator 3823 used by the computing device 3800. The time-based frame generator 3823 may include one or more time-aware neural networks as described herein, such as the unified neural network 3005 of FIGS. 30A-30B .

如所图示,在一个实施例中,并且除了图形处理器3804之外,计算设备3800还可进一步包括任何数量和类型的硬件部件和/或软件部件,包括但不限于应用处理器3806、存储器3808和输入/输出(I/O)源3810。应用处理器3806可以与硬件图形管线进行交互,以共享图形管线功能性。经处理数据被存储在硬件图形管线中的缓冲器中,以及状态信息被存储在存储器3808中。所产生的数据可以经由显示设备被传输到显示控制器以用于经由显示设备输出。显示设备可具有各种类型,诸如,阴极射线管(CRT)、薄膜晶体管(TFT)、液晶显示器(LCD)、有机发光二极管(OLED)阵列等,并且可被配置成经由图形用户界面向用户显示信息。As illustrated, in one embodiment, and in addition to the graphics processor 3804, the computing device 3800 may further include any number and type of hardware components and/or software components, including but not limited to an application processor 3806, a memory 3808, and an input/output (I/O) source 3810. The application processor 3806 may interact with the hardware graphics pipeline to share the graphics pipeline functionality. The processed data is stored in a buffer in the hardware graphics pipeline, and the state information is stored in the memory 3808. The generated data may be transmitted to the display controller via a display device for output via the display device. The display device may be of various types, such as a cathode ray tube (CRT), a thin film transistor (TFT), a liquid crystal display (LCD), an organic light emitting diode (OLED) array, etc., and may be configured to display information to a user via a graphical user interface.

应用处理器3806可以包括一个或多个处理器,诸如图1的(一个或多个)处理器102,并且可以是至少部分用来执行计算设备3800的操作系统(OS)3802的中央处理单元(CPU)。OS 3802可以用作计算设备3800的硬件和/或物理资源与一个或多个用户之间的接口。OS 3802可以包括用于计算设备3800中的各种硬件装置的驱动器逻辑。驱动器逻辑可以包括图形驱动器逻辑3822,该图形驱动器逻辑可以包括用户模式图形驱动器和/或内核模式图形驱动器。图形驱动器逻辑可以包括图形存储器管理器3821,以管理图形处理器3804的虚拟存储器地址空间。图形存储器管理器3821可以促进统一虚拟地址空间,该统一虚拟地址空间可由应用处理器3806和图形处理器3804访问。The application processor 3806 may include one or more processors, such as the processor(s) 102 of FIG. 1 , and may be a central processing unit (CPU) used at least in part to execute an operating system (OS) 3802 of the computing device 3800. The OS 3802 may serve as an interface between the hardware and/or physical resources of the computing device 3800 and one or more users. The OS 3802 may include driver logic for various hardware devices in the computing device 3800. The driver logic may include graphics driver logic 3822, which may include a user mode graphics driver and/or a kernel mode graphics driver. The graphics driver logic may include a graphics memory manager 3821 to manage the virtual memory address space of the graphics processor 3804. The graphics memory manager 3821 may facilitate a unified virtual address space that is accessible by the application processor 3806 and the graphics processor 3804.

预期在一些实施例中,图形处理器3804可以作为应用处理器3806的部分(诸如物理CPU封装的部分)存在,在此情况下,存储器3808的至少一部分可由应用处理器3806和图形处理器3804共享,但是存储器3808的至少一部分可对图形处理器3804是排他的,或者图形处理器3804可具有存储器的单独存储单元。存储器3808还可以经由CXL逻辑3812与图形处理器3804的分立版本共享。It is contemplated that in some embodiments, graphics processor 3804 may exist as part of application processor 3806 (such as part of a physical CPU package), in which case at least a portion of memory 3808 may be shared by application processor 3806 and graphics processor 3804, but at least a portion of memory 3808 may be exclusive to graphics processor 3804, or graphics processor 3804 may have a separate storage location for memory. Memory 3808 may also be shared with a discrete version of graphics processor 3804 via CXL logic 3812.

存储器3808可以包括缓冲器(例如帧缓冲器)的预先分配区域;然而本领域的普通技术人员应当理解,实施例并不局限于此,并且可使用更低的图形管线可访问的任何存储器。存储器3808可包括各种形式的随机存取存储器(RAM)(例如SDRAM、SRAM等),包括利用图形处理器3804来渲染桌面或3D图形场景的应用。存储器控制器中枢可以访问存储器3808中的数据,并且将它转发到图形处理器3804以用于图形管线处理。可使存储器3808是计算设备3800内的其他部件可用的。例如,从计算设备3800的各种I/O源3810所接收的任何数据(例如输入图形数据)可以在软件程序或应用的实现方式中由一个或多个处理器(例如应用处理器3806)操作之前暂时被排队到存储器3808中。类似地,软件程序确定的数据应当通过计算系统接口中的一个从计算设备3800发送到外部实体,或者被存储到内部存储元件中,在被传送或存储之前通常暂时在存储器3808中被排队。Memory 3808 may include a pre-allocated area of a buffer (e.g., a frame buffer); however, it should be understood by those skilled in the art that embodiments are not limited thereto, and any memory accessible to a lower graphics pipeline may be used. Memory 3808 may include various forms of random access memory (RAM) (e.g., SDRAM, SRAM, etc.), including applications that utilize graphics processor 3804 to render a desktop or 3D graphics scene. The memory controller hub may access data in memory 3808 and forward it to graphics processor 3804 for graphics pipeline processing. Memory 3808 may be made available to other components within computing device 3800. For example, any data (e.g., input graphics data) received from various I/O sources 3810 of computing device 3800 may be temporarily queued in memory 3808 before being operated by one or more processors (e.g., application processor 3806) in an implementation of a software program or application. Similarly, data determined by the software program to be sent from computing device 3800 to an external entity through one of the computing system interfaces, or stored in an internal storage element, is typically queued temporarily in memory 3808 prior to being transmitted or stored.

I/O源可以包括诸如触摸屏、触摸面板、触摸垫、虚拟或常规键盘、虚拟或常规鼠标、端口、连接器、网络设备或诸如此类的设备,并且可以经由平台控制器中枢进行附连。附加地,I/O源3810可包括一个或多个I/O设备(例如联网适配器),这些I/O设备被实现以用于向和/或从计算设备3800传输数据;或者用于计算设备3800内的大规模非易失性存储装置(例如SSD/HDD)。用户输入设备(包括字母数字和其他按键)可以用来向图形处理器3804传递信息和命令选择。另一类型的用户输入设备是光标控制,诸如鼠标、轨迹球、触摸屏、触摸垫或者光标方向键,以向GPU传递方向信息和命令选择,并且控制显示设备上的光标移动。计算设备3800的相机和话筒阵列可以用来观察手势,记录音频和视频,并且接收和传送视觉和音频命令。The I/O sources may include devices such as a touch screen, a touch panel, a touch pad, a virtual or conventional keyboard, a virtual or conventional mouse, a port, a connector, a network device, or the like, and may be attached via a platform controller hub. Additionally, the I/O source 3810 may include one or more I/O devices (e.g., a networking adapter) that are implemented to transfer data to and/or from the computing device 3800; or for large-scale non-volatile storage devices (e.g., SSD/HDD) within the computing device 3800. User input devices (including alphanumeric and other keys) may be used to pass information and command selections to the graphics processor 3804. Another type of user input device is a cursor control, such as a mouse, trackball, touch screen, touch pad, or cursor direction keys to pass direction information and command selections to the GPU and control cursor movement on the display device. The camera and microphone array of the computing device 3800 may be used to observe gestures, record audio and video, and receive and transmit visual and audio commands.

I/O源3810可以包括一个或多个网络接口。网络接口可以包括关联网络处理逻辑,和/或与网络处理片3857耦合。一个或多个网络接口可以提供对LAN、广域网(WAN)、城域网(MAN)、个人区域网络(PAN)、蓝牙、云网络、蜂窝或移动网络(例如第3代(3G)、第4代(4G)、第5代(5G)等)、内联网、因特网等)的访问。(一个或多个)网络接口可以包括例如具有一个或多个天线的无线网络接口。(一个或多个)网络接口还可以包括例如有线网络接口,以经由网络线缆与远程设备进行通信,该网络线缆可以是例如以太网线缆、同轴线缆、光纤线缆、串行线缆或并行线缆。The I/O source 3810 may include one or more network interfaces. The network interface may include associated network processing logic, and/or be coupled to a network processing slice 3857. The one or more network interfaces may provide access to a LAN, a wide area network (WAN), a metropolitan area network (MAN), a personal area network (PAN), Bluetooth, a cloud network, a cellular or mobile network (e.g., 3rd generation (3G), 4th generation (4G), 5th generation (5G), etc.), an intranet, the Internet, etc.). The (one or more) network interfaces may include, for example, a wireless network interface having one or more antennas. The (one or more) network interfaces may also include, for example, a wired network interface to communicate with a remote device via a network cable, which may be, for example, an Ethernet cable, a coaxial cable, a fiber optic cable, a serial cable, or a parallel cable.

(一个或多个)网络接口可以例如通过遵守IEEE 802.11标准来提供对LAN的访问,和/或无线网络接口可以例如通过遵守蓝牙标准来提供对个人区域网络的访问。还可以支持其他无线网络接口和/或协议,包括标准的先前和后续版本。作为经由无线LAN标准的通信的附加或替代,(一个或多个)网络接口可使用例如时分多址(TDMA)协议、全球移动通信系统(GSM)协议、码分多址(CDMA)协议和/或任何其他类型的无线通信协议来提供无线通信。The network interface(s) may provide access to a LAN, for example, by complying with the IEEE 802.11 standard, and/or the wireless network interface may provide access to a personal area network, for example, by complying with the Bluetooth standard. Other wireless network interfaces and/or protocols may also be supported, including previous and subsequent versions of the standards. In addition to or in lieu of communication via the wireless LAN standard, the network interface(s) may provide wireless communication using, for example, a time division multiple access (TDMA) protocol, a global system for mobile communications (GSM) protocol, a code division multiple access (CDMA) protocol, and/or any other type of wireless communication protocol.

要领会,比上述描述的示例中更少或更多配备的系统对某些实现方式可以是优选的。因此,本文中描述的计算设备的配置可根据许多因素从实现方式到实现方式而变化,这些因素诸如是价格约束、性能要求、技术改善或其他环境。示例(非限制性地)包括移动设备、个人数字助理、移动计算设备、智能电话、蜂窝电话、手机、单向寻呼机、双向寻呼机、消息传递设备、计算机、个人计算机(PC)、台式计算机、膝上型计算机、笔记本计算机、手持计算机、平板计算机、服务器、服务器阵列或服务器场、万维网服务器、网络服务器、因特网服务器、工作站、微型计算机、大型计算机、超级计算机、网络电器、万维网电器、分布式计算系统、多处理器系统、基于处理器的系统、消费者电子器件、可编程消费者电子器件、电视机、数字电视机、机顶盒、无线接入点、基站、订户台、移动订户中心、无线电网络控制器、路由器、中枢、网关、桥接器、交换机、机器或者它们组合。It is to be appreciated that the system less or more equipped than in the above-described example may be preferred for some implementations.Therefore, the configuration of the computing device described herein may vary from implementation to implementation according to many factors, such as price constraints, performance requirements, technical improvements or other environments.Examples (without limitation) include mobile devices, personal digital assistants, mobile computing devices, smart phones, cellular phones, mobile phones, one-way pagers, two-way pagers, messaging devices, computers, personal computers (PCs), desktop computers, laptop computers, notebook computers, handheld computers, tablet computers, servers, server arrays or server farms, web servers, network servers, Internet servers, workstations, microcomputers, mainframe computers, supercomputers, network appliances, web appliances, distributed computing systems, multiprocessor systems, systems based on processors, consumer electronics, programmable consumer electronics, television sets, digital television sets, set-top boxes, wireless access points, base stations, subscriber stations, mobile subscriber centers, radio network controllers, routers, hubs, gateways, bridges, switches, machines or combinations thereof.

实施例可例如作为计算机程序产品被提供,该计算机程序产品可以包括一个或多个机器可读介质,这些机器可读介质上存储了机器可执行指令,这些机器可执行指令当由一个或多个机器(诸如,计算机、计算机的网络或者其他电子设备)执行时可使所述一个或多个机器执行根据本文中描述的实施例的操作。机器可读介质可以包括但不限于软盘、光盘、CD-ROM(致密盘只读存储器)和磁光盘、ROM、RAM、EPROM(可擦可编程只读存储器)、EEPROM(电可擦可编程只读存储器)、磁卡或光卡、闪速存储器或者适合于存储机器可执行指令的其他类型的介质/机器可读介质。Embodiments may be provided, for example, as a computer program product that may include one or more machine-readable media having machine-executable instructions stored thereon that, when executed by one or more machines (such as a computer, a network of computers, or other electronic devices), may cause the one or more machines to perform operations according to the embodiments described herein. Machine-readable media may include, but are not limited to, floppy disks, optical disks, CD-ROMs (compact disk read-only memory) and magneto-optical disks, ROMs, RAMs, EPROMs (erasable programmable read-only memory), EEPROMs (electrically erasable programmable read-only memory), magnetic or optical cards, flash memory, or other types of media/machine-readable media suitable for storing machine-executable instructions.

此外,实施例可作为计算机程序产品被下载,其中程序可经由通信链路(例如,调制解调器和/或网络连接)通过载波或其他传播介质中实施和/或由其调制的一个或多个数据信号从远程计算机(例如,服务器)传输到请求计算机(例如客户端)。In addition, embodiments may be downloaded as a computer program product, wherein the program may be transmitted from a remote computer (e.g., a server) to a requesting computer (e.g., a client) via a communication link (e.g., a modem and/or a network connection) by one or more data signals embodied in and/or modulated by a carrier wave or other propagation medium.

在本文档中,术语“用户”通篇可以可互换地被称为“观看者”、“观察者”、“人员”、“个人”、“终端用户”和/或诸如此类。要注意,在本文档中,像“图形域”的术语通篇可以可互换地采用“图形处理单元”、“图形处理器”或简单的“GPU”被引用,以及类似地,“CPU域”或“主机域”可以可互换地采用“计算机处理单元”、“应用处理器”或简单的“CPU”被引用。In this document, the term "user" may be referred to interchangeably as "viewer," "observer," "person," "individual," "end-user," and/or the like throughout. Note that in this document, terms like "graphics domain" may be referred to interchangeably as "graphics processing unit," "graphics processor," or simply "GPU" throughout, and similarly, "CPU domain" or "host domain" may be referred to interchangeably as "computer processing unit," "application processor," or simply "CPU."

要注意,像“节点”、“计算节点”、“服务器”、“服务器设备”、“云计算机”、“云服务器”、“云服务器计算机”、“机器”、“主机机器”、“设备”、“计算设备”、“计算机”、“计算系统”和诸如此类的术语在本文档中通篇可以可互换地被使用。要进一步注意,像“应用”、“软件应用”、“程序”、“软件程序”、“包”、“软件包”和诸如此类的术语在本文档中通篇可以可互换地被使用。另外,像“作业”、“输入”、“请求”、“消息”和诸如此类的术语在本文档中通篇可以可互换地被使用。It is to be noted that terms like "node", "computing node", "server", "server device", "cloud computer", "cloud server", "cloud server computer", "machine", "host machine", "device", "computing device", "computer", "computing system" and the like may be used interchangeably throughout this document. It is to be further noted that terms like "application", "software application", "program", "software program", "package", "software package" and the like may be used interchangeably throughout this document. In addition, terms like "job", "input", "request", "message" and the like may be used interchangeably throughout this document.

预期像“请求”、“查询”、“作业”、“工作”、“工作项目”和“工作负荷”之类的术语在本文档中通篇可以可互换地被引用。类似地,“应用”或“代理”可指代或者包括通过应用编程接口(API)所提供的计算机程序、软件应用、游戏、工作站应用等,该API诸如是,自由渲染API,诸如开放图形库开放计算语言 11、12等,其中“调遣”可以可互换地被称为“工作单元”或“绘制”,以及类似地,“应用”可以可互换地被称为“工作流程”或者简单地被称为“代理”。例如,诸如三维(3D)游戏的工作负荷的工作负荷可包括并且发出任何数量和类型的“帧”,其中每个帧可表示图像(例如帆船、人脸)。进一步,每个帧可包括和提供任何数量和类型的工作单元,其中每个工作单元可表示通过其对应帧所表示的图像(例如帆船、人脸)的一部分(例如帆船的桅杆、人脸的前额)。然而,为了一致性的缘故,每一项在本文档中通篇可通过单个术语(例如,“调遣”、“代理”等)被引用。It is contemplated that terms such as "request,""query,""job,""work,""workitem," and "workload" may be referred to interchangeably throughout this document. Similarly, an "application" or "agent" may refer to or include a computer program, software application, game, workstation application, etc., provided through an application programming interface (API), such as a free rendering API, such as the Open Graphics Library. Open Computing Language 11. 12, etc., where "dispatches" may be interchangeably referred to as "work units" or "renderings," and similarly, "applications" may be interchangeably referred to as "workflows" or simply "agents." For example, a workload such as a three-dimensional (3D) game workload may include and emit any number and type of "frames," where each frame may represent an image (e.g., a sailboat, a person's face). Further, each frame may include and provide any number and type of work units, where each work unit may represent a portion (e.g., a mast of a sailboat, a forehead of a person's face) of the image (e.g., a sailboat, a person's face) represented by its corresponding frame. However, for the sake of consistency, each item may be referred to throughout this document by a single term (e.g., "dispatches,""agents," etc.).

本文中对“一个实施例”、“实施例”、“示例实施例”等的引用指示所述的实施例可包括特定特征、结构或特性,但是可能不一定每一个实施例都包括所述特定特征、结构或特性。此外,这类短语不一定指代相同的实施例。进一步,当结合实施例来描述特定特征、结构或特性时,无论是否明确描述,都认为结合其他实施例来实现这种特征、结构或特性是在本领域的技术人员的知识范围之内的。References herein to "one embodiment," "an embodiment," "an example embodiment," etc. indicate that the described embodiment may include a particular feature, structure, or characteristic, but not every embodiment may include the particular feature, structure, or characteristic. Moreover, such phrases do not necessarily refer to the same embodiment. Further, when a particular feature, structure, or characteristic is described in conjunction with an embodiment, it is considered to be within the knowledge of those skilled in the art to implement such feature, structure, or characteristic in conjunction with other embodiments, whether or not explicitly described.

本文中描述的是一种图形处理器,该图形处理器包括存储器接口、处理集群和第一电路模块。图形处理集群包括通过数据互连所连接的多个处理资源。图形处理器包括第一电路模块,该第一电路模块从应用接收工作负荷请求,以通过编程接口调用来渲染帧,并且使用图形引擎的命令队列来处理这些请求。图形处理器还跟踪工作负荷的进度,以确定帧是否将满足其目标显示更新最后期限,并且如果帧将满足最后期限则继续执行工作负荷。如果帧将不满足最后期限,则图形处理器还将请求神经帧生成。图形处理器随后将在目标显示更新最后期限显示所渲染的帧或者所生成的帧。图形处理器包括第二电路模块,用于执行帧的神经帧生成。第二电路模块包括与处理资源关联的计算引擎,并且处理资源包括矩阵加速器,用于代表计算引擎来执行矩阵乘法运算。第二电路模块配置成经由计算引擎来执行与时间感知的机器学习模型关联的操作,以执行帧的神经帧生成。时间感知的机器学习模型被训练,以基于多个输入帧在目标时间戳来估计光流,渲染与所述多个输入帧以及所述多个输入帧之间的光流相关联的时间戳。第二电路模块可以使用时间感知的机器学习模型在目标时间戳来估计光流,并且基于在目标时间戳所估计的光流来扭曲先前渲染的帧。在目标时间戳所估计的光流是经外推的光流,并且先前渲染的帧被扭曲以外推生成的帧。Described herein is a graphics processor comprising a memory interface, a processing cluster, and a first circuit module. The graphics processing cluster comprises a plurality of processing resources connected by a data interconnect. The graphics processor comprises a first circuit module that receives workload requests from an application to render frames through programming interface calls and processes these requests using a command queue of a graphics engine. The graphics processor also tracks the progress of the workload to determine whether the frame will meet its target display update deadline and continues to execute the workload if the frame will meet the deadline. If the frame will not meet the deadline, the graphics processor will also request neural frame generation. The graphics processor will then display the rendered frame or the generated frame at the target display update deadline. The graphics processor comprises a second circuit module for performing neural frame generation of a frame. The second circuit module comprises a computing engine associated with the processing resource, and the processing resource comprises a matrix accelerator for performing matrix multiplication operations on behalf of the computing engine. The second circuit module is configured to perform operations associated with a time-aware machine learning model via the computing engine to perform neural frame generation of a frame. A time-aware machine learning model is trained to estimate optical flow at a target timestamp based on a plurality of input frames, and timestamps associated with the plurality of input frames and the optical flow between the plurality of input frames are rendered. A second circuit module may estimate the optical flow at the target timestamp using the time-aware machine learning model, and warp a previously rendered frame based on the estimated optical flow at the target timestamp. The estimated optical flow at the target timestamp is an extrapolated optical flow, and the previously rendered frame is warped to a frame generated by the extrapolation.

一个实施例提供一种方法,包括接收工作负荷以基于从应用所接收的编程接口调用为要渲染的帧渲染帧数据,处理被提交到配置成渲染用于该应用的所述帧数据的图形引擎的命令队列的工作负荷,跟踪对于帧的所提交工作负荷的进度,确定帧是否将满足目标显示更新最后期限,响应于确定帧将满足目标显示更新最后期限而继续执行工作负荷以渲染帧数据,响应于确定帧将不满足目标显示更新最后期限而请求帧的神经帧生成,并且在目标显示更新最后期限显示所渲染或所生成的帧。该方法附加地包括经由计算引擎来执行帧的神经帧生成,这包括经由与计算引擎关联的矩阵引擎来执行矩阵乘法运算。执行帧的神经帧生成包括经由计算引擎来执行与时间感知的机器学习模型关联的操作。时间感知的机器学习模型被训练,以基于多个输入帧在目标时间戳来估计光流,渲染与所述多个输入帧以及所述多个输入帧之间的光流相关联的时间戳。该方法附加地包括经由时间感知的机器学习模型在目标时间戳来估计光流,并且基于在目标时间戳所估计的光流来扭曲先前渲染的帧。在目标时间戳所估计的光流是经外推的光流,并且先前渲染的帧被扭曲以外推生成的帧。One embodiment provides a method, including receiving a workload to render frame data for a frame to be rendered based on a programming interface call received from an application, processing the workload submitted to a command queue of a graphics engine configured to render the frame data for the application, tracking progress of the submitted workload for the frame, determining whether the frame will meet a target display update deadline, continuing to execute the workload to render the frame data in response to determining that the frame will meet the target display update deadline, requesting neural frame generation for the frame in response to determining that the frame will not meet the target display update deadline, and displaying the rendered or generated frame at the target display update deadline. The method additionally includes performing neural frame generation for the frame via a compute engine, which includes performing matrix multiplication operations via a matrix engine associated with the compute engine. Performing neural frame generation for the frame includes performing operations associated with a time-aware machine learning model via the compute engine. The time-aware machine learning model is trained to estimate optical flow based on multiple input frames at target timestamps, rendering timestamps associated with the multiple input frames and optical flow between the multiple input frames. The method additionally includes estimating optical flow at a target timestamp via a time-aware machine learning model, and warping a previously rendered frame based on the estimated optical flow at the target timestamp. The estimated optical flow at the target timestamp is an extrapolated optical flow, and the previously rendered frame is warped to a frame generated by the extrapolation.

一个实施例提供一种图形处理系统,该图形处理系统包括存储器设备以及与该存储器设备关联的图形处理器。图形处理器包括处理集群,该处理集群具有经由数据互连所耦合的多个处理资源。数据互连使所述多个处理资源能够协同执行被提交到图形处理器的工作负荷的线程块。图形处理器还包括第一电路模块,用于经由所述多个处理资源中的处理资源来处理输入数据。第一电路模块配置成接收工作负荷以基于从应用所接收的编程接口调用为要渲染的帧渲染帧数据,并且执行被提交到图形引擎的命令队列的工作负荷以渲染帧数据。One embodiment provides a graphics processing system, which includes a memory device and a graphics processor associated with the memory device. The graphics processor includes a processing cluster having a plurality of processing resources coupled via a data interconnect. The data interconnect enables the plurality of processing resources to collaboratively execute thread blocks of a workload submitted to the graphics processor. The graphics processor also includes a first circuit module for processing input data via a processing resource in the plurality of processing resources. The first circuit module is configured to receive a workload to render frame data for a frame to be rendered based on a programming interface call received from an application, and execute the workload submitted to a command queue of a graphics engine to render the frame data.

响应于帧的工作负荷执行将满足目标显示更新最后期限的确定,第一电路模块将继续执行工作负荷以渲染帧数据。响应于帧的工作负荷执行将不满足目标显示更新最后期限的确定,第一电路模块将转而显示经由神经帧生成所创建的帧。图形处理系统包括第二电路模块,该第二电路模块跟踪对于帧的所提交工作负荷的进度,并且确定帧是否将满足目标显示更新最后期限。第二电路模块将响应于帧将不满足目标显示更新最后期限的确定而发信号通知第一电路模块,并且随后发起对帧的神经帧生成的请求。图形处理系统还包括第三电路模块,用于执行帧的神经帧生成。第三电路模块包括与处理资源关联的计算引擎。计算引擎经由处理资源的矩阵加速器来执行矩阵乘法运算,以执行与被训练成在目标时间戳来估计光流的时间感知的机器学习模型关联的操作。时间感知的机器学习模型被训练,以基于多个输入帧在目标时间戳来估计光流,渲染与所述多个输入帧以及所述多个输入帧之间的光流相关联的时间戳。第三电路模块可以在目标时间戳来估计经外推的光流,并且基于在目标时间戳所估计的光流来扭曲先前渲染的帧,以外推所生成的帧。In response to a determination that the workload execution of the frame will meet the target display update deadline, the first circuit module will continue to execute the workload to render the frame data. In response to a determination that the workload execution of the frame will not meet the target display update deadline, the first circuit module will instead display the frame created via neural frame generation. The graphics processing system includes a second circuit module that tracks the progress of the submitted workload for the frame and determines whether the frame will meet the target display update deadline. The second circuit module will signal the first circuit module in response to the determination that the frame will not meet the target display update deadline, and then initiate a request for neural frame generation of the frame. The graphics processing system also includes a third circuit module for performing neural frame generation of the frame. The third circuit module includes a computing engine associated with a processing resource. The computing engine performs matrix multiplication operations via a matrix accelerator of the processing resource to perform operations associated with a time-aware machine learning model trained to estimate optical flow at a target timestamp. The time-aware machine learning model is trained to estimate optical flow at a target timestamp based on a plurality of input frames, rendering timestamps associated with the plurality of input frames and optical flow between the plurality of input frames. The third circuit module may estimate the extrapolated optical flow at the target time stamp, and warp the previously rendered frame based on the estimated optical flow at the target time stamp to extrapolate the generated frame.

一个实施例提供一种经由图形处理器来生成用于显示的帧的方法。该方法包括经由图形引擎基于编程接口命令来渲染帧,基于包括当前渲染的帧和先前渲染的帧的帧数据来生成光流数据,将帧数据和光流数据存储到图形处理器的设备存储器的预定位置,并且经由图形处理器的计算引擎来执行帧生成逻辑,以基于与下一个显示更新对应的时间戳来生成用于显示的帧。该方法附加地包括与执行帧生成逻辑异步地渲染附加帧数据,读取被存储到设备存储器中的预定位置的光流数据和帧数据的最近可用集合,并且基于与下一个显示更新对应的时间戳来生成用于显示的帧。One embodiment provides a method for generating a frame for display via a graphics processor. The method includes rendering a frame based on a programming interface command via a graphics engine, generating optical flow data based on frame data including a currently rendered frame and a previously rendered frame, storing the frame data and the optical flow data to a predetermined location of a device memory of the graphics processor, and executing frame generation logic via a compute engine of the graphics processor to generate a frame for display based on a timestamp corresponding to a next display update. The method additionally includes rendering additional frame data asynchronously with executing the frame generation logic, reading the optical flow data and the most recently available set of frame data stored to the predetermined location in the device memory, and generating a frame for display based on a timestamp corresponding to a next display update.

在以上描述的各个实施例中,除非另加具体注释,否则诸如短语“A、B或C中的至少一个”的析取语言旨在被理解为意味着A、B或者C或者它们的任何组合(例如A、B和/或C)。因此,析取语言不是旨在并且不应当理解为暗示给定实施例要求A中的至少一个、B中的至少一个或者C中的至少一个每个都存在。类似地,以“A、B或C中的至少一个”的形式所列示的项目可以意味着(A)、(B)、(C)、(A和B)、(B和C)或者(A、B和C)。In the various embodiments described above, unless specifically noted otherwise, disjunctive language such as the phrase "at least one of A, B, or C" is intended to be understood to mean A, B, or C, or any combination thereof (e.g., A, B, and/or C). Thus, disjunctive language is not intended to, and should not be understood to, imply that a given embodiment requires that at least one of A, at least one of B, or at least one of C each be present. Similarly, items listed in the form of "at least one of A, B, or C" may mean (A), (B), (C), (A and B), (B and C), or (A, B, and C).

本文中提供的技术的某些方面包括以算法的形式或者相对算法来描述的逻辑和关联操作。应当注意,此类逻辑可以在软件、固件和/或硬件中被实施。当逻辑在软件中被实施时,此类逻辑能够被下载以驻留在由多种操作系统使用的不同平台上并且从不同平台被操作,以及由处理器执行以执行关联操作。在固件中实施的逻辑可在本文中描述的微控制器或处理器设备上执行。当逻辑在硬件中被实施时,此类逻辑能以数字逻辑的形式。此类数字逻辑还可与模拟电路模块关联。Some aspects of the technology provided herein include logic and associated operations described in the form of an algorithm or relative to an algorithm. It should be noted that such logic can be implemented in software, firmware and/or hardware. When logic is implemented in software, such logic can be downloaded to reside on different platforms used by a variety of operating systems and operated from different platforms, and executed by a processor to perform associated operations. The logic implemented in firmware can be executed on a microcontroller or processor device described in this article. When logic is implemented in hardware, such logic can be in the form of digital logic. Such digital logic can also be associated with an analog circuit module.

在一些实施例中,像“显示屏幕”和“显示表面”之类的术语可以可互换地用来指代显示设备的可见部分,而显示设备的其余部分可被嵌入计算设备(诸如智能电话、可穿戴设备等)中。预期并且要注意,实施例并不局限于任何特定计算设备、软件应用、硬件部件、显示设备、显示屏幕或表面、协议、标准等。例如,实施例可被应用于任何数量和类型的计算机上的任何数量和类型的实时应用并且与其配合使用,这些计算机诸如是台式、膝上型、平板计算机、智能电话、头戴式显示器和其他可穿戴设备和/或诸如此类。进一步,例如,使用这个新技术的高效性能的渲染情形的范围可从诸如桌面合成的简单情形到诸如3D游戏、增强现实应用等的复杂情形。In some embodiments, terms like "display screen" and "display surface" may be used interchangeably to refer to the visible portion of the display device, while the remainder of the display device may be embedded in a computing device (such as a smart phone, wearable device, etc.). It is contemplated and noted that embodiments are not limited to any particular computing device, software application, hardware component, display device, display screen or surface, protocol, standard, etc. For example, embodiments may be applied to and used with any number and type of real-time applications on any number and type of computers, such as desktops, laptops, tablet computers, smart phones, head-mounted displays and other wearable devices, and/or the like. Further, for example, rendering scenarios using the efficient performance of this new technology may range from simple scenarios such as desktop composition to complex scenarios such as 3D games, augmented reality applications, etc.

前述说明书和附图应以说明性意义而非限制性意义来看待。本领域技术人员将理解,在不背离如所附权利要求中所陈述特征的更宽泛精神和范围的情况下,可对本文中描述的实施例做出各种修改和改变。The foregoing description and drawings are to be regarded in an illustrative rather than a restrictive sense. It will be appreciated by those skilled in the art that various modifications and changes may be made to the embodiments described herein without departing from the broader spirit and scope of the features as set forth in the appended claims.

Claims (25)

1. A graphics processor, comprising:
A memory interface;
A processing cluster coupled with the memory interface, the processing cluster comprising a plurality of processing resources coupled via a data interconnect; and
A first circuit module for processing input data via a processing resource of the plurality of processing resources, the first circuit module for:
Receiving a workload for rendering frame data for a frame to be rendered based on a programming interface call received from an application;
processing the workload submitted to the command queue of the graphics engine to render the frame data;
tracking progress of the submitted workload for the frame;
Determining whether the frame will meet a target display update deadline;
continuing to execute the workload to render the frame data in response to a determination that the frame will meet the target display update deadline;
requesting neural frame generation of the frame in response to a determination that the frame will not meet the target display update deadline; and
The rendered or generated frame is displayed at the target display update deadline.
2. The graphics processor of claim 1, comprising a second circuit module to perform neural frame generation of the frame, the second circuit module comprising a computing engine associated with the processing resource.
3. The graphics processor of claim 2, the processing resource comprising a matrix accelerator to perform matrix multiplication operations on behalf of the compute engine.
4. The graphics processor of claim 3, the second circuit module configured to perform operations associated with a time-aware machine learning model trained to estimate optical flow at a target timestamp via the computing engine to perform the neural frame generation of the frame.
5. The graphics processor of claim 4, the time-aware machine learning model trained to estimate the optical flow based on a plurality of input frames at the target timestamp, render timestamps associated with the plurality of input frames and optical flow between the plurality of input frames.
6. The graphics processor of claim 5, the second circuit module to:
Estimating optical flow at the target timestamp; and
Previously rendered frames are warped based on the optical flow estimated at the target timestamp.
7. The graphics processor of claim 6, wherein the optical flow estimated at the target timestamp is an extrapolated optical flow and the previously rendered frame is warped to extrapolate the generated frame.
8. A method, comprising:
Receiving a workload for rendering frame data for a frame to be rendered based on a programming interface call received from an application;
processing a workload submitted to a command queue of a graphics engine configured to render the frame data for the application;
tracking progress of the submitted workload for the frame;
Determining whether the frame will meet a target display update deadline;
continuing to execute the workload to render the frame data in response to determining that the frame will meet the target display update deadline;
requesting neural frame generation of the frame in response to determining that the frame will not meet the target display update deadline; and
And displaying the rendered or generated frame at the target display update deadline.
9. The method of claim 8, comprising performing, via a computing engine, neural frame generation of the frame.
10. The method of claim 9, comprising performing a matrix multiplication operation via a matrix engine associated with the computing engine to perform the neural frame generation.
11. The method of claim 10, performing, via the computing engine, operations associated with a time-aware machine learning model trained to estimate optical flow at a target timestamp to perform the neural frame generation of the frame.
12. The method of claim 11, the time-aware machine learning model trained to estimate the optical flow based on a plurality of input frames at the target timestamp, render timestamps associated with the plurality of input frames and optical flow between the plurality of input frames.
13. The method of claim 12, comprising:
Estimating optical flow at the target timestamp; and
Previously rendered frames are warped based on the optical flow estimated at the target timestamp.
14. The method of claim 13, wherein the optical flow estimated at the target timestamp is an extrapolated optical flow and the previously rendered frame is warped to extrapolate the generated frame.
15. A graphics processing system, comprising:
A memory device;
A graphics processor coupled with the memory device, the graphics processor comprising a processing cluster including a plurality of processing resources coupled via a data interconnect; and
A first circuit module for processing input data via a processing resource of the plurality of processing resources, the first circuit module for:
Receiving a workload for rendering frame data for a frame to be rendered based on a programming interface call received from an application;
executing a workload submitted to a command queue of a graphics engine to render the frame data;
continuing to execute the workload to render the frame data in response to the workload of the frame executing a determination that a target display update deadline is to be met; and
The display generates the created frame via a neural frame in response to the workload of the frame performing a determination that the target display update deadline will not be met.
16. The graphics processing system of claim 15, comprising a second circuit module to:
tracking progress of the submitted workload for the frame;
determining whether the frame will meet the target display update deadline; and
In response to a determination that the frame will not meet the target display update deadline, the first circuit module is signaled and a request for nerve frame generation of the frame is initiated.
17. The graphics processing system of claim 16, comprising a third circuit module for performing neural frame generation of the frame, the third circuit module comprising a computing engine associated with the processing resource.
18. The graphics processing system of claim 17, the processing resource comprising a matrix accelerator to perform matrix multiplication operations on behalf of the compute engine.
19. The graphics processing system of claim 18, the second circuit module configured to perform operations associated with a time-aware machine learning model, trained to estimate optical flow at a target timestamp, via the computing engine to perform the neural frame generation of the frame.
20. The graphics processing system of claim 19, the time-aware machine learning model is trained to estimate the optical flow based on a plurality of input frames at the target timestamp, rendering timestamps associated with the plurality of input frames and optical flow between the plurality of input frames.
21. The graphics processing system of claim 20, the third circuit module to:
Estimating optical flow at the target timestamp; and
Previously rendered frames are warped based on the optical flow estimated at the target timestamp.
22. The graphics processing system of claim 21, wherein the optical flow estimated at the target timestamp is an extrapolated optical flow and the previously rendered frame is warped to extrapolate the generated frame.
23. A method of generating a frame for display, comprising:
rendering, via a graphics engine of a graphics processor, the frame based on the programming interface command;
Generating optical flow data based on frame data including a current rendered frame and a previous rendered frame;
Storing the frame data and the optical flow data to predetermined locations in a device memory of the graphics processor; and
Executing, via a computing engine of the graphics processor, frame generation logic to generate a frame for display based on a timestamp corresponding to a next display update, wherein the frame generation logic uses a time-aware machine learning model to generate the frame for display based on a set of optical flow data and rendered frame data read from the predetermined location in the device memory.
24. The method of claim 23, comprising:
rendering additional frame data via the graphics engine based on the programming interface command asynchronously with execution of the frame generation logic via the computing engine;
Reading a most recently available set of optical flow data and frame data stored to the predetermined location in the device memory; and
The frame for display is generated based on a timestamp corresponding to a next display update.
25. A system comprising means for performing the method as in claim 23 or 24.
CN202311798194.XA 2023-03-16 2023-12-25 Time-based frame generation via time-aware machine learning model Pending CN118674603A (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US63/490618 2023-03-16
US18/478233 2023-09-29
US18/478,233 US20240311950A1 (en) 2023-03-16 2023-09-29 Time based frame generation via a temporally aware machine learning model

Publications (1)

Publication Number Publication Date
CN118674603A true CN118674603A (en) 2024-09-20

Family

ID=92721759

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311798194.XA Pending CN118674603A (en) 2023-03-16 2023-12-25 Time-based frame generation via time-aware machine learning model

Country Status (1)

Country Link
CN (1) CN118674603A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112330523A (en) * 2017-04-28 2021-02-05 英特尔公司 Computational optimization of low-precision machine learning operations

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112330523A (en) * 2017-04-28 2021-02-05 英特尔公司 Computational optimization of low-precision machine learning operations
CN112330523B (en) * 2017-04-28 2025-05-27 英特尔公司 Computational optimization for low-precision machine learning operations
US12373911B2 (en) 2017-04-28 2025-07-29 Intel Corporation Compute optimizations for low precision machine learning operations

Similar Documents

Publication Publication Date Title
CN112905240B (en) Architecture for block-sparse operations on systolic arrays
EP4163797A1 (en) Modular gpu architecture for clients and servers
EP4432208A1 (en) Time based frame generation via a temporally aware machine learning model
EP4152162B1 (en) Immediate offset of load store and atomic instructions
CN117529751A (en) Supersampling of time domain amortization using kernel splash networks
US12174783B2 (en) Systolic array of arbitrary physical and logical depth
WO2022271227A1 (en) Dual pipeline parallel systolic array
CN116091332A (en) Augment motion vectors via procedural shader output
CN116091294A (en) Motion vector refinement for supersampling of time domain amortization
CN116136776A (en) Forward progress guarantees using single-level synchronization at individual thread granularity
EP4432209A1 (en) Dynamic gpu frame generation and read/write scaling using command buffer predication
EP4432241A1 (en) Optical flow mip adjustment for rendering and encoding
EP4428823A1 (en) Unified latency aware neural network for frame interpolation and prediction
US20240312113A1 (en) Uv space rendering and ai processing
EP4432240A1 (en) Methodology to enable highly responsive gameplay in cloud and client gaming
EP4478197A1 (en) Coarse and fine filtering for gpu hardware-based performance monitoring
US20240419447A1 (en) Configurable processing resource event filter for gpu hardware-based performance monitoring
CN118674603A (en) Time-based frame generation via time-aware machine learning model
US20240168807A1 (en) Cross-thread register sharing for matrix multiplication compute
EP4432239B1 (en) Preserving g-buffer and optical flow in uv space
EP4553649A1 (en) Multiple register allocation sizes for gpu hardware threads
US20240312028A1 (en) Optical flow generation using rasterized triangle id buffers
EP4498237A1 (en) 32-bit channel-aligned integer multiplication via multiple multipliers per-channel
CN118674604A (en) Keeping G-buffer and optical flow in UV space
CN118674842A (en) Hardware efficient neural frame prediction using low resolution optical flow

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication