CN111666106A

CN111666106A - Data offload acceleration from multiple remote chips

Info

Publication number: CN111666106A
Application number: CN202010137127.3A
Authority: CN
Inventors: N·罗伯森; E·托马斯; D·马西奥洛斯基; E·安格拉达
Original assignee: Hewlett Packard Development Co LP
Current assignee: Hewlett Packard Development Co LP
Priority date: 2019-03-07
Filing date: 2020-03-02
Publication date: 2020-09-15
Also published as: DE102020105896A1

Abstract

Embodiments of the present disclosure relate to data offload acceleration from multiple remote chips. The data offload accelerator offloads data from the plurality of remote chips to the processor. A specification of a plurality of addresses for retrieving data from a plurality of remote chips is received into an address buffer bank of a data offload accelerator. A command to initiate capture of data from a plurality of remote chips is received into an offload control device of a data offload accelerator. Data from multiple remote chips is captured in parallel into a data buffer bank of a data offload accelerator, and the processor is interrupted via an offload control device to pass at least a portion of the data to the processor.

Description

Data offload acceleration from multiple remote chips

背景技术Background technique

专用集成电路(ASIC)芯片通常包括边带接口从端口，用于设备配置、管理以及运行时状态和监视功能。该接口可以通过专有的物理/逻辑协议或诸如外围组件互连快速(PCIe)的行业准则定义。诸如微处理器或基板管理控制器(BMC)的处理器可以通过该边带ASIC接口发起事务，以访问和操纵可寻址设备，诸如ASIC中的控制和状态寄存器(CSR)和存储器映射数据结构。Application-specific integrated circuit (ASIC) chips typically include sideband interface slave ports for device configuration, management, and runtime status and monitoring functions. The interface may be defined by a proprietary physical/logical protocol or an industry guideline such as Peripheral Component Interconnect Express (PCIe). A processor such as a microprocessor or baseboard management controller (BMC) can initiate transactions through this sideband ASIC interface to access and manipulate addressable devices such as control and status registers (CSRs) and memory-mapped data structures in the ASIC .

附图说明Description of drawings

本公开的特征通过示例的方式示出，并且在以下附图中不受限制，在附图中，类似的标号指示类似的元件，在附图中：Features of the present disclosure are shown by way of example, and not limitation, in the following drawings, in which like numerals refer to like elements, and in the drawings:

图1描绘了根据本公开的一个或多个示例的系统架构，在该系统架构内可以实现用于将数据从多个远程芯片卸载到处理器的数据卸载加速器；1 depicts a system architecture within which a data offload accelerator for offloading data from multiple remote chips to a processor may be implemented in accordance with one or more examples of the present disclosure;

图2描绘了根据本公开的一个或多个示例的系统架构，在该系统架构内可以实现用于将数据从多个远程芯片卸载到处理器的数据卸载加速器；2 depicts a system architecture within which a data offload accelerator for offloading data from a plurality of remote chips to a processor may be implemented in accordance with one or more examples of the present disclosure;

图3描绘了根据本公开的一个或多个示例的系统架构，在该系统架构内可以实现用于将数据从多个远程芯片卸载到处理器的数据卸载加速器；3 depicts a system architecture within which a data offload accelerator for offloading data from a plurality of remote chips to a processor may be implemented in accordance with one or more examples of the present disclosure;

图4描绘了根据本公开的一个或多个示例的系统架构，在该系统架构内可以实现用于将数据从多个远程芯片卸载到处理器的数据卸载加速器；4 depicts a system architecture within which a data offload accelerator for offloading data from a plurality of remote chips to a processor may be implemented in accordance with one or more examples of the present disclosure;

图5描绘了根据本公开的一个或多个示例的系统架构，在该系统架构内可以实现用于将数据从多个远程芯片卸载到处理器的数据卸载加速器；5 depicts a system architecture within which a data offload accelerator for offloading data from a plurality of remote chips to a processor may be implemented in accordance with one or more examples of the present disclosure;

图6描绘了根据本公开的一个或多个示例的系统架构，在该系统架构内可以实现用于发起多个远程芯片的初始化加速器；6 depicts a system architecture within which an initialization accelerator for launching a plurality of remote chips may be implemented, according to one or more examples of the present disclosure;

图7描绘了根据本公开的一个或多个示例的系统架构，在该系统架构内可以实现用于将数据从多个远程芯片卸载到处理器的数据卸载加速器；7 depicts a system architecture within which a data offload accelerator for offloading data from a plurality of remote chips to a processor may be implemented in accordance with one or more examples of the present disclosure;

图8描绘了根据本公开的一个或多个示例的用于将数据从多个远程芯片卸载到处理器的方法的流程图；以及8 depicts a flowchart of a method for offloading data from a plurality of remote chips to a processor in accordance with one or more examples of the present disclosure; and

图9描绘了根据本公开的一个或多个示例的用于将数据从多个远程芯片卸载到处理器的另一种方法的流程图。9 depicts a flow diagram of another method for offloading data from multiple remote chips to a processor in accordance with one or more examples of the present disclosure.

具体实施方式Detailed ways

一种数据收集解决方案，诸如用于收集遥测的解决方案，在BMC固件中实现了循环，该循环一次跨一个到几个BMC-到-ASIC管理通信接口读取一个CSR或存储器映射的数据元素。为了使BMC执行单个CSR的取回，将固件设计为设置请求，开始事务并在被通知时从硬件取回数据。在设置过程中，固件格式化事务并对硬件执行多次单独的写入，以准备其用于执行。然后，固件通过另一个硬件写入来发起请求。一旦硬件完成事务后，将通过异步硬件中断通知固件完成状态并且任意响应数据可用。最后，固件对硬件执行多次单独读取，以取回返回数据。为了收集整个数据集合，此过程在循环中执行，该循环单独请求每个CSR或存储器映射的数据元素。由于顺序发起请求和服务每个响应所涉及的开销(软件、操作系统、固件、驱动程序、存储资源等)，该序列化过程被认为很慢速。A data collection solution, such as that used to collect telemetry, implements a loop in the BMC firmware that reads one CSR or memory-mapped data element across one to several BMC-to-ASIC management communication interfaces at a time . In order for the BMC to perform the retrieval of a single CSR, the firmware is designed to set the request, start the transaction and retrieve the data from the hardware when notified. During setup, the firmware formats the transaction and performs multiple separate writes to the hardware to prepare it for execution. The firmware then initiates the request with another hardware write. Once the hardware completes the transaction, the firmware is notified of the completion status via an asynchronous hardware interrupt and any response data is available. Finally, the firmware performs multiple individual reads of the hardware to retrieve the return data. To collect the entire data set, this process is performed in a loop that individually requests each CSR or memory-mapped data element. This serialization process is considered slow due to the overhead (software, operating system, firmware, drivers, storage resources, etc.) involved in initiating requests sequentially and servicing each response.

如果BMC可以本地访问ASIC并可以维持系统的高级要求所指示的预期性能(带宽、延迟、持续与突发等)，则此配置就足够了。但是，此配置未解决针对某些系统拓扑的性能预期。This configuration is sufficient if the BMC has local access to the ASIC and can maintain the expected performance (bandwidth, latency, sustained vs burst, etc.) as dictated by the high level requirements of the system. However, this configuration does not address performance expectations for some system topologies.

公开了一种与数据卸载加速有关的系统和方法，例如用于一对多BMC-ASIC接口拓扑。例如在诸如现场可编程门阵列(FPGA)或ASIC的集成电路(IC)内的硬件数据卸载加速器从多个芯片读取、存储和传递数据，诸如遥测数据，例如多个ASIC，每个ASIC都具有具有相似的可寻址存储器和CSR结构的多个小芯片。数据卸载加速器将该数据卸载到远离多个芯片的处理器，诸如BMC。如本文所用，“卸载”数据意味着在一个或多个处理器与一个或多个芯片之间发送或传递数据。数据卸载加速器包括寄存器集合、一个或多个地址缓冲器以及一个或多个数据缓冲器以实现数据卸载，该寄存器集合允许处理器控制数据卸载。处理器可以指定远程存储器地址，例如CSR地址或其他映射的存储器地址位置，用于卸载。处理器可在一个或多个地址缓冲器内指定这些地址。处理器可以在附有加速器的ASIC的预定义地址空间内的任意独立位置处指定不连续的地址。A system and method related to data offload acceleration, such as for a one-to-many BMC-ASIC interface topology, is disclosed. A hardware data offload accelerator, eg, within an integrated circuit (IC) such as a Field Programmable Gate Array (FPGA) or ASIC, reads, stores and transfers data, such as telemetry data, from multiple chips, eg multiple ASICs, each ASIC There are multiple chiplets with similar addressable memory and CSR structures. The data offload accelerator offloads the data to processors remote from multiple chips, such as the BMC. As used herein, "offloading" data means sending or transferring data between one or more processors and one or more chips. The data offload accelerator includes a set of registers that allows the processor to control the data offload, one or more address buffers, and one or more data buffers to implement data offload. The processor may specify a remote memory address, such as a CSR address or other mapped memory address location, for offloading. The processor may specify these addresses in one or more address buffers. The processor may specify non-contiguous addresses at any independent location within the predefined address space of the accelerator-attached ASIC.

在一个示例中，数据卸载加速器包括“快速”和“慢速”地址缓冲器(以及对应的“快速”和“慢速”数据缓冲器)。例如“快速”是指数据以某个相对较快速的刷新率(例如100Hz)被卸载到处理器。而“慢速”是指可能是比前者更大的数据集合的数据以相对较慢速的刷新率(例如1Hz)被卸载到处理器。已经初始化一个或多个地址缓冲器后，处理器可以通过单次写入数据卸载加速器内的寄存器来发起数据捕获。当数据卸载加速器观察到其“开始”位有效时，它将为指定地址中的每一个发出芯片读取。这些读取可以跨多个芯片接口并行执行。当每个芯片做出响应时，数据卸载加速器会使用相关的响应数据并行填充适当的数据缓冲器。在存储了部分或全部响应数据之后，数据卸载加速器将中断处理器。然后，处理器从数据卸载加速器中取回存储的数据。In one example, the data offload accelerator includes "fast" and "slow" address buffers (and corresponding "fast" and "slow" data buffers). For example "fast" means that data is offloaded to the processor at some relatively fast refresh rate (eg 100Hz). And "slow" means that data, which may be a larger set of data than the former, is offloaded to the processor at a relatively slow refresh rate (eg, 1 Hz). After one or more address buffers have been initialized, the processor can initiate data capture by writing a single write to the registers within the data offload accelerator. When the data offload accelerator observes that its "start" bit is valid, it will issue a chip read for each of the specified addresses. These reads can be performed in parallel across multiple chip interfaces. As each chip responds, the data offload accelerator fills the appropriate data buffers in parallel with the associated response data. After storing some or all of the response data, the data offload accelerator will interrupt the processor. The processor then retrieves the stored data from the data offload accelerator.

示例益处包括改进了从多个芯片收集数据的速度，因为可以并行执行对来自芯片，并且特别是例如数十或数百个芯片内的多个CSR的数据的读取和写入。此外，通过显著减少事务开销来提高速度。即，处理器向数据卸载加速器的单个写入可以代替处理器向每个CSR发起请求并为每个响应提供服务所涉及的开销(软件、OS、固件、驱动程序、存储资源等)。这触发了数据卸载加速器以并行方式在硬件中发出多个读取请求，并将响应并行保存到适当的响应数据缓冲器中。Example benefits include improved speed of data collection from multiple chips, as reading and writing of data from chips, and particularly multiple CSRs within, for example, tens or hundreds of chips, can be performed in parallel. Additionally, speed is improved by significantly reducing transaction overhead. That is, a single write by the processor to the data offload accelerator can replace the overhead (software, OS, firmware, drivers, storage resources, etc.) involved in the processor initiating a request to each CSR and servicing each response. This triggers the data offload accelerator to issue multiple read requests in hardware in parallel and save the responses into the appropriate response data buffers in parallel.

在特定示例中，数据卸载加速器从多个远程IC内的多个芯片卸载大量遥测数据。例如每个远程IC内的芯片中的每一个都可以包含成百上千个CSR(或其他存储器映射位置)，可以从CSR卸载遥测数据。在该示例中，当处理器发出单个写入命令时，数据卸载加速器可以并行地向多个芯片发出多个读取请求，以从第一多个指定地址并行取回第一多个数据响应，并将第一多个数据响应并行地存储在一个或多个数据缓冲器中。数据卸载加速器可以优化这些读取请求的排序请求大小，以增加IC接口上的数据传递效率。例如如果数据卸载加速器在处理器定义的地址缓冲器中标识了连续存储器的块，则它可以将离散地址分组为单个读取请求，其中事务读取响应的大小适合于覆盖连续范围中包含的多个离散地址。此外，数据卸载加速器可以对地址缓冲器内容进行排序，以促进此类空间局部性优化。在不从处理器接收额外命令或不产生额外开销的情况下，数据卸载加速器可以重复数据的并行捕获(包括向芯片并行发出另一个多个读取请求，从另一多个指定地址并行地接收另一多个数据响应，并将多个数据响应并行存储到一个或多个数据缓冲器)，直到捕获了来自所有指定地址的数据为止。因此，在数据卸载加速器卸载大量远程数据的情况下(例如遥测)，可以节省大量开销。In a particular example, a data offload accelerator offloads large amounts of telemetry data from multiple chips within multiple remote ICs. For example, each of the chips within each remote IC may contain hundreds or thousands of CSRs (or other memory-mapped locations) from which telemetry data can be offloaded. In this example, when the processor issues a single write command, the data offload accelerator may issue multiple read requests to multiple chips in parallel to retrieve the first multiple data responses in parallel from the first multiple specified addresses, and storing the first plurality of data responses in one or more data buffers in parallel. The data offload accelerator can optimize the ordering request size of these read requests to increase the efficiency of data transfer over the IC interface. For example, if the data offload accelerator identifies blocks of contiguous memory in a processor-defined address buffer, it can group discrete addresses into a single read request, where the transactional read response is sized to cover multiple blocks contained in the contiguous range. discrete addresses. Additionally, the data offload accelerator can sort the address buffer contents to facilitate such spatial locality optimizations. Without receiving additional commands from the processor or incurring additional overhead, the data offload accelerator can repeat the parallel capture of data (including issuing another multiple read requests to the chip in parallel, receiving from another multiple specified addresses in parallel another multiple data responses, and store multiple data responses in parallel to one or more data buffers) until data from all specified addresses has been captured. Thus, in the case of data offload accelerators offloading large amounts of remote data (such as telemetry), significant overhead savings can be achieved.

现在转向附图，图1描绘了根据本公开的一个或多个示例的系统架构100，在其中可以实现数据卸载加速器112，用于将数据从系统架构100的多个IC 106-1到106-N中的多个远程芯片1到M卸载到处理器102。在这种情况下，M和N是整数，其值取决于系统架构100的设计。整数M或N可以相同或不同。Turning now to the drawings, FIG. 1 depicts a system architecture 100 in which a data offload accelerator 112 may be implemented for transferring data from a plurality of ICs 106-1 to 106- of the system architecture 100 in accordance with one or more examples of the present disclosure. Multiple remote chips 1 through M in N are offloaded to processor 102 . In this case, M and N are integers whose values depend on the design of the system architecture 100 . The integers M or N may be the same or different.

系统架构100还包括数据卸载加速器设备104，该数据卸载加速器设备104包括数据卸载加速器112，并且耦合到处理器102以及多个IC 106-1至106-N中的多个芯片1至M。芯片1至M“远离”处理器102和数据卸载加速器设备104，这意味着芯片1至M至少被包括在与处理器102和数据卸载加速器设备104不同的芯片组或管芯(die)上。在特定示例中，每个均包含“远程”芯片1至M的IC 106-1至106-N形成在与其上形成有处理器102和数据卸载加速器设备104的一个或多个半导体基板不同的半导体基板上。The system architecture 100 also includes a data offload accelerator device 104 that includes a data offload accelerator 112 and is coupled to the processor 102 and the plurality of chips 1 through M of the plurality of ICs 106-1 through 106-N. Chips 1 through M are "remote" from processor 102 and data offload accelerator device 104 , meaning that chips 1 through M are at least included on a different chipset or die than processor 102 and data offload accelerator device 104 . In a particular example, ICs 106-1 through 106-N, each including "remote" chips 1 through M, are formed on a different semiconductor than the one or more semiconductor substrates on which processor 102 and data offload accelerator device 104 are formed on the substrate.

在一个示例中，处理器102是基板管理控制器(BMC)。在另一个示例中，处理器102是一种微处理器。在另一个示例中，系统架构100可以包括多个处理器。例如系统架构100可以包括处理器130，数据卸载加速器112还可以向处理器130卸载来自多个IC 106-1至106-N内的多个芯片1至M的数据。在又一示例中，系统架构100可以包括安装在一个或多个印刷电路组件上的多个数据卸载加速器设备104，其耦合到一个或多个处理器。In one example, the processor 102 is a baseboard management controller (BMC). In another example, the processor 102 is a microprocessor. In another example, system architecture 100 may include multiple processors. For example, system architecture 100 may include processor 130 to which data offload accelerator 112 may also offload data from multiple chips 1 through M within multiple ICs 106-1 through 106-N. In yet another example, the system architecture 100 may include a plurality of data offload accelerator devices 104 mounted on one or more printed circuit assemblies coupled to one or more processors.

如图所示，IC 106-1至106-N中的每一个都是ASIC；因此，IC 106-1至106-N也被标记在图1(和图2-6)中，并在本文中称为ASIC 1至N。ASIC 1至N可以具有超大规模集成(VLSI)设计。例如ASIC 1至N可以各自具有多个小芯片1至M(如图所示)，并且可以使用硅堆叠互连(SSI)或其他三维IC设计技术来制造，其中多个ASIC管芯被嵌入到单个基板或IC封装中。如进一步示出的，小芯片1至M各自包括一个或多个可寻址的CSR，在一个示例中，每个CSR都是64位寄存器。尽管仅示出了两个，但是N个ASIC可以包括两个以上的ASIC。尽管仅示出了两个，但是M个小芯片可以包括多于两个的小芯片。在另一示例中，系统架构100可以包括具有小芯片1至M的单个ASIC。在另一示例中，系统架构100可以包括各自具有单个小芯片的ASIC 1至N。As shown, each of ICs 106-1 through 106-N is an ASIC; therefore, ICs 106-1 through 106-N are also labeled in Figure 1 (and Figures 2-6), and are used herein Called ASICs 1 to N. ASICs 1 through N may have very large scale integration (VLSI) designs. For example, ASICs 1-N may each have multiple chiplets 1-M (as shown), and may be fabricated using Silicon Stacked Interconnect (SSI) or other three-dimensional IC design techniques, where multiple ASIC dies are embedded in in a single substrate or IC package. As further shown, chiplets 1 through M each include one or more addressable CSRs, each of which is a 64-bit register in one example. Although only two are shown, the N ASICs may include more than two ASICs. Although only two are shown, the M chiplets may include more than two chiplets. In another example, system architecture 100 may include a single ASIC with chiplets 1-M. In another example, system architecture 100 may include ASICs 1 through N each having a single chiplet.

在示例中，数据卸载加速器112用于将遥测数据从CSR卸载到处理器102。在特定示例中，ASIC 1至N至少部分地形成交换结构网络，并且处理器102可以针对某些条件监视来自CSR的遥测数据。例如处理器102可以监视遥测数据，从而可以采取响应或纠正动作以防止在交换结构网络内发生不想要的问题。In an example, the data offload accelerator 112 is used to offload telemetry data from the CSR to the processor 102 . In certain examples, ASICs 1 through N at least partially form a switched fabric network, and processor 102 may monitor telemetry data from the CSR for certain conditions. For example, the processor 102 can monitor telemetry data so that responsive or corrective action can be taken to prevent unwanted problems within the switched fabric network.

数据卸载加速器设备104是具有各种硬件组件的硬件设备，如下所述。在一示例中，数据卸载加速器设备104是IC。例如数据卸载加速器设备104是FPGA。替代地，数据卸载加速器设备104是ASIC。将数据卸载加速器设备104实现为FPGA的示例益处是设计中的改进的灵活性，因为它可以例如在现场或在工厂中被适当地编程。将数据卸载加速器设备104实现为FPGA的另一示例益处是：可以基于用户需求或系统架构100内的硬件改变来优化和/或改变其中使用的数据卸载加速器算法。The data offload accelerator device 104 is a hardware device with various hardware components, as described below. In one example, the data offload accelerator device 104 is an IC. For example, the data offload accelerator device 104 is an FPGA. Alternatively, the data offload accelerator device 104 is an ASIC. An example benefit of implementing the data offload accelerator device 104 as an FPGA is improved flexibility in design, as it can be appropriately programmed, eg, in the field or in the factory. Another example benefit of implementing the data offload accelerator device 104 as an FPGA is that the data offload accelerator algorithm used therein can be optimized and/or changed based on user requirements or hardware changes within the system architecture 100 .

数据卸载加速器设备104还包括将处理器102耦合到数据卸载加速器112的接口桥110。数据卸载加速器设备104还包括接口122、124-1到124-N和126-1到126-N，其将数据卸载加速器112耦合到ASIC 1至N内的小芯片1至M。在系统架构100内有多个处理器的情况下，例如处理器102和130，接口桥110可以将所有处理器连接到数据卸载加速器112。The data offload accelerator device 104 also includes an interface bridge 110 that couples the processor 102 to the data offload accelerator 112 . Data offload accelerator device 104 also includes interfaces 122, 124-1 through 124-N, and 126-1 through 126-N, which couple data offload accelerator 112 to chiplets 1 through M within ASICs 1 through N. Where there are multiple processors within system architecture 100 , such as processors 102 and 130 , interface bridge 110 may connect all processors to data offload accelerator 112 .

接口桥110可以包括任意合适的接口，例如能够实现在数据卸载加速器112和处理器102之间数据的串行传递的接口。接口桥110内的接口可以进一步允许处理器102和数据卸载加速器112之间的命令(或请求)和中断或任意其他合适的消息传送或信息的传递。一个示例性接口桥110包括PCIe工业接口。在系统架构100内有多个处理器(例如处理器102和130)的情况下，接口桥110可以包括多个PCIe接口，其中每个PCIe接口耦合到处理器之一和数据卸载加速器112。替代地，接口桥110可以包括PCIe开关，该PCIe开关耦合到每个处理器和数据卸载加速器112。Interface bridge 110 may include any suitable interface, such as an interface that enables serial transfer of data between data offload accelerator 112 and processor 102 . The interface within interface bridge 110 may further allow for the transfer of commands (or requests) and interrupts or any other suitable messaging or information between processor 102 and data offload accelerator 112 . An example interface bridge 110 includes a PCIe industrial interface. Where there are multiple processors (eg, processors 102 and 130 ) within system architecture 100 , interface bridge 110 may include multiple PCIe interfaces, where each PCIe interface is coupled to one of the processors and data offload accelerator 112 . Alternatively, interface bridge 110 may include a PCIe switch coupled to each processor and data offload accelerator 112 .

在一个示例中，接口122、124-1至124-N和126-1至126-N允许数据卸载加速器112与ASIC 1至N的小芯片1至M之间的基于事务的通信。基于事务的通信或“事务”可以包括例如配置的请求(例如在初始化期间或在某个后续时间)和存储器读取和写入，以及对请求的响应，例如数据，诸如遥测数据。如图所示，接口124-1至124-N分别连接在接口122与ASIC 1的小芯片1至M之间。类似地，接口126-1至126-N分别在接口122与ASIC N的小芯片1至M之间连接以访问小芯片内的一个或多个CSR。In one example, interfaces 122, 124-1 through 124-N, and 126-1 through 126-N allow transaction-based communication between data offload accelerator 112 and chiplets 1 through M of ASICs 1 through N. Transaction-based communications or "transactions" may include, for example, requests to configure (eg, during initialization or at some subsequent time) and memory reads and writes, as well as responses to requests, eg, data, such as telemetry data. As shown, interfaces 124-1 through 124-N are connected between interface 122 and chiplets 1 through M of ASIC 1, respectively. Similarly, interfaces 126-1 through 126-N connect between interface 122 and chiplets 1 through M of ASIC N, respectively, to access one or more CSRs within the chiplets.

如进一步示出的，接口124-1至124-N和126-1至126-N各自包括物理、链路和协议(较高)层。此外，如图所示，接口122用作数据卸载加速器112与接口124-1至124-N和126-1至126-N之间的协议层接口。在一示例中，接口122包括事务缓冲器和路由器。在特定示例中，接口124-1至124-N和126-1至126-N各自都是专有接口，并且接口122包括多个事务先进先出(FIFO)路由器。在该示例中，例如如果接口桥110包括PCIe，则数据卸载加速器设备104可以用作处理器102与小芯片1至M之间的协议桥。替代地，接口124-1至124-N和126-1至126-N是PCIe接口，并且接口122包括多个PCIe事务FIFO路由器。As further shown, interfaces 124-1 to 124-N and 126-1 to 126-N each include physical, link, and protocol (higher) layers. Additionally, as shown, interface 122 serves as a protocol layer interface between data offload accelerator 112 and interfaces 124-1 through 124-N and 126-1 through 126-N. In one example, interface 122 includes transaction buffers and routers. In a particular example, interfaces 124-1 through 124-N and 126-1 through 126-N are each proprietary interfaces, and interface 122 includes multiple transaction first-in-first-out (FIFO) routers. In this example, data offload accelerator device 104 may function as a protocol bridge between processor 102 and chiplets 1-M if interface bridge 110 includes PCIe, for example. Alternatively, interfaces 124-1 through 124-N and 126-1 through 126-N are PCIe interfaces, and interface 122 includes multiple PCIe transaction FIFO routers.

接口122、124-1至124-N和126-1至126-N的该示例布置能够实现事务(例如请求)从数据卸载加速器112到ASIC 1到N的小芯片1至M的并行传输。同样，接口122、124-1至124-N和126-1至126-N的该示例布置能够实现事务(例如，对请求的响应)从ASIC 1到N的小芯片1到M到数据卸载加速器112的并行传输。例如在处理器102接收到单个写入命令后，数据卸载加速器112可以并行地向ASIC 1-N中的小芯片1到M发送读取请求，其中一个读取请求在接口124-1至124-N和接口126-1至126-N中的每个上被发送。另外，响应于读取请求，ASIC 1至N内的小芯片1至M可以并行地向数据卸载加速器发送数据响应，其中一个数据响应在接口124-1至124-N和接口126-1至126-N中的每个上被发送。This example arrangement of interfaces 122, 124-1 to 124-N, and 126-1 to 126-N enables parallel transfer of transactions (eg, requests) from data offload accelerator 112 to chiplets 1 to M of ASICs 1 to N. Likewise, this example arrangement of interfaces 122, 124-1 to 124-N, and 126-1 to 126-N enables transactions (eg, responses to requests) from ASICs 1 to N chiplets 1 to M to data offload accelerators 112 parallel transfers. For example, after processor 102 receives a single write command, data offload accelerator 112 may send read requests in parallel to chiplets 1 through M in ASICs 1-N, with one read request at interfaces 124-1 through 124- N and are transmitted on each of interfaces 126-1 through 126-N. Additionally, in response to read requests, chiplets 1 through M within ASICs 1 through N may send data responses to the data offload accelerators in parallel, with one data response at interfaces 124-1 through 124-N and interfaces 126-1 through 126 -N are sent on each of them.

数据卸载加速器112包括卸载控制设备114、具有单个地址缓冲器116的地址缓冲器库、具有单个响应数据缓冲器120的响应数据缓冲器库、以及FSM事务(TXN)处理逻辑118(本文中称为作为事务处理逻辑)。在所示的示例中，地址缓冲器116被实现为块随机存取存储器(BRAM)，但是可以使用任意适当的存储器技术来实现以存储多个地址。地址缓冲器116例如经由BRAM接口(I/F)耦合至处理器102。地址缓冲器116还例如使用硬件连接来耦合到事务处理逻辑118。更具体地，处理器102可以在地址缓冲器116中指定地址，例如CSR的地址，以访问芯片106-1至106-N中的一个或多个并且从芯片106-1至106-N中的一个或多个取回数据。在一个示例中，地址是线性或顺序的。在该示例中，处理器102可以在地址缓冲器116中指示起始地址和地址范围。可替代地，这些地址可以是非线性的或异构的、非离散的地址，与使用线性地址相比，其可以提供更大的灵活性。在该替代示例中，地址缓冲器116可以实现列表特征，该列表特征允许处理器102将非线性地址的列表写入地址缓冲器116。The data offload accelerator 112 includes an offload control device 114, an address buffer bank with a single address buffer 116, a response data buffer bank with a single response data buffer 120, and FSM transaction (TXN) processing logic 118 (referred to herein as as transaction logic). In the example shown, address buffer 116 is implemented as block random access memory (BRAM), but may be implemented using any suitable memory technology to store multiple addresses. Address buffer 116 is coupled to processor 102, eg, via a BRAM interface (I/F). Address buffer 116 is also coupled to transaction logic 118, eg, using a hardware connection. More specifically, the processor 102 may specify an address in the address buffer 116, such as an address of a CSR, to access one or more of the chips 106-1 through 106-N and from the address in the chips 106-1 through 106-N. One or more retrieved data. In one example, the addresses are linear or sequential. In this example, the processor 102 may indicate the starting address and address range in the address buffer 116 . Alternatively, these addresses may be non-linear or heterogeneous, non-discrete addresses, which may provide greater flexibility than using linear addresses. In this alternative example, address buffer 116 may implement a list feature that allows processor 102 to write a list of non-linear addresses to address buffer 116 .

在所示的示例中，响应数据缓冲器120被实现为BRAM，但是可以使用任意适当的存储器技术来实现以存储多个地址。响应数据缓冲器120例如经由BRAM I/F耦合到处理器102。响应数据缓冲器120还耦合到事务处理逻辑118。事务处理逻辑118可以并行地将其从芯片106-1至106-N取回的数据转发到响应数据缓冲器120。事务处理逻辑(例如118)和一个或多个数据缓冲器(例如120)之间的平行线表示多个硬件连接，以能够实现来自小芯片1到M的信息(包括数据响应)的并行转发，在一些示例中，事务被发送到小芯片1至M以初始化CSR。如图所示，数据响应缓冲器120是单片缓冲器。替代地，并且如关于本文公开的一个或多个示例数据卸载加速器所描述的，响应数据缓冲器库可以包括多个数据缓冲器，例如多个数据缓冲器集合，其中每个集合具有多个数据缓冲器。In the example shown, the response data buffer 120 is implemented as a BRAM, but may be implemented using any suitable memory technology to store multiple addresses. Response data buffer 120 is coupled to processor 102, eg, via BRAM I/F. Response data buffer 120 is also coupled to transaction logic 118 . Transaction logic 118 may forward the data it retrieves from chips 106-1 through 106-N to response data buffer 120 in parallel. Parallel lines between transaction logic (eg, 118 ) and one or more data buffers (eg, 120 ) represent multiple hardware connections to enable parallel forwarding of information (including data responses) from chiplets 1 through M, In some examples, transactions are sent to chiplets 1 through M to initialize the CSR. As shown, the data response buffer 120 is a monolithic buffer. Alternatively, and as described with respect to one or more of the example data offload accelerators disclosed herein, the response data buffer library may include multiple data buffers, such as multiple sets of data buffers, where each set has multiple data buffer.

卸载控制设备114被耦合到处理器102和事务处理逻辑118两者。卸载控制设备114可以包括任意合适的电路，其能够实现：处理器102提供状态指示(例如命令或其他输入)以控制数据卸载加速器112的功能以从小芯片1到M取回数据；事务处理逻辑118提供作为请求、接收、存储和/或处理来自小芯片1到M的数据的结果的一个或多个状态指示；并且处理器102从数据卸载加速器112读取一个或多个状态指示，例如由事务处理逻辑118提供的一个或多个状态指示。Offload control device 114 is coupled to both processor 102 and transaction logic 118 . Offload control device 114 may include any suitable circuitry that enables: processor 102 to provide status indications (eg, commands or other inputs) to control the functions of data offload accelerator 112 to retrieve data from chiplets 1 through M; transaction logic 118 provide one or more status indications as a result of requesting, receiving, storing and/or processing data from chiplets 1 through M; and processor 102 reads one or more status indications from data offload accelerator 112, such as by a transaction One or more status indications provided by logic 118 are processed.

卸载控制设备114包括多个状态指示器电路，例如一个或多个位寄存器，其能够实现不同的状态指示。如图所示，卸载控制设备114包括加速器(Accel)就绪、开始、进行中(InProg)、数据可用(DataAvail)、错误、地址(Addr)控制(Cnt)、缓冲器偏移/大小、和性能计数器(Cntrs)状态指示器电路。例如取决于由处理器102和数据卸载加速器112指示的信息，卸载控制设备114可以具有更多或更少的状态指示器电路。The offload control device 114 includes a plurality of status indicator circuits, such as one or more bit registers, that enable different status indications. As shown, the offload control device 114 includes accelerator (Accel) ready, start, in progress (InProg), data available (DataAvail), error, address (Addr) control (Cnt), buffer offset/size, and performance Counter (Cntrs) status indicator circuit. For example, the offload control device 114 may have more or less status indicator circuits depending on the information indicated by the processor 102 and the data offload accelerator 112 .

加速器就绪状态指示器电路使数据卸载加速器112能够向处理器102指示其准备工作。例如加速器就绪状态指示器电路使数据卸载加速器112能够指示其已经退出复位并且到小芯片1至M的所有链接已经被初始化。开始状态指示器电路使处理器102能够向数据卸载加速器112指示开始处理在地址缓冲器116中指定的地址。例如使用开始状态指示器电路，处理器102可以发出单个写入命令来发起从ASIC 1至N中的一个或多个的小芯片1至M的数据捕获。数据捕获或捕获数据包括从远程芯片接收数据并存储该数据。在一示例中，数据捕获包括发送数据请求，接收包括数据的数据响应，以及将数据转发到一个或多个响应数据缓冲器。进行中状态指示器电路使事务处理逻辑118能够向处理器102指示从地址缓冲器116中指示的地址的数据取回正在进行。数据可用状态指示器电路使传输处理逻辑118能够向处理器102指示数据可用于卸载。例如数据可用状态指示器电路使数据卸载加速器112能够中断处理器102，以将取回的数据中的至少一部分卸载到处理器102。错误状态指示器电路使事务处理逻辑118能够向处理器102指示可能发生的各种错误。示例错误包括例如读取超时或与ASIC的连接在事务中中断。地址控制状态指示器电路使处理器102能够向事务处理逻辑118指示在地址缓冲器116中指示了多少个地址。缓冲器偏移量/大小状态指示器电路使事务处理逻辑118能够向处理器102指示所取回的数据位于响应数据缓冲器120内的哪里。缓冲器偏移/大小状态指示器电路还使数据卸载加速器112能够向处理器102指示地址缓冲器116和/或响应数据缓冲器120的大小。性能计数器状态指示器电路使事务处理逻辑118能够向处理器102指示表示性能的各种参数数据，诸如取回和存储数据花费了多长时间。The accelerator ready status indicator circuit enables data offload accelerator 112 to indicate to processor 102 that it is ready. For example, the accelerator ready status indicator circuit enables the data offload accelerator 112 to indicate that it has come out of reset and that all links to chiplets 1 through M have been initialized. The start status indicator circuit enables the processor 102 to indicate to the data offload accelerator 112 to begin processing the address specified in the address buffer 116 . The processor 102 may issue a single write command to initiate data capture from the chiplets 1-M of one or more of the ASICs 1-N, eg, using the start status indicator circuit. Data capture or capturing data includes receiving data from a remote chip and storing that data. In one example, data capture includes sending a data request, receiving a data response including the data, and forwarding the data to one or more response data buffers. The in-progress status indicator circuit enables transaction logic 118 to indicate to processor 102 that a data fetch from the address indicated in address buffer 116 is in progress. The data availability status indicator circuit enables the transport processing logic 118 to indicate to the processor 102 that data is available for offloading. For example, the data availability status indicator circuit enables the data offload accelerator 112 to interrupt the processor 102 to offload at least a portion of the retrieved data to the processor 102 . Error status indicator circuitry enables transaction logic 118 to indicate to processor 102 various errors that may occur. Example errors include, for example, a read timeout or the connection to the ASIC was interrupted in the transaction. The address control status indicator circuit enables processor 102 to indicate to transaction logic 118 how many addresses are indicated in address buffer 116 . The buffer offset/size status indicator circuit enables transaction logic 118 to indicate to processor 102 where the retrieved data is located within response data buffer 120 . The buffer offset/size status indicator circuit also enables the data offload accelerator 112 to indicate to the processor 102 the size of the address buffer 116 and/or the response data buffer 120 . The performance counter status indicator circuit enables transaction logic 118 to indicate to processor 102 various parametric data indicative of performance, such as how long it took to retrieve and store data.

事务处理逻辑118可以耦合至接口122以处理数据卸载加速器112与ASIC 1至N的小芯片1至M之间的事务。在特定示例中，事务处理逻辑118以硬件被实现为有限状态机(FSM)，以执行各种功能。例如事务处理逻辑118：从地址缓冲器116读取地址；以适合于接口122、124-1至124-N和126-1至126-N以及小芯片1至M的格式构造读取请求并将读取请求发送到那些地址；响应于读取请求来接收数据；将数据存储在适当的响应数据缓冲器中，在该示例中为响应数据缓冲器120。事务处理逻辑118还可以处理错误并提供状态指示，例如如上所述。在其他示例中，事务处理逻辑118可以执行其他功能，诸如将取回的数据与一个或多个准则进行比较，并基于该比较来执行动作。在另一示例中，事务处理逻辑118可以将读取的请求并行发送到在地址缓冲器116中指定的多个地址中的至少一些。事务处理逻辑118还可以并行地从多个地址接收数据。Transaction logic 118 may be coupled to interface 122 to process transactions between data offload accelerator 112 and chiplets 1-M of ASICs 1-N. In a particular example, the transaction processing logic 118 is implemented in hardware as a finite state machine (FSM) to perform various functions. For example, transaction logic 118: reads addresses from address buffer 116; constructs read requests in a format suitable for interfaces 122, 124-1 to 124-N and 126-1 to 126-N, and chiplets 1 to M and sends A read request is sent to those addresses; data is received in response to the read request; the data is stored in an appropriate response data buffer, in this example the response data buffer 120 . Transaction logic 118 may also handle errors and provide status indications, eg, as described above. In other examples, transaction logic 118 may perform other functions, such as comparing the retrieved data to one or more criteria, and performing actions based on the comparison. In another example, transaction logic 118 may send requests to read to at least some of the plurality of addresses specified in address buffer 116 in parallel. Transaction logic 118 may also receive data from multiple addresses in parallel.

图2描绘了根据本公开的一个或多个示例的系统架构200，在系统架构200中，可以实现数据卸载加速器212，用于将数据从系统架构200的ASIC 1到N的多个远程小芯片1到M卸载到处理器102。系统架构200还包括数据卸载加速器设备204，该数据卸载加速器设备204包括数据卸载加速器212，并且该数据卸载加速器设备204耦合到处理器102和ASIC 1至N的多个小芯片1至M两者。处理器102和ASIC 1至N可以与上面参考图1描述的类似地实现。此外，系统架构200可以包括多个处理器。2 depicts a system architecture 200 in which a data offload accelerator 212 may be implemented for transferring data from ASICs 1 to N of multiple remote chiplets of the system architecture 200 in accordance with one or more examples of the present disclosure 1 to M are offloaded to processor 102 . The system architecture 200 also includes a data offload accelerator device 204 that includes a data offload accelerator 212 and that is coupled to both the processor 102 and the plurality of chiplets 1 to M of the ASICs 1 to N . The processor 102 and ASICs 1 to N may be implemented similarly as described above with reference to FIG. 1 . Additionally, system architecture 200 may include multiple processors.

数据卸载加速器设备204包括接口桥110，该接口桥110可以如上文参考图1所述类似地实现并且类似地耦合到处理器102。如上参考图1所描述，数据卸载加速器设备204还包括接口122、124-1至124-N和126-1至126-N，接口122、124-1至124-N和126-1至126-N可以类似地被实现并且类似地耦合到ASIC 1至N的小芯片1至M。The data offload accelerator device 204 includes an interface bridge 110 , which may be implemented similarly as described above with reference to FIG. 1 and similarly coupled to the processor 102 . As described above with reference to FIG. 1, data offload accelerator device 204 also includes interfaces 122, 124-1 to 124-N and 126-1 to 126-N, interfaces 122, 124-1 to 124-N and 126-1 to 126-N N can be similarly implemented and similarly coupled to chiplets 1-M of ASICs 1-N.

数据卸载加速器212被示为耦合到接口桥110和接口122两者。如图所示，数据卸载加速器212包括：多个加速器寄存器214；以及地址缓冲器库，其包括慢速地址缓冲器208和快速地址缓冲器202；数据缓冲器库，其包括多个慢速响应数据缓冲器210、多个快速响应数据缓冲器216和多个快速响应数据缓冲器206；以及事务处理逻辑218。事务处理逻辑218可以以硬件被实现为有限状态机，并且被耦合到接口122、加速器寄存器214、慢速地址缓冲器208和快速地址缓冲器202以及多个慢速响应数据缓冲器210、多个快速响应数据缓冲器206和多个快速响应数据缓冲器216。Data offload accelerator 212 is shown coupled to both interface bridge 110 and interface 122 . As shown, the data offload accelerator 212 includes: a plurality of accelerator registers 214; and an address buffer bank, which includes a slow address buffer 208 and a fast address buffer 202; and a data buffer bank, which includes a plurality of slow responses a data buffer 210, a plurality of fast response data buffers 216, and a plurality of fast response data buffers 206; and transaction logic 218. Transaction logic 218 may be implemented in hardware as a finite state machine and coupled to interface 122, accelerator registers 214, slow address buffer 208 and fast address buffer 202 and a plurality of slow response data buffers 210, a plurality of A fast response data buffer 206 and a plurality of fast response data buffers 216 .

慢速地址缓冲器208和快速地址缓冲器202以及多个慢速响应数据缓冲器210、多个快速响应数据缓冲器206和多个快速响应数据缓冲器216内的每个数据缓冲器被示为具有BRAM I/F的BRAM，用于将数据缓冲器耦合到处理器102，但是可以被实现为任意合适的存储设备。事务处理逻辑218可以类似于参考图1描述的事务处理逻辑118来起作用。但是，事务处理逻辑218可以从多个例如“慢速”和“快速”地址缓冲器208和202分别读取地址，而不是从单个地址缓冲器。另外，事务处理逻辑218可以分别并行地写入多个(例如“慢速”和“快速”)响应数据缓冲器210和206、216，而不是写入单个响应数据缓冲器。Slow address buffer 208 and fast address buffer 202 and each data buffer within plurality of slow response data buffers 210, plurality of fast response data buffers 206, and plurality of fast response data buffers 216 are shown as The BRAM with BRAM I/F is used to couple the data buffer to the processor 102, but may be implemented as any suitable storage device. Transaction logic 218 may function similarly to transaction logic 118 described with reference to FIG. 1 . However, transaction logic 218 may read addresses from multiple, eg, "slow" and "fast" address buffers 208 and 202, respectively, rather than from a single address buffer. Additionally, transaction logic 218 may write to multiple (eg, "slow" and "fast") response data buffers 210 and 206, 216 in parallel, respectively, rather than writing to a single response data buffer.

如图所示，加速器寄存器214包括加速器就绪、慢速/快速开始、慢速/快速进行中、慢速/快速数据可用、慢速/快速错误、慢速/快速地址控制、缓冲器偏移/大小和性能计数器状态寄存器，其可以类似于以上参考图1描述的卸载控制设备114的加速准备、开始、进行中、数据可用、错误、地址控制、缓冲器偏移/大小和性能计数器状态指示器电路进行工作。然而，加速器寄存器214分别能够实现针对多个例如“慢速”和“快速”地址缓冲器208和202的指示，而不是针对单个地址缓冲器的指示。此外，加速器寄存器214分别能够实现针对多个例如“慢速”和“快速”响应数据缓冲器210和206、216的指示，而不是针对单个响应数据缓冲器的指示。As shown, the accelerator registers 214 include accelerator ready, slow/fast start, slow/fast in progress, slow/fast data available, slow/fast error, slow/fast address control, buffer offset/ Size and performance counter status registers, which may be similar to the acceleration ready, start, in progress, data available, error, address control, buffer offset/size, and performance counter status indicators of the offload control device 114 described above with reference to FIG. 1 circuit works. However, instead of a single address buffer, the accelerator registers 214 are capable of implementing indications for multiple, eg, "slow" and "fast" address buffers 208 and 202, respectively. Furthermore, the accelerator registers 214 can enable indications for multiple, eg, "slow" and "fast" response data buffers 210 and 206, 216, respectively, rather than for a single response data buffer.

使用数据卸载加速器212，处理器102可以在慢速地址缓冲器208和快速地址缓冲器202两者中指示地址，例如CSR的地址，以访问ASIC 1至N中的一个或多个的小芯片1至M并且从ASIC 1至N中的一个或多个的小芯片1至M取回数据。事务处理逻辑218从地址缓冲器202和208中读取地址；构造读取请求并向这些地址发送读取请求；响应于读取请求来接收数据；以及向响应数据缓冲器206、210和216并行转发其从ASIC 1至N的小芯片1至M取回的数据。Using the data offload accelerator 212, the processor 102 may indicate an address, such as an address of a CSR, in both the slow address buffer 208 and the fast address buffer 202 to access chiplet 1 of one or more of ASICs 1 through N to M and retrieve data from chiplets 1 to M of one or more of ASICs 1 to N. Transaction logic 218 reads addresses from address buffers 202 and 208; constructs read requests and sends read requests to these addresses; receives data in response to the read requests; Forwards the data it retrieves from chiplets 1 through M of ASICs 1 through N.

如图所示，数据卸载加速器212包括不同性能类别或优先级的缓冲器，在这种情况下，“快速”缓冲器(例如202、206和216)和“慢速”缓冲器(例如208和210)使数据能够以不同的速率被卸载到处理器102。在一个示例中，“快速”缓冲器使某些数据能够以比“慢速”缓冲器更快速的速率(例如100倍)被取回和卸载。在此示例中，存在两个性能类别或优先级的缓冲器。但是，根据数据收集规范可能存在缓冲器的附加性能类别或优先级，例如基于预期的结果质量或特定的CSR分段。例如多个性能类别或优先级可以指示不同的数据刷新速率、由数据卸载加速器212捕获和/或传递到处理器102的数据的不同带宽、数据由数据卸载加速器212捕获和/或被传递到处理器102的不同速率等。在特定的示例中，每个性能等级或优先级都与其一起的单独的地址缓冲器和响应数据缓冲器相关联。As shown, data offload accelerator 212 includes buffers of different performance classes or priorities, in this case "fast" buffers (eg, 202, 206, and 216) and "slow" buffers (eg, 208 and 216). 210) Enable data to be offloaded to the processor 102 at different rates. In one example, a "fast" buffer enables certain data to be retrieved and unloaded at a faster rate (eg, 100 times) than a "slow" buffer. In this example, there are two performance classes or priority buffers. However, there may be additional performance categories or priorities for buffers based on data collection specifications, eg based on expected result quality or specific CSR segmentation. For example, multiple performance classes or priorities may indicate different data refresh rates, different bandwidths of data captured by data offload accelerator 212 and/or passed to processor 102, data captured by data offload accelerator 212 and/or passed to processing different rates of the device 102, etc. In a particular example, each performance class or priority is associated with its own separate address buffer and response data buffer.

另外，根据数据收集规范，慢速地址缓冲器208，快速地址缓冲器202，多个慢速响应数据缓冲器210，多个快速响应数据缓冲器206和多个快速响应数据缓冲器216可以具有不同的大小。在一示例中，以较快速率捕获的数据量小于以较慢速率捕获的数据量。因此，快速地址缓冲器202可以小于慢速地址缓冲器208，以存储更少的地址。同样地，多个快速响应数据缓冲器206和多个快速响应数据缓冲器216可以小于多个慢速响应数据缓冲器210，以存储更少数据。在特定示例中，事务处理逻辑218可以执行仲裁方案以确定多频繁地相对于多个慢速响应数据缓冲器210填充多个快速响应数据缓冲器206和216。Additionally, slow address buffer 208, fast address buffer 202, multiple slow response data buffers 210, multiple fast response data buffers 206 and multiple fast response data buffers 216 may have different the size of. In one example, the amount of data captured at the faster rate is less than the amount of data captured at the slower rate. Therefore, the fast address buffer 202 may be smaller than the slow address buffer 208 to store fewer addresses. Likewise, the number of fast response data buffers 206 and the number of fast response data buffers 216 may be smaller than the number of slow response data buffers 210 to store less data. In certain examples, transaction logic 218 may perform an arbitration scheme to determine how often multiple fast response data buffers 206 and 216 are filled relative to multiple slow response data buffers 210 .

数据卸载加速器212的另一个特征是快速响应数据缓冲器的多个集合(例如206和216)的使用。例如数据卸载加速器212可以并行地向芯片106-1至106-N发送请求以取回数据，并且可以并行地将数据填充至给定的响应数据缓冲器。然而，处理器102可以从填充的响应数据缓冲器中串行读取数据。因此，将数据卸载到处理器102可能比捕获和写入数据花费更长的数倍的时间。在这种情况下，多个快速响应数据缓冲器的使用可以支持更高的速度。例如当处理器102从FastA响应数据缓冲器216读取数据时，事务处理逻辑218可以收集数据并填充FastB响应数据缓冲器206。类似地，当处理器102从FastB响应数据缓冲器206读取数据时，事务处理逻辑218可以收集数据并填充FastA响应数据缓冲器216。在另一个示例中，数据卸载加速器212包括附加的快速响应数据缓冲器，其数量可以至少部分地取决于事务处理逻辑218捕获数据和将数据填充到读取数据的处理器102的相对速度。Another feature of data offload accelerator 212 is the use of multiple sets of fast response data buffers (eg, 206 and 216). For example, data offload accelerator 212 may send requests to chips 106-1 through 106-N to retrieve data in parallel, and may fill a given response data buffer with data in parallel. However, the processor 102 may serially read data from the filled response data buffer. Therefore, offloading the data to the processor 102 may take many times longer than capturing and writing the data. In this case, the use of multiple fast response data buffers can support higher speeds. For example, when the processor 102 reads data from the FastA response data buffer 216 , the transaction processing logic 218 may collect the data and populate the FastB response data buffer 206 . Similarly, when processor 102 reads data from FastB response data buffer 206 , transaction logic 218 may collect the data and populate FastA response data buffer 216 . In another example, the data offload accelerator 212 includes additional fast-response data buffers, the number of which may depend, at least in part, on the relative speed at which the transaction logic 218 captures and populates the data to the processor 102 that reads the data.

图3描绘了根据本公开的一个或多个示例的系统架构300，在该系统架构300内，可以实现数据卸载加速器312，用于将数据从系统架构300的ASIC 1至N的多个远程小芯片1至M卸载到处理器102。系统架构300还包括数据卸载加速器设备304，该数据卸载加速器设备304包括数据卸载加速器312，并且该数据卸载加速器设备304被耦合到处理器102和ASIC 1至N的多个小芯片1至M两者。处理器102和ASIC 1至N可以与上面参考图1描述类似地被实现。此外，系统架构300可以包括多个处理器。3 depicts a system architecture 300 within which a data offload accelerator 312 may be implemented for transferring data from a plurality of remote small ASICs 1 to N of the system architecture 300 in accordance with one or more examples of the present disclosure. Chips 1 to M are offloaded to processor 102 . The system architecture 300 also includes a data offload accelerator device 304 that includes a data offload accelerator 312, and the data offload accelerator device 304 is coupled to the processor 102 and to the plurality of chiplets 1 to M of the ASICs 1 to N. By. The processor 102 and the ASICs 1 to N may be implemented similarly as described above with reference to FIG. 1 . Additionally, system architecture 300 may include multiple processors.

如以上参考图1所述，数据卸载加速器设备304包括接口桥110，该接口桥110可以类似地被实现并且类似地被耦合到处理器102。如上参考图1所述，数据卸载加速器设备304还包括接口122、124-1至124-N和126-1至126-N，接口122、124-1至124-N和126-1至126-N可以类似地被实现并且类似地被耦合至ASIC 1至N的小芯片1至M。As described above with reference to FIG. 1 , data offload accelerator device 304 includes interface bridge 110 , which may be similarly implemented and similarly coupled to processor 102 . As described above with reference to FIG. 1, data offload accelerator device 304 also includes interfaces 122, 124-1 to 124-N and 126-1 to 126-N, interfaces 122, 124-1 to 124-N and 126-1 to 126-N N can be similarly implemented and similarly coupled to chiplets 1-M of ASICs 1-N.

数据卸载加速器312被示为耦合到接口桥110和接口122两者。如图所示，数据卸载加速器312包括：多个加速器寄存器314；地址缓冲器库，其包括地址缓冲器302；数据缓冲器库，其包括多个响应数据缓冲器310；事务处理逻辑318；比较器电路306；以及标记电路308。事务处理逻辑318可以以硬件被实现为有限状态机，并且被耦合到接口122、加速器寄存器314、地址缓冲器302、多个响应数据缓冲器310、比较器电路306、和动作电路308。Data offload accelerator 312 is shown coupled to both interface bridge 110 and interface 122 . As shown, the data offload accelerator 312 includes: a plurality of accelerator registers 314; an address buffer bank, which includes the address buffer 302; a data buffer bank, which includes a plurality of response data buffers 310; transaction logic 318; and marker circuit 306; and marker circuit 308. Transaction logic 318 may be implemented in hardware as a finite state machine and coupled to interface 122 , accelerator registers 314 , address buffer 302 , multiple response data buffers 310 , comparator circuit 306 , and action circuit 308 .

地址缓冲器302和多个响应数据缓冲器310中的每个数据缓冲器被示为具有用于将数据缓冲器耦合到处理器102的BRAM I/F的BRAM，但是可以使用任意适当的存储技术来实现。事务处理逻辑318可以类似于参考图1描述的事务处理逻辑118来起作用。然而，事务处理逻辑318可以并行地写入多个响应数据缓冲器310，而不是写入单个响应数据缓冲器。此外，在示出的示例中，事务处理逻辑318还向比较器电路306提供输入，并基于动作电路308的结果提供指示。Address buffer 302 and each data buffer in plurality of response data buffers 310 are shown as having a BRAM with a BRAM I/F for coupling the data buffer to processor 102, but any suitable storage technique may be used to fulfill. Transaction logic 318 may function similarly to transaction logic 118 described with reference to FIG. 1 . However, rather than writing to a single response data buffer, transaction logic 318 may write to multiple response data buffers 310 in parallel. Additionally, in the example shown, the transaction logic 318 also provides input to the comparator circuit 306 and provides an indication based on the result of the action circuit 308 .

如图所示，加速器寄存器314包括加速器就绪、开始、进行中、数据可用、错误、地址控制、缓冲器偏移/大小和性能计数器状态寄存器，其类似于如以上参考图1所描述的卸载控制设备114的加速器就绪、开始、进行中、数据可用、错误、地址控制、缓冲器偏移/大小、和性能计数器状态指示器电路起作用。加速器寄存器314还包括地址/数据(A/D)先进先出(FIFO)控制(Cnt)/状态寄存器，其使事务处理逻辑318能够基于动作电路308的结果向处理器102提供一个或多个状态指示。此外，加速器寄存器314能够实现针对多个响应数据缓冲器310而不是单个响应数据缓冲器的的指示。As shown, accelerator registers 314 include accelerator ready, start, in progress, data available, error, address control, buffer offset/size, and performance counter status registers, which are similar to offload control as described above with reference to FIG. 1 The accelerator ready, starting, in progress, data available, error, address control, buffer offset/size, and performance counter status indicator circuits of device 114 function. Accelerator registers 314 also include address/data (A/D) first-in, first-out (FIFO) control (Cnt)/status registers that enable transaction logic 318 to provide one or more states to processor 102 based on the results of action circuit 308 instruct. Furthermore, the accelerator register 314 can enable indication of multiple response data buffers 310 rather than a single response data buffer.

在示例情况下，来自ASIC 1至N的小芯片1至M的数据可能不会频繁改变。因此，为了进一步提高效率，一旦事务处理逻辑318第一次将数据写入响应数据缓冲器310(例如数据集合n-1)，则当存在改变时，事务处理逻辑318更新响应数据缓冲器310中的数据，并且改变的数据可以被标记并被传递到处理器102，而不是传递完整的数据集合。比较器电路306和标记电路308能够实现该特征。在示例中，标记电路308被实现为“标记地址/数据FIFO”寄存器集合。In an example case, the data from chiplets 1-M of ASICs 1-N may not change frequently. Therefore, to further improve efficiency, once transaction logic 318 writes data to response data buffer 310 for the first time (eg, data set n-1), transaction logic 318 updates the response data buffer 310 when there is a change data, and the changed data may be flagged and passed to the processor 102 instead of passing the complete data set. Comparator circuit 306 and marker circuit 308 enable this feature. In an example, tag circuit 308 is implemented as a set of "tag address/data FIFO" registers.

如图所示，比较器电路306例如使用组合逻辑电路来实现比较功能。在一个示例中，当事务处理逻辑318取回数据集合-n时，它可以将数据集合-n中的每个数据点(值)提供给比较器电路306。然后，比较功能可以针对响应数据缓冲区310的给定地址位置，将数据集合n中每个数据点与数据集合n-1中对应数据点进行比较。当比较功能向标记电路308输出差时，标记电路308可以将该差和/或实际数据值存储为“标记值”，并且还将响应数据缓冲器310的对应地址位置作为“标记地址”存储在标记地址/数据FIFO寄存器中。在示例中，标记电路308的标记地址/数据FIFO寄存器的内容可被处理器102访问。在特定示例中，数据卸载加速器312使用A/D FIFO控制/状态寄存器向处理器102提供该内容可用于从标记地址/数据FIFO寄存器进行传递的指示。As shown, the comparator circuit 306 implements the comparison function using, for example, a combinational logic circuit. In one example, when transaction logic 318 retrieves data set-n, it may provide each data point (value) in data set-n to comparator circuit 306. The compare function may then compare each data point in data set n with the corresponding data point in data set n-1 for a given address location of response data buffer 310. When the compare function outputs a difference to the tag circuit 308, the tag circuit 308 may store the difference and/or the actual data value as the "tag value" and also store the corresponding address location of the response data buffer 310 as the "tag address" in mark the address/data FIFO register. In an example, the contents of the tag address/data FIFO registers of tag circuit 308 are accessible by processor 102 . In a particular example, the data offload accelerator 312 uses the A/D FIFO control/status register to provide the processor 102 with an indication that the content is available for transfer from the tag address/data FIFO register.

因此，仅将数据的一部分，例如给定地址位置的标记数据值和/或数据集合n值和数据集合n-1值之间的标记差，卸载到处理器102，而不是整个数据集合-n和数据集合n-1。这加速了由处理器102读取的数据，这可能是重要的，例如在存在要从中读取数据的成千上万个CSR的情况下。此外，数据卸载加速器312可以在其硬件中监视数据的显著改变。与将数据首先传递到处理器102以执行比较功能相比，可以更高效地执行这种监视。Thus, only a portion of the data, such as the tag data value at a given address location and/or the tag difference between the data set n value and the data set n-1 value, is offloaded to the processor 102, rather than the entire data set-n and data set n-1. This speeds up the data read by the processor 102, which can be important, for example, where there are thousands of CSRs from which data is to be read. Furthermore, the data offload accelerator 312 can monitor in its hardware for significant changes to the data. This monitoring may be performed more efficiently than if the data were first passed to the processor 102 to perform the comparison function.

图4描绘了根据本公开的一个或多个示例的系统架构400，在该系统架构400中，可以实现数据卸载加速器412以将数据从系统架构400的ASIC 1至N的多个远程小芯片1至M卸载到处理器102。系统架构400还包括数据卸载加速器设备404，该数据卸载加速器设备404包括数据卸载加速器412，并且该数据卸载加速器设备404被耦合至处理器102和ASIC 1至N的多个小芯片1至M。处理器102和ASIC 1至N可以与上面参考图1描述的类似地实现。此外，系统架构400可以包括多个处理器。4 depicts a system architecture 400 in which a data offload accelerator 412 may be implemented to offload data from ASICs 1 to N of multiple remote chiplets 1 of the system architecture 400 in accordance with one or more examples of the present disclosure To M is offloaded to processor 102 . The system architecture 400 also includes a data offload accelerator device 404 that includes a data offload accelerator 412 and is coupled to the processor 102 and the plurality of chiplets 1 to M of the ASICs 1 to N. The processor 102 and ASICs 1 to N may be implemented similarly as described above with reference to FIG. 1 . Additionally, system architecture 400 may include multiple processors.

数据卸载加速器设备404包括接口桥110，该接口桥接器110可以如以上参考图1所述的被类似地实现并且被类似地耦合到处理器102。如上参考图1所述的，数据卸载加速器设备404还包括接口122、124-1至124-N和126-1至126-N，其可以类似地被实现并类似地被耦合到ASIC 1至N的小芯片1至M。The data offload accelerator device 404 includes an interface bridge 110 that may be similarly implemented and similarly coupled to the processor 102 as described above with reference to FIG. 1 . As described above with reference to FIG. 1, data offload accelerator device 404 also includes interfaces 122, 124-1 through 124-N, and 126-1 through 126-N, which may be similarly implemented and similarly coupled to ASICs 1 through N Chiplets 1 to M.

数据卸载加速器412被示出为耦合到接口桥110和接口122两者。如图所示，数据卸载加速器412包括：多个加速器寄存器414；多个地址缓冲器库，其包括地址缓冲器402；数据缓冲器库，其包括多个响应数据缓冲器410；事务处理逻辑418；比较器电路406；以及标记电路408。事务处理逻辑418可以以硬件被实现为有限状态机，并且被耦合到接口122、加速器寄存器414、地址缓冲器402、多个响应数据缓冲器410、比较器电路406和标记电路408。Data offload accelerator 412 is shown coupled to both interface bridge 110 and interface 122 . As shown, data offload accelerator 412 includes: a plurality of accelerator registers 414; a plurality of address buffer banks, including address buffers 402; a data buffer bank, including a plurality of response data buffers 410; transaction logic 418 ; comparator circuit 406 ; and flag circuit 408 . Transaction logic 418 may be implemented in hardware as a finite state machine and coupled to interface 122 , accelerator registers 414 , address buffer 402 , multiple response data buffers 410 , comparator circuit 406 and flag circuit 408 .

地址缓冲器402和多个响应数据缓冲器410内的每个数据缓冲器被示出为具有用于将数据缓冲器耦合到处理器102的BRAM I/F的BRAM，但是可以被实现为任意适当的存储器技术。标记电路408可以类似于参考图3描述的标记电路308起作用。事务处理逻辑418可以类似于参考图1描述的事务处理逻辑118起作用。但是，事务处理逻辑418可以并行写入多个响应数据缓冲器410而不是单个响应数据缓冲器。此外，在所示的示例中，事务处理逻辑418还基于动作电路408的结果来提供指示。Each data buffer within address buffer 402 and plurality of response data buffers 410 is shown as having a BRAM with a BRAM I/F for coupling the data buffer to processor 102, but may be implemented as any suitable memory technology. The marker circuit 408 may function similarly to the marker circuit 308 described with reference to FIG. 3 . Transaction logic 418 may function similarly to transaction logic 118 described with reference to FIG. 1 . However, transaction logic 418 may write to multiple response data buffers 410 in parallel rather than a single response data buffer. Furthermore, in the example shown, transaction logic 418 also provides an indication based on the results of action circuit 408 .

如图所示，加速器寄存器414包括加速器就绪、开始、进行中、数据可用、错误、地址控制、缓冲器偏移/大小和性能计数器状态寄存器，其类似于如以上参考图1描述的卸载控制设备114的加速器就绪、开始、进行中、数据可用、错误、地址控制、缓冲器偏移/大小和性能计数器状态指示器电路起作用。加速器寄存器414还包括A/D FIFO控制/状态寄存器，这使得事务处理逻辑418能够基于标记电路408的结果向处理器102提供一个或多个状态指示。此外，加速器寄存器414能够实现针对多个响应数据缓冲器410而不是单个响应数据缓冲器的指示。As shown, accelerator registers 414 include accelerator ready, start, in progress, data available, error, address control, buffer offset/size, and performance counter status registers, which are similar to the offload control device as described above with reference to FIG. 1 The accelerator ready, starting, in progress, data available, error, address control, buffer offset/size and performance counter status indicator circuits of 114 function. Accelerator registers 414 also include A/D FIFO control/status registers, which enable transaction logic 418 to provide one or more status indications to processor 102 based on the results of flag circuit 408 . Furthermore, the accelerator register 414 enables indication of multiple response data buffers 410 rather than a single response data buffer.

在另一个示例场景中，图4的示例的比较和监视功能可能比参考图3所描述的功能更复杂。例如比较器电路406可以包括多个数据集合准则和模式匹配寄存器，这允许将从ASIC 1至N中的一个或多个的小芯片1至M取回的数据与一个或多个准则进行比较。在一个示例中，处理器102将一个或多个准则编程到比较器电路406的数据集合准则和模式匹配寄存器中，所取回的数据与其进行比较。如图所示，处理器102可以对准则进行编程，包括但不限于位范围字段、模式匹配类型和模式匹配值。位范围字段可以指示例如来自CSR的感兴趣的位范围，以进行比较。模式匹配类型可以指示要执行的比较的类型，例如等于、大于、小于。模式匹配值可以指示使用模式匹配类型将取回的数据与其比较的一个或多个值。在一个示例中，比较器电路406可以基于定义的遥测反应目标来监视某些兴趣条件。In another example scenario, the comparison and monitoring functions of the example of FIG. 4 may be more complex than those described with reference to FIG. 3 . For example, comparator circuit 406 may include a plurality of data set criteria and pattern match registers, which allow data retrieved from chiplets 1 through M of one or more of ASICs 1 through N to be compared to one or more criteria. In one example, the processor 102 programs one or more criteria into the data set criteria and pattern match registers of the comparator circuit 406, to which the retrieved data is compared. As shown, the processor 102 may program criteria including, but not limited to, bit range fields, pattern match types, and pattern match values. The bit range field may indicate, for example, the bit range of interest from the CSR for comparison. The pattern match type can indicate the type of comparison to perform, such as equals, greater than, less than. A pattern match value may indicate one or more values to which the retrieved data is compared using the pattern match type. In one example, the comparator circuit 406 may monitor certain conditions of interest based on defined telemetry response targets.

如果满足一个或多个(模式匹配)准则，则标记电路408可以存储“标记值”，例如数据的值或与数据有关的某个其他值，并且还将响应数据缓冲器410的对应地址位置作为“标记地址”存储在标记地址/数据FIFO寄存器中。在示例中，标记电路408的标记地址/数据FIFO寄存器的内容可被处理器102访问。在特定示例中，数据卸载加速器412使用A/D FIFO控制/状态寄存器向处理器102提供内容可用于从标记的地址/数据FIFO寄存器中卸载的指示。因此，仅将一部分数据，例如标记的数据值和/或与给定地址位置的数据值有关的其他值，而不是整个数据集合-n和数据集合n-1，卸载到处理器102。If one or more (pattern matching) criteria are met, the tag circuit 408 may store a "tag value", such as the value of the data or some other value related to the data, and also the corresponding address location of the response data buffer 410 as The "tag address" is stored in the tag address/data FIFO register. In an example, the contents of the tag address/data FIFO registers of tag circuit 408 are accessible by processor 102 . In a particular example, the data offload accelerator 412 uses the A/D FIFO control/status register to provide the processor 102 with an indication that content is available for offloading from the marked address/data FIFO register. Accordingly, only a portion of the data, such as the tagged data values and/or other values associated with the data value at a given address location, is offloaded to the processor 102, rather than the entire data set-n and data set n-1.

图5描绘了根据本公开的一个或多个示例的系统架构500，在该系统架构500中可以实现数据卸载加速器512，用于将数据从系统架构500的ASIC 1至N的多个远程小芯片1至M卸载到处理器102。系统架构500还包括数据卸载加速器设备504，该数据卸载加速器设备504包括数据卸载加速器512，并且该数据卸载加速器设备504被耦合至处理器102和ASIC 1至N的多个小芯片1至M。处理器102和ASIC 1至N可以与上面参考图1描述的被类似地实现。此外，系统架构500可以包括多个处理器。5 depicts a system architecture 500 in which a data offload accelerator 512 may be implemented for transferring data from ASICs 1 to N of multiple remote chiplets of the system architecture 500 in accordance with one or more examples of the present disclosure 1 to M are offloaded to processor 102 . The system architecture 500 also includes a data offload accelerator device 504 that includes a data offload accelerator 512 and is coupled to the processor 102 and the plurality of chiplets 1 to M of the ASICs 1 to N. The processor 102 and the ASICs 1 to N may be implemented similarly as described above with reference to FIG. 1 . Additionally, system architecture 500 may include multiple processors.

数据卸载加速器设备504包括接口桥110，该接口桥110可以如以上参考图1所述的类似地被实现并且类似地被耦合到处理器102。数据卸载加速器设备404还包括接口122、124-1至124-N和126-1至126-N，其可以如上参考图1所述的类似地被实现并类似地被耦合到ASIC 1至N的小芯片1至M。The data offload accelerator device 504 includes an interface bridge 110 that may be implemented similarly as described above with reference to FIG. 1 and similarly coupled to the processor 102 . The data offload accelerator device 404 also includes interfaces 122, 124-1 to 124-N, and 126-1 to 126-N, which may be implemented similarly as described above with reference to FIG. 1 and similarly coupled to the ASICs 1 to N. Chiplets 1 to M.

数据卸载加速器512被示出为耦合到接口桥110和接口122两者。如图所示，数据卸载加速器512包括：多个加速器寄存器514；以及地址缓冲器库，其包括地址缓冲器502；数据缓冲器库，其包括多个响应数据缓冲器510；事务处理逻辑518；比较器电路506；以及动作电路508。事务处理逻辑518可以以硬件被实现为有限状态机，并且被耦合到接口122、加速器寄存器514、地址缓冲器502、多个响应数据缓冲器510、比较器电路506和动作电路508。Data offload accelerator 512 is shown coupled to both interface bridge 110 and interface 122 . As shown, the data offload accelerator 512 includes: a plurality of accelerator registers 514; and an address buffer bank, which includes an address buffer 502; a data buffer bank, which includes a plurality of response data buffers 510; transaction logic 518; a comparator circuit 506; and an action circuit 508. Transaction logic 518 may be implemented in hardware as a finite state machine and coupled to interface 122 , accelerator registers 514 , address buffer 502 , multiple response data buffers 510 , comparator circuit 506 , and action circuit 508 .

地址缓冲器402和多个响应数据缓冲器410内的每个数据缓冲器被示为具有用于将数据缓冲器耦合到处理器102的BRAM I/F的BRAM，但是可以被实现为任意适当的存储器技术。事务处理逻辑518可以类似于参考图1描述的事务处理逻辑118起作用。但是，事务处理逻辑518可以并行写入多个响应数据缓冲器510而不是单个响应数据缓冲器。此外，在所示的示例中，事务处理逻辑518还基于动作电路508的结果来提供指示。Each data buffer within address buffer 402 and plurality of response data buffers 410 is shown as a BRAM with a BRAM I/F for coupling the data buffer to processor 102, but may be implemented as any suitable memory technology. Transaction logic 518 may function similarly to transaction logic 118 described with reference to FIG. 1 . However, transaction processing logic 518 may write to multiple response data buffers 510 in parallel rather than a single response data buffer. Furthermore, in the example shown, the transaction logic 518 also provides an indication based on the results of the action circuit 508 .

如图所示，加速器寄存器514包括加速器就绪、开始、进行中、数据可用、错误、地址控制、缓冲器偏移/大小和性能计数器状态寄存器，其类似于如以上参考图1描述的卸载控制设备114的加速器就绪、开始、进行中、数据可用、错误、地址控制、缓冲器偏移/大小和性能计数器状态指示器电路起作用。加速器寄存器514还包括准则/动作CSR状态寄存器，其使得事务处理逻辑518能够基于标记电路508的结果向处理器102提供一个或多个状态指示。此外，加速器寄存器514能够实现针对多个响应数据缓冲器510而不是单个响应数据缓冲器的指示。As shown, the accelerator registers 514 include accelerator ready, start, in progress, data available, error, address control, buffer offset/size, and performance counter status registers, which are similar to the offload control device as described above with reference to FIG. 1 The accelerator ready, starting, in progress, data available, error, address control, buffer offset/size and performance counter status indicator circuits of 114 function. Accelerator registers 514 also include a criterion/action CSR status register, which enables transaction logic 518 to provide one or more status indications to processor 102 based on the results of flag circuit 508 . Furthermore, the accelerator register 514 enables indication of multiple response data buffers 510 rather than a single response data buffer.

在另一个示例场景中，图5的示例的比较和监视功能可能比参考图3所描述的功能更复杂。例如比较器电路406可以包括多个数据集合准则和模式匹配寄存器，其允许将从ASIC 1至N中的一个或多个的小芯片1至M取回的数据与一个或多个准则进行比较。在一个示例中，处理器102将一个或多个准则编程到比较器电路506的数据集合准则和模式匹配寄存器中，所取回的数据与其进行比较。如图所示，处理器102可以对准则进行编程，包括但不限于位范围字段、模式匹配类型、模式匹配值、和准则类别。位范围字段、模式匹配类型和模式匹配值寄存器可以类似于参考图4所描述的那些寄存器起作用。但是，准则类别寄存器包括附加准则，将取回的数据与其进行比较。在一个示例中，标准类别用于确定所取回的数据值是否来自指定的CSR类别内的CSR。In another example scenario, the comparison and monitoring functions of the example of FIG. 5 may be more complex than those described with reference to FIG. 3 . For example, comparator circuit 406 may include a plurality of data set criteria and pattern match registers that allow data retrieved from chiplets 1 through M of one or more of ASICs 1 through N to be compared to one or more criteria. In one example, the processor 102 programs one or more criteria into the data set criteria and pattern match registers of the comparator circuit 506, to which the retrieved data is compared. As shown, the processor 102 may program criteria including, but not limited to, bit range fields, pattern matching types, pattern matching values, and criteria categories. The bit range field, pattern match type and pattern match value registers may function similar to those described with reference to FIG. 4 . However, the Criterion Class Register includes additional criteria against which the retrieved data is compared. In one example, the standard category is used to determine whether the retrieved data value is from a CSR within the specified CSR category.

例如在ASIC 1至N形成或被包括在交换网络中的情况下，可以处理收集的遥测数据以寻找或监视特定的性能签名。当发现异常签名时，数据卸载加速器512可以采取动作从该条件恢复，同时交换网络继续操作。在此上下文中，作为非限制性示例，数据卸载加速器512可以：寻找指示结构路径不再可操作的条件；表征新的硅片和新的交换范例；搜索“黑洞”，即正在接收数据但从未将其传递到目的地的端口；标识“砖墙”，即从不接受数据的端口；等等For example where ASICs 1 to N are formed or included in a switched network, the collected telemetry data can be processed to find or monitor specific performance signatures. When an abnormal signature is found, the data offload accelerator 512 can take action to recover from the condition while the switching network continues to operate. In this context, by way of non-limiting example, the data offload accelerator 512 may: seek conditions that indicate structural paths are no longer operational; characterize new silicon and new exchange paradigms; ports that do not pass it to the destination; identify "brick walls", i.e. ports that never accept data; etc.

如果满足一个或多个准则或找到签名，则在该示例场景中，动作电路508可以采取某种动作，例如某种纠正动作。例如动作电路508可在响应数据缓冲器510内针对小芯片，特别是针对小芯片内的CSR等采取一个或多个动作。如图所示，动作电路508包括：多个加速动作响应寄存器，包括动作类别，动作类型，动作范围，动作状态和动作性能(Perf.)计数器寄存器，其提供了响应于满足一个或多个准则而动作电路可以采取的动作的灵活性。这些动作可以通过对应寄存器中的类别、类型和范围来描绘。所采取的动作的指示以及与所采取的动作有关的结果或其他状态可以被记录在动作状态寄存器中。而且，与动作有关的任意性能度量可以被存储在动作性能计数器寄存器中。这样的动作可以包括但不限于将特定值写入CSR或响应数据缓冲器510中的地址，改变用于CSR的路由表以避开黑洞或砖墙网络路由条件等。If one or more criteria are met or a signature is found, in this example scenario, the action circuit 508 may take some action, such as some corrective action. For example, the action circuit 508 may take one or more actions within the response data buffer 510 for the chiplet, particularly for a CSR within the chiplet, or the like. As shown, the action circuit 508 includes a plurality of accelerated action response registers, including action class, action type, action range, action state, and action performance (Perf.) counter registers, which provide responses to satisfying one or more criteria And the flexibility of the actions that the action circuit can take. These actions can be delineated by the class, type, and scope in the corresponding registers. An indication of the action taken and the result or other status related to the action taken may be recorded in an action status register. Also, any performance metrics associated with actions can be stored in action performance counter registers. Such actions may include, but are not limited to, writing specific values to addresses in the CSR or response data buffer 510, changing the routing table for the CSR to avoid black hole or brick wall network routing conditions, and the like.

在示例中，处理器120可访问动作电路508的加速动作响应寄存器的内容。在特定示例中，数据卸载加速器512使用准则/动作CRS寄存器向处理器102提供内容可从动作电路508的一个或多个动作响应寄存器中进行查看和传递的指示。在又一示例中，响应数据缓冲器510中的数据也被传送到处理器102。如果来自ASIC 1到N的小芯片1到M的所有数据必须先被串行卸载到处理器102进行分析，则与所采取的动作相比，动作电路508能够更快地实现要采取的动作以解决或缓解网络中的状况。In an example, the processor 120 may access the contents of an accelerated action response register of the action circuit 508 . In a particular example, the data offload accelerator 512 uses the criteria/action CRS registers to provide the processor 102 with an indication that content can be viewed and passed from one or more action response registers of the action circuit 508 . In yet another example, the data in response data buffer 510 is also transferred to processor 102 . If all data from chiplets 1 to M of ASICs 1 to N must first be serially offloaded to processor 102 for analysis, then action circuit 508 can implement the action to be taken faster than the action to be taken to Resolve or mitigate conditions in the network.

图6描绘了根据本公开的一个或多个示例的系统架构600，在该系统架构600中可以实现初始化加速器612，用于初始化系统架构600的ASIC 1至N的多个远程小芯片1至M。系统架构600还包括初始化加速器设备604，该初始化加速器设备604包括初始化加速器612，并且该初始化加速器设备604耦合到处理器102和ASIC 1到N的多个小芯片1到M。处理器102和ASIC 1到N可以如上参考图1所描述的类似地被实现。此外，系统架构600可以包括多个处理器。6 depicts a system architecture 600 in which an initialization accelerator 612 may be implemented for initializing a plurality of remote chiplets 1 through M of ASICs 1 through N of the system architecture 600 in accordance with one or more examples of the present disclosure . System architecture 600 also includes initialization accelerator device 604, which includes initialization accelerator 612, and which is coupled to processor 102 and a plurality of chiplets 1-M of ASICs 1-N. The processor 102 and the ASICs 1 to N may be implemented similarly as described above with reference to FIG. 1 . Additionally, system architecture 600 may include multiple processors.

初始化加速器设备604包括接口桥110，如上面参考图1所述，该接口桥接器110可以类似地被实现并且类似地耦合到处理器102。初始化加速器设备604还包括接口122、124-1至124-N和126-1至126-N，其可以如上文参考图1所描述的类似地被实现并类似地耦合到ASIC 1至N的小芯片1至M。Initialization accelerator device 604 includes interface bridge 110 , which may be similarly implemented and similarly coupled to processor 102 as described above with reference to FIG. 1 . Initialization accelerator device 604 also includes interfaces 122, 124-1 to 124-N, and 126-1 to 126-N, which may be implemented similarly as described above with reference to FIG. Chips 1 to M.

初始化加速器612被图示为耦合到接口桥110和接口122两者。如图所示，初始化加速器612包括：多个加速器寄存器614；地址缓冲器库，其包括地址缓冲器602；数据缓冲器库，其包括多个输出数据缓冲器606；以及事务处理逻辑618。事务处理逻辑618可以以硬件被实现为有限状态机，并且被耦合到接口122、加速器寄存器614、地址缓冲器602和多个输出数据缓冲器606。Initialization accelerator 612 is shown coupled to both interface bridge 110 and interface 122 . As shown, the initialization accelerator 612 includes: a plurality of accelerator registers 614; an address buffer bank, which includes the address buffer 602; a data buffer bank, which includes a plurality of output data buffers 606; and transaction logic 618. Transaction logic 618 may be implemented in hardware as a finite state machine and coupled to interface 122 , accelerator registers 614 , address buffer 602 , and multiple output data buffers 606 .

如图所示，加速器寄存器614包括加速器就绪、开始、进行中、数据可用、错误、地址控制、缓冲器偏移/大小和性能计数器状态寄存器，其类似于如以上参考图1描述的卸载控制设备114的加速器就绪、开始、进行中、数据可用、错误、地址控制、缓冲器偏移/大小和性能计数器状态指示器电路起作用。此外，加速器寄存器614能够实现关于多个输出数据缓冲器606的指示。在另一示例中，加速器寄存器614能够实现关于单个输出数据缓冲器的指示。As shown, the accelerator registers 614 include accelerator ready, start, in progress, data available, error, address control, buffer offset/size, and performance counter status registers, which are similar to the offload control device as described above with reference to FIG. 1 The accelerator ready, starting, in progress, data available, error, address control, buffer offset/size and performance counter status indicator circuits of 114 function. In addition, accelerator registers 614 can enable indications regarding multiple output data buffers 606 . In another example, the accelerator register 614 can enable indication of a single output data buffer.

地址缓冲器602和多个输出数据缓冲器606内的每个数据缓冲器被图示为具有用于将数据缓冲器耦合至处理器102的BRAM I/F的BRAM，但是可以被实现为任意适当的存储器技术。事务处理逻辑618可以类似于参考图1描述的事务处理逻辑118起作用，以从芯片106-1至106-N取回和存储数据。在这种情况下，出站数据缓冲器606将用作响应数据缓冲器以存储所取回的数据。Each data buffer within address buffer 602 and plurality of output data buffers 606 is illustrated as a BRAM with a BRAM I/F for coupling the data buffer to processor 102, but may be implemented as any suitable memory technology. Transaction logic 618 may function similarly to transaction logic 118 described with reference to FIG. 1 to retrieve and store data from chips 106-1 through 106-N. In this case, the outbound data buffer 606 would act as a response data buffer to store the retrieved data.

然而，在一些示例中，事务处理逻辑618将来自输出数据缓冲器606的初始化数据写入ASIC 1至N中的一个或多个的小芯片1至M，例如写入小芯片1至M内的CSR。初始化数据可以在接口122、124-1至124-N和126-1至126-N上并行写入。在示例中，作为初始化小芯片的一部分，处理器102可以在地址缓冲器602中指定多个地址。处理器102可以将初始化数据对应地填充到出站数据缓冲器106中，以在那些地址处初始化小芯片。然后，在例如使用加速器寄存器614内的开始寄存器从处理器102接收到指示时，传输处理逻辑618将来自出站数据缓冲器606的初始化数据写入在存储的地址处的小芯片。因此，事务处理逻辑618可以通过使用硬件来向小芯片并行地发送写请求来执行对小芯片的同时初始化。与处理器102一次一个地初始化每个小芯片相比，可以大大减少初始化时间。However, in some examples, transaction logic 618 writes initialization data from output data buffer 606 into chiplets 1 through M of one or more of ASICs 1 through N, eg, into chiplets 1 through M within chiplets 1 through M. CSR. Initialization data may be written in parallel on interfaces 122, 124-1 to 124-N, and 126-1 to 126-N. In an example, processor 102 may specify multiple addresses in address buffer 602 as part of initializing the chiplet. The processor 102 may correspondingly fill the outbound data buffer 106 with initialization data to initialize the chiplets at those addresses. Transfer processing logic 618 then writes the initialization data from outbound data buffer 606 to the chiplet at the stored address upon receiving an indication from processor 102, eg, using a start register within accelerator register 614. Thus, transaction logic 618 may perform simultaneous initialization of chiplets by using hardware to send write requests to chiplets in parallel. The initialization time can be greatly reduced compared to the processor 102 initializing each chiplet one at a time.

在另一示例中，事务处理逻辑618可以将其他类型的数据写入ASIC 1至N中的一个或多个的小芯片1至M。例如事务处理逻辑可以在除初始化之外的后续时间将配置数据写入小芯片，以改变CSR内的值。In another example, transaction logic 618 may write other types of data to chiplets 1-M of one or more of ASICs 1-N. For example, transaction logic may write configuration data to the chiplet at a later time than initialization to change the value within the CSR.

图7描绘了根据本公开的一个或多个示例的系统架构700，在系统架构700内可以实现数据卸载加速器712，以用于将数据从系统架构700的多个IC 730-1至730-N的多个远程芯片1至M卸载到处理器102。IC 730-1至730-N在本文中也称为IC 1至N。系统架构700还包括数据卸载加速器设备704，其包括数据卸载加速器712，并且耦合至处理器102和多个芯片106-1至106-N两者。处理器102和芯片1至M可以是包括存储器映射的地址位置的任意类型的芯片，该存储器映射的地址位置包含用于卸载到处理器102的数据。此外，系统架构700可以包括多个处理器。7 depicts a system architecture 700 within which a data offload accelerator 712 may be implemented for offloading data from a plurality of ICs 730-1 through 730-N of the system architecture 700 in accordance with one or more examples of the present disclosure The plurality of remote chips 1 to M are offloaded to the processor 102 . ICs 730-1 to 730-N are also referred to herein as ICs 1 to N. The system architecture 700 also includes a data offload accelerator device 704 that includes a data offload accelerator 712 and is coupled to both the processor 102 and the plurality of chips 106-1 through 106-N. Processor 102 and chips 1 through M may be any type of chip that includes memory-mapped address locations that contain data for offloading to processor 102 . Additionally, system architecture 700 may include multiple processors.

数据卸载加速器设备704包括接口桥110，该接口桥110可以如上文参考图1所述的类似地被实现并且类似地耦合到处理器102。数据卸载加速器设备704还包括接口122、124-1至124-N和126-1至126-N，其可以如上面参考图1所述的类似地被实现并类似地耦合到IC1至N的芯片1至M。The data offload accelerator device 704 includes an interface bridge 110 , which may be implemented similarly as described above with reference to FIG. 1 and similarly coupled to the processor 102 . Data offload accelerator device 704 also includes interfaces 122, 124-1 to 124-N, and 126-1 to 126-N, which may be implemented similarly as described above with reference to FIG. 1 and similarly coupled to chips of IC1 to N 1 to M.

数据卸载加速器712被图示为耦合到接口桥110和接口122两者。如图所示，数据卸载加速器712包括事务处理逻辑718，其耦合到数据卸载加速器712的其余组件。事务处理逻辑718可以包括以上参考图1、2、3、4和6所述的至少一些功能。数据卸载加速器712还包括可以与参照图1描述的加速器寄存器114类似地起作用的多个加速器寄存器714。数据卸载加速器712还包括具有慢速地址缓冲器708和快速地址缓冲器702的地址缓冲器库，其可以类似于由参考图2描述的慢速地址缓冲器208和快速地址缓冲器202起作用。数据卸载加速器712还包括具有多个慢速响应数据缓冲器710、多个快速响应数据缓冲器722和多个快速响应数据缓冲器706的数据缓冲器库，其类似于参考图2描述的慢速响应数据缓冲器210、快速响应数据缓冲器216和快速响应数据缓冲器206起作用。数据卸载加速器712进一步包括比较器电路716和动作电路720，其可以类似于参考图4描述的比较器电路406和动作电路408起作用。在特定示例中，处理器102可以对数据卸载加速器进行编程，以以不同的模式起作用，以触发其中的各种功能。Data offload accelerator 712 is illustrated coupled to both interface bridge 110 and interface 122 . As shown, data offload accelerator 712 includes transaction processing logic 718 that is coupled to the remaining components of data offload accelerator 712 . Transaction logic 718 may include at least some of the functionality described above with reference to FIGS. 1 , 2 , 3 , 4 , and 6 . The data offload accelerator 712 also includes a plurality of accelerator registers 714 that may function similarly to the accelerator registers 114 described with reference to FIG. 1 . Data offload accelerator 712 also includes an address buffer bank with slow address buffer 708 and fast address buffer 702, which may function similarly to slow address buffer 208 and fast address buffer 202 described with reference to FIG. The data offload accelerator 712 also includes a data buffer bank with a plurality of slow response data buffers 710, a plurality of fast response data buffers 722, and a plurality of fast response data buffers 706, which are similar to the slow response data buffers described with reference to FIG. Response data buffer 210, fast response data buffer 216, and fast response data buffer 206 function. The data offload accelerator 712 further includes a comparator circuit 716 and an action circuit 720 , which may function similarly to the comparator circuit 406 and the action circuit 408 described with reference to FIG. 4 . In certain examples, the processor 102 may program the data offload accelerator to function in different modes to trigger various functions therein.

图8描绘了根据本公开的一个或多个示例的用于将数据从多个远程芯片卸载到处理器的方法800的流程图。示例方法800或其部分可由分别在图1至图7中示出的示例数据卸载加速器112、212、312、412、512、612和712执行。然而，示例性地，方法800参考图1的数据卸载加速器112描述。根据示例方法800，用于从ASIC 1至N中的一个或多个的多个远程小芯片1至M取回数据的多个地址的指示被接收(802)到地址缓冲器库中，该地址缓冲器库在这种情况下是地址缓冲器116。发起数据的卸载的命令被接收(804)到卸载控制设备114中。事务处理逻辑118将数据并行捕获(806)到数据缓冲器库120，在这种情况下是响应数据缓冲器120。一旦响应数据缓冲器120至少部分地或完全被数据填充，事务处理逻辑118就中断(808)处理器102以进行传递数据的至少一部分。8 depicts a flowchart of a method 800 for offloading data from multiple remote chips to a processor in accordance with one or more examples of the present disclosure. The example method 800, or portions thereof, may be performed by the example data offload accelerators 112, 212, 312, 412, 512, 612, and 712 shown in FIGS. 1-7, respectively. By way of example, however, method 800 is described with reference to data offload accelerator 112 of FIG. 1 . According to example method 800, an indication of a plurality of addresses for retrieving data from a plurality of remote chiplets 1-M of one or more of ASICs 1-N is received (802) into an address buffer bank, the addresses The buffer bank in this case is the address buffer 116 . A command to initiate the offloading of data is received (804) into offload control device 114. Transaction logic 118 captures (806) the data in parallel to data buffer bank 120, in this case response data buffer 120. Once the response data buffer 120 is at least partially or completely filled with data, the transaction processing logic 118 interrupts (808) the processor 102 to transfer at least a portion of the data.

在一个示例中，事务处理逻辑118基于在事务处理逻辑118内编程的一个或多个策略来取回数据。在一个示例中，事务处理逻辑118从一些但不是全部ACIS 1至ACIS N取回数据。在另一示例中，事务处理逻辑118从ASIC 1至N中的一个或多个ASIC中的一些但并非全部芯片1至M中取回数据。在另一示例中，事务处理逻辑118从ASIC 1至N中的一个或多个中的芯片1至M中的一个或多个内的一些但并非全部CSR中取回数据。在另一示例中，事务处理逻辑118取回不同大小的数据，例如不同大小的数据位或字节。在另一个示例中，每个事务可以取回不同大小的数据。In one example, transaction logic 118 retrieves data based on one or more policies programmed within transaction logic 118 . In one example, transaction logic 118 retrieves data from some but not all of ACIS 1 through ACIS N. In another example, transaction logic 118 retrieves data from some but not all of chips 1-M from one or more of ASICs 1-N. In another example, transaction logic 118 retrieves data from some but not all of the CSRs within one or more of chips 1 through M in one or more of ASICs 1 through N. In another example, transaction logic 118 retrieves data of different sizes, eg, different sized bits or bytes of data. In another example, each transaction may retrieve data of different sizes.

图9描绘了根据本公开的一个或多个示例的用于将数据从多个远程芯片卸载到处理器的方法900的流程图。示例方法900或其部分可由示例数据卸载加速器112、212、312、412、512、612和712以及示例ASIC初始化加速器612执行，如图1至图7所示。然而，示意性地，方法900通过参照图7的数据卸载加速器712描述。9 depicts a flowchart of a method 900 for offloading data from multiple remote chips to a processor in accordance with one or more examples of the present disclosure. Example method 900, or portions thereof, may be performed by example data offload accelerators 112, 212, 312, 412, 512, 612, and 712 and example ASIC initialization accelerator 612, as shown in FIGS. 1-7. Illustratively, however, method 900 is described with reference to data offload accelerator 712 of FIG. 7 .

根据示例性方法900，用于从IC 1到N中的一个或多个中的多个远程芯片1到M提供或取回数据的多个地址的指示被接收到地址缓冲器库的多个地址缓冲器702、708中。特别地，存储器地址的第一部分在第一地址缓冲器中被接收(902)，例如快速地址缓冲器702，并且存储器地址的第二部分在第二地址缓冲器中被接收(904)，例如慢速地址缓冲器710。取决于数据卸载加速器的配置(例如在类似于图6的数据卸载加速器612被配置的情况下)，事务处理逻辑712可以在指示的地址并行地初始化(906)IC 1到N中的一个或多个IC中的多个远程芯片1到M。在一个示例中，响应数据缓冲器库包含初始化数据，事务处理逻辑718可以将这些初始化数据写入IC 1至N中的一个或多个的远程芯片1至M以初始化或配置芯片，例如IC 1至N中的一个或多个的芯片1至M内的CSR。在另一个示例中，处理器使用加速器寄存器来触发事务处理逻辑718以初始化CSR。According to the example method 900, indications of multiple addresses for providing or retrieving data from multiple remote chips 1 through M in one or more of ICs 1 through N are received to multiple addresses of the address buffer bank buffers 702, 708. In particular, a first portion of the memory address is received (902) in a first address buffer, eg, fast address buffer 702, and a second portion of the memory address is received (904) in a second address buffer, eg, slow address buffer 710. Depending on the configuration of the data offload accelerator (eg, where data offload accelerator 612 is configured similar to FIG. 6), transaction logic 712 may initialize (906) one or more of ICs 1 through N in parallel at the indicated address. Multiple remote chips 1 to M in each IC. In one example, the response data buffer bank contains initialization data that transaction logic 718 may write to remote chips 1-M of one or more of ICs 1-N to initialize or configure a chip, such as IC 1 CSRs within chips 1 to M of one or more of N to N. In another example, the processor uses accelerator registers to trigger transaction logic 718 to initialize the CSR.

用于发起数据的卸载的命令被接收(908)到卸载控制设备中，在这种情况下为加速器寄存器714。事务处理逻辑718将数据的第一部分捕获(910)到数据缓冲器库的数据缓冲器的第一集合中，并将数据的第二部分捕获(912)到数据缓冲器库的第二部分中。例如事务处理逻辑718并行地从远程芯片106-1至106-N请求并接收数据。事务处理逻辑718然后将数据的第一部分并行转发或写入到数据缓冲器库的快速响应数据缓冲器706和722中，并且将数据的第二部分并行写入数据缓冲器库的慢速响应数据缓冲器710中。在一个示例中，一旦响应数据缓冲器706、722或710中的一个或多个至少部分或全部被数据填充，事务处理逻辑718就中断(918)处理器102以卸载数据的至少一部分。例如一旦快速响应数据缓冲器之一例如722被填满，事务处理逻辑718就在事务处理逻辑718填充另一响应数据缓冲器706时中断处理器102以从一个快速响应数据缓冲器722卸载数据。A command to initiate offloading of data is received (908) into the offload control device, in this case the accelerator register 714. Transaction logic 718 captures (910) the first portion of the data into the first set of data buffers of the data buffer bank and captures (912) the second portion of the data into the second portion of the data buffer bank. For example, transaction logic 718 requests and receives data from remote chips 106-1 through 106-N in parallel. Transaction logic 718 then forwards or writes the first portion of the data in parallel to the fast response data buffers 706 and 722 of the data buffer bank, and writes the second portion of the data in parallel to the slow response data of the data buffer bank buffer 710. In one example, once one or more of response data buffers 706, 722, or 710 are at least partially or fully populated with data, transaction logic 718 interrupts (918) processor 102 to unload at least a portion of the data. For example, once one of the fast response data buffers, eg, 722 , is filled, the transaction logic 718 interrupts the processor 102 to unload data from one fast response data buffer 722 while the transaction logic 718 fills the other response data buffer 706 .

其中如图7所示，数据卸载加速器包括比较器电路(例如716)和电路(例如720)，比较器电路716可以将捕获的数据与随后取回到的数据或一个或多个模式匹配准则进行比较(914)。在框916，取决于其如何被配置，电路720可以标记与所捕获的数据不同的后续数据，或者可以基于数据是否满足一个或多个准则来采取动作。在另一示例中，满足一个或多个准则或被标记的数据中的一部分而不是数据的全部被传递918到处理器102。在另一示例中，当数据中的至少一些满足一个或多个准则时，电路720(例如如果例如与动作电路508类似地被配置)采取校正动作916。7, the data offload accelerator includes a comparator circuit (eg, 716) and a circuit (eg, 720), and the comparator circuit 716 can compare the captured data with the subsequently retrieved data or one or more pattern matching criteria. Compare (914). At block 916, depending on how it is configured, the circuit 720 may flag subsequent data that differs from the captured data, or may take action based on whether the data meets one or more criteria. In another example, some but not all of the data that satisfies one or more criteria or is flagged is passed 918 to the processor 102 . In another example, circuit 720 (eg, if configured similarly to action circuit 508 ) takes corrective action 916 when at least some of the data satisfy one or more criteria.

为了简单和说明性目的，主要通过参考本公开的示例来描述本公开。在以上描述中，阐述了许多具体细节以便提供对本公开的透彻理解。然而，将显而易见的是，可以实践本公开而不限于这些具体细节。例如示例示出了使用数据卸载加速器的不同硬件配置和硬件组合来实践本公开。然而，可以使用数据卸载加速器的其他组合和配置或本文未描述的ASIC初始化加速器的不同配置来实践本公开。另外，在不脱离所描述的示例的范围的情况下，可以移除和/或修改所描绘的一些元件。在其他情况下，未详细描述一些方法和结构，以免不必要地混淆本公开。如本文所使用的，术语“包括”是指包括但不限于，并且术语“包含”是指包含但不限于。另外，术语“具有”是指具有但不限于，并且术语“含有”是指含有但不限于。在本文中，术语“约”在应用于某个值时通常表示在用于生成该值的设备的公差范围内，或者在某些示例中，表示正负10％、或正负5％、或正负1％，除非另有明确说明。For simplicity and illustrative purposes, the present disclosure is primarily described by reference to examples of the present disclosure. In the above description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure. It will be apparent, however, that the present disclosure may be practiced without being limited to these specific details. For example, examples illustrate the practice of the present disclosure using different hardware configurations and hardware combinations of data offload accelerators. However, the present disclosure may be practiced using other combinations and configurations of data offload accelerators or different configurations of ASIC initialization accelerators not described herein. Additionally, some of the depicted elements may be removed and/or modified without departing from the scope of the described examples. In other instances, some methods and structures have not been described in detail so as not to unnecessarily obscure the present disclosure. As used herein, the term "including" means including but not limited to, and the term "comprising" means including but not limited to. Additionally, the term "having" means having, but not limited to, and the term "containing" means including, but not limited to. As used herein, the term "about" when applied to a value generally means within the tolerance of the equipment used to generate the value, or in some examples, plus or minus 10%, or plus or minus 5%, or Plus or minus 1% unless explicitly stated otherwise.

Claims

1. A method for offloading data from a plurality of remote chips to a processor, the method comprising:

receiving a specification of a plurality of addresses for retrieving the data from the plurality of remote chips into an address buffer bank of a data offload accelerator;

receiving a command for initiating capture of the data from the plurality of remote chips into an offload control device of the data offload accelerator;

capturing the data from the plurality of remote chips in parallel into a data buffer bank of the data offload accelerator; and

The processor is interrupted via the offload control device to pass at least a portion of the data to the processor.

2. The method of claim 1, wherein capturing the data from the plurality of remote chips comprises capturing multiple data from within each of the remote chips of the plurality of remote chips Telemetry data for control and status registers.

3. The method of claim 1, comprising:

receiving, into the address buffer bank, a specification of a first portion of the plurality of addresses for retrieving a first portion of the data from the plurality of remote chips;

receiving, into the address buffer bank, a specification of a second portion of the plurality of addresses for retrieving a second portion of the data from the plurality of remote chips;

capturing the first portion of the data into the data buffer bank;

capturing the second portion of the data into the data buffer bank; and

Passing the first portion of the data to the processor is based on at least one different criterion than passing the second portion of the data to the processor.

4. The method of claim 3, comprising:

receiving the specification of the first portion of the plurality of addresses into a first address buffer of the address buffer bank;

receiving the specification of the second portion of the plurality of addresses into a second address buffer of the address buffer bank;

capturing the first portion of the data into a first data buffer of the data buffer bank; and

capturing the second portion of the data into a second data buffer of the data buffer bank for transferring the first portion of the data and the second portion of the data to The at least one different criterion for the processor includes one or both of a different rate or a different bandwidth.

5. The method of claim 1, wherein capturing the data from the plurality of remote chips into the data buffer bank in parallel comprises:

Sending a plurality of requests for the data in parallel from the data offload accelerator's transaction processing logic to a plurality of remote chips;

receiving, in parallel by the transaction logic, a plurality of responses including the data from the plurality of remote chips; and

The data is forwarded from the transaction logic into the data buffer bank in parallel.

6. The method of claim 1, comprising writing data to the plurality of remote chips in parallel.

7. The method of claim 6, wherein writing the data to the plurality of remote chips in parallel comprises initializing the plurality of remote chips prior to capturing the data.

8. The method of claim 1, comprising:

receive subsequent data from the plurality of remote chips,

comparing the captured data with subsequent data; and

The subsequent data that is different from the captured data is marked.

9. The method of claim 8, comprising passing the subsequent data to the processor that is different from the captured data.

10. The method of claim 1, comprising:

comparing the data to one or more criteria;

An action is performed based on a determination that at least some of the data satisfies the one or more criteria.

11. A data offload accelerator for offloading data from a plurality of remote chips to a processor, the data offload accelerator comprising:

An address buffer bank for receiving a specification of a plurality of addresses for retrieving the data from the plurality of remote chips;

An offload control device coupled to the processor, the offload control device for receiving a command to initiate capture of the data from the plurality of remote chips, and interrupting the processor to convert the At least a portion of the data is passed to the processor.

transaction logic coupled to the address buffer bank and the offload control device, the transaction logic for retrieving the data from the plurality of remote chips; and

A data buffer library is coupled to the transaction processing logic via a plurality of physical couplings, the data buffer library for receiving the data in parallel from the transaction processing logic.

12. The data offload accelerator of claim 11, wherein the data buffer bank comprises:

a first data buffer for receiving a first portion of the data for delivery to the processor; and

a second data buffer for receiving the second portion of the data based on at least one different criterion than passing the first portion of the data to the processor to passed to the processor.

13. The data offload accelerator of claim 12, wherein the address buffer bank comprises:

a first address buffer for receiving a specification of a first portion of the plurality of addresses for retrieving the first portion of the data; and

A second address buffer for receiving a specification of a second portion of the plurality of addresses for retrieving the second portion of the data.

14. The data offload accelerator of claim 11, wherein the offload control device comprises: a first register for receiving from the processor a message for initiating data traffic from the plurality of remote chips. the command of the capture of the data.

15. The data offload accelerator of claim 14, wherein the offload control device further comprises: a second register for interrupting the processor.

16. The data offload accelerator of claim 11, comprising a comparator circuit coupled to the data buffer bank, the comparator circuit for making a comparison and providing an output based on the comparison.

17. The data offload accelerator of claim 16, wherein the comparator circuit comprises: logic for:

comparing the data received in the data buffer bank with subsequent data captured from the plurality of remote chips; and

An output is provided indicating a difference between the data received in the data buffer bank and the subsequent data captured from the plurality of remote chips.

18. The data offload accelerator of claim 17, comprising: at least one register coupled to the comparator circuit and the transaction logic, the at least one register for flagging receipt of the data buffer the difference between the data in the library and the subsequent data captured from the plurality of remote chips.

19. The data offload accelerator of claim 16, wherein the comparator circuit comprises: at least one register for:

comparing the data to one or more criteria; and

An output is provided indicating whether the data meets the one or more criteria.

20. The data offload accelerator of claim 19, comprising: an action circuit coupled to the comparator circuit and the transaction logic, the action circuit for when the data satisfies the one or more Actions are performed when the criteria are met.