CN103718244A

CN103718244A - Acquisition method and device for media processing accelerator

Info

Publication number: CN103718244A
Application number: CN201280036339.6A
Authority: CN
Inventors: K·瓦伊蒂亚纳坦; B·G·雷迪
Original assignee: Intel Corp
Current assignee: Intel Corp
Priority date: 2011-07-25
Filing date: 2012-07-23
Publication date: 2014-04-09
Anticipated expiration: 2032-07-23
Also published as: KR101625418B1; CN103718244B; KR20140043455A; WO2013016295A1; US20130027416A1

Abstract

Apparatus, systems, and methods are described including dividing cache lines into at least a most significant portion and a next most significant portion, storing cache line contents in a register array such that the most significant portion of each cache line is stored in a first row of the register array and the next most significant portion of each cache line is stored in a second row of the register array. The contents of the first register portion of the first row may be provided to a barrel shifter, where the contents may be aligned and then stored in a buffer.

Description

Acquisition method and device for media processing accelerator

背景技术Background technique

视频面通常以区块格式存储在存储器中，以改进存储器控制器效率。视频处理算法经常需要访问这些视频面内任意位置处的任意矩形尺寸的感兴趣的2D区域（ROI）。这些任意位置可以是未对齐的高速缓冲存储器，且可以跨越几个非相邻的高速缓冲存储器线和/或区块（tile）。为了从这样的位置采集像素，传统方式可以从存储器过量提取像素数据的几个高速缓冲存储器线，随后执行交叉混合（swizzling）、掩码和缩减操作，使得采集过程具有挑战性。Video planes are usually stored in memory in a block format to improve memory controller efficiency. Video processing algorithms often require access to a 2D region of interest (ROI) of arbitrary rectangular size at an arbitrary location within these video planes. These arbitrary locations may be cache-unaligned, and may span several non-adjacent cache lines and/or tiles. To acquire a pixel from such a location, traditional approaches can overfetch several cache lines of pixel data from memory, followed by swizzling, masking and downscaling operations, making the acquisition process challenging.

高能效的媒体处理通常由可编程向量或标量架构来进行，或者由固定的功能逻辑来进行。在传统的向量实施方式中，可以使用向量采集指令来采集ROI的像素值，这通常包括：从一个高速缓冲存储器线收集像素值的行中的某些值，遮蔽任何无效值，在缓冲器或存储器中存储值，从下一个高速缓冲存储器线收集该行的附加的像素值，并重复这个过程直到采集到像素值的完整的水平的行为止。结果，为了满足区块格式，典型的向量采集过程通常需要使用不同的蒙版（mask）来多次重发相同的高速缓冲存储器线。Power-efficient media processing is typically performed by programmable vector or scalar architectures, or by fixed-function logic. In a traditional vector implementation, the pixel values of an ROI can be captured using vector capture instructions, which typically include: collecting certain values from a line of pixel values from a cache line, masking out any invalid values, Values are stored in memory, additional pixel values for that row are collected from the next cache line, and the process is repeated until a complete horizontal row of pixel values has been collected. As a result, in order to satisfy the block format, a typical vector acquisition process often requires multiple retransmissions of the same cache line using different masks.

附图说明Description of drawings

在附图中通过示例而非限制的方式例示了本文中所描述的材料。为了例示的简单和清楚，附图中例示的元件不一定是按照比例绘制的。例如，为了清楚，可以相对于其他元件而放大某些元件的尺寸。此外，在认为适当的情况下，在附图中重复了附图标记，以表示相应的或类似的元件。在附图中：The materials described herein are illustrated in the drawings by way of example and not limitation. For simplicity and clarity of illustration, elements illustrated in the figures have not necessarily been drawn to scale. For example, the dimensions of some of the elements may be exaggerated relative to other elements for clarity. Further, where considered appropriate, reference numerals have been repeated among the figures to indicate corresponding or analogous elements. In the attached picture:

图1是示例性系统的示意图；Figure 1 is a schematic diagram of an exemplary system;

图2例示了示例性的过程；Figure 2 illustrates an exemplary process;

图3例示了示例性的区块存储器格式；Figure 3 illustrates an exemplary chunk store format;

图4例示了示例性的区块存储器格式；Figure 4 illustrates an exemplary chunk store format;

图5、6和7例示了不同环境下图1的示例性系统；Figures 5, 6 and 7 illustrate the exemplary system of Figure 1 in different environments;

图8例示了图2的示例性过程的附加部分；Figure 8 illustrates an additional portion of the exemplary process of Figure 2;

图9例示了溢出条件下图1的示例性系统；以及Figure 9 illustrates the exemplary system of Figure 1 under overflow conditions; and

图10是全部根据本公开内容的至少某些实施方式而布置的示例性系统的示意图。10 is a schematic diagram of an exemplary system, all arranged in accordance with at least some embodiments of the present disclosure.

具体实施方式Detailed ways

现在参考附图来说明一个或多个实施例。尽管论述了特定的结构和布置，但应理解，这仅是出于说明性目的而作出的。本领域技术人员应当认识到，在不脱离本说明书的精神和范围的情况下，可以使用其他结构和布置。对于本领域技术人员而言，本文中所描述的技术和/或布置也可以用于除了本文中所描述的以外的各种其他系统和应用是显而易见的。One or more embodiments are now described with reference to the figures. While specific structures and arrangements are discussed, it should be understood that this is done for illustrative purposes only. A person skilled in the art will recognize that other structures and arrangements may be used without departing from the spirit and scope of the description. It will be apparent to those skilled in the art that the techniques and/or arrangements described herein may also be used in various other systems and applications than those described herein.

尽管以下说明阐述了可以在例如这种芯片上系统（SoC）架构的架构中出现的多个实施方式，但本文所述的技术和/或布置的实施方式不局限于特定的架构和/或计算系统，可以由用于类似目的的任意架构和/或计算系统来实现。例如，采用例如多个集成电路（IC）芯片和/或封装的多种架构，和/或多种计算设备，和/或诸如机顶盒、智能电话之类的多种消费电子（CE）设备，可以实现本文所述的技术和/或布置。此外，尽管以下说明可以阐明多个特定细节，例如系统部件的逻辑实施方式、类型和相互关系，逻辑划分/集成选择等，但可以实施所要求保护的主题而不需要这样的特定细节。在其他情况下，例如，可以不详细示出诸如控制结构和完整软件指令序列之类的一些材料，从而不模糊本文中所公开的材料。Although the following description sets forth various implementations that may occur within an architecture such as this System-on-Chip (SoC) architecture, implementations of the techniques and/or arrangements described herein are not limited to a particular architecture and/or computing system, can be implemented by any architecture and/or computing system serving a similar purpose. For example, using various architectures such as multiple integrated circuit (IC) chips and/or packages, and/or various computing devices, and/or various consumer electronics (CE) devices such as set-top boxes, smartphones, etc., may Implement the techniques and/or arrangements described herein. Furthermore, while the following description may set forth numerous specific details, such as logical implementations, types and interrelationships of system components, logical partitioning/integration options, etc., claimed subject matter may be practiced without such specific details. In other instances, for example, some material, such as control structures and full software instruction sequences, may not be shown in detail so as not to obscure material disclosed herein.

本文中所公开的材料可以在硬件、固件、软件或其任意组合中实现。本文中所公开的材料也可以实现为存储在机器可读介质上的指令，其可以由一个或多个处理器读取并执行。机器可读介质可以包括用于以机器（例如计算设备）可读的形式存储或发送信息的任意介质和/或机制。例如，机器可读介质可以包括：只读存储器（ROM）；随机存取存储器（RAM）；磁盘存储介质；光存储介质；闪存设备；电、光、声或其他形式传播的信号（例如，载波、红外信号、数字信号等），及其他的介质。The material disclosed herein can be implemented in hardware, firmware, software, or any combination thereof. The material disclosed herein can also be implemented as instructions stored on a machine-readable medium, which can be read and executed by one or more processors. A machine-readable medium may include any medium and/or mechanism for storing or transmitting information in a form readable by a machine (eg, a computing device). For example, a machine-readable medium may include: read-only memory (ROM); random-access memory (RAM); magnetic disk storage media; optical storage media; flash memory devices; , infrared signal, digital signal, etc.), and other media.

说明书中引用的“一个实施例”、“一实施例”、“一示例性实施例”等表示所述的实施方式可以包括特定的特征、结构或特性，但是不需要每个实施方式都包括特定的特征、结构或特点。而且，这样的短语不一定指代相同的实施方式。此外，当结合一实施方式来描述特定的特征、结构或特点时，应当指出，这些特征、结构或特点在其他相关实施方式中起作用是在本领域技术人员的知识范围内的，而无论本文中是否明确地说明。References in the specification to "one embodiment," "an embodiment," "an exemplary embodiment," etc. mean that the described implementations may include a particular feature, structure, or characteristic, but that every implementation need not include the particular feature, structure, or characteristic. character, structure or characteristic. Moreover, such phrases are not necessarily referring to the same embodiment. Furthermore, when a particular feature, structure, or characteristic is described in conjunction with one embodiment, it should be within the knowledge of those skilled in the art that such feature, structure, or characteristic would function in other related embodiments, regardless of the context herein. Is it clearly stated in .

图1例示了根据本公开内容的采集引擎100的示例性实施方式。在多个实施方式中，采集引擎100可以构成媒体处理加速器的至少一部分。采集引擎100包括寄存器阵列102、桶形移位器104、两个采集寄存器缓冲器（GRB）106和108和多路复用器（MUX）110。寄存器阵列102包括具有多个寄存器存储位置或部分122的多个俄罗斯方块寄存器（tetris register）112、114、116、118和120。在多个实施方式中，根据本公开内容的俄罗斯方块寄存器可以是任意临时存储逻辑，例如被配置为字节标记的或使能的处理器寄存器逻辑。FIG. 1 illustrates an exemplary implementation of an acquisition engine 100 according to the present disclosure. In various implementations, capture engine 100 may form at least a portion of a media processing accelerator. Acquisition engine 100 includes register array 102 , barrel shifter 104 , two acquisition register buffers (GRB) 106 and 108 , and multiplexer (MUX) 110 . Register array 102 includes a plurality of tetris registers 112 , 114 , 116 , 118 , and 120 having a plurality of register storage locations or sections 122 . In various implementations, a Tetris register according to the present disclosure may be any temporary storage logic, such as processor register logic configured as byte tagged or enabled.

根据本公开内容，采集引擎100可以用于从存储在诸如高速缓冲存储器（例如L1高速缓冲存储器）之类的存储器中的视频面的感兴趣的区域（ROI）采集视频数据。在多个实施方式中，ROI可以包括任意类型的视频数据，例如像素强度值等。在多个实施方式中，引擎100可以被配置为存储从高速缓冲存储器（未示出）接收的多个高速缓冲存储器线（CL）的内容，从而跨过阵列102的俄罗斯方块寄存器112-120中的相对应的一个的部分122来存储每个高速缓冲存储器线（例如CL1、CL2等）。在多个实施方式中，俄罗斯方块寄存器的第一部分可以构成阵列102的第一行124，而俄罗斯方块寄存器的第二部分可以构成阵列的第二行126，如此类推。According to the present disclosure, the capture engine 100 may be used to capture video data from a region of interest (ROI) of a video surface stored in a memory such as a cache (eg L1 cache). In various embodiments, the ROI may include any type of video data, such as pixel intensity values and the like. In various implementations, the engine 100 may be configured to store the contents of a plurality of cache lines (CLs) received from a cache (not shown) across Tetris registers 112-120 of the array 102 Each cache line (eg, CL1, CL2, etc.) is stored in a portion 122 of the corresponding one. In various implementations, the first portion of the Tetris registers may constitute the first row 124 of the array 102, the second portion of the Tetris registers may constitute the second row 126 of the array, and so on.

根据本公开内容，高速缓冲存储器线内容可以存储在阵列102中，以使得每个CL的内容的不同部分存储在俄罗斯方块寄存器中的相对应的一个的不同部分中。例如，在多个实施方式中，CL1的最高有效部分可以存储在俄罗斯方块寄存器112的第一部分128中，而CL2的最高有效部分可以存储在俄罗斯方块寄存器114的第一部分130中，如此类推。CL1的次最高有效部分可以存储在俄罗斯方块寄存器112的第二部分132中，而CL2的次最高有效部分可以存储在俄罗斯方块寄存器114的第二部分134中，如此类推。According to the present disclosure, cache line contents may be stored in array 102 such that a different portion of the content of each CL is stored in a different portion of a corresponding one of the Tetris registers. For example, in various embodiments, the most significant portion of CL1 may be stored in the first portion 128 of the Tetris register 112, the most significant portion of CL2 may be stored in the first portion 130 of the Tetris register 114, and so on. The next most significant portion of CL1 may be stored in the second portion 132 of the Tetris register 112, the next most significant portion of CL2 may be stored in the second portion 134 of the Tetris register 114, and so on.

根据本公开内容，阵列102的行的数量可以与待处理的高速缓冲存储器线中的八进制字（OW）的数量相匹配，而阵列102的列的数量（及因此所采用的俄罗斯方块寄存器的数量）可以与高速缓冲存储器线OW加一的数量相匹配。在图1的示例中，引擎100可以配置为采集64字节的高速缓冲存储器线，以使得每个俄罗斯方块寄存器都包括四个部分122以存储相对应的高速缓冲存储器线的四个16字节OW部分，并且因此阵列102包括四行。例如，CL1的最高有效OW可以存储在俄罗斯方块寄存器112的部分128中，而CL1的次最高有效OW可以存储在寄存器112的部分132中，如此类推。如以下将更详细解释的那样，为了容纳并处理未对齐的和/或溢出的高速缓冲存储器线内容，根据本公开内容的采集引擎可以包括比存储高速缓冲存储器线OW所需的俄罗斯方块寄存器的数量至少多一个的俄罗斯方块寄存器。例如，为了处理具有四个OW的64字节高速缓冲存储器线，阵列102包括五个俄罗斯方块寄存器112-120，以使得阵列102的每一行都在宽度上横跨总共80字节。According to the present disclosure, the number of rows of array 102 may match the number of octal words (OW) in a cache line to be processed, while the number of columns of array 102 (and thus the number of Tetris registers employed ) can match the number of cache lines OW plus one. In the example of FIG. 1, the engine 100 may be configured to capture 64-byte cache lines such that each Tetris register includes four sections 122 to store four 16-byte cache lines for the corresponding cache line. OW portion, and thus array 102 includes four rows. For example, the most significant OW for CL1 may be stored in section 128 of Tetris register 112, the next most significant OW for CL1 may be stored in section 132 of register 112, and so on. As will be explained in more detail below, in order to accommodate and handle misaligned and/or overflowing cache line contents, an acquisition engine according to the present disclosure may include more Tetris registers than are required to store cache line OW Number of Tetris registers at least one more. For example, to handle a 64-byte cache line with four OWs, array 102 includes five Tetris registers 112-120 such that each row of array 102 spans a total of 80 bytes in width.

桶形移位器104可以接收寄存器102的任意一行的内容。例如，桶形移位器104可以是64字节桶形移位器，被配置为接收与在阵列102中存储的五个高速缓冲存储器线中的最高有效部分相对应的行124的内容。在多个实施方式中，如下将更详细解释的那样，桶形移位器104可以通过例如左移寄存器部分122的内容来对齐它们，随后可以将对齐的内容提供给GRB106或GRB108。例如，桶形移位器104可以以连续往复（successiveiteration）的方式接收行124的部分122的内容，对齐那些内容并将经对齐的内容提供给GRB106。例如，桶形移位器104可以接收寄存器部分128的内容，可以对齐那些内容，并且随后将经对齐的数据提供给GRB106。桶形移位器104可以随后接收寄存器部分130的内容，可以对齐那些内容并随后将经对齐的数据提供给GRB106，以相邻于与寄存器部分128相对应的经对齐的数据而临时存储，如此类推，直至行124的内容与GRB106对齐并存储于GRB106中，以生成像素数据的对齐行。The barrel shifter 104 can receive the contents of any row of the register 102 . For example, barrel shifter 104 may be a 64 byte barrel shifter configured to receive the contents of row 124 corresponding to the most significant portion of the five cache lines stored in array 102 . In various implementations, as will be explained in more detail below, barrel shifter 104 may align the contents of register section 122 by, for example, left shifting them, and may then provide the aligned contents to GRB 106 or GRB 108 . For example, barrel shifter 104 may receive the contents of portion 122 of row 124 in successive iterations, align those contents, and provide the aligned contents to GRB 106 . For example, barrel shifter 104 may receive the contents of register section 128 , may align those contents, and then provide the aligned data to GRB 106 . Barrel shifter 104 may then receive the contents of register section 130, may align those contents and then provide the aligned data to GRB 106 for temporary storage adjacent to the aligned data corresponding to register section 128, so By analogy, until the content of line 124 is aligned with and stored in GRB 106 , an aligned line of pixel data is generated.

当引擎100如刚才所描述那样处理行124的内容时，引擎100还可以以类似的方式进行行126的内容的处理，直至行126的内容与RGB108对齐并存储于RGB108中，以生成像素值的第二对齐行。在多个实施方式中，如下更详细解释的那样，GRB106和GRB108可以使用MUX110以往复方式将像素数据的对齐行提供给2D寄存器文件（未示出），以将GRB106和GRB108的内容交替地提供给寄存器文件（RF）。When the engine 100 processes the contents of row 124 as just described, the engine 100 can also process the contents of row 126 in a similar manner until the contents of row 126 are aligned with and stored in RGB108 to generate a pixel value The second justifies the row. In various implementations, as explained in more detail below, GRB 106 and GRB 108 may provide aligned lines of pixel data to a 2D register file (not shown) in a reciprocating manner using MUX 110 to alternately provide the contents of GRB 106 and GRB 108 to the register file (RF).

在多个实施方式中，采集引擎100可以在一个或多个集成电路（IC）中实现，所述集成电路例如是芯片上系统（SoC）和消费电子（CE）媒体处理系统的附加IC。例如，引擎100可以由被配置为处理视频数据的任意设备来实现，所述设备例如是但不限于专用集成电路（ASIC）、现场可编程门阵列（FPGA）、数字信号处理器（DSP）等。如上所述，尽管引擎100包括适合于处理64字节高速缓冲存储器线的五个俄罗斯方块寄存器112-120，但根据本公开内容的采集引擎可以包括取决于高速缓冲存储器线和/或被处理的ROI的尺寸的任意数量的俄罗斯方块寄存器。In various implementations, capture engine 100 may be implemented in one or more integrated circuits (ICs), such as add-on ICs for system-on-chip (SoC) and consumer electronics (CE) media processing systems. For example, engine 100 may be implemented by any device configured to process video data, such as, but not limited to, application specific integrated circuits (ASICs), field programmable gate arrays (FPGAs), digital signal processors (DSPs), etc. . As noted above, although engine 100 includes five Tetris registers 112-120 suitable for processing a 64-byte cache line, acquisition engines according to the present disclosure may include An arbitrary number of Tetris registers of ROI size.

图2例示了根据本公开内容的多个实施方式的用于实现采集操作的示例性过程200的流程图。过程200可以包括如由图2的块201、202、204、206、208、210和212中的一个或多个块所示的一个或多个操作、功能或动作。通过非限制性示例的方式，本文中将参考图1的示例性采集引擎100来描述过程200。过程200可以在块201处开始，其中开始对视频面的ROI的采集处理。例如，过程200可以在块201处开始，其中开始对64x64的ROI的采集处理（例如，ROI横跨64行，每一行都具有64字节的像素值）。FIG. 2 illustrates a flowchart of an exemplary process 200 for implementing acquisition operations in accordance with various implementations of the present disclosure. Process 200 may include one or more operations, functions or actions as illustrated by one or more of blocks 201 , 202 , 204 , 206 , 208 , 210 , and 212 of FIG. 2 . By way of non-limiting example, process 200 will be described herein with reference to the exemplary acquisition engine 100 of FIG. 1 . Process 200 may begin at block 201, where the acquisition process for an ROI of a video plane begins. For example, process 200 may begin at block 201 , where acquisition processing begins for a 64x64 ROI (eg, the ROI spans 64 rows, each row having 64 bytes of pixel values).

在块202处，可以接收第一高速缓冲存储器线（CL），其中，所述CL对应于在ROI中所包含的数据的第一CL。在块204处，可以将CL划分为最高有效部分、次最高有效部分等。例如，如果在块202处接收64字节CL，则可以将CL划分为四个16字节OW部分。随后可以将CL部分载入寄存器阵列中，以便将最高有效部分存储在阵列的第一行的第一位置中，次最高有效部分存储在阵列的第二行的第一位置中，如此类推。例如，由阵列102接收的64字节CL（CL1）可以划分为四个OW，并载入第一俄罗斯方块寄存器112的寄存器部分122中，以便将最高有效OW存储在部分128中，次最高有效OW存储在部分132中，如此类推。At block 202, a first cache line (CL) may be received, wherein the CL corresponds to a first CL of data contained in a ROI. At block 204, the CL may be partitioned into the most significant portion, the next most significant portion, and so on. For example, if a 64-byte CL is received at block 202, the CL can be divided into four 16-byte OW sections. The CL portion may then be loaded into the register array so that the most significant portion is stored in the first location of the first row of the array, the next most significant portion is stored in the first location of the second row of the array, and so on. For example, the 64 bytes CL (CL1) received by array 102 may be divided into four OWs and loaded into register section 122 of first Tetris register 112 so that the most significant OW is stored in section 128, the next most significant OW is stored in section 132, and so on.

在块208处，做出关于是否要针对ROI获得附加的数据的高速缓冲存储器线的确定。如果要获得附加的CL，则过程200可以环回（loop back）并针对ROI中下一个CL进行块202-206。例如，可以由阵列102接收下一个64字节CL（CL2），划分为四个OW并载入第二俄罗斯方块寄存器114的寄存器部分122中，以便将最高有效OW存储在部分130中，次最高有效OW存储在部分134中，如此类推。以此方式，过程200可以通过块202-206的连续往复而继续循环，直至ROI的一个或多个附加的CL载入阵列102中。例如，继续以上的示例，直到可以由阵列102接收ROI的另外三个CL（例如，CL3、CL4和CL5），以类似的方式划分为四个OW并载入剩余俄罗斯方块寄存器116、118和120的寄存器部分122中。At block 208, a determination is made as to whether additional cache lines of data are to be obtained for the ROI. If additional CLs are to be obtained, the process 200 can loop back and proceed to blocks 202-206 for the next CL in the ROI. For example, the next 64 bytes CL (CL2) may be received by array 102, divided into four OWs and loaded into register section 122 of second Tetris register 114, so that the most significant OW is stored in section 130, the next highest Effective OWs are stored in section 134, and so on. In this manner, process 200 may continue to loop through successive iterations of blocks 202 - 206 until one or more additional CLs of the ROI are loaded into array 102 . For example, continuing the example above until three more CLs (e.g., CL3, CL4, and CL5) of the ROI can be received by the array 102, similarly divided into four OWs and loaded into the remaining Tetris registers 116, 118, and 120 in the register section 122 of the

图3和4例示了根据本公开内容的多个实施方式的、在区块存储器中用于存储视频面的示例性区块-y格式。在图3中，存储器的4KB个区块300可以包括八（8）列乘以16字节宽存储位置的三十二（32）行。在区块-y格式中，区块300可以将64字节CL302的四个OW存储为区块300的列的第一部分。以此方式，区块300可以存储数据的六十四（64）个高速缓冲存储器线。在图4中，示出区块300跨诸如高速缓冲存储器之类的存储器的区域400的一部分。参考过程200和引擎100，用以加载ROI的CL的块202-206的连续往复可以包括连续地将区块300的高速缓冲存储器线402-410载入阵列102中。3 and 4 illustrate exemplary tile-y formats for storing video planes in a tile store according to various embodiments of the present disclosure. In FIG. 3, a 4KB block 300 of memory may include eight (8) columns by thirty-two (32) rows of 16 byte wide storage locations. In block-y format, block 300 may store four OWs of 64 bytes CL 302 as the first part of the column of block 300 . In this manner, bank 300 can store sixty-four (64) cache lines of data. In FIG. 4, a block 300 is shown spanning a portion of a region 400 of memory, such as a cache memory. With reference to process 200 and engine 100 , successive reciprocations of blocks 202 - 206 to load the CL of the ROI may include sequentially loading cache lines 402 - 410 of block 300 into array 102 .

返回到图2的论述，当已经将ROI的一个或多个CL载入到寄存器阵列中时，过程200可以在块210处继续，其中，针对阵列的第一行的每一个连续部分，将该部分载入到桶形移位器中，如有必要，对齐该部分的内容。例如，块210可以包括将行124的第一部分128的内容载入到移位器104中，随后左移数据以将其GRB106对齐。在一些实施方式中，如果当在块202-206处将高速缓冲存储器线载入阵列时已经对齐了高速缓冲存储器线，则块210可以不包括对齐内容。在块212处，可以将像素值的对齐的第一行提供给第一采集缓冲器。例如，可以从桶形移位器104将行124的对齐的像素值内容提供给GRB106。Returning to the discussion of FIG. 2, when one or more CLs of the ROI have been loaded into the register array, process 200 may continue at block 210, where, for each successive portion of the first row of the array, the The section is loaded into the barrel shifter, aligning the contents of the section if necessary. For example, block 210 may include loading the contents of first portion 128 of row 124 into shifter 104 and then left shifting the data to align it to GRB 106 . In some implementations, block 210 may not include alignment content if the cache line was already aligned when it was loaded into the array at blocks 202-206. At block 212, the aligned first row of pixel values may be provided to a first acquisition buffer. For example, the aligned pixel value content of row 124 may be provided to GRB 106 from barrel shifter 104 .

例如，图5例示了根据本公开内容的多个实施方式的、在针对第一寄存器部分进行过程200的块210和212的环境500中的引擎100。在环境500中，如图所示，已经将ROI的五个CL载入到阵列102中，其中ROI的内容（由虚线标记示出）没有相对于阵列102对齐。在这个示例中，ROI的第一CL（例如CL1）载入到第一俄罗斯方块寄存器112中，以使得俄罗斯方块寄存器112的每一个部分122都包括无效部分502。根据本公开内容，当针对行124的第一寄存器部分128进行块210时，将部分128的内容载入到移位器104中并左移，以使得当在块210处将内容提供给GRB106时，数据如图所示地与GRB106对齐。For example, FIG. 5 illustrates engine 100 in an environment 500 performing blocks 210 and 212 of process 200 for a first register portion, according to various embodiments of the present disclosure. In environment 500 , as shown, five CLs of an ROI have been loaded into array 102 , where the contents of the ROI (shown by dashed markers) are not aligned relative to array 102 . In this example, the first CL of the ROI (eg, CL1 ) is loaded into the first Tetris register 112 such that each portion 122 of the Tetris register 112 includes an invalid portion 502 . According to the present disclosure, when block 210 is performed for the first register portion 128 of row 124, the contents of portion 128 are loaded into shifter 104 and shifted left such that when the content is provided to GRB 106 at block 210 , the data is aligned to GRB106 as shown.

继续该示例，图6示出了根据本公开内容的多个实施方式的、在针对下一个寄存器部分进行过程200的块210和212的环境600中的引擎100。在环境600中，通过将俄罗斯方块寄存器114的部分130的内容载入到移位器104中，左移数据并随后将对齐的数据提供给GRB106来为行124的下一个部分130进行块210和212，以使得该数据如图所示地相邻于来自部分128的对齐的数据而被存储。以该方式，在块210和212结束处，行124的完全对齐的内容可以存储在GRB106中，如图7所示，其中，在根据本公开内容的多个实施方式的、针对第一寄存器行124完成过程200的块210和212的环境700中例示了引擎100。Continuing with the example, FIG. 6 illustrates engine 100 in an environment 600 performing blocks 210 and 212 of process 200 for the next register section, according to various implementations of the present disclosure. In environment 600, block 210 and 212 so that the data is stored adjacent to the aligned data from portion 128 as shown. In this way, at the end of blocks 210 and 212, the fully aligned contents of row 124 may be stored in GRB 106, as shown in FIG. Engine 100 is instantiated in environment 700 where blocks 210 and 212 of process 200 are completed 124 .

返回到图2的论述，当在块212处已经将第一行的对齐的内容载入到第一采集缓冲器中时，过程200可以继续进行寄存器阵列的任意附加的行的处理。图8示出了根据本公开内容的多个实施方式的用于实现采集操作的示例性过程200的附加部分的流程图。过程200的附加部分可以包括如图8的块215、214、216、218、220、和222中的一个或多个块所例示的一个或多个操作、功能或动作。通过非限制性示例的方式，本文中还将参考图1的示例性采集引擎100来描述过程200的附加的块。过程200可以在图8的块214处继续。Returning to the discussion of FIG. 2 , when the aligned contents of the first row have been loaded into the first capture buffer at block 212 , the process 200 may proceed with processing any additional rows of the register array. FIG. 8 shows a flow diagram of additional portions of an exemplary process 200 for implementing acquisition operations in accordance with various implementations of the present disclosure. Additional portions of process 200 may include one or more operations, functions or actions as illustrated by one or more of blocks 215 , 214 , 216 , 218 , 220 , and 222 of FIG. 8 . By way of non-limiting example, additional blocks of the process 200 will also be described herein with reference to the exemplary acquisition engine 100 of FIG. 1 . Process 200 may continue at block 214 of FIG. 8 .

在块214处，可以将阵列的第二行的部分的内容连续地载入到桶形移位器中，并且如有必要，可以对齐该内容。在块215处，可以将经对齐的寄存器部分的内容并入第二采集缓冲器中。例如，块214和块215可以包括：将第二行126的第一部分132的内容载入到移位器104中，左移数据，将经对齐的数据载入到GRB108中，将第二行126的第二部分134的内容载入到移位器104中，左移数据，将经对齐的数据载入到的GRB108中邻近来自部分132的经对齐数据，如此类推，直至处理了第二行的全部部分。因此，在这个示例中，在块214和块215结束处，寄存器阵列102的第二行126的经对齐的内容可以被载入到GRB108中。At block 214, the contents of the portion of the second row of the array may be sequentially loaded into the barrel shifter, and the contents may be aligned if necessary. At block 215, the contents of the aligned register portion may be incorporated into a second capture buffer. For example, blocks 214 and 215 may include loading the contents of the first portion 132 of the second row 126 into the shifter 104, shifting the data left, loading the aligned data into the GRB 108, and shifting the contents of the second row 126 The contents of the second portion 134 of 1 are loaded into shifter 104, the data is shifted left, the aligned data is loaded into GRB 108 adjacent to the aligned data from portion 132, and so on until the second row of all parts. Thus, in this example, at the end of block 214 and block 215 , the aligned contents of second row 126 of register array 102 may be loaded into GRB 108 .

当块214和/或块215进行时，可以在块216处将第一行的经对齐的内容从第一寄存器缓冲器提供给2D寄存器文件。例如，块216可以包括：使用MUX110来将存储在GRB106中的经对齐的第一行数据提供给RF，其中，所述数据可以在RF中存储为第一行数据。在块218处，可以将第二行的经对齐的内容从第二寄存器缓冲器提供给RF。例如，块218可以包括：使用MUX110来将存储在GRB108中的经对齐的第二行数据提供给RF，其中，所述数据可以在RF中存储为第二行数据。While block 214 and/or block 215 are in progress, at block 216 the aligned contents of the first row may be provided from the first register buffer to the 2D register file. For example, block 216 may include using MUX 110 to provide aligned first row data stored in GRB 106 to RF, where the data may be stored as first row data in RF. At block 218, the aligned contents of the second row may be provided to the RF from the second register buffer. For example, block 218 may include using MUX 110 to provide aligned second row data stored in GRB 108 to RF, where the data may be stored as second row data in RF.

过程200可以在块220处继续，其中，以类似于以上针对寄存器阵列的前两行所描述的方式来处理寄存器阵列的附加的行。因此，例如，块220可以引起阵列102的三个剩余行的经对齐内容在RF中被存储为接下来的三行数据，并可以完成阵列的这些行的处理。在块222处，可以作出有关于是否应进行针对ROI采集更多的高速缓冲存储器线的确定。例如，如果过程200的第一次往复（iteration）已引起了采集64x64的ROI的四行，则可以针对ROI接下来的四行继续进行采集操作。如果将针对ROI继续采集操作，则过程200可以返回到图2，并可以在块201处开始针对ROI的一个或多个附加的高速缓冲存储器线进行第二次过程200。否则，如果采集操作不继续，则过程200可以结束。Process 200 may continue at block 220, where additional rows of the register array are processed in a manner similar to that described above for the first two rows of the register array. Thus, for example, block 220 may cause the aligned contents of the three remaining rows of array 102 to be stored in RF as the next three rows of data, and processing of these rows of the array may be completed. At block 222, a determination may be made as to whether acquisition of more cache lines for the ROI should be made. For example, if the first iteration of process 200 has resulted in the acquisition of four rows of a 64x64 ROI, the acquisition operation may continue for the next four rows of the ROI. If acquisition operations are to continue for the ROI, process 200 may return to FIG. 2 and a second pass of process 200 may begin at block 201 for one or more additional cache lines of the ROI. Otherwise, if acquisition operations are not to continue, process 200 may end.

尽管示例性过程200的实施方式如图2和8所示可以包括以例示的顺序进行所示的全部块，但是本公开内容不限于此，并且在多个示例中，过程200的实施方式可以包括仅进行所示的全部块的一子集和/或以不同于所示的顺序进行。例如，在多个实施方式中，可以在块214和215的任意一个或两者的之前、期间和/或之后进行图8的块216。另外，可以针对寄存器阵列的不同填充阶段来进行根据本公开内容的采集处理，以使得如果在任一时间，寄存器阵列的一行或多行为空的话，则可以在如本文所述地处理保持有ROI的像素值的阵列行的同时，用来自高速缓冲存储器的ROI像素值来加载那些行。Although an implementation of the exemplary process 200, as shown in FIGS. Only a subset of all blocks shown are performed and/or in an order different from that shown. For example, in various implementations, block 216 of FIG. 8 may be performed before, during, and/or after either or both of blocks 214 and 215 . In addition, acquisition processing according to the present disclosure can be done for different filling stages of the register array, so that if at any one time, one or more rows of the register array are empty, then the ROI remaining can be processed as described herein. At the same time as the rows of the array of pixel values are loaded, those rows are loaded with the ROI pixel values from the cache.

另外，可以响应于由一个或多个计算机程序产品提供的指令来进行图2和图8的处理和/或块中的任意一个或多个。这种程序产品可以包括提供指令的信号承载介质，在由例如一个或多个处理器核心执行所述指令时，可以提供本文所描述的功能。可以在任意形式的计算机可读介质中提供计算机程序产品。因此，例如，包括一个或多个处理器核心的处理器可以响应于由计算机可读介质传送到处理器的指令来进行图2和8中所示的一个或多个块。Additionally, any one or more of the processes and/or blocks of Figures 2 and 8 may be performed in response to instructions provided by one or more computer program products. Such a program product may include a signal bearing medium providing instructions that, when executed by, for example, one or more processor cores, may provide the functionality described herein. A computer program product may be provided on any form of computer readable medium. Thus, for example, a processor including one or more processor cores may perform one or more of the blocks shown in FIGS. 2 and 8 in response to instructions conveyed to the processor by a computer-readable medium.

此外，尽管本文中已经在针对在高速缓冲存储器中以区块-y格式存储的视频面的64x64的ROI来采集64字节的高速缓冲存储器线的示例性采集引擎100的环境中描述了过程200，但是本公开内容不限于高速缓冲存储器线的具体尺寸、ROI的尺寸或形状、和/或具体的区块存储器格式。例如，为了针对具有大于64字节宽度的ROI实现采集处理，可以将一个或多个附加的俄罗斯方块寄存器添加到寄存器阵列中。另外，对于较小宽度的ROI，例如32x64的ROI，阵列的前两行可以在被写出到RF之前收集到采集缓冲器中。此外，诸如区块-x之类的其他区块存储器格式可以根据本公开内容而进行采集处理。Furthermore, while process 200 has been described herein in the context of an exemplary capture engine 100 that captures a 64-byte cache line for a 64x64 ROI of a video plane stored in tile-y format in the cache , but the present disclosure is not limited to a particular size of cache line, size or shape of ROI, and/or a particular block memory format. For example, to enable acquisition processing for ROIs having a width greater than 64 bytes, one or more additional Tetris registers may be added to the register array. Also, for smaller width ROIs, such as a 32x64 ROI, the first two rows of the array can be collected into the acquisition buffer before being written out to RF. Additionally, other chunk memory formats, such as chunk-x, can be captured in accordance with this disclosure.

在多个实施方式中，一个或多个处理器核心可以针对ROI的任意尺寸和/或形状以及针对ROI数据相对于引擎100的任何对齐使用引擎100来进行过程200数据。在如此进行时，处理器吞吐量可以取决于ROI的尺寸、形状和/或对齐。例如，在非限制性实例中，如果待采集的ROI在X方向上伸展（例如，在区块-y格式中作为一行像素值）并完全对齐，则可以在两个循环中处理一个高速缓冲存储器线。在这种环境下，吞吐量会受到高速缓冲存储器宽度的限制。另一方面，如果ROI在Y方向上伸展（例如，在区块-y格式中作为一列像素值）并完全对齐，则可以在64个循环中处理一个高速缓冲存储器线。在另一个非限制性示例中，对于完全未对齐的17x17的ROI，可以在12个循环中处理一个高速缓冲存储器线。在最后的非限制性示例中，可以在50个循环中采集对齐的24x24的ROI的像素值，然而如果24x24的ROI完全未对齐，则可能用81个循环来采集全部像素值。In various embodiments, one or more processor cores may use the engine 100 to process the process 200 data for any size and/or shape of the ROI and for any alignment of the ROI data relative to the engine 100 . In doing so, processor throughput may depend on the size, shape and/or alignment of the ROI. For example, in a non-limiting example, if the ROI to be acquired is stretched in the X direction (e.g., as a row of pixel values in block-y format) and perfectly aligned, a cache can be processed in two loops Wire. In this environment, throughput is limited by the cache memory width. On the other hand, if the ROI is stretched in the Y direction (e.g., as a column of pixel values in block-y format) and perfectly aligned, then one cache line can be processed in 64 cycles. In another non-limiting example, for a fully misaligned 17x17 ROI, one cache line can be processed in 12 cycles. In a final non-limiting example, pixel values for an aligned 24x24 ROI may be acquired in 50 cycles, whereas if the 24x24 ROI is not aligned at all, it may take 81 cycles to acquire all pixel values.

在多个实施方式中，可以在溢出条件下进行根据本公开内容的采集过程。例如，参考示例性采集引擎100，在一些实施方式中，ROI可以超过桶形移位器104和GRB106及GRB108的宽度。图9例示了在根据本公开内容的多个实施方式的在溢出条件下进行过程200的环境900中的引擎100。如图9所示，在以第一行的大部分填充GRB106之后，可以将从第一行剩余的溢出数据902放置到GRB108中。可以以类似的方式继续剩余行的处理。In various embodiments, acquisition processes according to the present disclosure may be performed under overflow conditions. For example, referring to exemplary acquisition engine 100 , in some implementations, the ROI may exceed the width of barrel shifter 104 and GRBs 106 and 108 . FIG. 9 illustrates engine 100 in an environment 900 in which process 200 is performed under overflow conditions, according to various implementations of the present disclosure. As shown in FIG. 9 , after filling GRB 106 with most of the first row, overflow data 902 remaining from the first row can be placed into GRB 108 . Processing of the remaining rows may continue in a similar manner.

图10例示了根据本公开内容的示例性系统1000。系统1000可以用于执行本文中所论述的多种功能的某些或全部，并可以包括根据本公开内容的多个实施方式能够进行采集处理的任何设备或设备的集合。例如，系统1000可以包括诸如台式机、移动或平板计算机、智能电话、机顶盒等之类的计算平台或设备的选择的部件，但是本公开内容不限于此。在一些实施方式中，系统1000可以是基于用于CE设备的

architecture(IA)的计算平台或SoC。本领域技术人员易于理解，在不脱离本公开内容的范围的情况下，本文所描述的实施方式可以应用于替换的处理系统。FIG. 10 illustrates an example system 1000 according to the present disclosure. The system 1000 may be configured to perform some or all of the various functions discussed herein, and may include any device or collection of devices capable of acquisition and processing according to various embodiments of the present disclosure. For example, system 1000 may include components of selected computing platforms or devices such as desktop computers, mobile or tablet computers, smartphones, set-top boxes, etc., but the disclosure is not limited thereto. In some implementations, the system 1000 may be based on a CE device

architecture (IA) computing platform or SoC. Those skilled in the art will readily appreciate that the embodiments described herein may be applied to alternative processing systems without departing from the scope of the present disclosure.

系统1000包括具有一个或多个处理器核心1004的处理器1002。处理器核心1004可以是能够至少部分地执行软件和/或处理数据信号的任意类型的处理器逻辑。在多个示例中，处理器核心1004可以包括CISC处理器核心、RISC微处理器核心、VLIW微处理器核心、和/或实现指令集的任何组合的任意数量的处理器核心，或者诸如数字信号处理器或微控制器之类的任何其他处理器设备。在多个实施方式中，一个或多个处理器核心1004可以根据本公开内容实现采集引擎和/或进行采集处理。System 1000 includes a processor 1002 having one or more processor cores 1004 . Processor core 1004 may be any type of processor logic capable of executing software and/or processing data signals, at least in part. In various examples, processor core 1004 may include a CISC processor core, a RISC microprocessor core, a VLIW microprocessor core, and/or any number of processor cores implementing any combination of instruction sets, or such as digital signal Any other processor device like a processor or microcontroller. In various implementations, one or more processor cores 1004 may implement an acquisition engine and/or perform acquisition processing in accordance with the present disclosure.

处理器1002还包括解码器1006，其可以用于将由例如显示处理器1008和/或图形处理器1010接收的指令解码为控制信号和/或微码入口点。尽管在系统1000中例示为与核心1004不同的部件，但本领域技术人员应当理解，一个或多个核心1004可以实现解码器1006、显示处理器1008和/或图形处理器1010。响应于控制信号和/或微码入口点，显示处理器1008和/或图形处理器1010可以执行相对应的操作。Processor 1002 also includes a decoder 1006, which may be used to decode instructions received by, for example, display processor 1008 and/or graphics processor 1010 into control signals and/or microcode entry points. Although illustrated in system 1000 as a separate component from cores 1004 , one skilled in the art will appreciate that one or more cores 1004 may implement decoder 1006 , display processor 1008 and/or graphics processor 1010 . In response to the control signals and/or microcode entry points, the display processor 1008 and/or the graphics processor 1010 may perform corresponding operations.

处理核心1004、解码器1006、显示处理器1008和/或图形处理器1010可以通过系统互连1016彼此和/或与多个其他系统设备可通信地和/或可操作地耦合，所述其他系统设备可以包括但不限于，例如，存储器控制器1014、音频控制器1018和/或外围设备1020。外围设备1020可以包括，例如，通用串行总线（USB）主机端口、外围设备互连（PCI）Express端口、串行外围接口（SPI）、扩展总线、和/或其他外围设备。尽管图10将存储器控制器1014例示为由互连1016耦合到解码器1006和处理器1008及1010，但在多个实施方式中，存储器控制器1014可以直接耦合到解码器1006、显示处理器1008和/或图形处理器1010。Processing core 1004, decoder 1006, display processor 1008, and/or graphics processor 1010 may be communicatively and/or operably coupled via system interconnect 1016 to each other and/or to various other system devices that Devices may include, but are not limited to, memory controller 1014 , audio controller 1018 and/or peripherals 1020 , for example. Peripherals 1020 may include, for example, a Universal Serial Bus (USB) host port, a Peripheral Component Interconnect (PCI) Express port, a Serial Peripheral Interface (SPI), an expansion bus, and/or other peripherals. Although FIG. 10 illustrates memory controller 1014 as being coupled to decoder 1006 and processors 1008 and 1010 by interconnect 1016, in various implementations, memory controller 1014 may be directly coupled to decoder 1006, display processor 1008, and/or graphics processor 1010 .

在一些实施方式中，系统1000可以经由I/O总线（未示出）与图10中未示出的多个I/O设备通信。这样的I/O设备可以包括但不限于，例如，通用异步接收器/发射器（UART）设备、USB设备、I/O扩展接口或其他I/O设备。在多个实施方式中，系统1000可以表示用于进行移动、网络和/或无线通信的系统的至少部分。In some implementations, system 1000 can communicate with multiple I/O devices not shown in FIG. 10 via an I/O bus (not shown). Such I/O devices may include, but are not limited to, for example, Universal Asynchronous Receiver/Transmitter (UART) devices, USB devices, I/O expansion interfaces, or other I/O devices. In various implementations, system 1000 can represent at least a portion of a system for mobile, network, and/or wireless communications.

系统1000可以进一步包括存储器1012。存储器1012可以是一个或多个分离的存储器部件，例如动态随机存取存储器（DRAM）设备、静态随机存取存储器（SRAM）设备、闪存设备、或其他存储器设备。存储器1012可以存储由数据信号表示的指令和/或数据，其可以由处理器1002执行。在一些实施方式中，存储器1012可以包括系统存储器部分和显示存储器部分。在多个实施方式中，存储器1012可以存储视频数据，例如包括像素值的视频数据的帧，所述像素值可以在多个接合点被存储为由引擎100采集的和/或由过程200处理的高速缓冲存储器线。System 1000 may further include memory 1012 . Memory 1012 may be one or more discrete memory components, such as dynamic random access memory (DRAM) devices, static random access memory (SRAM) devices, flash memory devices, or other memory devices. Memory 1012 may store instructions and/or data represented by data signals, which may be executed by processor 1002 . In some implementations, memory 1012 may include a system memory portion and a display memory portion. In various implementations, the memory 1012 may store video data, such as frames of video data including pixel values that may be stored at various junctures as captured by the engine 100 and/or processed by the process 200 cache line.

尽管图10例示了在处理器1002以外的存储器1012，但在多个实施方式中，处理器1002包括诸如L1高速缓冲存储器之类的内部高速缓冲存储器1024的一个或多个实例。根据本公开内容，高速缓冲存储器1024可以以区块-y格式布置的高速缓冲存储器线的形式存储诸如像素值之类的视频数据。处理器核心1004可以访问存储在高速缓冲存储器1024中的数据，以实现本文中所描述的采集功能。此外，高速缓冲存储器1024可以提供2D寄存器文件，其存储引擎100和过程200的经对齐的数据输出。在多个实施方式中，高速缓冲存储器1024可以从存储器1012接收诸如像素值之类的视频数据。Although FIG. 10 illustrates memory 1012 external to processor 1002, in various implementations, processor 1002 includes one or more instances of internal cache memory 1024, such as an L1 cache. According to the present disclosure, cache 1024 may store video data, such as pixel values, in the form of cache lines arranged in a block-y format. Processor core 1004 may access data stored in cache memory 1024 to implement the acquisition functions described herein. Additionally, the cache memory 1024 may provide a 2D register file, which stores the aligned data output of the engine 100 and process 200 . In various implementations, cache memory 1024 may receive video data, such as pixel values, from memory 1012 .

以上所描述的系统以及如本文中所描述的那样由系统执行的处理可以在硬件、固件或软件或者其任意组合中实现。另外，本文中所公开的任何一个或多个特征可以在包括分立的和集成的电路逻辑、专用集成电路（ASIC）逻辑和微控制器的硬件、软件、固件及其组合中实现，并可以实现为特定域集成电路封装的部分、或集成电路封装的组合。本文中所使用的术语软件指代计算机程序产品，其包括具有存储于其中的计算机程序逻辑的计算机可读介质，以使得计算机系统执行本文中所公开的一个或多个特征和/或特征的组合。The systems described above, and the processes performed by the systems as described herein, may be implemented in hardware, firmware, or software, or any combination thereof. Additionally, any one or more features disclosed herein can be implemented in hardware, software, firmware, and combinations thereof, including discrete and integrated circuit logic, application-specific integrated circuit (ASIC) logic, and microcontrollers, and can be implemented in Part of a domain-specific integrated circuit package, or combination of integrated circuit packages. The term software as used herein refers to a computer program product comprising a computer readable medium having computer program logic stored therein to cause a computer system to perform one or more features and/or combinations of features disclosed herein .

尽管已经参考多个实施方式描述了本文中所阐述的某些特征，但是该描述并非旨在以限制性意义来解释。因此，对于本发明所属领域技术人员显而易见的本文中所描述的实施方式的多种变型以及其他实施方式也视为在本公开内容的精神和范围内。While certain features set forth herein have been described with reference to various implementations, this description is not intended to be construed in a limiting sense. Accordingly, various modifications of the embodiments described herein, as well as other embodiments, which are apparent to those skilled in the art to which the invention pertains are considered to be within the spirit and scope of the present disclosure.

Claims

1. A device for collecting pixel values, comprising:

a plurality of Tetris registers arranged as a register array, each Tetris register comprising at least a first register portion and a second register portion, wherein the first row of the register array comprises each said first register portion of a Tetris register, said register array to store a plurality of cache lines of pixel values such that said first row of said register array stores the most significant part;

a barrel shifter to receive the most significant portion of the plurality of cache memory lines from the first row of the register array as first row pixel values, the barrel shifter to aligning said first row of pixel values; and

A first buffer for receiving aligned first row pixel values from the barrel shifter.

2. The apparatus of claim 1 , wherein a second row of the register array includes the second register portion of each Tetris register, the register array storing the plurality of high-speed registers of pixel values. buffering memory lines such that a second row of said register array stores the next most significant portion of each of said cache memory lines, said barrel shifter to receive all cache memory lines from said second row of said register array The next most significant part of the plurality of cache memory lines is used as the pixel value of the second row, and the barrel shifter is used to align the pixel value of the second row, and the device further includes:

A second buffer for receiving the aligned second row of pixel values from the barrel shifter.

3. The apparatus of claim 1, further comprising:

a multiplexer coupled to the first buffer and the second buffer; and

a register file coupled to the multiplexer, wherein the multiplexer is configured to provide either the aligned first row pixel values or the aligned second row pixel values to The register file, wherein the register file is configured to store the aligned second row of pixel values adjacent to the aligned first row of pixel values.

4. The apparatus of claim 1, wherein the most significant portion of each cache line comprises a row of pixel data in block-y format.

5. The apparatus of claim 1 , wherein each cache memory line comprises 64 bytes of pixel values, wherein the plurality of Tetris registers comprises at least five Tetris registers, wherein each Tetris The square registers are each configured to store 64-byte pixel values, and wherein both the first register portion and the second register portion are configured to store 16-byte pixel values.

6. The apparatus of claim 1, wherein, to align the first row of pixel values, the barrel shifter is configured to left shift the first row of pixel values.

7. A computer implemented method comprising:

receiving multiple cache lines;

dividing each cache line into at least a most significant portion and a next most significant portion;

storing the contents of the plurality of cache lines in a register array such that the most significant portion of each cache line is stored in a first row of the register array, the first row comprising a first plurality of register sections;

providing the contents of a first register portion of the first plurality of register portions to a barrel shifter;

aligning the contents of the first register portion of the first plurality of register portions; and

The aligned contents of the first register portion of the first plurality of register portions are stored in a first buffer.

8. The method of claim 7 , wherein storing the contents of the plurality of cache memory lines in the register array comprises storing the contents of the plurality of cache memory lines in the register array , such that the next most significant portion of each cache line is stored in a second row of the register array, the second row comprising a second plurality of register portions, the method further comprising:

providing the contents of a first register portion of the second plurality of register portions to a barrel shifter;

aligning the contents of the first register portion of the second plurality of register portions; and

The aligned contents of the first register portion of the second plurality of register portions are stored in a second buffer.

9. The method of claim 8, further comprising:

Providing the aligned contents of the first register portion of the first plurality of register portions prior to providing the aligned contents of the first register portion of the second plurality of register portions to a register file to the register file.

10. The method of claim 7, wherein the register array includes a plurality of Tetris registers.

11. The method of claim 10, wherein the plurality of Tetris registers are arranged such that a first portion of each Tetris register stores the most significant part.

12. The method of claim 7 , wherein aligning the contents of the first register portion of the first plurality of register portions comprises left shifting the first register portion of the first plurality of register portions Content.

13. A system for collecting pixel values comprising:

a cache memory to store a plurality of cache memory lines of pixel values;

an acquisition engine coupled to the cache memory; and

an additional memory coupled to the acquisition engine, wherein instructions in the additional memory configure the acquisition engine to receive the plurality of cache lines from the cache memory, the acquisition engine comprising :

a plurality of Tetris registers arranged as a register array, each Tetris register comprising at least a first register portion and a second register portion, wherein the first row of the register array comprises each said first register portion of a Tetris register, said register array to store said plurality of cache lines such that said first row of said register array stores the most significant portion of each cache line ;

14. The system of claim 13 , wherein a second row of the register array includes the second register portion of each Tetris register, the register array to store the plurality of cache lines , such that the second row of the register array stores the next most significant portion of each of the cache lines, the barrel shifter is configured to receive the The next most significant portion of the plurality of cache memory lines is used as a second row of pixel values, the barrel shifter aligns the second row of pixel values, and the acquisition engine further includes:

15. The system according to claim 14, further, the acquisition engine further comprising:

a multiplexer coupled to the first buffer and the second buffer; and

16. The system of claim 13, wherein the cache is configured to store cache lines in block-y format.

17. The system of claim 13 , wherein each cache memory line includes 64 bytes of pixel values, wherein the plurality of Tetris registers includes at least five Tetris registers, wherein each Tetris The square registers are both configured to store 64-byte pixel values, and wherein both the first register part and the second register part are configured to store 16-byte pixel values.

18. The system of claim 13, wherein, to align the first row of pixel values, the barrel shifter is configured to left shift the first row of pixel values.

19. The system of claim 13 , said additional memory to store video data and to provide a portion of said video data to said cache memory for storage as said plurality of cache memory lines .