WO2002084451A2 - Architecture de processeur vectoriel et procedes mis en oeuvre dans cette architecture - Google Patents
Architecture de processeur vectoriel et procedes mis en oeuvre dans cette architecture Download PDFInfo
- Publication number
- WO2002084451A2 WO2002084451A2 PCT/US2002/020645 US0220645W WO02084451A2 WO 2002084451 A2 WO2002084451 A2 WO 2002084451A2 US 0220645 W US0220645 W US 0220645W WO 02084451 A2 WO02084451 A2 WO 02084451A2
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- vector
- operand
- data
- instructions
- register
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3818—Decoding for concurrent execution
- G06F9/3822—Parallel decoding, e.g. parallel decode units
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F15/00—Digital computers in general; Data processing equipment in general
- G06F15/76—Architectures of general purpose stored program computers
- G06F15/80—Architectures of general purpose stored program computers comprising an array of processing units with common control, e.g. single instruction multiple data processors
- G06F15/8053—Vector processors
- G06F15/8076—Details on data register access
- G06F15/8084—Special arrangements thereof, e.g. mask or switch
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
- G06F9/30036—Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
- G06F9/30036—Instructions to perform operations on packed data, e.g. vector, tile or matrix operations
- G06F9/30038—Instructions to perform operations on packed data, e.g. vector, tile or matrix operations using a mask
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/3005—Arrangements for executing specific machine instructions to perform operations for flow control
- G06F9/30065—Loop control instructions; iterative instructions, e.g. LOOP, REPEAT
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/3005—Arrangements for executing specific machine instructions to perform operations for flow control
- G06F9/30069—Instruction skipping instructions, e.g. SKIP
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30072—Arrangements for executing specific machine instructions to perform conditional operations, e.g. using predicates or guards
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30181—Instruction operation extension or modification
- G06F9/30192—Instruction operation extension or modification according to data descriptor, e.g. dynamic data typing
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/32—Address formation of the next instruction, e.g. by incrementing the instruction counter
- G06F9/322—Address formation of the next instruction, e.g. by incrementing the instruction counter for non-sequential address
- G06F9/325—Address formation of the next instruction, e.g. by incrementing the instruction counter for non-sequential address for loops, e.g. loop detection or loop counter
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/34—Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes
- G06F9/345—Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes of multiple operands or results
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/34—Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes
- G06F9/345—Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes of multiple operands or results
- G06F9/3455—Addressing or accessing the instruction operand or the result ; Formation of operand address; Addressing modes of multiple operands or results using stride
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3802—Instruction prefetching
- G06F9/3816—Instruction alignment, e.g. cache line crossing
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3824—Operand accessing
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3836—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3836—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
- G06F9/3853—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution of compound instructions
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3854—Instruction completion, e.g. retiring, committing or graduating
- G06F9/3856—Reordering of instructions, e.g. using queues or age tags
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3885—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
Definitions
- the present invention relates to vector processors.
- the present invention involves a novel vector processor architecture, and hardware and processing features associated therewith.
- the invention may be understood to pertain to a vector processing architecture that provides both vector processing and superscalar processing features.
- a vector processor as described herein may perform both vector processing and superscalar register processing.
- this processing may comprise fetching instructions from an instruction stream, where the instruction stream comprises vector instructions and register instructions.
- the type of a fetched instruction is determined, and if the fetched instruction is a vector instruction, the instruction is routed to decoders of the vector processor in accordance with functional units used by the vector instruction.
- the fetched instruction is a register instruction, a vector element slice of the vector processor that is associated with the register instruction is determined, one or more functional units that are associated with the register instruction are determined, and the register instruction is routed to the functional units of the vector element slice.
- These functional units may be instruction decoders associated with said functional units and said vector element slice.
- a vector processor as described above may comprise a plurality of vector element slices, each comprising a plurality of functional units, and a plurality of instruction decoders, each associated with a functional unit of one of the vector element slices, for providing instructions to an associated functional unit.
- the vector processor may further comprise a vector instruction router for routing a vector instruction to all instruction decoders associated with functional units used by said vector instruction, and a register instruction router for routing a register instruction to instruction decoders associated with a vector element slice and functional units associated with the register instruction.
- a vector processor as described herein may also create Very Long Instruction Words (VLIW) from component instructions.
- this processing may comprise fetching a set of instructions from an instruction stream, the instruction stream comprising VLIW component instructions, and identifying VLIW component instructions according to their respective functional units.
- the processing may further comprise determining a group of VLIW component instructions that may be assigned to a single VLIW, and assigning the component instructions of the group to a specific positions of a VLIW instruction according to their respective functional units. Identifying VLIW component instructions may be preceded by determining whether each of fetched instructions is a VLIW component instruction. Determining whether a fetched instruction is a VLIW component instruction may be based on an instruction type and an associated functional unit of the instruction, and instruction types may include vector instructions, register instructions, load instructions or control instructions.
- the component instructions may include vector instructions and register instructions.
- a vector processor that forms Very Long Instruction Words (VLIW) from VLIW component instructions of an instruction stream as described herein may be designed by defining a set of VLIW component instructions, each component instruction being associated with a functional unit of the vector processor, defining grouping rules for VLIW component instructions that associate component instructions that may be executed in parallel, and defining associations between VLIW component instructions and specific positions of a VLIW instruction based on the functional unit of the component instruction.
- VLIW Very Long Instruction Words
- a vector processor as described herein that forms Very Long Instruction Words (VLIW) from VLIW component instructions of an instruction stream may comprise a plurality of vector element slices, each comprising a plurality of functional units, and a plurality of instruction decoders, each associated with a functional unit of one of the vector element slices, for providing instructions to an associated functional unit.
- VLIW Very Long Instruction Words
- the processor may further include a plurality of routers, each associated with a type of said functional units, for routing instructions to a decoder associated with a functional unit of the routed instruction, a plurality of pipeline registers, each corresponding to a type of said functional units, for storing instructions provided by instruction decoders corresponding to the same type of functional unit, and a plurality of instruction grouping decoders, for receiving instructions from an instruction stream and providing groups of VLIW component instructions of said stream to said plurality of routers.
- the VLIW instruction is comprised of the instructions stored in the respective pipeline registers.
- a processor as described herein may also implement a method to deliver an instruction window, comprising a set of instructions, to a superscalar instruction decoder.
- the method may comprise fetching two adjacent lines of instructions that together contain a set of instructions to be delivered to the superscalar instruction decoder, each of the lines being at least the size of the set of instructions to be delivered, and reordering the positions of instructions of the two adjacent lines so as to position first and subsequent elements of the set of instructions to be delivered into first and subsequent positions corresponding to first and subsequent positions of the superscalar instruction decoder.
- Reordering the positions of the instructions may involve rotating the positions of said instructions within the two adjacent lines.
- the first line may comprise a portion of the set of instructions and the second line may comprise a remaining portion of the set of instructions.
- the method may obtain a line of instructions containing at least a set of instructions to be provided to the superscalar instruction decoder, provide the line of instructions to a rotator network along with a starting position of said set of instructions within the line, the rotator network having respective outputs coupled to inputs of a superscalar instruction decoder, and control the rotator network in accordance with the starting position of the set of instructions to output the first and subsequent instructions of the set of instructions to first and subsequent inputs of the superscalar decoder.
- the method may obtain at least a portion of a first line of instructions containing at least a portion of a set of instructions to be delivered to the superscalar instruction decoder, obtain at least a portion of a second line of instructions containing at least a remaining portion of said set of instructions, provide the first and second lines of instructions to a rotator network along with a starting position of the set of instructions, the rotator network having respective outputs coupled to inputs of a superscalar instruction decoder, and control the rotator network in accordance with the starting position of the set of instructions to output the first and subsequent instructions of the set of instructions to first and subsequent inputs of the superscalar decoder.
- Each line may contain the same number of instruction words as contained in an instruction window, or may contain more instruction words than contained in an instruction window.
- a processor as described herein may comprise a memory storing lines of superscalar instructions, a rotator for receiving at least portions of two lines of superscalar instructions that together contain a set of instructions, and a superscalar decoder having a set of inputs for receiving corresponding first and subsequent instructions of a superscalar instruction window, the rotator network providing the first and subsequent superscalar instructions of the instruction window from within the at least portions of two lines of instructions to the corresponding inputs of the superscalar decoder.
- the rotator may comprise a set of outputs corresponding in number to the number of superscalar instructions in a superscalar instruction window, and further corresponding to positions of instructions within the at least portions of two lines of instructions within the rotator,.
- the rotator network may reorder the instructions of the at least portions of two lines of superscalar instructions within the rotator network to associate the first and subsequent superscalar instructions of the superscalar instruction window with first and subsequent outputs of the rotator network coupled to corresponding inputs of the superscalar decoder.
- the rotator network may reorder the positions of the instructions by rotating the instructions of the at least portions of two lines within the rotator. The reordering may be perfonned in accordance with a known position of a first instruction of the instruction window within the at least portions of two lines.
- a processor as described herein may also implement a method to address a memory line of a non-power of 2 multi- word wide memory in response to a linear address.
- the method may involve shifting the linear address by a fixed number of bit positions, and using high order bits of a sum of the shifted linear address and the unshifted linear address to address a memory line.
- the linear address may be shifted to the right or the left to achieve the desired position.
- the method may involve shifting the linear address by a fixed number of bit positions, adding the shifted linear address to the unshifted linear address to form an intermediate address, retaining a subset of high order address bits of the intermediate address as a modulo index, and using low order address bits of the intermediate address and the modulo index in a conversion process to obtain a starting position within a selected memory line.
- the conversion process may use a look-up table or a logic array.
- the method may involve shifting the linear address by a fixed number of bit positions, adding the shifted linear address to the unshifted linear address to form an intermediate address, retaining a subset of low order address bits of the intermediate address as a modulo index, and using the modulo index in a conversion process to obtain a starting position within a selected memory line.
- the method may involve isolating a subset of low order address bits of the linear address as a modulo index, and using the modulo index in a conversion process to obtain a starting position within a selected memory line.
- a processor as described herein may further perform an operation on first and second operand data having respective operand formats.
- the device may comprise a first hardware register specifying a type attribute representing an operand format of the first data, a second hardware register specifying a type attribute representing an operand format of the second data, an operand matching logic circuit determining a common operand format to be used for both of the first and second data in performing the operation based on the first type attribute of the first data and the second type attribute of the second data, and a functional unit that performs the operation in accordance with the common operand type.
- a related method as described herein may include specifying an operation type attribute representing an operation format of the operation, specifying in a hardware register an operand type attribute representing an operand format of data to be used by the operation, determining an operand conversion to be performed on the data to enable performance of the operation in accordance with the operation format based on the operation format and the operand format of the data, and performing the determined operand conversion.
- the operation type attribute may be specified in a hardware register or in a processor instruction.
- the operation format may be an operation operand format or an operation result format.
- a related method as described herein may include specifying in a hardware register an operation type attribute representing an operation format, specifying in a hardware register an operand type attribute representing a data operand format, and performing the operation in a functional unit of the computer in accordance with the specified operation type attribute and the specified operand type attribute.
- the operation format may be an operation operand format or an operation result format.
- a related method as described herein may provide an operation that is independent of data operand type.
- the method may comprise specifying in a hardware register an operand type attribute representing a data operand format of said data operand, and performing the operation in a functional unit of the computer in accordance with the specified operand type attribute.
- the method may comprise specifying in a first hardware register an operand type attribute representing an operand format of a first data operand, specifying in a second hardware register an operand type attribute representing an operand format of a second data operand, determining in an operand matching logic circuit a common operand format to be used for both of the first and. second data in performing the operation based on the first type attribute of the first data and the second type attribute of the second data, and performing the operation in a functional unit of the computer in accordance with the determined common operand.
- a related method for performing operand conversion in a computer device as described herein may comprise specifying in a hardware register an original operand type attribute representing an original operand format of operand data, specifying in a hardware register a converted operand type attribute representing a converted operand format to which the operand data is to be converted, and converting the data from the original operand format to the converted operand format in an operand format conversion logic circuit in accordance with the original operand type attribute and the converted operand type attribute.
- the operand conversion may occur automatically when a standard computational operation is requested.
- the operand conversion may implement sign extension for an operand having an original operand type attribute indicating a signed operand, zero fill for an operand having an original operand type attribute indicating an unsigned operand, positioning for an operand having an original operand type attribute indicating operand position, positioning for an operand in accordance with a converted operand type attribute indicating a converted operand position, or one of fractional, integer and exponential conversion for an operand according to the original operand type attribute or the converted operand type attribute.
- Another method in a device as described herein may conditionally perform operations on elements of a vector.
- the method may comprise generating a vector enable mask comprising a plurality of bits, each bit corresponding to a respective element of a vector, generating a vector conditional mask comprising a plurality of bits, each bit corresponding to a respective element of a vector, and, for each of the elements, applying logic to the vector enable mask bit and vector conditional mask bit that correspond to that element to determine if an operation is to be performed for that element.
- the logic may require the vector enable bit corresponding to an element to be set to enable an operation on the corresponding element to be performed.
- a related method as described herein may nest conditional controls for elements of a vector.
- the method may comprise generating a vector enable mask comprising a plurality of bits, each bit corresponding to a respective element of a vector, generating a vector conditional mask comprising a plurality of bits, each bit corresponding to a respective element of a vector, saving the vector enable mask to a temporary storage location, generating a nested vector enable mask comprising a logical combination of the vector enable mask with the vector conditional mask, and using the nested vector enable mask as a vector enable mask for a subsequent vector operation.
- the logical combination may use a bitwise "and” operation, a bitwise “or” operation, a bitwise “not” operation, or a bitwise "pass” operation.
- An alternative method may comprise generating a vector enable mask comprising a plurality of bits, each bit corresponding to a respective element of a vector, generating a vector conditional mask comprising a plurality of bits, each bit corresponding to a respective element of a vector, saving the vector enable mask to a temporary storage location, generating a nested vector enable mask by performing a bitwise "and" of the vector enable mask with the vector conditional mask, and using the nested vector enable mask as a vector enable mask for a subsequent vector operation.
- a further alternative method may comprise generating a vector enable mask comprising a plurality of bits, each bit corresponding to a respective element of a vector, generating a vector conditional mask comprising a plurality of bits, each bit corresponding to a respective element of a vector, saving the vector enable mask to a temporary storage location, generating a nested vector enable mask by performing a bitwise "and" of the vector enable mask with a bitwise "not” of the vector conditional mask, and using the nested vector enable mask as a vector enable mask for a subsequent vector operation.
- a device as described herein may also implement a method to improve responsiveness to program control operations.
- the method may comprise providing a separate computational unit designed for program control operations, positioning the separate computational unit early in the pipeline thereby reducing delays, and using the separate computation unit to produce a program control result early in the pipeline to control the execution address of a processor.
- a related method may improve the responsiveness to an operand address computation.
- the method may comprise providing a separate computational unit designed for operand address computations, positioning said separate computational unit early in the pipeline thereby reducing delays, and using said separate computation unit to produce a result early in the pipeline to be used as an operand address.
- a vector processor as described herein may further comprise a vector of multipliers computing multiplier results; and an array adder computational unit computing an arbitrary linear combination of the multiplier results.
- the array adder computational unit may have a plurality of numeric inputs that are added, subtracted or ignored according to a control vector comprising the numeric values 1 , -1 and 0, respectively.
- the array adder computational unit may comprise at least 4 or at least 8 inputs, and may comprise at least 4 outputs.
- a device as described herein may further provide an indication of a processor attempt to access an address yet to be loaded or stored.
- the device may comprise a current bulk transfer address register storing a current bulk transfer address, an ending bulk transfer address register storing an ending bulk transfer address, a comparison circuit coupled to the current bulk transfer address register and the ending bulk transfer address register, and to the processor, to provide a signal to the processor indicating whether an address received from the processor is between the current bulk transfer address and the ending bulk transfer address.
- the device may further produce a stall signal for stalling the processor until transfer to the address received from the processor is complete, or an interrupt signal for interrupting the processor to inform the processor that data at the address is unavailable.
- a related device may comprise a current bulk transfer address register storing a current bulk transfer address, and a comparison circuit coupled to the current bulk transfer address register and to the processor to provide a signal to the processor indicating whether a difference betweeri the current bulk transfer address and an address received from the processor is within a specified stall range.
- the signal produced by the device may be a stall signal for stalling the processor until transfer to the address received from the processor is complete, or an interrupt signal for interrupting the processor to inform the processor that data at the address is unavailable.
- a device as described herein may further implement a method of controlling processing, comprising receiving an instruction to perform a vector operation using one or more vector data operands, and determining a number of vector data elements of the one or more vector data operands to be processed by the vector operation based on a number of vector data elements that constitute each vector data operand and a number of hardware elements available to perform the vector operation.
- the method may comprise receiving instructions to perform a plurality of vector operations, each vector operation using one or more vector data operands, for each of the plurality of vector operations, determining a number of vector data elements of each of the one or more vector data operands to be processed by the vector operation based on a number of vector data elements that constitute each vector data operand of the operation and number of hardware elements available to perform the vector operation, and determining a number of vector data elements to be processed by all of the plurality * of operations by comparing the number of vector data elements to be processed for each respective vector operation.
- a device as described herein may also implement a method for performing a vector operation on all data elements of a vector, comprising: setting a loop counter to a number of vector data elements to be processed, performing one or more vector operations on vector data elements of the vector, determining a number of vector data elements processed by the vector operations, subtracting the number of vector data elements processed from the loop counter, determining, after subtraction, whether additional vector data elements remain to be processed, and if additional vector data elements remain to be processed, performing further vector operations on remaining data elements of the vector.
- the method may further include reducing a number of vector data elements processed by the vector processor to accommodate a partial vector of data elements on a last loop iteration.
- a related method for reducing a number of operations performed for a last iteration of a processing loop may comprise setting a loop counter to a number of vector data elements to be processed, performing one or more vector operations on data elements of the vector, determining a number of vector data elements processed by the vector operations, subtracting the number of vector data elements processed from the loop counter, determining, after subtraction, whether additional vector data elements remain to be processed, and if additional vector data elements remain to be processed, and the number of additional vector data elements to be processed is less than a full vector of data elements, reducing one of available elements used to perform the vector operations and vector data elements available for the last loop iteration.
- a device as described herein may also implement a method for controlling processing in a vector processor that comprises performing one or more vector operations on data elements of a vector, determining a number of data elements processed by the vector operations, and updating an operand address register by an amount corresponding to the number of data elements processed.
- a device as described herein may also implement a method for performing a loop operation.
- the method may comprise storing, in a match register, a value to be compared to a monitored register, designating a register as the monitored register, comparing the value stored in the match register with a value stored in the monitored register, and responding to a result of the comparison in accordance with a program-specified condition by one of branching or repeating a desired sequence of program instructions, thereby forming a program loop.
- the program specified condition may be one of equal to, not equal to, less than, less than or equal to, greater than, or greater than or equal to.
- the register to be monitored may be an address register.
- the program-specified condition may be an absolute difference between the value stored in the match register and the value stored in the address register, and responding to the result of the comparison may further comprise reducing a number of vector data elements to be processed on a last iteration of a loop.
- a device as described herein may also implement a method of processing interrupts.
- the method may comprise monitoring an interrupt line for a signal indicating an interrupt to the superscalar processor, upon detection of an interrupt signal, fetching a group of instructions to be executed in response to the interrupt, and inhibiting in hardware an address update of a program counter, and executing the group of instructions.
- the group of instructions may include an instruction to disable further interrupts and an instruction to call a routine.
- a device as described herein may therefore perform a method comprising receiving an instruction, determining whether a vector satisfies a condition specified in the instruction, and, if the vector satisfies the condition specified in the instruction, branching to a new instruction.
- the condition may comprise a vector element condition specified in at least one of a vector enable mask and a vector condition mask.
- a device as described herein may also implement a method of providing a vector of data as a vector processor operand.
- the method may comprise obtaining a line of data containing at least a vector of data to be provided as the vector processor operand, providing the line of data to a rotator network along with a starting position of said vector of data within the line, the rotator network having respective outputs coupled to vector processor operand data inputs, and controlling the rotator network in accordance with the * starting position of the vector of data to output the first and subsequent data elements of the vector of data to first and subsequent operand data inputs of the vector processor.
- a related method may comprise obtaining at least a portion of a first line of vector data containing at least a portion of a vector processor operand, obtaining at least a portion of a second line of vector data containing at least a remaining portion of said vector processor operand, providing the at least a portion of said first line of vector data and the at least a portion of said second line of vector data to a rotator network along with a starting position of said vector data, the rotator network having respective outputs coupled to vector processor operand data inputs, and controlling the rotator network in accordance with the starting position of the vector data to output the first and subsequent vector data elements to first and subsequent operand data inputs of the vector processor.
- a device as described herein may also implement a method to read a vector of data for a vector processor operand.
- the method may comprise reading into a local memory device a series of lines from a larger memory, obtaining from the local memory device at least a portion of a first line containing a portion of a vector processor operand, obtaining from the local memory device at least a portion of a second line containing a remaining portion of the vector processor operand, providing the at least a portion of the first line of vector data and the at least a portion of the second line of vector data to a rotator network along with a starting position of the vector data, the rotator network having respective outputs coupled to vector processor operand data inputs, and controlling the rotator network in accordance with the starting position of the vector data to output first and subsequent vector data elements to first and subsequent vector processor operand data inputs.
- Figure 1 shows a L-Hardware Element Vector Processor or L-Slice Super-Scalar Processor
- FIG. 1 shows the Main Functional Units
- Figure 3 shows the Processor Pipeline
- Figure 4 shows the Placement Positions
- Figure 5 shows a VMU Element Pair
- Figure 6 shows High Word Detect Logic
- Figure 7 shows Basic Multiplier Cell
- Figure 8 shows a Summation Network
- Figure 9 shows an Array Adder Element
- Figure 10 shows an Array Adder Element Segments and Placement
- Figures 1 la and 1 lb show an AAU Operand Promotion
- Figure 12 shows an Optimized Array Adder Element
- Figure 13 shows a VALU Element
- Figure 14 shows a VALU Element Segments and Placement
- Figures 15a and 15b show a VALU Operand Promotion
- Figure 16 shows a Demotion/Promotion Process
- Figure 17 shows a Fractional/Integer Value Demotion.
- Figure 18 shows a Size Demotion Hardware
- FIG. 19 shows the Packer
- Figure 20 shows the Spreader
- Figure 21 shows a Size Promotion Hardware
- FIG 22 shows the Detailed Processor Pipeline
- Figure 23 shows the Overall Processor Data Flows
- Figure 24 shows a Double Clocked Memory Access Plan
- Figure 25 shows the Vector Prefetch and Load Units
- Figure 26 shows the Detailed Vector Prefetch and Load Units
- Figure 27 shows a Vector Rotator and Alignment
- Figure 28 shows a Vector Rotator Control
- Figure 29 shows a Vector Operand Alignment Examples
- Figure 30 shows a Vector Operand Prefetch
- Figure 31 shows a Processor Pipeline Operation
- Figure 32 shows a Processor Pipeline Operation
- Figure 33 shows a Bulk Memory Transfer Hazard Detection
- Figure 34 ' shows the Instruction Prefetch and Fetch Units
- Figure 35 shows the Instruction Fetch Alignment
- Figure 36 shows the Detailed Instruction Prefetch and Fetch Units
- Figure 37 shows an Instruction Rotator
- Figure 38 shows an Instruction Rotator Control
- Figures 39a and 39b show an Instruction Grouping, Routing and Decoding;
- Figure 40 shows a Non-Power of 2 Memory Access;
- Figure 41 shows a Non-Power of 2 Memory Access Alternative Implementation 1;
- Figure 42 shows a Non-Power of 2 Memory Access Alternative Implementation 2;
- Figure 43 shows a Full 16 Element Rotator;
- Figure 44 shows 11 Element to 10 Position Rotator;
- Figure 45 shows a Fractional Memory Alignment.
- Functional Unit Dedicated hardware defined for certain tasks (functions). May refer to individual functional unit elements or to a vector of functional units.
- Computational Unit Dedicated hardware (functional unit) designed for arithmetic operations.
- the VALU is a computational unit with its main purpose being arithmetic operations.
- Execution Unit Same as a computational unit.
- Element - Hardware or a vector can be broken down into word size units. These units are referred to as elements.
- Hardware Element - A computational/execution unit is composed of duplicated hardware blocks called hardware elements.
- the VALU can add 8 words because it has 8 duplicated hardware elements that each add a word.
- Hardware elements are always 32 bits.
- Data Element - refers to data components of a data vector. Data elements may be in all the different sizes supported by the processor, 8, 16 or 32 bit.
- ' Slice A set of hardware related to a particular element of the vector processor.
- a slice is usually selected by a particular destination register (R,,). _
- Segment - A portion of a hardware element of the vector processor that allows processing of a smaller width operand.
- a single segment is used to operate on 8-bit elements (12-bits with guard).
- a pair of segments are used together are used to operate on 16-bit elements (24-bits with guard).
- all four segments are used to operate on a 32-bit element (48-bits with guard).
- Integer- An ordinary number (natural number) that may be all positive values (unsigned) or have both positive and negative values (signed).
- Fractional - A common representation used to express numbers in the range of [-1, 1) as a signed fractional number or [0, 2) as an unsigned fractional number.
- the most significant bit of the fractional number contains either a sign bit (for a signed fractional number) or an integer bit (for an unsigned fractional number).
- the next most two significant bits represent the fractions 'A and l ⁇ respectively and so on.
- Exponential - A conventional floating-point number in IEEE single or double precision format. (The conventional name, "float” is not used as the single letter representation * “F” is used for Fractional, hence, the name Exponential is used.)
- L - Usually refers to the hardware vector length. May refer to a Low piece of data when used as a subscript. H - refers to a High piece of data when used as a subscript. G - Refers to the Guard bits in the extended precision registers.
- [n:m] - Represents a range of registers or bits ananged from the most significant, "n”, to the least significant, "m”.
- TOVEN Tolon Vector Engine
- the Tolon Vector Engine (TOVEN) processor family uses an expandable base architecture optimized for digital signal processing (DSP) and other numeric intensive applications. Specifically the vector processor has been optimized for neural networks, FFT's, adaptive filters, DCT's, wavelets, Virterbi trellis, Turbo decoding, and in general linear algebra intensive algorithms. Through the use of super-scalar instruction execution, control operations common in the physical layer processing for applications such as 802.1 la b/g wireless, GPRS and xDSL (ADSL, HDSL and VDSL) may be accommodated with a complementary performance increase. Multi-channel algorithm implementations for speech and wireline modems are supported through the consistent use of guarded operations.
- the TOVEN processor family is implemented as a super-scalar pipelined parallel vector processor using RISC-like . instruction encoding. RISC instructions are generally regular, easy to decode, and can be quickly categorized by TOVEN decoder. Certain instruction categories may require more complex decoding than others and this is provided after the grouping. All instructions (with encoded operands) are currently 16 bits. Some non-vector instructions may specify an optional 16 or 32- bit constant following the instruction.
- the processor may operate in either Vector or Super-scalar mode (referred to as Register mode). Figure 1 illustrates the concurrent assignment of functional units for Vector mode and independent use of hardware "slices" in Register mode.
- the processing of data in Vector mode is SIMD (single instruction, multiple data) using multiple hardware elements. These processing hardware elements are duplicated to permit the parallel processing of data in Vector mode but also provide independent element "slices" for Register mode. Where processing hardware is not duplicated, pipeline logic is implemented to automatically reuse the available hardware within a pipeline stage to implement the programmer-specified operation transparently using two or more clock cycles rather than a single cycle.
- Vector mode 8 16 or 32-bit data sizes are supported and a fixed size of 32 bit is used for Register mode.
- the native hardware elements operate on a 32-bit word size (optional 64 bit in future versions).
- up to 8 instructions may be issued in a single clock cycle.
- Vector or Register instructions are assigned to a particular instruction decoder.
- Register mode a traditional model is used whereby the instructions are assigned to the functional unit to which they pertain.
- the instructions in Register mode are directed through a "slice" of the vector-processing pipeline, where each "slice" normally corresponds to an element of the resulting vector. This permits super-scalar processing to exploit all hardware elements of the vector processor.
- the super-scalar processor may dispatch up to 8 instructions per clock cycle.
- the processor groups and assembles vector instructions from the super-scalar instruction stream and creates a very wide, multistage pipeline-instruction which operates in lock-step order on the various components of the vector processor.
- EPIC and VLIW instruction processors may offer similar vector performance using the technique of loop unrolling but this requires many registers and an unnecessary large code size.
- VLIW and EPIC processors further impose restricted combinations of instructions which a programmer or compiler must honor.
- TOVEN assembling the multistage pipeline-instruction from smaller constituent vector instructions (primitive instructions) allows a programmer to specify only those operations required without a need for filler functional-unit specific NOP's.
- TOVEN processor is well suited for pipelined operations. In a standard configuration, each functional unit occupies its own pipeline stage. This standard implementation uses an 11-stage pipeline. With the use of vector element- guarded operations, the vector-processing pipeline is well suited for super-pipelining whereby the number of pipeline stages may be 3 to 4x while the clock rate may be increased into the GHz range. In order to provide responsiveness for program control purposes, a simple Scalar ALU is provided with a short pipeline.
- Program control logic, address computations and other simple general calculations and logic may be implemented in the Scalar ALU and results are immediately available early in the pipeline. Where necessary the pipeline implements a distributed control and hazard detection model to resolve resource contention, operand hazards and simulation of additional parallel hardware. Implementation of hardware-based control allows programs to be developed independently and isolated from avoidance of hazard conditions. Of course the best program would exploit full knowledge of hazard and avoid them where possible, but a programmer-friendly softly degraded performance is far better than a hard error condition.
- This manual provides a description of the processor family architecture, complete reference material for programmers and software examples for common signal, image and other applications. Additional application information is available in a companion manual.
- Table 1-1 shows the architecture configuration options for the Tolon Vector Engine Processor Family.
- the number of hardware elements and the width of the data memories are configurable based on the acceleration necessary. These sizes need not be powers of two.
- the TOVEN processor family is designed for the efficient support of DSP algorithms. 8, 16 and 32-bit sizes (Byte, Half- Word and Word) as signed/unsigned integer or fractional types are supported.
- Optional data formats include long integer or fractional (64 bit), compact floating point (16 bit in 6.10 format), IEEE single precision (32 bit) and IEEE double precision (64 bit) floating point operands.
- Extended precision accumulation for integer and fractional is supported with the following ranges: 48 bit for accumulating 32-bit numbers, 24 bit for accumulating 16-bit numbers, and 12 bit for accumulating 8-bit numbers. Rounding and shift operations are supported as per the ETSI basic speech primitives and for clipping/limiting of video data.
- the processor addressing modes (used for loading and storing registers) support post-address modification by positive or negative steps. Circular buffer addressing is also supported in hardware as part of the post-addressing operations.
- Table 1-2 summarizes the different data operand types, sizes, and formats.
- the TOVEN uses strongly typed operands and automatically performs type conversions (type-casting) according to the desired operation result. This is accomplished by "tagging" the data format in the appropriate registers. This tagging can be done manually or automatically allowing the programmer to take advantage of this feature or to treat it as transparent. This data format "tagging" is implicitly performed by most computer languages (such as C/C++) according to built-in rules for operating with mixed operands.
- VMU Vector Multiplier Unit
- AAU Array Adder Unit
- VALU Vector Arithmetic/Logic Unit
- SALU scalar Arithmetic/Logic Unit
- Vector Results - M is the vector result from the VMU
- Q is the vector result from the AAU
- R is the primary result from the VALU
- T contains secondary results (such as division quotient) from the VALU.
- Data Address Generators - Dedicated multiple address generators supply addresses for X and Y vector operand access and result (M, Q, R, T) storage.
- Program Sequencer A program sequencer fetches groups of instructions for the superscalar instruction decoder. The sequencer supports XXX-cycle conditional branches and executes program loops with no overhead.
- each unit executes an element operation.
- the TOVEN is implemented in a series of interconnected vector units in a pipeline as shown in Figure 3.
- the Vector Pre-Fetch Unit (VPFU) (not shown) is responsible for accessing operands from the on-chip memory.
- the Vector Load Unit (VLU) responds to operand load instructions and delivers X and Y operands in the proper vector order to the execution units.
- the Vector Operand Conversion (VOC) is responsible for promoting and demoting operands as required for the concurrent operation(s).
- VMU Vector Multiplier Unit
- AAU is responsible for the addition of vector elements from either the VMU, a prior VALU result or a memory vector operand.
- the Vector Arithmetic and Logic Unit (VALU) is responsible for classical ALU operations and implementation of the accumulate stage normally used in Multiply and Accumulate DSP operations.
- the Vector Write Unit (VWU) writes results back to the on-chip memory based on individual conditional controls for each element. Included within the result write path is a Vector Result Conversion (VRC) which rounds or saturates, convert formats, and reduces or increases precision.
- VRC Vector Result Conversion
- the on-chip memory is organized as a wide memory with the appearance of multiple access ports.
- the access ports are used for fetching the X and Y operands and writing the R result.
- Integral to the memory system is also a bulk transfer mechanism used for moving data to/from external bulk memory.
- a multistage instruction can be defined as a group of primitive instructions (opcodes) that would be grouped together.
- the multistage, single-cycle instruction to find the expected value of a vector given an accompanying probability vector is as follows:
- the computational (execution) units of the TOVEN Processor are designed to support both Vector and Register mode instructions.
- Vector instructions Vector mode
- Register mode instructions make the hardware elements or the "slices" of a functional unit work independently.
- each element of a functional unit can be programmed in Register mode, but in Vector mode, all the elements in a particular functional unit are performing in SIMD and do not have to be individually programmed.
- Processor instructions are categorized as Vector (Type 7), Register (Types 4, 5 and 6) and General (Types 0, 1, 2 and 3). These instructions types are further described in Table 1-3.
- Vector and Register instruction groups are mutually exclusive as they both allocate the vector processor's pipeline functional resources according to different algorithms.
- Vector mode a vector load of each X and Y, a vector multiply, an array addition, a vector ALU, and a vector write are executed together in one group (multistage instruction).
- Register mode one vector or scalar load of each X and Y, any multiplication or ALU operation on an element of R, and a vector or scalar write are permitted to be executed together in one group.
- Vector or Register most General instructions may be used. These include scalar/pointer load/store operations, immediate value set operations, scalar ALU operations, control transfer and miscellaneous operations.
- the vector computational units of the TOVEN Processor include the Vector Multiply Unit (VMU), Array Adder Unit (AAU), Vector Arithmetic and Logic Unit (VALU).
- VMU Vector Multiply Unit
- AAU Array Adder Unit
- VALU Vector Arithmetic and Logic Unit
- the scalar computations are performed in the Scalar Arithmetic and Logic Unit (SALU).
- SALU Scalar Arithmetic and Logic Unit
- the SALU is provided for performing simple computations for program control and initial addresses.
- the SALU is positioned early in the pipeline so that the effect of the full pipeline length can usually be avoided. This reduces penalties for branching and other change of -control operations (calls and returns).
- VMU Vector Multiply Unit
- VMU Vector Multiply Unit
- the Vector Multiply Unit operates on 8, 16 and 32-bit size data and produces 16, 32 and 32-bit results respectively.
- a result of a multiplication requires doubling the range of its operands.
- Multiplication of 32-bit data types in the VMU is limited to producing either the high or low 32-bit result.
- a high word result is needed when multiplying fractional numbers, whereas a low word result expresses the result of multiplying integer numbers.
- a mixed-mode fractional/integer multiplication is supported and the result is considered as fractional.
- Each multiplier hardware element (for a 32-bit word size) is responsible for operating with a mixture of signed and unsigned operands with both fractional and integer types: 1) four 8x8 integer/fractional multiplies to produce four 16-bit products
- the multiplier element also performs cross-wise multiplication (cross-product) of vectors that is used for in multiplying real and imaginary parts in complex multiplication. For 32-bit operands, this exchange is performed outside of the basic element multiplier. For 16 and 8-bit operands, this exchange is performed within the multiplier element by computing appropriate partial products.
- AAU Array Adder Unit
- the Array Adder Unit operates on 8, 16, and 32-bit size data and produces 12, 24, and 48-bit results respectively.
- the output data size is increased over the input data size because of guard bits.
- a matrix of this form allows the summation of an input vector (operand register), partial summation, permutation, and many other powerful transformations (such as an FFT, dyadic wavelet transform).
- the Vector Arithmetic and Logic Unit operates on 8, 16, 32-bit and also 12, 24, 48-bit size data producing a 12, 24 and 48-bit result respectively.
- the VALU input may be a result (stored in the R or Q register) from the AAU unit hence the support of 12, 24, 48-bit operand size is needed.
- register type "tagging" operand registers for the VALU can be different and the proper type cast will be performed automatically (transparent to the programmer).
- VALU The function of the VALU is to perform the traditional arithmetic, logical, shifting and rounding operations. Special considerations for ETSI routines are accommodated in overflow and shifting situations. Shift right uses should allow for optional rounding to resulting LSB. Shift left should allow for saturation.
- SALU Scalar Arithmetic and Logic Unit
- SALU Scalar Arithmetic and Logic Unit
- ALU instructions are supported with the result stored as a 32-bit register (S register).
- S register can be accessed by the VMU for vector-scalar multiplication.
- the conversion units of the TOVEN Processor include the Vector Operand Conversion (VOC), and Vector Result Conversion (VRC). Both of these units do not respond to explicit instructions, but rather perform the conversions as specified for the operations being performed with the operands being used.
- VOC Vector Operand Conversion
- VRC Vector Result Conversion
- VPFU Vector Pre-Fetch Unit
- the Vector Pre-Fetch Unit is responsible for accessing operands from the on-chip memory.
- VLU Vector Load Unit
- the Vector Load Unit (VLU) responds to operand load instructions and delivers X and Y operands in the proper vector order to the execution units.
- VWU Vector Write Unit
- VWU Vector Write Unit
- VCM Vector Enable Mask
- VCM Vector Condition Mask
- Vector instructions execute unconditionally or use an Enabled condition, a True condition or a False condition.
- the Enabled condition, E executes if the corresponding bit in the Vector Enable Mask is one.
- the True condition, T executes if the corresponding bits in both the Vector Enable Mask and Condition Mask are one.
- the False condition, F executes if the corresponding bit in the Vector Enable Mask is a one and the Condition Mask is a zero. If no condition is specified, the instruction executes on all elements. Table 1-4 summaries the vector instruction execution guards.
- the Vector Enable Mask is provided to facilitate the implementation of concurrent multi-channel algorithms such as vocoders.
- the Vector Enable Mask is used by a calling routine to selectively enable the channels (elements) for which the processing must be performed.
- the Vector Condition Mask register is used to enable/disable selective elements based on conditional codes.
- the looping mechanism works in ' multiples of the hardware vector length such that if the hardware supports a vector length of 8, the loop can be specified as 1/8" 1 of the number of elements. Alternatively, the loop can be specified in the number of elements and decremented by the hardware vector length, VML or VAL. The last instantiation may even be partial as the value of VML and/or VAL may be set to the remainder for the last pass through the loop. These temporarily changed values of VML and/or VAL may be restored upon completion of the loop. This mechanism allows software implementations to be independent of the hardware length of the vector units.
- Memory organization is Harvard with separate instruction and data memory. All data memory is however unified to be friendly to the compiler and programmer.
- pre-fetch operations (effectively as a cache), allows full speed, delivery of operands to the operational units.
- Data pre-fetch reads at least twice the amount of data consumed in any given clock cycle. This balances the throughput with respect to the consumption of pairs of data from different locations with the reading of sequential operands. Operands only need to be aligned according to their size to allow efficient access as on most RISC processors.
- the TOVEN implements a strongly typed-system for identifying data operands and conversions required for particular operations.
- Each data operand has characteristics of the following:
- Operand type may be Integer, Fractional or Exponential (floating point)
- Placement specifies positions 0 to 7 for Byte, 0 to 3 for Half-Word, 0 to 1 for Word, where 0 denotes the least significant position
- Placement refers to a position relative to a "virtual" 64-bit Long- ord and is used to identify the significance associated with each component data.
- Figure 4 illustrates the positions of Bytes, Half- Words and Words relative to a 64-bit
- Each position is type-aligned. For example if one was accumulating 8-bit data (summing the elements of a vector, say y) with the result being "r" a 12-bit number, "position 0" would refer to bits 0 to 7 of r (r[7:0]) and “position 1” would refer ' to bits 8 to 11 of r (r[l 1:8]). In this case “position 1” would reference the guard bits. In reality, the accumulating register is 16 bits but only 12 bits are used, hence "position 1" just provides 4 bits of information. Exponential (floating point) support is currently not implemented, but is reserved for a future member of the TOVEN
- Fractional data is shown using either one sign or one integer bit with the rest of the bits as fractional.
- Other Fractional data formats may be used by the programmer maintaining the location of the binary point (like other DSPs).
- Table 2-1 summarizes the different data operand types, sizes, formats and placement:
- Double S.11.52+1 A placement of 0 refers to the least significant position.
- operand-type information utilizes a "type register" associated with each operand and address pointer.
- type register associated with each operand and address pointer.
- Table 2-2 The format of a type register is shown below in Table 2-2:
- the types are Fractional, Integer and Exponential.
- the operand type "Automatic" is used for automatic operand matching. The inte ⁇ retation of "Automatic” is dependent on its use as an operand, operation, or result type.
- “Automatic” means the operand type is of the same type as the operation expects and hence no conversion is necessary.
- When used as an operation type the operation will be performed according to the type of its operands (operand matching logic is used to determine the common operation type). As a result type, "Automatic” is not used.
- Operand "size” and "position” are encoded into a common field. The position is enumerated from the least significant position to the most relative to a 64-bit word.
- a Byte may occupy any one of 8 positions, a Half-Word may occupy any one of 4 positions, a Word may occupy either of 2 positions, and a Long- Word may only be in one position.
- the size/position field value of "Unspecified" is used for operand matching of size and position properties but not of an operand type.
- the "sign" field indicates if the operand or result is to be considered Signed or Unsigned.
- This specification is used for multiplication and saturation.
- Multiplication uses the sign attributes of its operands to control its operation to be Signed/Signed, Unsigned/Unsigned or mixed.
- Saturation uses the sign attribute of its operand to control the saturation range (such as 0x8000 to 0x7fff for signed or 0x0000 to Oxffff for unsigned).
- the sign field of an operation type is unused.
- the type registers associated with vector data operands are:
- the destination registers, X[2:0] and Y[2:0] inherit the "tag" associated with a pointer it was loaded with. Hence if X0 is loaded using pointer 1X1, then the type attributes of X0 will be taken from TX1. Further, any changes to the type register, TX1, will immediately apply as the type of data held in X0.
- the type registers associated with the vector functional units are:
- TMOP - specifies the VMU operand type TRES - specified the VMU, AAU and VALU result type
- the vector operations performed through TOVEN are controlled through the use of this type information.
- the ⁇ operands for the VMU are converted according to the type-register, TMOP. This may specify "Automatic” or “Unspecified” to allow the operand matching logic determine the common type for the VMU operation.
- the results of the VMU, AAU and VALU are all specified according to the type-register, TRES.
- the operands for the AAU and VALU are also converted according to TRES. Again, specifying "Automatic” or "Unspecified” allows the operand matching logic to determine the common type for the AAU or VALU operation.
- the actual result of the VMU may be converted to match the type specified in TRES if necessary.
- the type registers associated with writing vector results are:
- the destination registers, M, Q, R and T may be converted according to the type register associated with the destination address pointer.
- an operand type-register is associated with each operand and result (and also with each address pointer).
- the operand type(s) and operation/result type(s) are used for controlling conversions for each operation.
- Instructions are provided to alter the type registers once operands are in registers.
- Operand promotion refers to conversions to larger operands with generally no loss of precision.
- the operand promotions performed according to operand and operation type attributes include:
- Operand promotions are performed in the preparation of the operands in the Vector Operand Conversion Unit (VOC) before the operand is delivered to the specific vector-processing unit (VMU, AAU or VALU).
- Result promotion is performed by the Vector Result Conversion Unit (VRC) when storing operands to memory through the Vector Write Unit (VWU).
- Both operand and operation types are used for promoting the operand*. Promotion of operands may be implicit by matching one form of operand with another form operand (either to match the other data operand or match the operation type). Depending on either the operation type or the other data operand, a conversion from one format to another would be performed automatically. The conversion is equivalent to what is normally performed in high-level languages, such as C Language, when mixed operands types are used.
- one operand is Exponential (floating point) and the other is Integer
- an implicit conversion of Integer to Exponential is performed first and then the operation is performed.
- Vector instructions may operate on Integer or Fractional data with bytes, half-words or words sizes.
- the sign extension/zero fill controls the expansion into the higher order bits.
- the first step is a type conversion to the nearest exponential equivalent whereby no loss of precision is expected.
- the second step is then a promotion of a "smaller" exponential operand to a larger operand as discussed in the section
- Operand demotion refers to conversions to smaller operands with an intentional loss of precision.
- the demotion is performed to match operand types for specific operation type(s) and for operand storage.
- the operand demotions performed according to operand and operation type attributes include:
- VOC Vector Operand Conversion Unit
- VMU specific vector-processing unit
- VRC Vector Result Conversion Unit
- VWU Vector Write Unit
- operand positioning Use of a portion of a word in a half-word operand or a portion of a word or half-word in a byte operand is implemented through operand positioning.
- the high or low-half of a word operand may be used as the half-word operand.
- the operand is considered as unsigned.
- the corresponding type register should be set accordingly for the selection of the desired high or low portion and sign attributes of the portion, Table 2-4 shows this Operand Positioning.
- a demotion occurs on the storage of operands when a Floating-Point operand is to be stored in a Fractional variable, or used as Fractional instruction operand.
- the conversion may result in either an Integer or Fractional number.
- a Fractional number is assumed to be 1.7, 1.15 or 1.31 in either signed or unsigned format.
- Optional rounding and/or saturation may be used in the conversion to Integer or Fractional numbers.
- Video saturation may also be specified for saturating data to unsigned bytes using a maximum of 240 (235 for chroma) and a minimum of 16 for 656 video format.
- the specific form of the instruction operation may be selected based on the promoted matching data operand types. For example, a type-independent "add" operation of two data operands may be in either Integer/Fractional or Exponential depending on the common promoted data operand type. The result may be further converted (promoted or demoted) for subsequent operations or storage according to desired operand type.
- the selection of the ' form of the type-independent instruction is much like operator overloading in C++. Data operands would be automatically promoted to a common type and the matching operation would be performed.
- the operand type would be a characteristic of a data operand
- the operand type would be passed into a routine or piece of code along with the data operand. This allows common code to operate on different and mixed types of data. This is a classic example of its utility is for a maximum function. Any type of data operand may be compared with any type of data operand using a type-independent "compare" instruction with automatic promotion.
- the TOVEN also performs other conversions as results are generated. These conversions are-used to ensure reliable computations. They are discussed in the following sections.
- Redundant sign elimination is used automatically when two Fractional numbers are multiplied. This serves to eliminate the redundant sign bit formed by the multiplication of two S.15 numbers to form a S.31 result as an example.
- the redundant sign elimination is NOT performed for mixed Integer/Fractional or Integer only operations so as to preserve all result bits. The programmer is responsible for shifts in these cases. Multiplication of two Fractional operands or one Fractional and one Integer operand results in a Fractional result type. Only a multiplication of two Integer operands results in an Integer result type.
- Another example is - (-1) which should also result in a value of 1 but needs to be represented by a value of nearly 1 as a fractional number.
- This form of fractional negation is used frequently in the AAU and VALU. Conditions such as these should be detected and corrected in each processing stage where such comer cases may occur. Alternatively, the expansion by 1 bit could be accommodated in the processing of the AAU and VALU.
- VMU VECTOR MULTIPLIER UNIT
- VMU Vector Multiplier Unit
- the operands come from vector operand registers, X[2:0] or Y[2:0], a prior vector result, R, or a scalar operand, S.
- the 15 result from a VMU is stored (return) in register M.
- Point-wise vector multiplication is defined as:
- Cross-product or cross-wise vector multiplication is defined as:
- a complex number is represented by a real number followed by an imaginary number.
- a Complex multiplication is as follows:
- VMU Element pair is illustrated in Figure 5. Multiplexors, controlled by the decoded instruction, are used to select the operands. When using 32-bit data size, X k and X k _, are exchanged between elements for performing cross-product/cross-wise multiplication.
- the operand-type registers provide sign and type attributes.
- the multiplier size is produced by the operand-size matching logic according to the multiplier-type register, TMOP.
- the VMU operates on 8, 16 or 32-bit data sizes and produces 16, 32 and 32-bit results respectively.
- a result of a multiplication requires doubling the range of its operands.
- Multiplication of 32-bit data types in the VMU is limited to producing either the high or low 32-bit result.
- a high word result is needed when multiplying Fractional numbers, whereas a low word result expresses the result of multiplying Integer numbers.
- a mixed-mode Fractional/Integer multiplication is supported and the result is considered as Fractional.
- Each multiplier hardware element (32-bit word size) is responsible for operating with a mixture of signed and unsigned operands with both Fractional and Integer types:
- the multiplier element is also required to perform cross-wise multiplication by interchanging a neighboring operand. For 32-bit operands, this exchange is performed outside of the basic element multiplier. For 16 and 8-bit operands, this exchange is performed within the multiplier element by computing appropriate partial products. Table 3-1 shows the multiplier result types and sign attributes.
- the multiplier corrects "comer" cases such as the multiplication of 0x8000 by 0x8000 as signed 16 bit numbers
- VMU Vector Mode Operations
- the first instruction is point-wise vector multiplication or point-wise vector- scalar multiplication
- the second instruction is cross-wise vector multiplication or cross-wise vector-scalar multiplication
- the third is vector-vector multiplication (squaring) or scalar-scalar multiplication.
- the last two instructions are used for moving a value into the M register.
- the 32-bit S register is use as an operand, a vector is created with each element of the vector equaling the value in the S register.
- the "V.SQR S" instruction would result in a vector (not scalar) stored in the M register with each element equaling the value in S squared.
- the VMU instructions for Register mode require an additional operand, "Rd", which selects the register (R0 to R7) to store the result.
- Rd where "d” is also the hardware element slice, will implicitly select the operands Xi.d and Yi.d.
- the user need n t specify the ".d” suffixes in the X and Y operands.
- R.CMULI Rd, [Xi.d, S], [Yj.d, S] // Rd x(1) * y(0) + x(0) * y(1) [T, none].
- the dual operand VMU Register instructions are:
- VMU Type Conversions The operands for the VMU are converted according to the type register, TMOP. This may specify "Automatic" or
- Table 3-3 shows the VMU operand conversions used when TMOP is set to a specific operand type.
- Word . Word * Word When TMOP is explicitly set for a particular operation type, then that is exactly the operand format used for the operation. In this case, both operands may be converted if necessary (using either promotion or demotion) into the common operand format.
- the result, M, of the VMU is specified according to the type register, TRES.
- the result of the VMU may be converted to match the type specified in TRES if necessary using a demotion operation. Since only a demotion is provided, it may be necessary to restrict the type specified in TMOP according to the type specified in TRES.
- Table 3-4 shows the VMU result conversion used to match the result format specified in TRES.
- the four 8x8 multiplication pairs are (using four 8x8 multipliers):
- the four 8x8 crosswise multiplication pairs are (using four 8x8 multipliers):
- the two 16x16 crosswise multiplications generate the following pairs, which are added and shifted to form the proper result (using eight 8x8 multipliers):
- the 32x32 fractional multiplication generates the following pairs, which are added and shifted to form a 32-bit fractional result (using ten 8x8 multipliers):
- the 32x32 integer multiplication generates the following pairs, which are added and shifted to form a 32-bit integer result (using ten 8x8 multipliers):
- the check may be implemented by detecting if either (or both) of the two operands are zero. First, each of the 6 operands, A, B, C and E, F, G is checked for a value of zero (using an 8 input OR). Then 6 AND gates check for a zero operand for each of these product terms. Finally, a 6 input OR combines the results of the 6 product tests. This logic to implement High- Word detection is shown in Figure 6.
- a full 64-bit product may be produced from two successive integer multiplications.
- the first multiplication produces the low order 32 bits and the second produces the upper 32 bits.
- a partial product from the first multiplication needs to be saved for the proper carry into the upper 32 bits. This may be specified using a word position of 1 for the result selecting the upper 32 bits.
- Ten 8x8 multipliers are needed for this implementation.
- a two-input multiplexor is used to select the input operands for about half of the multipliers.
- the 32x32 fractional multiplier inputs must all be accommodated.
- the six remaining terms may be overlapped with terms not used for their respective multiplications.
- Logic would be needed to select which set is used for each of the 6-multiplier products that have multiple selections.
- the assignment of products to Set B may be optimized with respect to several criteria. First, the cross multiplier unit terms, AH and DE should not be multiplexed as these may have longer signal delays. Next, the assignment of operand pairs may consider the commonality of an input operand and hence eliminate the need for one operand multiplexor. Finally, the resulting routing of the product terms into the adders may be considered. Following at least the first two suggested optimizations, the following sets given in Table 3-6 are recommended:
- the basic multiplier cell uses two 8-bit operands, referred to as operands mul_u and mul_v, two single-bit operand-sign indications (conveying either signed or unsigned), referred to as ind_u and ind_v, and produces a 16-bit partial product, referred to as product_uv.
- the overall operand sign and size types determine the operand-sign indications for the basic multiplier cell. Only the most significant byte of a signed operand is indicated as signed while the rest of the bytes are indicated as unsigned.
- multiplier cells also include one or two 2-input multiplexors for selection of Set A or Set B operands.
- the suggested Set A/B pairings allows for commonality in some multiplier inputs and often only one 2-input multiplexor is required.
- integer/fractional affects primarily normalization after the 8x8 product term additions. It does not affect the generation of the 8x8 partial product terms (except it selects the terms for producing an integer or fractional result from a 32x32-bit multiply.)
- the normalization process is implemented after the summation of the partial products as a simple one-bit shift to the left for a fractional result type.
- the 16-bit partial products are added together according to the operation.
- Table 3-7 shows the partial products to be added together.
- the structure of the summation network will be a set of multiplexors to select the desired operand(s) (or to select 0) and a set of adders.
- the number of full adders required is at least 13.
- An expected number is probably 15.
- L and H subscripts refer to the low and high 8 bits of the partial product terms respectively.
- FIG. 7 shows an illustrative implementation of the summation network using a full adder. * The exact implementation of both components needs to be researched.
- a Wallace tree or an Additive Multiply technique (Section 12.2 of Computer Arithmetic by Parhami) may be suitable for the multiplier implementation.
- Some form of a CSA (Carry Save Adder) style adder (3 inputs, 2 outputs per level) may be appropriate for the implementation of the adder networks.
- multiplier cell When a multiplier cell is not needed, power should be conserved by setting (and holding) its inputs at a zero value. This could be done with the multiplexor or with a set of simple AND gates.
- the summation network should also perform similar power management.
- the clock used for internal pipeline stages (and anything else) should be gated off for the multiplier cells and adders in the summation network that are not needed.
- the multiplier should also be correct with "comer" cases such as the multiplication of 0x8000 by 0x8000 as signed 16 bit numbers (equivalent to -1).
- the result of-1 times —1 should be 1 and hence the proper arithmetic result should be 0x7fff ffff rather than 0x8000 0000.
- AAU The Array Adder Unit (AAU) performs the summation of an input vector (operand register), partial summation, permutation, and many other powerful transformations (such as an FFT, dyadic wavelet transform, and compare-operations for Virterbi decoding).
- the Array Adder Unit is used to arithmetically combine elements of a VMU result, M, a prior VALU result, R, or from a memory operand X or Y.
- the C matrix may be fetched or altered for each subsequent instruction.
- the fundamental operation performed by this unit is
- FIG. 9 An AAU Element is illustrated in Figure 9.
- the multiplexor at the bottom right, controlled by the decoded instruction, is used to select the operands. Multiplexors along the left, controlled by a row of the C matrix, now referred to as a C vector (a matrix can be broken into row vectors), selects the addition or subtraction of each term.
- the sign (signed or unsigned) and type (Fractional or Integer) attributes are provided by the operand-type register.
- Figure 10 shows the implementation of the AAU as a set of 12-bit wide segments. Multiplexors control the delivery of operands for each segment as illustrated below the diagram of the segments. Sign extension is necessary when a smaller operand is used in a segment (such as an X or Y operand).
- Figures 1 la and 1 lb show the multiplexors, operand positioning and sign extension processes. 3.3.2 AAU Standard Functions
- the Array Adder Unit controls each adder term with a pair of bits from the control matrix, C, to allow each P k to be excluded, added or subtracted.
- the encoding of the control bits are 00 for excluding, 01 for adding and 10 for subtracting.
- the combination 11 is not used and reserved.
- the C matrix representing the pattern to be used for add/subtract, is a set of 8 half-words with the first half-word for Q[0] (i.e. C[0][7 to 0]) and the last half-word for Q[7] (i.e. C[7][7 to 0]).
- the pre-determined patterns are:
- PASS sets Q j to P j for all possible terms. .
- SUM produces all Q j as the same sum of all P k .
- REAL is used to set Q j to P j - P j+1 and Q j+ , to 0 for all even j.
- IMAGINARY is used to set Q j to 0 and Q j+1 to P j + P j+I and for all even j.
- FFT2, FFT4 and FFT8 represent addition/subtraction patterns used for FFT Radix 2, 4 and 8 kernels respectively. The patterns and use needs to be evaluated. More patterns may be needed for computing FFTs efficiently.
- VIRTERBI may be used to perform several compares in parallel to accelerate the algorithm. It is likely that several different patterns may be necessary for the support of Virterbi.
- DCT represents a group of addition/subtraction patters used for the implementation of DCT and IDCT operations. Several patterns may be necessary.
- SCATTER represents a group of scatter/gather/merging patterns, which may be deemed useful to support.
- control matrix, C may be loaded using the address specified in ICn.
- VML the address specified in ICn.
- the multiplier vectors are normally half the width of the ALU vectors and the pre-fetch unit is designed to sustain full throughput to the ALU.
- the VMU result, M, the VALU result, R, or a direct operand, X or Y, may be used for the AAU operation.
- the result of the AAU is available as Q in the VALU.
- the AAU should be correct when forming -(-1) as a fractional number.
- the result may need to be approximated as
- the single operand AAU Vector instructions is:
- the defined C matrix patterns are the following:
- the three operand AAU Register instructions are:
- the AAU performs a limited operand promotion whereby it places an operand X, Y or M, into either the low or high halves of an extended precision format compatible with the operand type.
- an operand X, Y or M may be positioned in bit 7 to 0, i.e., a placement of 0, or it may be positioned in the extended bits, bits 11 to 8, i.e., a placement of 1.
- Table 3-8 shows the placement and bit position of the different operands. (Note, all even placements are regarded the same as placement of 0 and all odd placements are regarded the same a placement of 1. This allows a more consistent identification of operand significance.)
- Figure 10 shows the implementation of the AAU as a set of 12-bit wide segments. Multiplexors control the delivery of operands for each segment as illustrated below the diagram of the segments. Sign extension is necessary when a smaller operand is used in a segment (such as an X or Y operand).
- Figure 11 shows the multiplexors, operand positioning and sign extension processes. The implementation of the array addition for each result element, Q j , is shown in Figure 9.
- An alternate implementation of the array addition uses a common first stage to form shared terms resulting from the combination of two inputs of either positive or negative polarity. These terms may then be selected for use in the second level of additions in the AAU.
- the implementation in this manner saves a number of adders, as only one addition and one subtraction (herein after refereed to as "adders") is necessary.
- Table 3-9 shows the possible combinations of two inputs.
- a vector processor as described herein may comprise a vector of multipliers computing multiplier results; and an array adder computational unit computing an arbitrary linear combination of the multiplier results.
- the array adder computational unit may have a plurality of numeric inputs that are added, subtracted or ignored according to a control vector comprising the numeric values 1, -1 and 0, respectively.
- the array adder computational unit may comprise at least 4 or at least 8 inputs, and may comprise at least 4 outputs.
- the Vector ALU performs the traditional arithmetic, logical, shifting and rounding operations.
- the operands are the results of the VMU, AAU or VALU as M, Q, R or T respectively, direct inputs, X and Y and scalar, S.
- the VALU result, T is not available for all Register mode instructions.
- the operands for the VALU insfructions are symbolized by the following:
- VALU instructions The basic operations performed by the VALU instructions are the following:
- R A « exp (exp can be +, 0 or -, shift is arithmetic or logical)
- R A » exp (exp can be +, 0 or -, shift is arithmetic or logical)
- This unit is also responsible for conditional operations to perform merging, scatter and gather.
- This unit is also responsible for conditional operations to perform merging, scatter and gather.
- a VALU Element is illustrated in Figure 13.
- the multiplexors at the left, controlled by the decoded instruction, are used to select the operands.
- the operand-type registers provide the sign and type attributes.
- VALU performs a variety of traditional arithmetic, logical, shifting and rounding operations.
- the operands are the results of the VMU, AAU or VALU as M, Q, R or T respectively, direct inputs, X and Y and scalar, S.
- the VALU result, T is not available for all Register mode instructions.
- the shift count for shift operations would need to be specified by a register or immediate value.
- the shift count may be either positive or negative where a negative shift count reverses the shift direction (as in C Language).
- the result of the shift may be optionally rounded and saturated.
- the dual operand VALU Vector instructions are:
- the single operand VALU Vector instructions are:
- the three operand VALU Register instructions are:
- the dual operand VALU Register instructions are:
- the dual operand VALU Register instructions are:
- R.SUB [Rs, Ts, Xi.d, Yj.d, S, C4, C16, C32] [T, none].
- R.SUBC [Rs, Xi.d, Yj.d, S, C4, C16, C32] [T, none].
- R.SUBR [Rs, Xi.d, Yj.d, S, C4, C16, C32] [T, none].
- R.DIV [Rs, Xi.d, Yj.d, S, C4, C16, C32]
- R.BITC [Rs, Xi.d, Yj.d, S, C4] [T, nonej.RBITI Rd, [Rs, Xi.d, Yj.d, S, C4] [T, none]
- R.SHLA Rd [Rs, Xi.d, Yj.d, S, C4, C16, C32] [T, none].
- R.SHLL Rd [Rs, Xi.d, Yj.d, S, C4, C16, C32] [T, none].
- R.SHRA Rd [Rs, Xi.d, Yj.d, S, C4, C16, C32] [T, none].
- R.SHRL Rd [Rs, Xi.d, Yj.d, S, C4, C16, C32]
- the single operand VALU Register instructions are:
- the VALU performs a limited operand promotion whereby it places an operand X, Y, M or S, into either the low * o ⁇ r high positions of an extended precision format compatible with the operand type.
- an operand X, Y, M or S may be positioned in bits 7 to 0, (placement of 0), or it may be positioned in the extended bits, bits 11 to 8, (placement of 1).
- placement of 0 may be positioned in bits 7 to 0, (placement of 0)
- bits 11 to 8 placement of 1
- Table 3-10 shows the placement and bit position of the different operands. Table 3-10 Placement and Bit Position of Operands
- Figure 14 shows the implementation of the VALU as a set of 1 -bit wide segments. Multiplexors control the delivery of operands for each segment as illustrated below the diagram of the segments. Sign extension is necessary when a smaller operand is used in a segment (such as an X or Y operand).
- Figures 15a and 15b shows the multiplexors, operand positioning and sign extension processes.
- the Scalar ALU performs the simple arithmetic, logical and shifting operations for the support of program control flow operations and special address calculations not supported by the dedicated address pointer operations.
- the SALU is positioned early in the processor pipeline to permit both control flow operations (such as for program loops and other logic tests) and address calculations (such as for indexing into arrays) to be done without waiting for the full length of the standard processing pipeline.
- the SALU functional unit is positioned as shown in Figure 1-3 immediately after the SALU instruction decoder.
- the operands are the SALU result register, S, and an immediate constant, general purpose registers, G[7:0], the VAR registers consisting of (Izn, Tzn, Bzn and Lzn) as well as other special processor registers such as VEM and VCM.
- processor may also support operands from individual elements of M, Q, R, T, X and Y.
- the SALU performs a variety of traditional arithmetic, logical and shifting operations.
- the operands are the SALU result register, S, and an immediate constant, general purpose registers, G[7:0], the VAR registers consisting of (Izn, Tzn, Bzn and Lzn) as well as other special processor registers such as VEM and VCM.
- processor may also support operands from individual elements of M, Q, R, T, X and Y.
- the shift count for shift operations would need to be specified by a register or immediate value.
- the shift count may be either positive or negative where a negative shift count reverses the shift direction (as in C Language).
- the dual operand SALU Register instructions are: U, nonej.S.ABS S, [register, C4, C16, C32] [T, none].
- S.ADD S [register, C4, C16, C32] [T, none].
- S.CMP S [register, C4. C16, C32]
- a device as described herein may implement a method to improve responsiveness to program control operations.
- the method may comprise providing a separate computational unit designed for program control operations, positioning the separate computational unit early in the pipeline thereby reducing delays, and using the separate computation unit to produce a program control result early in the pipeline to confrol the execution address of a processor.
- a related method may improve the responsiveness to an operand address computation.
- the method may comprise providing a separate computational unit designed for operand address computations, positioning said separate computational unit early in the pipeline thereby reducing delays, and using said separate computation unit to produce a result early in the pipeline to be used as an operand address.
- Operand conversion units are used for the conversion of operands read from memory (X and Y), after the multiplier produces a result for storage into M, operand inputs to the AAU and VALU, and for result storage back to memory.
- the conversion of operands to/from memory is regarded as the most general.
- the other conversions are specialized for each of its associated units (VMU, AAU and VALU).
- VMU conversion is limited to operand demotion as growth in operand size is natural with multiplication. In order to match operand sizes and reduce complexity in vector length computation logic, VMU results may only be demoted.
- the AAU and VALU promote operands to permit them to represent a normal or a guard position. Support of the guard position is provided to allow a program to specify the full-extended precision maintained by the functional unit.
- Figure 16 illustrates the conversion process to convert a data operand for use in a vector processor unit.
- the first implementation is a linear sequence of the five processing functions.
- the second form exploits the knowledge that either a demotion or a promotion is being used (and not both). The processing delay may be reduced through use of this structure. It requires an additional multiplexor to select the properly formatted operand. Either process may be used to pass through an operand unaltered for the cases where no promotion/demotion is necessary.
- Fractional numbers are commonly saturated if the extended precision value (held in the guard bits) is different than the sign bits. Signed 32/48-bit Fractional numbers greater than 0x0000 7fff ffff are limited to this value as Fractional numbers less than Oxffff 8000 0000 are limited to this value. Unsigned 32/48-bit Fractional numbers greater than 0x0000 ffff fff are limited to this value.
- Fractional numbers may also be rounding to improve the accuracy of the least significant bit retained.
- the value 0x0000 0000 8000 is effectively added (for positive numbers) or subtracted (for negative numbers) to round the fractional number prior to reducing its precision.
- Integer numbers may also be saturated identically as Fractional numbers. They are not however rounded. Integer saturation may also require limiting the values to smaller numeric ranges when reducing the precision from 32/48-bits to 16-bits as an example. In addition, Integer numbers may be saturated to special ranges when they are used to convey image information. For some color image fo ⁇ nats, the intensity (luminance) is to be bounded within the range [16, 240] and the color (chrominance) is to be bounded within the range [16, 235].
- Fractional demotion is used to round and/or saturate an operand before it is converted through demotion to a smaller sized operand.
- Integer demotion is used to saturate an operand before it is converted through demotion to a smaller sized operand.
- the data operand may be either 16 to 32-bits (or 48 bits for the result write conversion) in size.
- the Fractional demotion process is illustrated in Figure 17 and is described in the following subsections. Fractional demotion (saturation and rounding) should not be used in any conversions of Fractional operands if multi-precision operations are being performed in software.
- the conversion is to a byte, then all bytes above the selected byte must be the same as the sign bit (or zero if unsigned). If not, the number is saturated to a value according to its sign (if it is signed, otherwise limited to the maximum value the converted value may represent). A similar conversion is performed if the conversion is to a half-word.
- Special Integer video saturation mode is provided for limited luminance values to the range [16, 240] and chrominance values to the range [16, 235].
- the use of special limits is conveyed through the operand-type registers associated with the target operand. Note, the conversion need not be to a byte size for the special Integer video saturation modes.
- Table 4-1 shows the saturation limits for signed and unsigned operands.
- Rounding is used to more accurately represent a Fractional value when only a higher order partial word is being used as a target operand. Rounding may be either unbiased or biased. Most DSP algorithms prefer the use of unbiased rounding to prevent inadvertent digression. Speech coder algorithms explicitly require the use of biased rounding operations as they were specified by functional implementation commonly performed by ordinary Integer processors by the unconditional addition of the rounding value.
- Size demotion is used to select the 8 or 16-bit sub-field of the 16 or 32-bit Integer or Fractional operand. (Fractional numbers are also subject to this demotion when converting operand sizes.)
- Figure 18 illustrates the hardware implementation of this processing.
- the symbol, b k [i:j], represents bits i to j of element k of vector b.
- a single byte result is placed on the lowest 8 bits.
- a half-word result is placed on the lowest 16 bits.
- a pair of bytes related to a single byte from each of two half-words is placed on the lowest 16 bits of the word (* indicates the usual position and A* or B* represents this alternative position). These conventions are considered as the "normalized" orientation for further processing by the Vector Packer. All positions not explicitly filled are do-not-care values. They may be held at zero (as a constant) value to conserve power by reducing switching of circuits.
- the packer re-organizes the data operands into a set of adjacent elements. This completes the process of demotion.
- the packing operation uses 1 , 2 or 4 bytes from each 32-bit element.
- the normalized forms used are:
- the packer is responsible for compressing the unused space out of the vector so that each vector processor (up to the length of the vector) is delivered data for processing.
- This conversion step uses C* (instead of the position indicated by *) when converting from Half-Words to Bytes assuming the "normalized” orientation with the two Bytes packed into the lower Half-Word. This internal convention is used to simplify and regularize the packer logic.
- Figure 19 illustrates the hardware implementation of the Vector Packer.
- Table 4-3 identifies the packing operation for representative 32-bit vector processors.
- This conversion step uses C* (instead of a B) when converting from Half-Words to Bytes assuming the "normalized" orientation with the two Bytes packed into the lower Half- Word. This internal convention is used to simplify and regularize the packer logic.
- corrective action may include trapping the processor to inform the developer or performing additional vector data operand pre-fetches to obtain all the required data.
- the partial vector would need to be saved in a register while the rest of the data is obtained.
- the packer network would need to allow for a distributor function to deliver the entire byte or half-word vector in pieces.
- the spreader re-organizes the data operands from a packed form into a more precision data type (such as U.8.0 to
- the spreading operation provides 1, 2 or 4 bytes for each 32-bit element in normalized form (position 0). If a "position" other than normalized is desired, then a second step is required.
- Figure 20 illustrates the hardware implementation of the Vector Spreader.
- Table 4-4 identifies the spreading operation for representative 32-bit vector processors
- a pair of bytes related to a single byte from each of two half-words is placed on the lowest 16 bits of the word (* indicates the usual position and A* or B* represents this alternative position). These conventions are considered as the "normalized" orientation for further processing by the Vector Spreader. All positions not explicitly filled are do-not-care values. They may be held at zero (as a constant) value to conserve power by reducing switching of circuits.
- Size promotion is used to position the smaller Integer or Fractional operand into the desired field of the target operand.
- the operand is presented as a set of bytes, ABCD.
- Figure 21 illustrates the hardware implementation.
- Table 4-5 specified the size promotion.
- a byte operand may be placed into any byte of the half-word or word target operand. Sign extension may be used if the operand is signed; zero fill is otherwise used. Similar conversions are used for positioning half-word into word operands.
- This conversion step uses C* (instead of a B) when converting from Bytes to Half-Words assuming the "normalized” orientation with the two Bytes packed into the lower Half-Word. This internal convention is used to simplify and regularize the spreader logic.
- the Operand Matching Logic (shown in Figure 22) evaluates the types of operands and the scheduled operations. This logic determines common operand types for the VMU, AAU and VALU. This section described the algorithm coded in a C-like style. If "Auto” or "Unspecified” attributes are used in an operation-type register, TMOP or TRES, operand-type matching logic is used to adjust the operation type to the largest of the operands to be used for an operation. Otherwise, the operands are converted to the size requested for an operation according to TMOP or TRES as appropriate.
- VMU Operand and Operation Types are determined according to the following algorithm:
- AUTO represents an unspecified operand size OS8 represent an 8-bit operand/result size OS16 represent a 16-bit operand/result size
- OS32 represent a 32-bit operand/result size
- TMOP is the operand type register for the VMU TRES is the result type register for the VMU, AAU and ALU
- TU is the operand type register for U operand vector (an X, S operand)
- TV is the operand type register for V operand vector (an Y, S or R operand)
- TUV is the common operand type register for the VMU TM is the result type register for the VMU M result vector
- TM OS16; ⁇ else I* either OS32 or OS16 */
- TUV OS16; /* Needs adjustment for required result format */ ⁇ ⁇
- the VMU result is optionally demoted after a computation to match the result format (according to TRES) used in the rest of the functional units.
- a 16-bit operand may be forced if a 32-bit result format is required.
- AAU Operand and Operation Types are determined according to the following algorithm: Symbols AUTO represents an unspecified operand size
- OS8 represent an 8-bit operand/result size
- OS16 represent a 16-bit operand/result size
- OS32 represent a 32-bit operand/result size
- TRES is the result type register for the VMU, AAU and ALU
- TO is the operand type register for O operand vector (an X, Y, or R operand)
- TQ is the result type register for the AAU Q result vector and the operand type for the AAU
- VALU Operand and Operation Types are determined according to the following algorithm:
- AUTO represents an unspecified operand size OS8 represent an 8-bit operand/result size OS16 represent a 6-bit operand/result size OS32 represent a 32-bit operand/result size
- TRES is the result type register for the VMU, AAU and ALU
- TA is the operand type register for A operand vector (an X, S, T, Q, M, or R operand)
- TB is the operand type register for B operand vector (an Y, S, T, Q, M, or R operand)
- TR is the result type register for the VALU R result vector and the common operand type for the VALU
- the type determination as exemplified above would need additional decisions when feeding back and forward operands such as R, M, Q and T.
- the operand type, TU, TV, TO, TA or TB would be taken from TR, TM, TQ or TT from the previous cycle (i.e. the type would correspond to the previously computed operand type).
- the operand type TO, TA or TB would be taken from the current cycle's TM or TQ (i.e. the type would correspond to the newly computed operand type).
- the adaptation of the algorithms to fully support the feedback and feed forward operands is relatively simple for one skilled in the art.
- a processor as described herein may perform an operation on first and second operand data having respective operand formats.
- the device may comprise a first hardware register specifying a type attribute representing an operand format of the first data, a second hardware register specifying a type attribute representing an operand format of the second data, an operand matching logic circuit determining a common operand format to be used for both of the first and second data in performing the operation based on the first type attribute of the first data and the second type attribute of the second data, and a functional unit that performs the operation in accordance with the common operand type.
- a related method as described herein may include specifying an operation type attribute representing an operation format of the operation, specifying in a hardware register an operand type attribute representing an operand format of data to be used by the operation, determining an operand conversion to be performed on the data to enable performance of the operation in accordance with the operation format based on the operation format and the operand format of the data, and performing the determined operand conversion.
- the operation type attribute may be specified in a hardware register or in a processor instruction.
- the operation format may be an operation operand format or an operation result format.
- a related method as described herein may include specifying in a hardware register an operation type attribute representing an operation format, specifying in a hardware register an operand type attribute representing a data operand format, and performing the operation in a functional unit of the computer in accordance with the specified operation type attribute and the specified operand type attribute.
- the operation format may be an operation operand format or an operation result format.
- a related method as described herein may provide an operation that is independent of data operand type.
- the method may comprise specifying in a hardware register an operand type attribute representing a data operand format of said data operand, and performing the operation in a functional unit of the computer in accordance with the specified operand type attribute.
- the method may comprise specifying in a first hardware register an operand type attribute representing an operand format of a first data operand, specifying in a second hardware register an operand type attribute representing an operand format of a second data operand, determining in an operand matching logic circuit a common operand format to be used for both of the first and second data in performing the operation based on the first type attribute of the first data and the second type attribute of the second data, and performing the operation in a functional unit of the computer in accordance with the determined common operand.
- a related method for performing operand conversion in a computer device as described herein may comprise specifying in a hardware register an original operand type attribute representing an original operand format of operand data, specifying in a hardware register a converted operand type attribute representing a converted operand format to which the operand data is to be converted, and converting the data from the original operand format to the converted operand format in an operand format conversion logic circuit in accordance with the original operand type attribute and the converted operand type attribute.
- the operand conversion may occur automatically when a standard computational operation is requested.
- the operand conversion may implement sign extension for an operand having an original operand type atfribute indicating a signed operand, zero fill for an operand having an original operand type atfribute indicating an unsigned operand, positioning for an operand having an original operand type attribute indicating operand position, positioning for an operand in accordance with a converted operand type attribute indicating a converted operand position, or one of fractional, integer and exponential conversion for an operand according to the original operand type attribute or the converted operand type attribute.
- the vector operand lengths corresponding to the data elements consumed by an operation may be determined. This process matches the number of elements processed by each unit.
- the vector length once determined, is used for loop control and for advancing the address pointer(s) related to the operand(s) accessed and consumed for an operation. Within a loop, it is assumed that all the operations will be of the same number of elements. For operand addressing, each pointer used may be incremented by a different value representing the number of elements consumed times the size of the operand in memory.
- the following algorithm is used for determining the number of elements processed:
- OS8 represent an 8-bit operand/result size
- OS1 ' 6 represent a 16-bit operand/result size
- OS32 represent a 32-bit operand/result size
- L is the number of 32-bit hardware elements
- TUV is the common operand type register for the VMU
- TM is the result type register for the VMU M result vector
- TQ is the result type register for the AAU Q result vector and the operand type for the AAU
- TR is the result type register for the VALU R result vector and the common operand type for the VALU
- LM is the result length (in elements) register for the VMU M result vector
- LQ is the result length (in elements) register for the AAU Q result vector
- LR is the result length (in elements) register for the VALU R result vector
- VML is the length of vector (in elements) consumed by the VMU
- AAL is the length of vector (in elements) consumed by the AAU
- VAL is the length of vector (in elements) consumed by the VALU
- VML 8; ⁇ else ⁇
- VML 16; ⁇
- VML 4; ⁇ ⁇
- VML LR
- An alternative implementation uses length information (in bytes, not counting extension/guard bits) associated with each of the operand and result registers.
- OS8 represent an 8-bit operand/result size
- OS16 represent a 16-bit operand/result size
- OS32 represent a 32-bit operand/result size Inputs
- L is the number of.8-bit elements enabled (maximum value is number of 8-bit hardware elements)
- TU is the operand type register for U operand vector (an X, S operand)
- TV is the operand type register for V operand vector (an Y, S or R operand)
- TUV is the common operand type register for the VMU TM is the result type register for the VMU NI result vector
- TO is the operand type register for O operand vector (an X, Y, I or R operand)
- TQ is the result type register for the AAU Q result vector and the operand type for the AAU
- TA is the operand type register for A operand vector (an X, S, T, Q, , or R operand)
- TB is the operand type register for B operand vector (an Y, S, T, Q, NI, or R operand)
- TR is the result type register for the VALU R result vector and the common operand type for the VALU
- LU is the operand length register for U operand vector (an X, S operand)
- LV is the operand length register for V operand vector (an Y, S or R operand)
- LUV is the common operand length register for the VMU LM is the result length register for the VMU M result vector
- LO is the operand length register for O operand vector (an X, Y, M or R operand)
- LQ is the result length register for the AAU Q result vector and the operand type for the AAU
- LA is the operand length register for A operand vector (an X, S, T, Q, NI, or R operand)
- LB is the operand length register for B operand vector (an Y, S, T, Q, NI, or R operand)
- LR is the result length register for the VALU R result vector and the common operand type for the VALU
- LM is the result length register for the VMU M result vector
- LQ is the result length register for the AAU Q result vector
- LR is the result length register for the VALU R result vector
- VML is the length of vector (in elements) consumed by the VMU AAL is the length of vector (in elements) consumed by the AAU
- VAL is the length of vector (in elements) consumed by the VALU
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Computer Hardware Design (AREA)
- Computing Systems (AREA)
- Image Processing (AREA)
- Advance Control (AREA)
Abstract
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US10/467,225 US20040073773A1 (en) | 2002-02-06 | 2002-02-06 | Vector processor architecture and methods performed therein |
| AU2002338616A AU2002338616A1 (en) | 2001-02-06 | 2002-02-06 | Vector processor architecture and methods performed therein |
Applications Claiming Priority (4)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US26670601P | 2001-02-06 | 2001-02-06 | |
| US60/266,706 | 2001-02-06 | ||
| US27529601P | 2001-03-13 | 2001-03-13 | |
| US60/275,296 | 2001-03-13 |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| WO2002084451A2 true WO2002084451A2 (fr) | 2002-10-24 |
| WO2002084451A3 WO2002084451A3 (fr) | 2003-03-20 |
Family
ID=26951989
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2002/020645 Ceased WO2002084451A2 (fr) | 2001-02-06 | 2002-02-06 | Architecture de processeur vectoriel et procedes mis en oeuvre dans cette architecture |
Country Status (2)
| Country | Link |
|---|---|
| AU (1) | AU2002338616A1 (fr) |
| WO (1) | WO2002084451A2 (fr) |
Cited By (16)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| GB2470782A (en) * | 2009-06-05 | 2010-12-08 | Advanced Risc Mach Ltd | Conditional execution in a data processing apparatus handling vector instructions |
| WO2015052484A1 (fr) * | 2013-10-09 | 2015-04-16 | Arm Limited | Appareil et procédé de traitement de données pour effectuer des opérations d'accès vectoriel spéculatives |
| EP3125108A1 (fr) * | 2015-07-31 | 2017-02-01 | ARM Limited | Traitement de donnees |
| GB2548603A (en) * | 2016-03-23 | 2017-09-27 | Advanced Risc Mach Ltd | Program loop control |
| GB2548602A (en) * | 2016-03-23 | 2017-09-27 | Advanced Risc Mach Ltd | Program loop control |
| WO2018213598A1 (fr) * | 2017-05-17 | 2018-11-22 | Google Llc | Puce d'apprentissage de réseau neuronal à usage spécial |
| US10261789B2 (en) | 2013-10-09 | 2019-04-16 | Arm Limited | Data processing apparatus and method for controlling performance of speculative vector operations |
| EP3495947A4 (fr) * | 2016-08-05 | 2020-05-20 | Cambricon Technologies Corporation Limited | Dispositif d'exploitation et son procédé de d'exploitation |
| US10768938B2 (en) | 2016-03-23 | 2020-09-08 | Arm Limited | Branch instruction |
| CN112233220A (zh) * | 2020-10-15 | 2021-01-15 | 洛阳众智软件科技股份有限公司 | 基于OpenSceneGraph的体积光生成方法、装置、设备和存储介质 |
| CN112506468A (zh) * | 2020-12-09 | 2021-03-16 | 上海交通大学 | 支持高吞吐多精度乘法运算的risc-v通用处理器 |
| CN114331803A (zh) * | 2021-12-24 | 2022-04-12 | 浙江大学 | 一种面向多阶数字半色调的高能效专用处理器 |
| CN115039070A (zh) * | 2020-02-10 | 2022-09-09 | Xmos有限公司 | 用于向量运算的旋转累加器 |
| CN115374923A (zh) * | 2022-01-30 | 2022-11-22 | 西安交通大学 | 基于risc-v扩展的通用神经网络处理器微架构 |
| CN115861026A (zh) * | 2022-12-07 | 2023-03-28 | 格兰菲智能科技有限公司 | 数据处理方法、装置、计算机设备、存储介质 |
| CN120540709A (zh) * | 2025-04-30 | 2025-08-26 | 上海思朗科技股份有限公司 | 向量处理器、高性能处理器和电子设备 |
Family Cites Families (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US4888682A (en) * | 1983-09-09 | 1989-12-19 | International Business Machines Corp. | Parallel vector processor using multiple dedicated processors and vector registers divided into smaller registers |
| US5423051A (en) * | 1992-09-24 | 1995-06-06 | International Business Machines Corporation | Execution unit with an integrated vector operation capability |
| US5537606A (en) * | 1995-01-31 | 1996-07-16 | International Business Machines Corporation | Scalar pipeline replication for parallel vector element processing |
| US6401194B1 (en) * | 1997-01-28 | 2002-06-04 | Samsung Electronics Co., Ltd. | Execution unit for processing a data stream independently and in parallel |
| US5946496A (en) * | 1997-12-10 | 1999-08-31 | Cray Research, Inc. | Distributed vector architecture |
| US6282634B1 (en) * | 1998-05-27 | 2001-08-28 | Arm Limited | Apparatus and method for processing data having a mixed vector/scalar register file |
-
2002
- 2002-02-06 WO PCT/US2002/020645 patent/WO2002084451A2/fr not_active Ceased
- 2002-02-06 AU AU2002338616A patent/AU2002338616A1/en not_active Abandoned
Cited By (39)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| GB2470782A (en) * | 2009-06-05 | 2010-12-08 | Advanced Risc Mach Ltd | Conditional execution in a data processing apparatus handling vector instructions |
| US8661225B2 (en) | 2009-06-05 | 2014-02-25 | Arm Limited | Data processing apparatus and method for handling vector instructions |
| GB2470782B (en) * | 2009-06-05 | 2014-10-22 | Advanced Risc Mach Ltd | A data processing apparatus and method for handling vector instructions |
| WO2015052484A1 (fr) * | 2013-10-09 | 2015-04-16 | Arm Limited | Appareil et procédé de traitement de données pour effectuer des opérations d'accès vectoriel spéculatives |
| CN105593808A (zh) * | 2013-10-09 | 2016-05-18 | Arm有限公司 | 用于执行推测向量存取操作的数据处理装置和方法 |
| KR20160065144A (ko) * | 2013-10-09 | 2016-06-08 | 에이알엠 리미티드 | 데이터 처리장치 및 추론적 벡터 액세스 연산의 수행방법 |
| US9483438B2 (en) | 2013-10-09 | 2016-11-01 | Arm Limited | Apparatus and method for controlling the number of vector elements written to a data store while performing speculative vector write operations |
| US10261789B2 (en) | 2013-10-09 | 2019-04-16 | Arm Limited | Data processing apparatus and method for controlling performance of speculative vector operations |
| CN105593808B (zh) * | 2013-10-09 | 2019-08-16 | Arm 有限公司 | 用于执行推测向量存取操作的数据处理装置和方法 |
| EP3125108A1 (fr) * | 2015-07-31 | 2017-02-01 | ARM Limited | Traitement de donnees |
| WO2017021269A1 (fr) * | 2015-07-31 | 2017-02-09 | Arm Limited | Traitement de vecteur à l'aide de boucles de longueur de vecteur dynamique |
| US10430192B2 (en) | 2015-07-31 | 2019-10-01 | Arm Limited | Vector processing using loops of dynamic vector length |
| CN107851021A (zh) * | 2015-07-31 | 2018-03-27 | Arm 有限公司 | 使用动态矢量长度循环的矢量处理 |
| JP2018525735A (ja) * | 2015-07-31 | 2018-09-06 | エイアールエム リミテッド | 動的ベクトル長のループを用いたベクトル処理 |
| GB2548603A (en) * | 2016-03-23 | 2017-09-27 | Advanced Risc Mach Ltd | Program loop control |
| US10768938B2 (en) | 2016-03-23 | 2020-09-08 | Arm Limited | Branch instruction |
| KR20180126002A (ko) * | 2016-03-23 | 2018-11-26 | 에이알엠 리미티드 | 프로그램 루프 제어 |
| GB2548603B (en) * | 2016-03-23 | 2018-09-26 | Advanced Risc Mach Ltd | Program loop control |
| WO2017163039A1 (fr) * | 2016-03-23 | 2017-09-28 | Arm Limited | Commande de boucle de programme |
| GB2548602A (en) * | 2016-03-23 | 2017-09-27 | Advanced Risc Mach Ltd | Program loop control |
| GB2548602B (en) * | 2016-03-23 | 2019-10-23 | Advanced Risc Mach Ltd | Program loop control |
| TWI738744B (zh) | 2016-03-23 | 2021-09-11 | 英商Arm股份有限公司 | 用於程式迴圈控制的設備、方法及電腦程式產品 |
| US10747536B2 (en) | 2016-03-23 | 2020-08-18 | Arm Limited | Program loop control |
| US10768934B2 (en) | 2016-03-23 | 2020-09-08 | Arm Limited | Decoding predicated-loop instruction and suppressing processing in one or more vector processing lanes |
| EP3495947A4 (fr) * | 2016-08-05 | 2020-05-20 | Cambricon Technologies Corporation Limited | Dispositif d'exploitation et son procédé de d'exploitation |
| EP4083789A1 (fr) * | 2017-05-17 | 2022-11-02 | Google LLC | Puce d'apprentissage de réseau neuronal à usage spécial |
| WO2018213598A1 (fr) * | 2017-05-17 | 2018-11-22 | Google Llc | Puce d'apprentissage de réseau neuronal à usage spécial |
| US11275992B2 (en) | 2017-05-17 | 2022-03-15 | Google Llc | Special purpose neural network training chip |
| EP4361832A3 (fr) * | 2017-05-17 | 2024-08-07 | Google LLC | Puce d'apprentissage de réseau neuronal à usage spécial |
| CN115039070A (zh) * | 2020-02-10 | 2022-09-09 | Xmos有限公司 | 用于向量运算的旋转累加器 |
| CN112233220A (zh) * | 2020-10-15 | 2021-01-15 | 洛阳众智软件科技股份有限公司 | 基于OpenSceneGraph的体积光生成方法、装置、设备和存储介质 |
| CN112233220B (zh) * | 2020-10-15 | 2023-12-15 | 洛阳众智软件科技股份有限公司 | 基于OpenSceneGraph的体积光生成方法、装置、设备和存储介质 |
| CN112506468B (zh) * | 2020-12-09 | 2023-04-28 | 上海交通大学 | 支持高吞吐多精度乘法运算的risc-v通用处理器 |
| CN112506468A (zh) * | 2020-12-09 | 2021-03-16 | 上海交通大学 | 支持高吞吐多精度乘法运算的risc-v通用处理器 |
| CN114331803A (zh) * | 2021-12-24 | 2022-04-12 | 浙江大学 | 一种面向多阶数字半色调的高能效专用处理器 |
| CN115374923A (zh) * | 2022-01-30 | 2022-11-22 | 西安交通大学 | 基于risc-v扩展的通用神经网络处理器微架构 |
| CN115861026B (zh) * | 2022-12-07 | 2023-12-01 | 格兰菲智能科技有限公司 | 数据处理方法、装置、计算机设备、存储介质 |
| CN115861026A (zh) * | 2022-12-07 | 2023-03-28 | 格兰菲智能科技有限公司 | 数据处理方法、装置、计算机设备、存储介质 |
| CN120540709A (zh) * | 2025-04-30 | 2025-08-26 | 上海思朗科技股份有限公司 | 向量处理器、高性能处理器和电子设备 |
Also Published As
| Publication number | Publication date |
|---|---|
| AU2002338616A1 (en) | 2002-10-28 |
| WO2002084451A3 (fr) | 2003-03-20 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US20040073773A1 (en) | Vector processor architecture and methods performed therein | |
| EP3513281B1 (fr) | Instruction de multiplication-addition vectorielle | |
| US7937559B1 (en) | System and method for generating a configurable processor supporting a user-defined plurality of instruction sizes | |
| US6829696B1 (en) | Data processing system with register store/load utilizing data packing/unpacking | |
| US7509483B2 (en) | Methods and apparatus for meta-architecture defined programmable instruction fetch functions supporting assembled variable length instruction processors | |
| KR100705507B1 (ko) | 확장가능한 프로세서 아키텍처에 진보된 명령어들을부가하는 방법 및 장치 | |
| US5958048A (en) | Architectural support for software pipelining of nested loops | |
| US5805875A (en) | Vector processing system with multi-operation, run-time configurable pipelines | |
| EP1124181B1 (fr) | Appareil de traitement de données | |
| US6754809B1 (en) | Data processing apparatus with indirect register file access | |
| EP1102163A2 (fr) | Processeur avec jeu d'instructions amélioré | |
| WO2002084451A2 (fr) | Architecture de processeur vectoriel et procedes mis en oeuvre dans cette architecture | |
| CN108205448B (zh) | 具有在每个维度上可选择的多维循环寻址的流引擎 | |
| JP3829166B2 (ja) | 極長命令語(vliw)プロセッサ | |
| JP2002517037A (ja) | 混合ベクトル/スカラレジスタファイル | |
| WO2000034887A9 (fr) | Systeme pour une selection dynamique d'une sous-instruction vliw permettant d'obtenir un parallelisme de duree d'execution dans un processeur indirect vliw | |
| CA2366830A1 (fr) | Procede d'indexage de fichier registre, et appareil permettant de commander de maniere indirecte un adressage de registre dans un processeur a tres long mot d'instruction | |
| JP2008530642A (ja) | 低レイテンシーの大量並列データ処理装置 | |
| WO1998006042A1 (fr) | Procede et appareil permettant de decondenser des instructions longues | |
| EP0982655A2 (fr) | Unité de traitement de données et procédé pour l'exécution d'instructions à longueur variable | |
| US6857063B2 (en) | Data processor and method of operation | |
| US5768553A (en) | Microprocessor using an instruction field to define DSP instructions | |
| Song | Demystifying epic and ia-64 | |
| Kuo et al. | Digital signal processor architectures and programming | |
| WO1998006040A1 (fr) | Support architectural pour le chevauchement par logiciel de boucles imbriquees |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| AK | Designated states |
Kind code of ref document: A2 Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NO NZ OM PH PL PT RO RU SD SE SG SI SK SL TJ TM TN TR TT TZ UA UG US UZ VN YU ZA ZM ZW |
|
| AL | Designated countries for regional patents |
Kind code of ref document: A2 Designated state(s): GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE CH CY DE DK ES FI FR GB GR IE IT LU MC NL PT SE TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG |
|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application | ||
| DFPE | Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101) | ||
| WWE | Wipo information: entry into national phase |
Ref document number: 10467225 Country of ref document: US |
|
| REG | Reference to national code |
Ref country code: DE Ref legal event code: 8642 |
|
| 122 | Ep: pct application non-entry in european phase | ||
| NENP | Non-entry into the national phase |
Ref country code: JP |
|
| WWW | Wipo information: withdrawn in national office |
Country of ref document: JP |