TWI470545B - Apparatus,processor,system,method,instruction,and logic for performing range detection - Google Patents
Apparatus,processor,system,method,instruction,and logic for performing range detection Download PDFInfo
- Publication number
- TWI470545B TWI470545B TW98136966A TW98136966A TWI470545B TW I470545 B TWI470545 B TW I470545B TW 98136966 A TW98136966 A TW 98136966A TW 98136966 A TW98136966 A TW 98136966A TW I470545 B TWI470545 B TW I470545B
- Authority
- TW
- Taiwan
- Prior art keywords
- range
- vector
- input
- logic
- complex
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/17—Function evaluation by approximation methods, e.g. inter- or extrapolation, smoothing, least mean square method
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
- G06F7/38—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation
- G06F7/48—Methods or arrangements for performing computations using exclusively denominational number representation, e.g. using binary, ternary, decimal representation using non-contact-making devices, e.g. tube, solid state device; using unspecified devices
- G06F7/57—Arithmetic logic units [ALU], i.e. arrangements or devices for performing two or more of the operations covered by groups G06F7/483 – G06F7/556 or for performing logical operations
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Mathematical Analysis (AREA)
- Computational Mathematics (AREA)
- Mathematical Physics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Computing Systems (AREA)
- Algebra (AREA)
- Complex Calculations (AREA)
- Executing Machine-Instructions (AREA)
- Advance Control (AREA)
- Length Measuring Devices With Unspecified Measuring Means (AREA)
Description
本發明之具體實施例大致上有關資訊處理之領域,且更特別地是有關在計算系統及微處理器中執行範圍檢測之領域。The specific embodiments of the present invention are generally related to the field of information processing, and more particularly to the field of performing range detection in computing systems and microprocessors.
電腦硬體、諸如微處理器中之數學函數的性能可視於一些位置、諸如快取記憶體或主記憶體中所儲存之查詢表(LUTs)的使用而定。單一指令多數資料(SIMD)指令可執行多數記憶體操作,以當執行數學函數時於硬體中存取LUTs。譬如,對於若干輸入運算元之每一個,執行一基於該等輸入運算元之函數的SIMD指令可存取一LUT,以便對該SIMD函數獲得一結果輸出,因為一些處理器架構不提供對若干LUTs的平行存取,但反之使用該相同的記憶體存取邏輯,以存取一或多個LUTs,這些LUT存取可串連地發生,而非一平行之方式,藉此限制執行該SIMD函數之性能。The performance of computer hardware, such as mathematical functions in a microprocessor, can be determined by the use of some locations, such as cache memory or lookup tables (LUTs) stored in the main memory. A single instruction majority (SIMD) instruction can perform most memory operations to access LUTs in hardware when executing mathematical functions. For example, for each of a number of input operands, a SIMD instruction executing a function based on the input operands can access a LUT to obtain a result output for the SIMD function, as some processor architectures do not provide for a number of LUTs. Parallel access, but conversely using the same memory access logic to access one or more LUTs, these LUT accesses may occur in series, rather than in a parallel manner, thereby limiting execution of the SIMD function Performance.
數學函數可在一些演算法中使用曲線方程或其他以多項式為基礎之技術被評估。於一些先前技藝範例中,被用於評估數學函數之曲線方程函數需要多數軟體操作,以執行目標,像範圍檢測、係數匹配、及多項式計算。曲線方程之使用以評估數學函數可因此為計算密集及在性能中相當低的,如此限制曲線方程計算於電腦程式中之有用性。Mathematical functions can be evaluated in some algorithms using curve equations or other polynomial-based techniques. In some prior art examples, the curve equation function used to evaluate mathematical functions requires a majority of software operations to perform targets such as range detection, coefficient matching, and polynomial calculations. The use of curve equations to evaluate mathematical functions can therefore be computationally intensive and relatively low in performance, thus limiting the usefulness of curve equation calculations in computer programs.
本發明之具體實施例可被用來改善微處理器及電腦中之數學計算性能。於一些具體實施例中,曲線方程計算可在比一些先前技藝曲線方程計算較大之性能層次被用來執行各種數學運算。於至少一具體實施例中,曲線方程計算性能可藉由加速執行曲線方程計算中所涉及的最費時及耗資源操作之至少一個而被改善。於一具體實施例中,一範圍檢測指令及對應的硬體邏輯被提供,以在曲線方程內加速範圍之檢測,其對應於曲線方程計算中所使用之各種多項式。Particular embodiments of the present invention can be used to improve the mathematical performance of microprocessors and computers. In some embodiments, the curve equation calculations can be used to perform various mathematical operations at a higher performance level than some prior art curve equations. In at least one embodiment, the curve equation calculation performance can be improved by accelerating at least one of the most time consuming and resource consuming operations involved in the calculation of the curve equation. In one embodiment, a range detection command and corresponding hardware logic are provided to detect the acceleration range within the curve equation, which corresponds to the various polynomials used in the calculation of the curve equation.
圖1說明一微處理器,其中本發明之至少一具體實施例可被使用。特別地是,圖1說明具有一或更多處理器核心105及110之微處理器100,每一處理器核心分別與一本地快取記憶體107及113有關聯。亦在圖1中說明者係一共享快取記憶體115,其可儲存該等本地快取記憶體107及113的每一個中所儲存之至少部份資訊的版本。於一些具體實施例中,微處理器100亦可包括在圖1未示出之另一邏輯,諸如一整合型記憶體控制器、整合型繪圖控制器、以及另一邏輯以在一電腦系統內執行其他函數、諸如輸入/輸出控制。於一具體實施例中,多處理器系統中之每一微處理器或多核心處理器中之每一處理器核心可包括邏輯119或以別的方式為與邏輯119有關聯,以根據一具體實施例回應於一指令執行範圍檢測。Figure 1 illustrates a microprocessor in which at least one embodiment of the present invention can be utilized. In particular, Figure 1 illustrates a microprocessor 100 having one or more processor cores 105 and 110, each associated with a local cache memory 107 and 113, respectively. Also illustrated in FIG. 1 is a shared cache memory 115 that stores a version of at least a portion of the information stored in each of the local cache memories 107 and 113. In some embodiments, the microprocessor 100 can also include another logic not shown in FIG. 1, such as an integrated memory controller, an integrated graphics controller, and another logic in a computer system. Perform other functions such as input/output control. In one embodiment, each of the microprocessors or multi-core processors in the multi-processor system may include logic 119 or otherwise associated with logic 119 to Embodiments respond to an instruction execution range detection.
圖2譬如說明一前側匯流排(FSB)電腦系統,其中本發明的一具體實施例可被使用。在該等處理器核心223、227、233、237、243、247、253、257之一內或以別的方式與該處理器核心有關聯,任何處理器201、205、210或215可由任何本地一階(L1)快取記憶體220、225、230、235、240、245、250、255存取資訊。再者,任何處理器201、205、210或215可由共享二階(L2)快取記憶體203、207、213、217之任一個或由系統記憶體260經由晶片組265存取資訊。圖2中之處理器的一或多個可包括邏輯219或以別的方式與邏輯219有關聯,以根據一具體實施例執行一範圍檢測指令。2 illustrates a front side busbar (FSB) computer system in which a particular embodiment of the present invention can be used. In one of the processor cores 223, 227, 233, 237, 243, 247, 253, 257 or otherwise associated with the processor core, any processor 201, 205, 210 or 215 can be any local The first-order (L1) cache memory 220, 225, 230, 235, 240, 245, 250, 255 accesses information. Moreover, any processor 201, 205, 210 or 215 can access information via any of the shared second-order (L2) caches 203, 207, 213, 217 or from system memory 260 via chipset 265. One or more of the processors in FIG. 2 may include logic 219 or otherwise be associated with logic 219 to perform a range of detection instructions in accordance with a particular embodiment.
除了圖2所說明之FSB電腦系統以外,其他系統組構可會同本發明之各種具體實施例被使用,包括點對點(P2P)互連系統及環互連系統。圖3之P2P系統譬如可包括數個處理器,且譬如僅只顯示其中之二處理器370、380。處理器370、380之每一個可包括一本地記憶體控制器集線器(MCH)372、382,以與記憶體32、34連接。處理器370、380可經由點對點(PtP)介面350使用PtP介面電路378、388交換資料。處理器370、380之每一個可與一晶片組390經由個別PtP介面352、354使用點對點介面電路376、394、386、398交換資料。晶片組390亦可與一高性能繪圖電路338經由一高性能繪圖介面339交換資料。本發明之具體實施例可為位在具有任何數目之處理核心的任何處理器內、或在圖3之每一PtP匯流排代理器內。於一具體實施例中,任何處理器核心可包括一本地快取記憶體(未示出)或以別的方式與該本地快取記憶體有關聯。再者,一共享快取記憶體(未示出)可被包括於兩處理器外面之任一處理器中,又經由p2p互連與該等處理器連接,使得如果一處理器被放置進入一低功率模式,該任一個或兩處理器之本地快取記憶體資訊可被儲存於該共享快取記憶體中。圖3中之處理器或核心的一或多個可包括邏輯319或以別的方式與邏輯319有關聯,以根據一具體實施例執行一範圍檢測指令。In addition to the FSB computer system illustrated in Figure 2, other system configurations can be utilized with various embodiments of the present invention, including point-to-point (P2P) interconnect systems and ring interconnect systems. The P2P system of FIG. 3 can include, for example, a number of processors, and for example, only two of the processors 370, 380 are shown. Each of the processors 370, 380 can include a local memory controller hub (MCH) 372, 382 for connection to the memory 32, 34. Processors 370, 380 can exchange data using PtP interface circuitry 378, 388 via point-to-point (PtP) interface 350. Each of the processors 370, 380 can exchange data with a chipset 390 via the individual PtP interfaces 352, 354 using point-to-point interface circuits 376, 394, 386, 398. The chipset 390 can also exchange data with a high performance graphics circuit 338 via a high performance graphics interface 339. Particular embodiments of the invention may be located in any processor having any number of processing cores, or within each PtP bus agent of FIG. In one embodiment, any processor core may include or otherwise be associated with a local cache memory (not shown). Furthermore, a shared cache memory (not shown) can be included in any processor external to both processors and connected to the processors via a p2p interconnect such that if a processor is placed into a processor In the low power mode, the local cache memory information of the one or two processors can be stored in the shared cache memory. One or more of the processors or cores of FIG. 3 may include logic 319 or otherwise be associated with logic 319 to perform a range of detection instructions in accordance with a particular embodiment.
曲線方程計算能夠否定使用查詢表(LUTs)及與其有關聯之昂貴記憶體存取的需要。圖4譬如說明一階曲線方程函數。於圖4中,讓“X”係8元素輸入向量,其元素包括資料,該向量X中之256位元,“Xin”每一個藉由32位元所代表。用於任何給定輸入“Xin”,該曲線方程函數之向量Y的元素“Yout”可導致一向量W=Y(X)。該向量W之元素可使用包括範圍檢測、係數匹配、及多項式計算的曲線方程計算操作而被評估。至少一具體實施例包括一指令及邏輯,以於評估該曲線方程函數中執行範圍檢測。於一些具體實施例中,向量X之元素尺寸可為8位元,然而,於其他具體實施例中,它們可為16位元、32位元、64位元、128位元等。再者,於一些具體實施例中,X之元素可為整數、浮點數值、單一或雙精確度浮點數值等。Curve equation calculations negate the need to use lookup tables (LUTs) and the expensive memory access associated with them. Figure 4 illustrates the first-order curve equation function. In Figure 4, let "X" be a 8 element input vector whose elements include data, 256 bits in the vector X, and "Xin" each represented by 32 bits. For any given input "Xin", the element "Yout" of the vector Y of the curve equation function may result in a vector W = Y(X). The elements of the vector W can be evaluated using a curve equation calculation operation including range detection, coefficient matching, and polynomial calculation. At least one embodiment includes an instruction and logic to evaluate execution range detection in the curve equation function. In some embodiments, the element size of vector X can be 8 bits, however, in other embodiments, they can be 16 bits, 32 bits, 64 bits, 128 bits, and the like. Moreover, in some embodiments, the elements of X may be integers, floating point values, single or double precision floating point values, and the like.
於一具體實施例中,範圍檢測邏輯可包括解碼及執行邏輯,以執行具有一指令格式之範圍檢測指令,及控制領域以執行該表示式,“範圍向量(R)=範圍_檢測(輸入向量(X),範圍限制向量(RL))”,在此R係一藉由圖5中所敘述之邏輯所產生的範圍向量,X係該輸入向量,且RL係包含該曲線方程函數之每一範圍的第一Xin之向量。譬如,於一具體實施例中,該向量RL包含圖4之每一範圍的第一Xin(0,10,30,50,70,80,255),於一些順序中對應於該輸入向量X。In a specific embodiment, the range detection logic can include decoding and execution logic to execute a range detection instruction having an instruction format, and control the field to perform the representation, "Range Vector (R) = Range_Detection (Input Vector) (X), Range Limit Vector (RL)), where R is a range vector generated by the logic described in Figure 5, X is the input vector, and RL is each of the curve equation functions The vector of the first Xin of the range. For example, in one embodiment, the vector RL includes a first Xin (0, 10, 30, 50, 70, 80, 255) for each range of FIG. 4, corresponding to the input vector X in some order.
於一具體實施例中,根據該輸入向量X內所提供之每一輸入點,範圍檢測匹配圖4中所說明之曲線方程函數的一特定範圍,且將該結果儲存於SIMD暫存器中。以下之範例顯示一輸入向量X及一對應於圖4中所敘述之曲線方程的範圍檢測器向量。該給定之範例敘述在16位元固定點輸入上之操作;然而相同之技術係可適用於8、32位元固定及浮點數值,以及用於目前及未來向量延伸中所使用之不同資料型式。In one embodiment, a range of matching curve function functions illustrated in FIG. 4 is detected based on each input point provided within the input vector X, and the result is stored in a SIMD register. The following example shows an input vector X and a range detector vector corresponding to the curve equation described in FIG. The given example describes the operation on a 16-bit fixed-point input; however, the same technique can be applied to 8- and 32-bit fixed and floating-point values, as well as different data used in current and future vector extensions. Type.
讓X為以下之輸入向量,在此每一元素沿著圖4之x軸包含一Xin值:Let X be the following input vector, where each element contains a Xin value along the x-axis of Figure 4:
基於上面之輸入向量X與圖4中所描述之曲線方程,該範圍檢測向量將包含以下:Based on the above input vector X and the curve equation depicted in Figure 4, the range detection vector will contain the following:
於一具體實施例中,一指令可被執行,以藉由根據圖4之曲線方程在該輸入向量上之操作產生上面之範圍檢測向量。於一具體實施例中,該指令造成該等輸入向量元素將與該等範圍限制(圖4中之0,10,30,50,70,80)的每一個比較。於一具體實施例中,每一範圍限制可被傳播至SIMD暫存器及與該輸入向量X比較。於一具體實施例中,在此比較操作導致0或-1,以指示該比較之結果,該等比較結果之減去及累積產生該曲線方程之範圍,其中該輸入向量X中之每一輸入點被包含。執行該等比較操作之邏輯被說明在圖5中,在此xi 標示一在輸入向量X內之輸入點,ti 敘述圖4之曲線方程的範圍限制,且ri 敘述範圍檢測向量R內之結果的範圍,對應於輸入點xi 。於其他具體實施例中,該比較操作可導致其他值(例如1及0),其可使用該等比較值之比較、相加或減去、及累積而被執行,以產生範圍檢測向量R。In one embodiment, an instruction can be executed to generate the above range detection vector by operation on the input vector according to the curve equation of FIG. In one embodiment, the instructions cause the input vector elements to be compared to each of the range limits (0, 10, 30, 50, 70, 80 in Figure 4). In one embodiment, each range limit can be propagated to and compared to the SIMD register. In a specific embodiment, the comparison operation results in 0 or -1 to indicate the result of the comparison, and the subtraction and accumulation of the comparison results yields a range of the curve equation, wherein each input of the input vector X Points are included. The logic for performing these comparison operations is illustrated in Figure 5, where x i indicates an input point within the input vector X, t i describes the range limit of the curve equation of Figure 4, and r i describes the range detection vector R The range of results corresponds to the input point x i . In other embodiments, the comparison operation may result in other values (eg, 1 and 0) that may be performed using the comparison, addition or subtraction, and accumulation of the comparison values to produce a range detection vector R.
圖5a說明邏輯,其可根據一具體實施例被用來回應於執行一範圍檢測指令而產生一範圍檢測向量R。於一具體實施例中,邏輯500a包括輸入向量X501a,其藉由比較邏輯505a與範圍限制向量510a比較,其在每一元素中包括該曲線方程範圍之範圍限制對應於該輸入向量X之第“i”個元素。於一具體實施例中,輸入向量501a的一元素係藉由比較邏輯505a與範圍限制向量510a之對應元素比較。於一具體實施例中,零向量515a之元素係將517a加至輸入向量501a及範圍限制向量510a之比較結果的負值,以於該比較結果之結果的每一元素中產生0或-1。該輸入向量501a係接著與範圍限制向量520a之對應元素比較,其負的結果被加至該先前之比較結果。對於範圍限制向量510a之每一元素持續此過程,於範圍檢測向量525a中告終。Figure 5a illustrates logic that may be used to generate a range detection vector R in response to execution of a range detection instruction in accordance with a particular embodiment. In one embodiment, logic 500a includes an input vector X501a that is compared to range limit vector 510a by comparison logic 505a, which includes a range limit for each curve element corresponding to the range of the input vector X. i" elements. In one embodiment, an element of input vector 501a is compared by a comparison element 505a with a corresponding element of range limit vector 510a. In one embodiment, the element of zero vector 515a adds 517a to the negative of the comparison of input vector 501a and range limit vector 510a to produce 0 or -1 in each element of the result of the comparison. The input vector 501a is then compared to the corresponding element of the range limit vector 520a, and the negative result is added to the previous comparison result. This process continues for each element of the range limit vector 510a, ending in the range detection vector 525a.
於一具體實施例中,圖5a之邏輯可會同一程式使用至少一指令集架構而被使用,並藉由以下之虛擬碼所說明:In one embodiment, the logic of FIG. 5a may be used by the same program using at least one instruction set architecture and illustrated by the following virtual code:
用於決定範圍檢測向量R之其他技術可被使用於其他具體實施例中,包括邏輯,以在該等範圍限制向量元素上執行二進位搜尋。圖5b說明二進位搜尋樹枝狀圖,根據一具體實施例,其可被用來產生範圍檢測向量R。於圖5b之二進位搜尋樹枝狀圖500b中,輸入向量X501b之每一元素係與該範圍限制向量之每一元素510b比較,在一中間向量元素(T4,在該8元素輸入與範圍限制向量之案例中)開始及持續至每一半向量(T5-T8、及T3-T1)。於一具體實施例中,以下之虛擬碼說明圖5b之二進位搜尋樹枝狀圖的作用,並使用來自一指示集架構之指令。Other techniques for determining the range detection vector R can be used in other embodiments, including logic to perform binary search on the range of restricted vector elements. Figure 5b illustrates a binary search dendrogram, which may be used to generate a range detection vector R, according to a particular embodiment. In the binary search bar graph 500b of FIG. 5b, each element of the input vector X501b is compared with each element 510b of the range restriction vector, in an intermediate vector element (T4, at the 8-element input and range limit vector In the case of the case, start and continue to each half vector (T5-T8, and T3-T1). In one embodiment, the following virtual code illustrates the effect of the binary search for the dendrogram of Figure 5b and uses instructions from an indicator set architecture.
於上面之虛擬碼中,T代表該範圍限制向量,I代表該輸入向量X及範圍限制向量T之第i個元素。In the above virtual code, T represents the range limit vector, and I represents the i-th element of the input vector X and the range limit vector T.
於一具體實施例中,一指令及對應的邏輯被使用於產生範圍檢測向量R。當該範圍檢測向量R被決定時,能執行與評估該曲線方程函數有關聯之其他操作,該函數與所討論之特別數學運算有關聯,包括該係數匹配及多項式計算操作。In one embodiment, an instruction and corresponding logic are used to generate the range detection vector R. When the range detection vector R is determined, other operations associated with evaluating the curve equation function can be performed, the function being associated with the particular mathematical operation in question, including the coefficient matching and polynomial calculation operations.
於一具體實施例中,對應於圖4中之曲線方程的每一範圍之每一多項式具有一對應的係數。係數匹配將係數向量元素匹配至在本發明的一具體實施例中所產生之範圍檢測向量元素。於圖4中所說明之範例中,有六個範圍,其可藉由以下多項式所敘述:In one embodiment, each polynomial of each range corresponding to the equation of the curve in FIG. 4 has a corresponding coefficient. Coefficient matching matches coefficient vector elements to range detection vector elements produced in a particular embodiment of the invention. In the example illustrated in Figure 4, there are six ranges, which can be described by the following polynomial:
範圍1:y=2*x (0<=X<10)Range 1: y=2*x (0<=X<10)
範圍2:y=0*x+20 (10<=X<30)Range 2: y=0*x+20 (10<=X<30)
範圍3:y=-2*x+20 (30<=X<50)Range 3: y=-2*x+20 (30<=X<50)
範圍4:y=0*x-20 (50<=X<70)Range 4: y=0*x-20 (50<=X<70)
範圍5:y=2*x-20 (70<=X<80)Range 5: y=2*x-20 (70<=X<80)
範圍6:y=0 (80<=X<255)Range 6: y = 0 (80 <= X < 255)
係數匹配係基於該範圍檢測階段之結果。該結果之係數向量的數目等於該多項式最高次數+1。持續上面之範例,用於圖4中所敘述之輸入向量X的結果之係數向量C1 及C2 係在下文說明:The coefficient matching is based on the results of the range detection phase. The number of coefficient vectors for this result is equal to the highest number of times the polynomial is +1. Continuing the above example, the coefficient vectors C 1 and C 2 for the results of the input vector X described in FIG. 4 are described below:
上面範例中之所有多項式的次數為一,因此結果係數向量之數目為二。於一具體實施例中,該等C1 及C2 向量係使用一混合指令基於圖5a及5b中所敘述之範圍檢測階段的輸出計算,其在該二係數向量C1 及C2 之對應的元素中儲存該適當係數。The number of times of all polynomials in the above example is one, so the number of result coefficient vectors is two. In a specific embodiment, the C 1 and C 2 vectors are calculated using an output command based on the output of the range detection phase described in FIGS. 5a and 5b, which corresponds to the two coefficient vectors C 1 and C 2 . The appropriate coefficient is stored in the element.
在計算對應於輸入向量X之多項式的係數之後,可對於該輸入向量X中之每一輸入值執行該多項式評估計算。於一具體實施例中,多項式計算可被分成二主要操作。該第一操作包括發現每一輸入值由該曲線方程之範圍的開始之偏置。於一具體實施例中,發現該等偏置可藉由譬如使用一混合指令將每一範圍之開始匹配至每一輸入點而被達成。由圖4之曲線方程的每一範圍之開始的偏置係接著藉由自該對應的輸入向量元素減去每一範圍的最初值所計算。譬如,圖4的曲線方程中之點77將被分派至範圍5。既然範圍5之開始係在70,由其之分派範圍之開始的偏置為7。該第二操作包括對於每一輸入向量元素計算該輸出向量元素。為了計算該最後之輸出向量,一範圍的開始中所發現之偏置被發現及設定為一用於該有關多項式的輸入元素。譬如,該範圍5多項式係藉由以下之公式所敘述:y=2*x-20。對於該輸入向量元素77,我們獲得7之偏置,且如此用於點77之最後值將為y=2*(偏置)-20=2*(7)-20=-6。在計算對應於該等輸入向量元素的剩餘多項式之後,該結果可被儲存於一結果向量中。下文說明用於該最初範圍值B之向量值、偏置向量值O、及輸出向量值Y:After calculating the coefficients of the polynomial corresponding to the input vector X, the polynomial evaluation calculation can be performed for each of the input vectors X. In one embodiment, the polynomial calculation can be divided into two main operations. The first operation includes finding that each input value is offset by the beginning of the range of the curve equation. In one embodiment, the offsets are found to be achieved by, for example, using a blending instruction to match the beginning of each range to each input point. The offset from the beginning of each range of the curve equation of Figure 4 is then calculated by subtracting the initial value of each range from the corresponding input vector element. For example, point 77 in the curve equation of Figure 4 will be assigned to range 5. Since the start of range 5 is at 70, the offset from the beginning of its dispatch range is 7. The second operation includes calculating the output vector element for each input vector element. To calculate the final output vector, the offset found in the beginning of a range is found and set as an input element for the polynomial of interest. For example, the range 5 polynomial is described by the following formula: y = 2 * x -20. For this input vector element 77, we get the offset of 7 and the final value so used for point 77 would be y=2*(offset)-20=2*(7)-20=-6. After calculating the residual polynomials corresponding to the input vector elements, the results can be stored in a result vector. The vector value, offset vector value O, and output vector value Y for the initial range value B are described below:
該輸出向量Y係根據一具體實施例藉由該表示式計算。在此範例中,該輸出向量Y係藉由“Y=O*C1+C2”所計算。The output vector Y is calculated by the expression according to a specific embodiment. In this example, the output vector Y is calculated by "Y=O*C1+C2".
圖6說明可會同本發明之至少一具體實施例而被使用的操作之流程圖。於一具體實施例中,在操作601,範圍檢測向量被產生。於一具體實施例中,該範圍檢測向量係根據過程、諸如在此中所說明之二進位搜尋及邏輯對於每一輸入向量元素而產生。在操作605,係數匹配被執行,以根據該等輸入向量元素產生對應於該曲線方程之每一範圍的多項式之係數。在操作610,多項式計算被執行,對於該輸入向量中之每一元素,且該結果被儲存於一結果向量中。Figure 6 illustrates a flow diagram of operations that may be used in conjunction with at least one embodiment of the present invention. In one embodiment, at operation 601, a range detection vector is generated. In one embodiment, the range detection vector is generated for each input vector element according to a process, such as the binary search and logic described herein. At operation 605, coefficient matching is performed to generate coefficients of polynomials corresponding to each range of the curve equation based on the input vector elements. At operation 610, a polynomial calculation is performed for each element in the input vector, and the result is stored in a result vector.
至少一具體實施例之一或更多態樣可藉由儲存在一電腦可讀媒體上之代表性資料所提供,該資料代表該處理器內之各種邏輯,當藉由一機器所讀取時造成該機器製造邏輯,以執行在此中所敘述之技術。此等代表、已知為於“IP核心”可被儲存在一實質、電腦可讀媒體(“磁帶”)上,且供給至各種客戶或製造設備,以載入真正地製成該邏輯或處理器之成形機器。One or more aspects of at least one embodiment may be provided by representative material stored on a computer readable medium, the data representing various logic within the processor, when read by a machine The machine manufacturing logic is caused to perform the techniques described herein. Such representatives, known as "IP cores", can be stored on a physical, computer readable medium ("tape") and supplied to various customers or manufacturing equipment to load the logic or process that is actually made. Forming machine.
如此,一用於引導微架構記憶體區域存取之方法及設備已被敘述。當然該上面之敘述係意欲為說明性及非限制性。於閱讀及了解該上面之敘述時,許多其他具體實施例對於熟諳此技藝者將變得明顯。因此,隨著此等申請專利所給予之同等項的整個範圍,本發明之範圍將參考所附申請專利範圍被決定。Thus, a method and apparatus for directing access to a micro-architectural memory region has been described. The above description is intended to be illustrative and not limiting. Many other specific embodiments will become apparent to those skilled in the art upon reading this description. Therefore, the scope of the invention will be determined with reference to the appended claims.
5...範圍5. . . range
32...記憶體32. . . Memory
34...記憶體34. . . Memory
77...點77. . . point
100...微處理器100. . . microprocessor
105...處理器核心105. . . Processor core
107...區域快取記憶體107. . . Area cache memory
110...處理器核心110. . . Processor core
113...區域快取記憶體113. . . Area cache memory
115...共享快取記憶體115. . . Shared cache memory
119...邏輯119. . . logic
201...處理器201. . . processor
203...快取記憶體203. . . Cache memory
205...處理器205. . . processor
207...快取記憶體207. . . Cache memory
210...處理器210. . . processor
213...快取記憶體213. . . Cache memory
215...處理器215. . . processor
217...快取記憶體217. . . Cache memory
219...邏輯219. . . logic
220...快取記憶體220. . . Cache memory
223...處理器核心223. . . Processor core
225...快取記憶體225. . . Cache memory
227...處理器核心227. . . Processor core
230...快取記憶體230. . . Cache memory
233...處理器核心233. . . Processor core
235...快取記憶體235. . . Cache memory
237...處理器核心237. . . Processor core
240...快取記憶體240. . . Cache memory
243...處理器核心243. . . Processor core
245...快取記憶體245. . . Cache memory
247...處理器核心247. . . Processor core
250...快取記憶體250. . . Cache memory
253...處理器核心253. . . Processor core
255...快取記憶體255. . . Cache memory
257...處理器核心257. . . Processor core
260...系統記憶體260. . . System memory
265...晶片組265. . . Chipset
319...邏輯319. . . logic
338...繪圖電路338. . . Drawing circuit
339...繪圖介面339. . . Drawing interface
350...點對點介面350. . . Point-to-point interface
352...點對點介面352. . . Point-to-point interface
354...點對點介面354. . . Point-to-point interface
370...處理器370. . . processor
372...記憶體控制器集線器372. . . Memory controller hub
376...點對點介面電路376. . . Point-to-point interface circuit
378...點對點介面電路378. . . Point-to-point interface circuit
380...處理器380. . . processor
382...記憶體控制器集線器382. . . Memory controller hub
386...點對點介面電路386. . . Point-to-point interface circuit
388...點對點介面電路388. . . Point-to-point interface circuit
390...晶片組390. . . Chipset
394...點對點介面電路394. . . Point-to-point interface circuit
394...點對點介面電路394. . . Point-to-point interface circuit
500a‧‧‧邏輯500a‧‧‧Logic
500b‧‧‧二進位搜尋樹枝狀圖500b‧‧‧ binary search for dendrites
501a‧‧‧輸入向量501a‧‧‧ input vector
501b‧‧‧輸入向量501b‧‧‧ input vector
505a‧‧‧比較邏輯505a‧‧‧Comparative logic
510a‧‧‧範圍限制向量510a‧‧‧ Range Limit Vector
510b‧‧‧元素510b‧‧‧ elements
515a‧‧‧零向量515a‧‧‧zero vector
520a‧‧‧範圍限制向量520a‧‧‧ Range Limit Vector
525a‧‧‧範圍限制向量525a‧‧‧ Range Restriction Vector
601‧‧‧操作601‧‧‧ operation
605‧‧‧操作605‧‧‧ operation
610‧‧‧操作610‧‧‧ operation
C1 ‧‧‧係數向量C 1 ‧ ‧ coefficient vector
C2 ‧‧‧係數向量C 2 ‧‧‧ coefficient vector
I‧‧‧元素I‧‧‧ elements
R‧‧‧範圍檢測向量R‧‧‧ range detection vector
T‧‧‧範圍限制向量T‧‧‧ Range Limit Vector
T1‧‧‧半向量T1‧‧‧ half vector
T2‧‧‧半向量T2‧‧‧ half vector
T3‧‧‧半向量T3‧‧‧ half vector
T4‧‧‧中間向量元素T4‧‧‧ intermediate vector element
T5‧‧‧半向量T5‧‧‧ half vector
T6‧‧‧半向量T6‧‧‧ half vector
T7‧‧‧半向量T7‧‧‧ half vector
T8‧‧‧半向量T8‧‧‧ half vector
X‧‧‧輸入向量X‧‧‧ input vector
本發明之具體實施例係藉由所附圖面的圖示中之範例所說明,且不通過限制,及其中類似參考數字意指類似元件,且其中:The specific embodiments of the present invention are illustrated by way of example in the accompanying drawings, and not by way of limitation,
圖1說明一微處理器之方塊圖,其中本發明之至少一具體實施例可被使用;1 illustrates a block diagram of a microprocessor in which at least one embodiment of the present invention can be used;
圖2說明一共享匯流排電腦系統之方塊圖,其中本發明之至少一具體實施例可被使用;2 illustrates a block diagram of a shared busbar computer system in which at least one embodiment of the present invention can be utilized;
圖3說明一點對點互連電腦系統之方塊圖,其中本發明之至少一具體實施例可被使用;3 illustrates a block diagram of a point-to-point interconnect computer system in which at least one embodiment of the present invention can be used;
圖4根據一具體實施例說明被分成各區域之曲線方程。Figure 4 illustrates a curve equation divided into regions according to an embodiment.
圖5係回應於區域檢測指令可根據一具體實施例被用來在曲線方程內加速區域檢測之邏輯的概要圖。Figure 5 is a schematic diagram of logic that can be used to accelerate region detection within a curve equation in response to a region detection command in accordance with a particular embodiment.
圖6係可被用於執行本發明之至少一具體實施例的操作之流程圖。6 is a flow diagram of operations that may be used to perform at least one embodiment of the present invention.
100...微處理器100. . . microprocessor
105...處理器核心105. . . Processor core
107...區域快取記憶體107. . . Area cache memory
110...處理器核心110. . . Processor core
113...區域快取記憶體113. . . Area cache memory
115...共享快取記憶體115. . . Shared cache memory
119...邏輯119. . . logic
Claims (23)
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US12/290,565 US8386547B2 (en) | 2008-10-31 | 2008-10-31 | Instruction and logic for performing range detection |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| TW201030607A TW201030607A (en) | 2010-08-16 |
| TWI470545B true TWI470545B (en) | 2015-01-21 |
Family
ID=42063259
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| TW98136966A TWI470545B (en) | 2008-10-31 | 2009-10-30 | Apparatus,processor,system,method,instruction,and logic for performing range detection |
Country Status (7)
| Country | Link |
|---|---|
| US (1) | US8386547B2 (en) |
| JP (2) | JP5518087B2 (en) |
| KR (1) | KR101105474B1 (en) |
| CN (1) | CN101907987B (en) |
| DE (1) | DE102009051288A1 (en) |
| TW (1) | TWI470545B (en) |
| WO (1) | WO2010051298A2 (en) |
Families Citing this family (14)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US9454366B2 (en) | 2012-03-15 | 2016-09-27 | International Business Machines Corporation | Copying character data having a termination character from one memory location to another |
| US9280347B2 (en) | 2012-03-15 | 2016-03-08 | International Business Machines Corporation | Transforming non-contiguous instruction specifiers to contiguous instruction specifiers |
| US9459864B2 (en) | 2012-03-15 | 2016-10-04 | International Business Machines Corporation | Vector string range compare |
| US9715383B2 (en) | 2012-03-15 | 2017-07-25 | International Business Machines Corporation | Vector find element equal instruction |
| US9454367B2 (en) | 2012-03-15 | 2016-09-27 | International Business Machines Corporation | Finding the length of a set of character data having a termination character |
| US9459868B2 (en) | 2012-03-15 | 2016-10-04 | International Business Machines Corporation | Instruction to load data up to a dynamically determined memory boundary |
| US9588762B2 (en) | 2012-03-15 | 2017-03-07 | International Business Machines Corporation | Vector find element not equal instruction |
| US9268566B2 (en) | 2012-03-15 | 2016-02-23 | International Business Machines Corporation | Character data match determination by loading registers at most up to memory block boundary and comparing |
| US9459867B2 (en) | 2012-03-15 | 2016-10-04 | International Business Machines Corporation | Instruction to load data up to a specified memory boundary indicated by the instruction |
| US9710266B2 (en) | 2012-03-15 | 2017-07-18 | International Business Machines Corporation | Instruction to compute the distance to a specified memory boundary |
| US9495155B2 (en) * | 2013-08-06 | 2016-11-15 | Intel Corporation | Methods, apparatus, instructions and logic to provide population count functionality for genome sequencing and alignment |
| US9513907B2 (en) * | 2013-08-06 | 2016-12-06 | Intel Corporation | Methods, apparatus, instructions and logic to provide vector population count functionality |
| US20160124651A1 (en) * | 2014-11-03 | 2016-05-05 | Texas Instruments Incorporated | Method for performing random read access to a block of data using parallel lut read instruction in vector processors |
| US20190250917A1 (en) * | 2018-02-14 | 2019-08-15 | Apple Inc. | Range Mapping of Input Operands for Transcendental Functions |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20030018687A1 (en) * | 1999-04-29 | 2003-01-23 | Stavros Kalafatis | Method and system to perform a thread switching operation within a multithreaded processor based on detection of a flow marker within an instruction information |
| US20050044123A1 (en) * | 2003-08-22 | 2005-02-24 | Apple Computer, Inc., | Computation of power functions using polynomial approximations |
| CN1754187A (en) * | 2003-02-28 | 2006-03-29 | 索尼株式会社 | Image processing device, method, and program |
| US20070074007A1 (en) * | 2005-09-28 | 2007-03-29 | Arc International (Uk) Limited | Parameterizable clip instruction and method of performing a clip operation using the same |
Family Cites Families (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US4918618A (en) * | 1988-04-11 | 1990-04-17 | Analog Intelligence Corporation | Discrete weight neural network |
| JP3303835B2 (en) * | 1999-04-30 | 2002-07-22 | 日本電気株式会社 | Apparatus and method for generating pitch pattern for rule synthesis of speech |
| JP3688533B2 (en) * | 1999-11-12 | 2005-08-31 | 本田技研工業株式会社 | Degradation state evaluation method of exhaust gas purification catalyst device |
| WO2003019356A1 (en) | 2001-08-22 | 2003-03-06 | Adelante Technologies B.V. | Pipelined processor and instruction loop execution method |
-
2008
- 2008-10-31 US US12/290,565 patent/US8386547B2/en active Active
-
2009
- 2009-10-28 JP JP2011534699A patent/JP5518087B2/en active Active
- 2009-10-28 WO PCT/US2009/062307 patent/WO2010051298A2/en not_active Ceased
- 2009-10-29 DE DE102009051288A patent/DE102009051288A1/en active Pending
- 2009-10-30 KR KR1020090104031A patent/KR101105474B1/en not_active Expired - Fee Related
- 2009-10-30 TW TW98136966A patent/TWI470545B/en not_active IP Right Cessation
- 2009-10-31 CN CN200910253082.XA patent/CN101907987B/en active Active
-
2014
- 2014-01-17 JP JP2014006991A patent/JP5883462B2/en not_active Expired - Fee Related
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20030018687A1 (en) * | 1999-04-29 | 2003-01-23 | Stavros Kalafatis | Method and system to perform a thread switching operation within a multithreaded processor based on detection of a flow marker within an instruction information |
| CN1754187A (en) * | 2003-02-28 | 2006-03-29 | 索尼株式会社 | Image processing device, method, and program |
| US20050044123A1 (en) * | 2003-08-22 | 2005-02-24 | Apple Computer, Inc., | Computation of power functions using polynomial approximations |
| US20070074007A1 (en) * | 2005-09-28 | 2007-03-29 | Arc International (Uk) Limited | Parameterizable clip instruction and method of performing a clip operation using the same |
Non-Patent Citations (1)
| Title |
|---|
| 國立成功大學機械工程系碩士論文,研究生:陳武勇,指導教授:何旭彬「使用圖形處理器於B-Spline有限元素分析」(2007年6月) * |
Also Published As
| Publication number | Publication date |
|---|---|
| WO2010051298A2 (en) | 2010-05-06 |
| KR20100048928A (en) | 2010-05-11 |
| TW201030607A (en) | 2010-08-16 |
| KR101105474B1 (en) | 2012-01-13 |
| US8386547B2 (en) | 2013-02-26 |
| WO2010051298A3 (en) | 2010-07-08 |
| JP5518087B2 (en) | 2014-06-11 |
| US20100115014A1 (en) | 2010-05-06 |
| JP2014096174A (en) | 2014-05-22 |
| DE102009051288A1 (en) | 2010-05-06 |
| JP5883462B2 (en) | 2016-03-15 |
| CN101907987B (en) | 2015-05-20 |
| JP2012507796A (en) | 2012-03-29 |
| CN101907987A (en) | 2010-12-08 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| TWI470545B (en) | Apparatus,processor,system,method,instruction,and logic for performing range detection | |
| CN109240746B (en) | Apparatus and method for performing matrix multiplication | |
| JP5647859B2 (en) | Apparatus and method for performing multiply-accumulate operations | |
| JP5573134B2 (en) | Vector computer and instruction control method for vector computer | |
| US20210264273A1 (en) | Neural network processor | |
| JP5731937B2 (en) | Vector floating point argument reduction | |
| EP3451153B1 (en) | Apparatus and method for executing transcendental function operation of vectors | |
| CN110321161B (en) | Vector function fast lookup using SIMD instructions | |
| US12166878B2 (en) | System and method to improve efficiency in multiplication_ladder-based cryptographic operations | |
| US5341320A (en) | Method for rapidly processing floating-point operations which involve exceptions | |
| WO2017185392A1 (en) | Device and method for performing four fundamental operations of arithmetic of vectors | |
| TWI493456B (en) | Method, apparatus and system for execution of a vector calculation instruction | |
| US11416261B2 (en) | Group load register of a graph streaming processor | |
| TWI587137B (en) | Improved simd k-nearest-neighbors implementation | |
| WO2024032027A1 (en) | Method for reducing power consumption, and processor, electronic device and storage medium | |
| WO2019023910A1 (en) | Data processing method and device | |
| CN110060195A (en) | A kind of method and device of data processing | |
| US10289386B2 (en) | Iterative division with reduced latency | |
| KR20140138053A (en) | Fma-unit, in particular for use in a model calculation unit for pure hardware-based calculation of a function-model | |
| US11080054B2 (en) | Data processing apparatus and method for generating a status flag using predicate indicators | |
| JP3310316B2 (en) | Arithmetic unit | |
| US20220326956A1 (en) | Processor embedded with small instruction set | |
| JP3773033B2 (en) | Data arithmetic processing apparatus and data arithmetic processing program | |
| US20110153702A1 (en) | Multiplication of a vector by a product of elementary matrices | |
| JP2002304288A (en) | Data operation processing device and data operation processing program |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| MM4A | Annulment or lapse of patent due to non-payment of fees |