TWI236620B

TWI236620B - On-die mechanism for high-reliability processor

Info

Publication number: TWI236620B
Application number: TW092132000A
Authority: TW
Inventors: Hang Nguyen; Steven Tu; Alexander Honcharik; Sujat Jamil
Original assignee: Intel Corp
Priority date: 2002-12-19
Filing date: 2003-11-14
Publication date: 2005-07-21
Also published as: WO2004061666A3; CN1729456A; AU2003287729A1; US7055060B2; CN100375050C; AU2003287729A8; EP1573544A2; HK1079316A1; WO2004061666A2; US20040123201A1; JP2006510117A; ATE461484T1; TW200416595A; DE60331771D1; EP1573544B1

Abstract

A processor includes first and second execution cores that operate in a redundant (FRC) mode, an FRC check unit to compare results from the first and second execution cores, and an error check unit to detect recoverable errors in the first and second cores. The error detector disables the FRC checker, responsive to detection of a recoverable error. A multi-mode embodiment of the processor implements a multi-core mode in addition to the FRC mode. An arbitration unit regulates access to resources shared by the first and second execution cores in multi-core mode. The FRC checker is located proximate to the arbitration unit in the multi-mode embodiment.

Description

1236620 (υ 玖、發明說明【發明所屬之技術領域】本發明係關於微處理器，更特定地係關於在有FRC 功能的處理器中處理錯誤的機構。【先前技術】伺服器及其它高階計算與通訊系統都被設計來提供高等級的可靠性及可用性。軟性錯誤對於這兩項特性都產生一極大的挑戰。軟性錯誤是高能粒子，如α粒子，與電荷儲存節點之間的碰撞所造成的結果。它們在儲存陣列中，像是快取，TLB，及類此者，是很普遍的，這些儲存陣列包括大量的電荷儲存節點。它們亦發生在隨機狀態元件及邏輯元件中。軟性錯誤發生的機率（軟性錯誤率或SER)會隨著元件幾何形狀的縮小及元件密度變大而升高。高可靠性系統包括保全裝置，其用來在軟性錯誤造成靜默，如未被偵測到的資料訛誤（S D C )，之前偵測及管理軟性錯誤。然而，對於會將一系統帶離其正常運作之可支援高可靠度操作的錯誤偵測/處理機構而言，系統的可用性會被降低。例如，如果一錯誤被偵測到的話，此一機構會將該系統重設回其最後被知道之有效的狀態。該系統在其遇到該重設操作時無法執行其被指定的工作。一種習知的偵測軟性錯誤的機構爲功能冗餘檢查 (FRC)。一具有FRC功能的單一處理器可包括重複的指令執行核心，相同的指令碼會在這些執行核心上被執行。依 -5- 1236620 (2) _ _定的實施例，每一重複的執行核心可包括一或多個快耳又，暫存器檔及除了基本執行單元（整數，浮點，載入儲存等等）之外的支援資源。FRC硬體比較每一核心所產生的結果，如果偵測到一矛盾的話，則該F R C系統會將控制交給一錯誤處理常式（r 0 u t in e)。該來自於不同的執行核心的結果被比較的點代表該系統的FRC邊界。沒有在 FRC邊界被偵測到的錯誤會導致SDC。因爲FRC錯誤只顯示出執行核心在結果上的不一致，所以FRC錯誤是可偵測的，但爲不可恢復的。如上文提及的，該FRC錯誤處理常式典型地將系統重設至最後知道的可靠資料點。此重設機構是相當耗時間的。該重設機構將系統帶離其正常運作，降低系統的可用性。 FRC只是處理軟性錯誤的一個機構，且對於隨機邏輯及隨機狀態元件而言，其爲只要的機構。陣列結構呈現出一不同的圖像。陣列結構典型地包括同位及/或ECC硬體，其可藉由檢查資料的特性來偵測軟性錯誤。在許多情形中，系統可使用相對較快的硬體或軟體機構來改正由於資料訛誤產產生的錯誤。然而，對於有FRC功能的處理器而言’這些錯誤會被表示爲FRC錯誤，因爲它們將執行核心帶離鎖定步驟。經由一重設機構來處理其它可恢復的錯誤會降低系統的可用性。【發明內容】本發明係關於將可恢復的錯誤處理機構及不可恢復的 -6- 1236620 (3) 錯誤處理機構有效地整合在有F R C功能的處理器中的機構。【實施方式】以下的說明提出許多特定的細節用以提供對本發明的一徹底瞭解。然而，熟悉此技藝者在瞭解本案的揭示內容之後’將可體認到本發明可在沒有這些特定細節之下被實施。此外，許多習知的方法，程序，構件，及電路沒有被詳細地說明用以將重心聚焦在本發明的主要特徵上。例如 ’本發明的態樣係使用一雙核心處理器來舉例，但熟悉此技藝者可瞭解到在適當地修改重設及恢復機構之下即可應用在多於兩個核心的處理器上。第1圖爲一方塊圖，其代表依據本發明的一有FRC 功能的處理器1 1 0的一個實施例。處理器1 1 〇包括第一及第二執行核心120(a)，120(b)(統稱爲執行核心120)，一 FRC檢查器130，一錯誤偵查器140，一恢復模組150，一重设模組1 6 0 ’及共享資源1 7 0。該F R C邊界的一·部分是由虛線1 〇 4來表示。爲了舉例的目的，恢復模組1 5 〇及重設模組1 60被顯示爲處理器1丨〇的一部分。這些模組可全部或部分以硬體，韌體或軟體來實現且可以是在處理器晶粒之內或之外。相類似地，共享資源1 70可包括處理器晶粒內的構件以及在一或多個不同的晶粒上的構件。每一執行核心1 2 0都包括一資料管線（d a t a pipeline) 124及一錯誤管線128，它們分別饋給至FRC檢 1236620 (4)1236620 (υ 玖, description of the invention [Technical field to which the invention belongs] The present invention relates to a microprocessor, and more specifically, to a mechanism for handling errors in a processor having an FRC function. [Prior Art] Servo and other high-order computing Both communication systems are designed to provide high levels of reliability and availability. Soft errors pose a great challenge to both of these characteristics. Soft errors are caused by collisions between high-energy particles, such as alpha particles, and charge storage nodes The result is that they are very common in storage arrays, such as caches, TLBs, and the like. These storage arrays include a large number of charge storage nodes. They also occur in random state elements and logic elements. Soft errors The probability of occurrence (soft error rate or SER) will increase as the component geometry shrinks and the component density increases. High-reliability systems include security devices that are used to silence soft errors, if not detected Data error (SDC), previously detected and managed soft errors. However, it is possible to remove a system from its normal operation. For error detection / processing organizations that support high reliability operations, the availability of the system will be reduced. For example, if an error is detected, the organization will reset the system back to what it was last known to be effective Status. The system is unable to perform its assigned task when it encounters the reset operation. A known mechanism for detecting soft errors is functional redundancy check (FRC). A single processor with FRC functionality may include Duplicate instruction execution cores, the same instruction code will be executed on these execution cores. According to the embodiment of -5- 1236620 (2) _ _, each repeated execution core may include one or more fast ears, Register files and supporting resources other than the basic execution unit (integer, floating point, load storage, etc.). The FRC hardware compares the results produced by each core. If a conflict is detected, the FRC The system passes control to an error handling routine (r 0 ut in e). The point at which the results from different execution cores are compared represents the system's FRC boundary. No errors detected at the FRC boundary will guide SDC. Because FRC errors only show inconsistent results from the execution core, FRC errors are detectable, but not recoverable. As mentioned above, this FRC error handling routine typically resets the system to The last known reliable data point. This reset mechanism is quite time consuming. This reset mechanism takes the system away from its normal operation and reduces the system's availability. FRC is only a mechanism for handling soft errors, and it is for random logic and random state As far as components are concerned, it is only a mechanism. The array structure presents a different image. The array structure typically includes parity and / or ECC hardware, which can detect soft errors by checking the characteristics of the data. In many cases In the system, a relatively fast hardware or software mechanism can be used to correct errors due to data misproduction. However, for FRC-capable processors, these errors will be represented as FRC errors because they will remove the core from the lock step. Handling other recoverable errors through a reset mechanism reduces system availability. [Summary of the Invention] The present invention relates to a mechanism that effectively integrates a recoverable error processing mechanism and an unrecoverable -6- 1236620 (3) error processing mechanism in a processor having an F R C function. [Embodiment] The following description proposes many specific details to provide a thorough understanding of the present invention. However, those skilled in the art, after understanding the disclosure of the present case, 'will recognize that the present invention may be practiced without these specific details. In addition, many conventional methods, procedures, components, and circuits have not been described in detail to focus on the main features of the present invention. For example, the aspect of the present invention uses a dual-core processor as an example, but those skilled in the art will understand that it can be applied to processors with more than two cores by properly modifying the reset and restore mechanism. FIG. 1 is a block diagram showing an embodiment of an FRC-enabled processor 110 according to the present invention. The processor 1 10 includes first and second execution cores 120 (a), 120 (b) (collectively referred to as execution cores 120), an FRC checker 130, an error detector 140, a recovery module 150, and a reset Module 16 0 'and shared resource 170. A part of the F R C boundary is indicated by a dashed line 104. For the purpose of example, the recovery module 150 and the reset module 160 are shown as part of the processor 1o. These modules can be implemented in whole or in part in hardware, firmware or software and can be inside or outside the processor die. Similarly, shared resources 170 may include components within a processor die and components on one or more different die. Each execution core 1 2 0 includes a data pipeline (d a t a pipeline) 124 and an error pipeline 128, which are respectively fed to the FRC inspection 1236620 (4)

查器130及錯誤檢查器140。資料管線124代表當資料移動通過處理器1 1 〇朝向FRC檢查器1 30時造不同資料種類上的操作邏輯。經資料管線1 2 4處理過的資料可包括結果運算元，狀態旗標，位址，指令及在碼處理期間經由處理器1 1〇產生及階段化（Staged)者。錯誤管線128代表在不同的資料種類上的操作邏輯用以偵測在資料中的錯誤並提供適當的訊號給錯誤偵測器1 4 0。例如，該訊號可以是一或多個位元（旗標），其代表從該處理器1 1 0的不同儲存陣列（未示出）處取得的資料的同位或E C C狀態。當訛誤的資料被取得時，在這些陣列中的軟性錯誤可以同位或E C C 錯誤旗標形式出現。Checker 130 and error checker 140. The data pipeline 124 represents operation logic on different data types when data moves through the processor 1 10 toward the FRC checker 130. The data processed by the data pipeline 1 2 4 may include result operands, status flags, addresses, instructions, and those generated and staged by the processor 1 10 during code processing. The error pipeline 128 represents operating logic on different types of data to detect errors in the data and provide appropriate signals to the error detector 140. For example, the signal may be one or more bits (flags), which represent parity or E C C status of data obtained from different storage arrays (not shown) of the processor 110. When erroneous data is obtained, soft errors in these arrays can appear in the form of parity or E C C error flags.

如果一錯誤從任一核心1 20到達錯誤偵測器1 40的話 ’則恢復模組1 5 0被啓動用以執行一恢復常式。恢復可利用硬體，軟體，韌體或它們的組合以相對低的潛時 (1&^11(^)來執行。例如，資料在兩執行核心12〇中同時（或幾近同時）訛誤的機率是非常的小。這會留下沒有訛誤的資料（corrupted data)備份使得處理器1 1〇可將資料的完整性恢復。然而，如果在恢復模組1 5 0被啓動之前，來自於一執行核心的被訛誤的資料與來自另一執行核心的一沒有訛誤的資料版本到答該FRC檢查器i 3 〇的話，則一 FRC錯誤將會被觸發。因爲frc錯誤是不可恢復的，所以如果FRC檢查器130在基本的同位/Ecc錯誤被偵測到之前發出一 FRC錯誤訊號的話，則重設模組60會將系統重設。 -8 - (5) 1236620 並不是所有的FRC錯誤都可追溯至基本的同位/EC C 或其它可改正的軟性錯誤。對於不能追溯的FRC錯誤而言，用錯誤偵測器1 40來處理基本的軟性錯誤會比用FRC 檢查器來處理訛誤的資料到達FRC邊界1 04時所產生的 FRC錯誤來得快。如上文提及的，該重設處理的潛時比恢復處理的潛時要長許多，且如果錯誤是可被恢復模組1 5 0 改正的話，則要避免用重設處理來處理錯誤。此外，重設通常會將整個系統帶下來，而恢復則只是會導致暫時的性能損失。因此之故，如果錯誤檢查器1 40在任一錯誤管線 1 2 8中偵測到一錯誤的話，會讓FRC檢查器1 3 0暫時失去作用，因爲執行核心1 20將不再是在鎖定步驟。執行核心120在FRC模式中是在鎖定步驟中運送，但資料管線1 24及錯誤管線1 28則可相對獨立地運作。例如，ECC硬體是相對複雜的，因此是相對慢的，特別是對於2位元錯誤。一代表此一錯誤的旗標會在與其相關的子料到達RFRC檢查器130之前，之後，或同時到達錯誤偵測器1 40。此一彈性是極有利的。例如，其容許資料在其錯誤狀態被決定之前被冒險地（s p e c U1 a t i v e 1 y)使用。因爲軟性錯誤相對上稀少，且錯誤管線1 28大體上與資料管線 1 24 —樣快，所以此彈性是絕對正面的。只要該錯誤期標即時到達該錯誤偵測器140來在FRC檢查器130對一可歸因於該訛誤的資料之失配（mismatch)作用之前讓FRC檢查器1 3 0失去作用，該潛時短的恢復常式即會被實施。如下文所述的，處理器1 1 〇可實施策略來緩和在可恢 -9- 1236620 (6) 復的錯誤機構與不可恢復的錯誤機構之間的競爭情況。例如，一流線化的發訊機構可使用在FRC模式中用以在一非FRC錯誤的事件中加速讓FRC檢查器130失去功能。此外，FRC錯誤可在重設之前被延遲一段時間，以防萬〜有一可取消重設之可恢復的錯誤訊號遲到。在本發明的一實施例中，處理器1 1 〇能夠在一高可靠度（如，FRC)模式或一高性能（如，多核心）模式中操作。當一包括了該處理器1 1 〇的計算系統被初始化或重設時，該操作模式可被選擇。在FRC模式中，執行核心120(a) 及120(b)在該做業系統中如一單一邏輯處理器般呈現，且所產生的結果會在FRC檢查器1 3 0處被比較。如果結果相符的話，則一相應於該碼序列的機器狀態會被更新。在FRC模式中，執行核心120中的一者會被指定爲主。該主執行核心係指負責更新執行核心1 20所共享的資源之執行核心。另一執行核心1 2 0則被指定爲從執行核心。從執行核心係負責從相同的碼序列產生結果，其將以主執行核心的結果來加以檢驗。因爲一錯誤會發生在主或從執行核心’所以本發明的實施例容許主/從指定被動態地改變。如下文中所說明的，這容許從執行核心接收主執行核心的指定用以實施恢復，如果一可恢復的錯誤在目前被指定爲主執行核心的執行核心中被偵測到的話。在多核心模式中，執行核心120(a)及120(b)在作業系統中可如一單一處理器晶粒上的兩個不同的邏輯處理器般地呈現。在此模式中，執行核心120(a)及120(b)分別處理 -10- 1236620 (7) 不同的碼序列，且每一執行核心分別更新與其所處理的碼序列相關聯的機器狀態。一邏輯處理器的機器狀態的一部分可被儲存在於該相應的執行碼相關聯的一快取及/或暫存器中。在該處理器晶粒上的某些點處，來自於執行核心 120(a)及120(b)的結果會送至共享的資源170以供處理器晶粒（匯流排）的儲存（快取）或送出。在此實施例，額外的邏輯被提供用以讓執行核心120(a)及120(b)共享的資源 1 70。大體上’多核心模式然許處理器的執行核心被分開來控制。第2圖爲一方塊圖，其代表能夠在多模式，如FRC 模式及多核心模式，中操作的處理器1 1 0的實施例。在此實施例中，一仲裁單元1 8 0被提供用以在處理器1 1 0在多核心模式中操作時管理執行核心120(a)及120(b)對共享資源170的異動（transaction)。仲裁單元180與FRC130相關聯，其將多核心模式操作的仲裁點放置在靠近FRC操作模式的FRC邊界處。在FRC模式中，來自於執行核心 120的訊號可被FRC檢查器130所處理，該FRC檢查器比較它們用以偵測出在執行核心中的軟性錯誤。將FRC 檢查器130及仲裁單元180設置的彼此靠近可將FRC邊界擴展用以涵蓋來自於兩個執行核心的訊號保持相異之該邏輯的大部分（如果不是全部的話）。其亦可減少支援處理益1 1 0在F R C模式及多核心模式中所需的佈線（w i r i n g)。 FRC邊界以此方式的擴展很自然地增加了讓訊號傳送至FRC檢查器130所需的時間。此，，飛行時間，，的增加可 -11 - 1236620 (8) 提供更多時間讓同位或ECC錯誤到達偵測器1 40，這可提高錯誤恢復的機會。如上文中提及的，被促誤偵測器1 觸動的恢復常式提供的系統可用性高於由FRC檢查器1 3 0 所觸動的重設常式提供的系統可用性。經FRC邊界擴展因而可同時增加爲了執行核心1 20而被複製的邏輯數量及飛行時間，在該飛行期間中可被偵測的錯誤會被找出。前者可提高FRC的保護，縱使是經由一重設機構來達成。後者可提高能夠經由同位，ECC或類似的特定核心特性來找出的錯誤是由恢復常式而非由重設常式來處理的可能性〇第3A圖爲一方塊圖，其代表依據本發明的一計算系統3 00的一實施例。該系統3 00包括一處理器310，晶片組3 70，主記憶體3 80，非揮發性記憶體3 90及週邊裝置 398。在此系統300中，處理器310可在FRC模式或多核心模式中操作。模式是可被選擇的，如當該計算系統3 00 是在初始化或重設時。晶片組3 7 0管理處理器3 1 0，主記憶體3 80，非揮發性記憶體3 90及週邊裝置3 98之間的溝通。處理器310包括第一及第二執行核心3 20(a)及 3 20(b)(其統稱爲執行核心3 20)。每一執行核心都包括執行資源3 24及一匯流排叢集3 2 8。執行資源3 24可包括一或多個整數，浮點，載入/儲存，及分支執行單元，以及整數檔及快取用以將它們與資料（如，指令，運算元，位址）一起提供。匯流排叢集3 28代表管理一由執行核心 -12- 1236620 (9) 3 2 0 (a)及3 20(b)所共享的快取3 40的異動以及一前端匯流排3 6 0的異動的邏輯，前端匯流排3 6 0係爲了會在共享的快取3 4 0中會遺失的異動而被提供的。相應於第1及2圖的錯誤管線的資源可以與執行資源3 2 4及/或匯流排叢集 3 2 8相關聯。界面單元（IFU) 3 3 0 (a)，3 3 0(b)(統稱爲IFU3 3 0)代表介於執行核心3 2 0與共享資源’快取3 4 0及F S B 3 6 0之間的邊界。在此實施例中，1FU330包括一 FRC單元332及一仲裁單元334。如上文中提及的，FRC單元3 3 2及仲裁單元3 3 4接受來自於執行核心3 2 0的訊號，並將它們設置在彼此靠近的位置用以節省在處理晶粒上的佈線（wiring)。同樣被顯示在第3A圖中的有錯誤單元3 3 6(a)及3 3 6(b)，它們包括可監視在執行核心3 20(a)及3 20(b)中的可偵測到的錯誤。在FRC模式中，FRC單元3 3 2比較來自於執行核心 320有關共享的資源，如快取340及FSB 3 60，的異動訊號。FRC單元332因而形成處理器310的FRC邊界的一部分。在多核心模式中，仲裁單元3 3 4監視來自於執行核心3 20的訊號並依據一仲裁法則來核准對與其相關的共享資源的存取。被該仲裁單元3 3 4所使用的仲裁法則可以是一循環法則，以優先權爲基礎的法則或類似的仲裁法則。在FRC及多核心模式中，錯誤單元3 3 6可監視來自於執行核心3 20有關可恢復的錯誤的訊號。恢復模組150及重設模組160(第2圖）的一部分可位 -13- 1236620 (10) 在該處理器3 1 0上或該系統3 0 0的其它位置◦在一實施例中，一恢復常式392及一重設常式394可被儲存在非揮發性記憶體3 90中且這些常式的影像可被載入到主記憶體 3 8 0中用以執行。在此實施例中，恢復模組1 5 0及重設模組160可分別包括指向恢復常式3 92及重設常式394(或它們在主記憶體3 8 0中的影像）的指標（pointer)。在此實施例中，該系統3 00亦包括中斷控制器3 70(a) 及3 70(B)(統稱爲中斷控制器3 70)分別用來處理核心 3 20(a)及320(b)的中斷。每一中斷控制器3 70都具有第一及第二構件3 74及3 7 8分別用來容納中斷控制器3 70可在其中操作的不同時脈領域。例如，F S B 3 6 0典型地是在一不同於處理器3 1 0的頻率中操作的。因此，直接與 FSB360互動之處理器310的構件典型地是在其時脈領域中操作的，該領域被標示爲處理器3 1 0上的區域3 64。在此實施例中，中斷控制器3 7 0亦包括一類FRC邊界構件，其形式爲XOR3 72的形式。XOR3 72如果偵測到在來自於執行核心3 20(a)及3 20(b)的構件3 74(a)及3 74(b) 的外送訊號之間有一失配的話，它就會送出一 FRC錯誤訊號。然而，可歸因於中斷控制器3 7 0的錯誤仍會從F S B 時脈領域3 64的構件3 7 8 (a)及3 7 8 (b)中的軟性錯誤中產生。這些錯誤可藉由它們導入到執行核心3 2 0(a)及3 2 0(b)的後續操作之間的差異（discrepancy)來偵測出。在此實施例的系統3 00中，一共同的窺探（snoop)區塊3 62處理進出執行核心3 20(a)及3 20(b)的窺探異動。 -14- 1236620 (11) XOR3 66提供對於來自執行核心3 20(a)及3 20(b)的窺探反應的FRC檢查且如果偵測到一失配的話即送出一錯誤訊號。如果處理器3 1 0是在多核心模式下操作時’可讓 XOR3 72及3 66失去作用。第3 B圖爲一方塊圖，其代表用來將可恢復的錯誤的狀況廣播給計算系統3 00的構件的設備3 44。例如，錯誤單元3 66(a)及3 66(b)可分別代表執行核心3 20(a)及3 20(b) 的不同陣列（如，暫存器，快取，緩衝器等等）的ECC或同位錯誤偵測邏輯，及/或處理這些錯誤的例外邏輯。一 OR 閘3 8 8監視來自於執行核心3 20的錯誤訊號並在一錯誤訊號被確認時發出一訊號讓FRC單元332失去作用。該錯誤訊號可以是一高階中斷，像是界定給Itanium處理器用的機器檢查中止（MCA)。OR閘3 3 8的輸出亦被饋回至執行核心3 20用以對無錯誤執行核心指出需要開始一恢復機構。一第二OR閘3 3 9被提供用以將錯誤訊號從共享資源送至執行核心3 2 0。如果錯誤訊號沒有讓FRC單元3 3 2失去作用的話，則該訛誤的資料會觸動一 FRC錯誤，且一可恢復的錯誤會被當作一不可恢復的錯誤，如FRC錯誤，般被處理。亦即’該系統會經歷一重設操作而不是一恢復操作。根據該系統的特定操作，會有數種例子的情形是，錯誤訊號與來自於執行核心的失配資料訊號（由可恢復的錯誤所產生）兩者間之到達FRC單元3 3 2的比賽被關閉。因此之故，設備344可包括一個機構其至少可在FRC模式中加速的 -15- 1236620 (12) 錯誤（accelerated error)訊號的傳送。在一實施例中，設備3 3 4支援一可在FRC及高性能兩模式中操作的高階中斷’如一 MCA。在高性能模式中，錯誤訊號會遇到管線暫停（Pipeline stall)，如在執行核心的前端中或在L2快取中。這可確保不會有不需要的 MCA被取得，因爲觸動該暫停的事件會讓該錯誤訊號變成是未定的。在FRC模式中，該錯誤訊號繞過這些暫停。在FRC中繞過暫停會導致某些不必要的錯誤訊號，但其亦可降低一 FRC錯誤在非FRC錯誤的訊號讓FRC單元 3 3 2失去作用之前即被觸動。如參照第7圖所做的說明，處理器1 1 0的實施例亦包括一硬體機構用來舒緩在錯誤訊號與反應訛誤的資料之核心訊號兩者之間的比賽。第4圖爲一方塊圖，其代表計算系統3 1 0的一實施例的資料路徑，其包括用來支援在FRC模式中的處理器3 1 0 的FRC構件。在此實施例中，快取3 40，FCB3 60及執行核心3 2 0都經由一串緩衝器而相耦合。例如，一寫出緩衝器（WOB)410提供從快取340被驅出到主記憶體3 80的資料場所，及窺探資料緩衝器（SDB)420提供從執行核心320 或快取3 40到FSB 3 60的窺探資料，以回應在這些結構中之窺探命中（hit)(除了共享的快取3 40之外，執行核心320 每一者都可具有一或多階快取）。一對寫入線緩衝器（WLB)43 0(a)，43 0(b)分別提供場所給從執行核心3 20(a)，3 20(b)到快取340或FSB 3 60的資料，及一對讀取線緩衝器440(a)，440(b)提供場所給從 -16- 1236620 (13) FSB 3 60到快取340或執行核心3 20的資料。合倂緩衝器 (CB)4 5 0(a) ’ 45 0(b)收集將被寫至記憶體3 8 0上的資料並週期性地將資料送至FSB3 60。例如，在觸動FSB3 60上的一寫入異動之前，寫至同一記憶體線上的複數個資料可被收集在CB4 5 0中。在此實施例中，與這些緩衝器相關聯的邏輯提供FRC 檢查及當處理器310是在FRC模式中操作時的資料循徑 (routing)功能。例如，邏輯區塊454代表在CB45 0(a)及 4 5 0(13)中之資料的乂1；又及\〇11功能。如果處理器310是在FRC模式中操作的話，則X0R功能提供FRC檢查。如果處理器3 1 0是在多核心模式中操作的話，則MUX提供資料循徑功能。邏輯區塊4 3 4及4 4 4分別提供相似的功能給在 WLB43 0(a)及 43 0(b)及 RLB440(a)及 440(b)中的資料。MUX4 60，470，480將來自於不同來源的資料引導至快取3 4 0，F S B 3 6 0及執行核心3 2 0。如上文中提及的，供在FRC邊界內被偵測到的錯誤使用之恢復機構可用硬體，軟體，韌體模組的不同組合來實施。該恢復機構的一個實施例使用與該處理器密切關聯的碼。例如，Intel®公司的Itanium®系列處理器使用一層被稱爲處理器摘要層（PAL)的韌體，其提供該處理器的摘要給該計算系統的其它部分。在該PAL中實施恢復可將恢復處理隱藏起來不讓系統層級的碼，像是系統摘要層 (SAL)，如BIOS，及作業系統知道。該恢復機構之以PAL 爲基礎的操作應該要快到足以避免觸動一由作業系統所執 -17- 1236620 (14) 行的暫停時間。恢復機構亦可使用系統層級的碼，如 S AL/BIOS，或作業系統碼來實施。後者的實施可不遇到與以PAL爲基礎的實施相同的時間限制。除非另有表示 ’否則下文所說明的恢復機構可用與上述任何資源相關的碼來實施。第5圖爲一流程圖，其代表一錯誤恢復機構，用來在一錯誤於執行核心中被偵測到且觸動之FRC重設之前將錯誤恢復。爲了回應在一執行核心中被偵測到的一同位， E C C或其它錯誤，一訊號被廣播5 1 0用以表示一恢復常式的開始。只要該錯誤在其觸動一 FRC重設之前被偵測到，該説誤的資料可被偏限在該執行核心，讓另一執行核心的機器狀態資料可爲恢復所用。因此，良好的核心的機器狀態得以被保存5 20。爲了要準備該處理器以進行恢復，兩個執行核心都被初始化5 3 0爲一特定的狀態，且該被保存的機器狀態被重建540至該被初始化的核心中。FRC模式然後被重建550且該處理器回復560到被中斷的碼。在本發明的一個實施例中，當處理器Π 0是在FRC 模式中操作的話，執行核心1 2 0之一可以被指定爲主核心及另一者則被指定爲從核心。在此實施例中，由主及從核心所產生的訊號在FRC邊界被比較用以決定是否需要重設。如果沒有FRC重設的話，由主核心所產生的訊號被送至共享的資源1 7〇，及由從核心所產生的訊號被丟棄。在此實施例中，在每一執行核心1 20的一狀態暫存器中的一個位元可被用來標示其爲主或從執行核心。該位元可在 -18- (15) 1236620 系統被初始化或重設時被設定。如在下文中所詳述的，一執行核心的主/從狀態亦可被動態地改變以容許在任_核心中之錯誤的恢復。對於在FRC邊界內被偵測到的錯誤而曰’如可恢伋的錯杂：’主核心及從核心的作動會不同，端視該錯誤是哪一個核心產生的而定。第6圖爲一流程圖’其代表可從一被指定爲從執行核心的執行核心中被偵測到的錯誤中恢復的機構。該從執行If an error reaches the error detector 1 40 from any of the cores 1 20, then the recovery module 150 is activated to execute a recovery routine. Recovery can be performed using hardware, software, firmware, or a combination of them with a relatively low latency (1 & ^ 11 (^). For example, data was corrupted simultaneously (or nearly simultaneously) in two execution cores 120 The probability is very small. This will leave no corrupted data backup so that the processor 110 can restore the integrity of the data. However, if the recovery module 150 is started from an execution, If the core errored data and an error-free version of the data from another execution core are answered to the FRC checker i 3 〇, an FRC error will be triggered. Because the FRC error is unrecoverable, so if the FRC If the checker 130 issues an FRC error signal before the basic parity / Ecc error is detected, the reset module 60 resets the system. -8-(5) 1236620 Not all FRC errors can be traced back to Basic parity / EC C or other correctable soft errors. For non-traceable FRC errors, using error detector 1 40 to handle basic soft errors is better than using FRC checkers to handle erroneous data reaching FRC. The FRC error generated at the time of the world 1 04 comes quickly. As mentioned above, the latency of the reset process is much longer than the latency of the recovery process, and if the error can be corrected by the recovery module 1 50, Avoid resetting to handle errors. In addition, resetting usually takes the entire system down, and recovery only results in a temporary performance loss. Therefore, if the error checker 1 40 is on any error pipeline 1 2 If an error is detected in 8, the FRC checker 130 will be temporarily disabled, because the execution core 1 20 will no longer be in the lock step. The execution core 120 is shipped in the lock step in FRC mode, but the data Pipelines 1 24 and error pipelines 1 28 can operate relatively independently. For example, ECC hardware is relatively complex and therefore relatively slow, especially for 2-bit errors. A flag representing this error is associated with it. The relevant sub-materials reach the RFRC checker 130 before, after, or at the same time, the error detector 1 40. This flexibility is extremely advantageous. For example, it allows data to be risky before its error state is determined (sp ec U1 ative 1 y) use. Because soft errors are relatively rare, and error pipeline 1 28 is generally as fast as data pipeline 1 24, so this flexibility is absolutely positive. As long as the error date reaches the error detection immediately The detector 140 disables the FRC checker 130 before the mismatch of the FRC checker 130 on data attributable to the error, and the short-lived recovery routine will be implemented. As described below As mentioned, the processor 110 can implement a strategy to mitigate the competition between the recoverable error mechanism and the unrecoverable error mechanism. For example, a state-of-the-art signaler can be used in FRC mode to accelerate the failure of FRC checker 130 in a non-FRC error event. In addition, the FRC error can be delayed for a period of time before the reset, in case there is a recoverable error signal that can be canceled and reset is late. In an embodiment of the present invention, the processor 110 can operate in a high reliability (e.g., FRC) mode or a high performance (e.g., multi-core) mode. This operating mode may be selected when a computing system including the processor 110 is initialized or reset. In the FRC mode, the execution cores 120 (a) and 120 (b) are presented as a single logical processor in the job system, and the results produced are compared at the FRC checker 130. If the results match, a machine state corresponding to the code sequence is updated. In the FRC mode, one of the execution cores 120 is designated as the master. The main execution core refers to the execution core responsible for updating the resources shared by the execution core 120. The other execution core 1 2 0 is designated as the slave execution core. The slave execution core is responsible for generating results from the same code sequence, which will be checked against the results of the master execution core. Because an error can occur at the master or slave execution core ', embodiments of the present invention allow the master / slave designation to be changed dynamically. As explained below, this allows the designation of the main execution core to be received from the execution core for performing recovery if a recoverable error is detected in the execution core currently designated as the main execution core. In the multi-core mode, the execution cores 120 (a) and 120 (b) can be presented in the operating system as two different logical processors on a single processor die. In this mode, the execution cores 120 (a) and 120 (b) process -10- 1236620 (7) different code sequences, and each execution core updates the machine state associated with the code sequence it processes. A portion of the machine state of a logical processor may be stored in a cache and / or register associated with the corresponding execution code. At some point on the processor die, the results from the execution cores 120 (a) and 120 (b) are sent to the shared resource 170 for the processor die (bus) storage (cache) ) Or submit. In this embodiment, additional logic is provided to allow the execution cores 120 (a) and 120 (b) to share resources 170. Generally, the 'multi-core mode' allows the execution core of the processor to be controlled separately. Figure 2 is a block diagram representing an embodiment of a processor 110 capable of operating in multiple modes, such as FRC mode and multi-core mode. In this embodiment, an arbitration unit 180 is provided to manage the transactions of the execution cores 120 (a) and 120 (b) to the shared resource 170 when the processor 110 operates in a multi-core mode. . The arbitration unit 180 is associated with the FRC 130, which places the arbitration point for multi-core mode operation near the FRC boundary of the FRC operation mode. In the FRC mode, signals from the execution core 120 can be processed by the FRC checker 130, which compares them to detect soft errors in the execution core. Placing the FRC checker 130 and the arbitration unit 180 close to each other can extend the FRC boundary to cover most, if not all, of the logic where the signals from the two execution cores remain different. It can also reduce the support processing benefits. Wiring required in F R C mode and multi-core mode (w i r i n g). The extension of the FRC boundary in this way naturally increases the time required for the signal to be transmitted to the FRC checker 130. As a result, the increase in flight time can be increased by -11-1236620 (8) providing more time for parity or ECC errors to reach detector 1 40, which can increase the chance of error recovery. As mentioned above, the system availability provided by the recovery routine triggered by the false detector 1 is higher than the system availability provided by the reset routine triggered by the FRC checker 130. By FRC boundary extension, the number of logics and flight time that are copied to execute core 1 20 can be increased at the same time, and errors that can be detected during the flight will be found. The former can increase FRC protection, even if it is achieved through a reset mechanism. The latter can increase the likelihood that errors that can be found via parity, ECC, or similar specific core characteristics are handled by the recovery routine rather than the reset routine. Figure 3A is a block diagram that represents a method according to the invention An embodiment of a computing system 300. The system 300 includes a processor 310, a chipset 3 70, a main memory 3 80, a non-volatile memory 3 90, and peripherals 398. In this system 300, the processor 310 may operate in an FRC mode or a multi-core mode. The mode can be selected, such as when the computing system 3 00 is being initialized or reset. The chipset 3 7 0 manages the communication between the processor 3 1 0, the main memory 3 80, the non-volatile memory 3 90 and the peripheral device 3 98. The processor 310 includes first and second execution cores 3 20 (a) and 3 20 (b) (which are collectively referred to as execution cores 3 20). Each execution core includes execution resources 3 24 and a bus cluster 3 2 8. Execution resources 3 24 may include one or more integers, floating point, load / store, and branch execution units, and integer files and caches to provide them with data (eg, instructions, operands, addresses) . The bus cluster 3 28 represents the management of a change of cache 3 40 shared by the execution core -12-1262020 (9) 3 2 0 (a) and 3 20 (b) and a change of a front-end bus 3 6 0 Logically, the front-end bus 360 is provided for transactions that will be lost in the shared cache 340. The resources corresponding to the error pipeline of Figs. 1 and 2 may be associated with execution resources 3 2 4 and / or bus clusters 3 2 8. Interface Units (IFU) 3 3 0 (a), 3 3 0 (b) (collectively referred to as IFU3 3 0) represent between the execution core 3 2 0 and the shared resource 'cache 3 4 0 and FSB 3 6 0 boundary. In this embodiment, the 1FU 330 includes a FRC unit 332 and an arbitration unit 334. As mentioned above, the FRC unit 3 3 2 and the arbitration unit 3 3 4 accept signals from the execution core 3 2 0 and place them close to each other to save wiring on the processing die. . Also shown in Figure 3A are the faulty units 3 3 6 (a) and 3 3 6 (b), which include detectable units that can be monitored in the execution cores 3 20 (a) and 3 20 (b). mistake. In the FRC mode, the FRC unit 3 3 2 compares transaction signals from the execution core 320 with respect to shared resources, such as cache 340 and FSB 3 60. The FRC unit 332 thus forms part of the FRC boundary of the processor 310. In the multi-core mode, the arbitration unit 3 3 4 monitors the signal from the execution core 3 20 and approves access to the shared resources associated with it according to an arbitration rule. The arbitration rules used by the arbitration unit 3 3 4 may be a circular law, a priority-based law or a similar arbitration law. In FRC and multi-core mode, the error unit 3 3 6 can monitor signals from the execution core 3 20 for recoverable errors. A part of the recovery module 150 and the reset module 160 (Figure 2) can be positioned-13-1236620 (10) on the processor 3 1 0 or other positions of the system 3 0 0. In one embodiment, A reset routine 392 and a reset routine 394 can be stored in the non-volatile memory 3 90 and images of these routines can be loaded into the main memory 380 for execution. In this embodiment, the recovery module 150 and the reset module 160 may include indicators (respectively to the recovery routine 3 92 and the reset routine 394 (or their images in the main memory 3 8 0)) pointer). In this embodiment, the system 3 00 also includes interrupt controllers 3 70 (a) and 3 70 (B) (collectively referred to as interrupt controllers 3 70) for processing cores 3 20 (a) and 320 (b), respectively. Interruption. Each of the interrupt controllers 3 70 has first and second members 3 74 and 3 7 8 to accommodate different clock domains in which the interrupt controller 3 70 can operate. For example, F S B 3 6 0 typically operates at a frequency different from processor 3 1 0. Therefore, the components of the processor 310 that directly interact with the FSB 360 typically operate in its clock domain, which is labeled as area 3 64 on the processor 3 10. In this embodiment, the interrupt controller 370 also includes a type of FRC boundary component, which is in the form of XOR3 72. XOR3 72 will send out if it detects a mismatch between the outgoing signals from components 3 74 (a) and 3 74 (b) from execution cores 3 20 (a) and 3 20 (b) An FRC error signal. However, errors attributable to the interrupt controller 3 70 will still result from soft errors in the components 3 7 8 (a) and 3 7 8 (b) of the F S B clock domain 3 64. These errors can be detected by discrepancies between subsequent operations that they introduce into the execution cores 3 2 0 (a) and 3 2 0 (b). In the system 300 of this embodiment, a common snoop block 3 62 processes the snooping movements in and out of the execution cores 3 20 (a) and 3 20 (b). -14- 1236620 (11) XOR3 66 provides an FRC check of the snoop response from the execution cores 3 20 (a) and 3 20 (b) and sends an error signal if a mismatch is detected. If the processor 3 1 0 is operating in a multi-core mode, the XOR3 72 and 3 66 may be disabled. Figure 3B is a block diagram representing a device 3 44 used to broadcast recoverable error conditions to the components of the computing system 300. For example, error units 3 66 (a) and 3 66 (b) can represent different arrays of execution cores 3 20 (a) and 3 20 (b) (eg, scratchpad, cache, buffer, etc.) ECC or parity error detection logic, and / or exception logic to handle these errors. An OR gate 3 8 8 monitors the error signal from the execution core 3 20 and issues a signal to disable the FRC unit 332 when an error signal is acknowledged. The error signal can be a high-level interrupt, such as a machine check abort (MCA) defined for the Itanium processor. The output of the OR gate 3 38 is also fed back to the execution core 3 20 to indicate to the error-free execution core that a recovery mechanism needs to be started. A second OR gate 3 3 9 is provided to send an error signal from the shared resource to the execution core 3 2 0. If the error signal does not invalidate the FRC unit 3 32, the corrupted data will trigger an FRC error, and a recoverable error will be treated as an unrecoverable error, such as an FRC error. That is, the system will undergo a reset operation instead of a restore operation. According to the specific operation of the system, there are several examples of situations in which the error signal and the mismatched data signal from the execution core (generated by a recoverable error) have reached the FRC unit 3 3 2 and the game is closed. . For this reason, the device 344 may include a mechanism that can accelerate at least -15-1236620 (12) accelerated error signal transmission in the FRC mode. In one embodiment, the device 3 3 4 supports a high-order interrupt, such as an MCA, which can operate in both FRC and high performance modes. In high-performance mode, the error signal encounters a pipeline stall, such as in the front-end of the execution core or in the L2 cache. This ensures that no unwanted MCA is obtained because the event that triggered the pause will make the error signal undefined. In FRC mode, the error signal bypasses these pauses. Bypassing the pause in FRC will cause some unnecessary error signals, but it can also reduce an FRC error. The non-FRC error signal is triggered before the FRC unit 3 3 2 becomes ineffective. As explained with reference to Figure 7, the embodiment of the processor 110 also includes a hardware mechanism to ease the competition between the erroneous signal and the core signal that responds to the erroneous data. Fig. 4 is a block diagram representing a data path of an embodiment of the computing system 3 10, which includes FRC components for supporting the processor 3 1 0 in the FRC mode. In this embodiment, the cache 3 40, the FCB 3 60 and the execution core 3 2 0 are all coupled through a series of buffers. For example, a write-out buffer (WOB) 410 provides data places that are driven from cache 340 to main memory 3 80, and a snoop data buffer (SDB) 420 provides data from execution core 320 or cache 3 40 to FSB. 3 60 of snooping data in response to snooping hits in these structures (except for shared cache 3 40, each of execution core 320 may have one or more levels of cache). A pair of write line buffers (WLB) 43 0 (a), 43 0 (b) respectively provide a place for data from the execution core 3 20 (a), 3 20 (b) to the cache 340 or FSB 3 60, And a pair of read line buffers 440 (a), 440 (b) provide a place for data from -16-1236620 (13) FSB 3 60 to cache 340 or execution core 3 20. The combined buffer (CB) 4 5 0 (a) ′ 45 0 (b) collects the data to be written to the memory 380 and sends the data to the FSB3 60 periodically. For example, before a write transaction on FSB3 60 is triggered, a plurality of data written to the same memory line may be collected in CB4 50. In this embodiment, the logic associated with these buffers provides FRC checking and data routing functions when the processor 310 is operating in FRC mode. For example, logical block 454 represents the 乂 1 of the data in CB45 0 (a) and 450 (13); and the 〇11 function. If the processor 310 is operating in FRC mode, the X0R function provides an FRC check. If the processor 3 1 0 is operating in multi-core mode, the MUX provides a data routing function. Logical blocks 4 3 4 and 4 4 4 provide similar functions to the data in WLB43 0 (a) and 43 0 (b) and RLB440 (a) and 440 (b), respectively. MUX4 60, 470, 480 guides data from different sources to cache 3 4 0, F S B 3 6 0 and execution core 3 2 0. As mentioned above, recovery mechanisms for errors detected within FRC boundaries can be implemented with different combinations of hardware, software, and firmware modules. One embodiment of the recovery mechanism uses codes that are closely associated with the processor. For example, the Intel® Itanium® family of processors uses a layer of firmware called the processor abstraction layer (PAL), which provides a summary of the processor to the rest of the computing system. Implementing recovery in this PAL hides the recovery process from system-level code, such as the system summary layer (SAL), such as the BIOS, and the operating system. The recovery mechanism's PAL-based operation should be fast enough to avoid triggering a pause on line -17-1236620 (14) performed by the operating system. Recovery organizations can also use system-level codes, such as S AL / BIOS, or operating system codes. The latter implementation does not encounter the same time constraints as the PAL-based implementation. Unless stated otherwise, the recovery mechanisms described below can be implemented with codes associated with any of the above resources. FIG. 5 is a flowchart representing an error recovery mechanism for recovering an error before an FRC reset is detected and triggered in the execution core. In response to a parity, ECC or other error detected in an execution core, a signal is broadcast 5 1 0 to indicate the start of a recovery routine. As long as the error is detected before it triggers a FRC reset, the erroneous data can be limited to the execution core, so that the machine state data of the other execution core can be used for recovery. As a result, good core machine conditions are preserved 5-20. In order to prepare the processor for recovery, both execution cores are initialized 530 to a specific state, and the saved machine state is rebuilt 540 into the initialized core. The FRC mode is then reconstructed 550 and the processor returns 560 to the interrupted code. In one embodiment of the present invention, when the processor UI 0 is operating in the FRC mode, one of the execution cores 120 can be designated as the master core and the other is designated as the slave core. In this embodiment, the signals generated by the master and slave cores are compared at the FRC boundary to determine whether a reset is required. If there is no FRC reset, the signal generated by the master core is sent to the shared resource 170, and the signal generated by the slave core is discarded. In this embodiment, a bit in a state register of each execution core 120 can be used to indicate whether it is a master or a slave execution core. This bit can be set when the -18- (15) 1236620 system is initialized or reset. As detailed below, the master / slave state of an execution core can also be dynamically changed to allow error recovery in the core. For errors detected within the boundaries of the FRC, ‘such as recoverable errors:’ The actions of the master and slave cores are different, depending on which core the error originated from. Fig. 6 is a flowchart 'representing a mechanism that can recover from errors detected in an execution core designated as the execution core. The execution from

核心的操作被示於左邊，該主執行核心的操作被示於右邊〇The operation of the core is shown on the left, and the operation of the main execution core is shown on the right.

如果該從執行核心6 1 0偵測到一錯誤（同位，e C C等等）610的話，常式600會被初始化。在由PAL或相容的處理器層級的碼所實施的常式6 0 0中，中斷訊號的廣播可被侷限在處理晶片內的構件上，如主執行核心。除了送出有關錯誤的訊號之外，該從執行核心讓該FRC單元失去作用63 0並停止其活動。讓該FRC單元失去作用可防止錯誤到達該FRC邊界時會觸動一 FRC重設，及停止其在從執行核心中的活動可防止其干擾恢復處理。在回應該中斷624時，主執行核心決定640其狀態資料是否包含任何錯誤。例如，每一執行核心都包括一狀態位元其會在一錯誤被偵測到時被設定。除了軟性錯誤幾乎同時發生在兩執行核心中的此一非常罕見的情形之外，該主核心將會是乾淨的。如果640是不乾淨的話，則沒有未被訛誤的處理器狀態可供實施恢復處理。在此情形下，該主核心會發出6 4 2 —重設訊號給從處理核心且該計算系統 •19- 1236620 (16) 執行644 —完整的，如FRC層級的，重設。If an error (parity, e C C, etc.) 610 is detected from the execution core 6 10, the routine 600 will be initialized. In routine 600 implemented by PAL or compatible processor-level code, the broadcast of interrupt signals can be limited to components within the processing chip, such as the main execution core. In addition to sending an error signal, the slave core disables the FRC unit and stops its activity. Disabling the FRC unit prevents an FRC reset from being triggered when the FRC boundary is reached by mistake, and stopping its activities in the slave execution core prevents it from interfering with recovery processing. In response to the interruption 624, the main execution core decides 640 whether its status data contains any errors. For example, each execution core includes a status bit that is set when an error is detected. Except for this very rare case where soft errors occur almost simultaneously in two execution cores, the main core will be clean. If 640 is not clean, there is no unerased processor state available for recovery processing. In this case, the master core will issue a 6 4 2 —reset signal to the slave processing core and the computing system • 19-1236620 (16) execute 644 —complete, such as FRC level, reset.

如果該主核心的狀態資料是未被毀壞的資料的話，則主核心會儲存6 6 0其機器狀態並更新6 6 4在其管線內的程序及緩衝器。例如，主核心可將其資料及控制暫存器與低階快取的內容物唇存到一被保護的記憶體區域中。該主核心亦會送出6 6 8 —有限度的重設訊號至該從核心並將其資源設定67 6爲一特殊的狀態，如將其管線初始化。該從核心偵測6 7 0該有限度的重設並將其管線初始化6 7 4，讓兩核心的狀態同步。當兩核心如上述地被同步化時，FRC模式被新開始 680。這可藉由讓每一核心執行一處理常式來達成，該常式會設定在其狀態/控制暫存器中的適當位元。該被保存的狀態被重建6 8 4至兩執行核心，且控制被回返6 9 0至被中斷的碼程序處。If the status data of the main core is undamaged data, the main core will store 660 its machine status and update the programs and buffers in its pipeline. For example, the main core may store its data and control registers and the contents of the low-level cache in a protected memory area. The master core will also send 6 6 8-a limited reset signal to the slave core and set its resource 67 6 to a special state, such as initializing its pipeline. The core should detect the limited reset of 670 and initialize its pipeline to 674 to synchronize the status of the two cores. When the two cores are synchronized as described above, the FRC mode is newly started 680. This can be achieved by having each core execute a processing routine that sets the appropriate bit in its state / control register. The saved state is reconstructed from 6 8 4 to the two execution cores, and control is returned to 6 9 0 to the interrupted code program.

方法600代表錯誤一恢復機構，其係使用在當下被指定爲從執行核心的執行核心中被偵測到的例子中。在一實施例中，該從執行核心爲沒有”控制”共享資源的執行核心。例如，在FRC模式中，來自於從執行核心的訊號在與來自於主執行核心的FRC邊界處的訊號相比較之後即被丟掉。如果沒有F RC錯誤被偵測到的話，來自於主執行核心的訊號即被用來控制在FRC邊界外面的共享資源。如果錯誤是開始於主執行核心而非從執行核心的話，則恢復可藉由改變兩執行核心之主/從的指定來加以處理。例如，主/從的指定可一狀態暫存器內與各執行核心相 -20- 1236620 (17) 關連的一個位元的狀態來顯示。狀態位元是在主狀態的執行核心會控制共享資源，該等共享資源是被用來實施恢復常式6 0 0的狀態保存的，如操作6 6 0。在該恢復常式的一實施例中，開始發生錯誤的該執行核心會檢查其主/從狀態位元。如果狀態位元顯示其爲從執行核心的話，則方法6 0 〇會如上所述地被執行。如果狀態位元顯示其爲主執行核心的話，則其會通知該從執行核心將其狀態改變成主，且將它自己的狀態改爲從，並暫停活動。第7圖爲一方塊圖，其顯示一 FRC檢查器730的一個實施例，其可緩和可恢復的錯誤處理與不可恢復的錯誤處理兩者間的競爭情況。FRC檢查器73 0包括一比較單元 73 4，佇列73 6，及計時器單元7 3 8。佇列73 6接收來自於執行核心（a)的資料，且比較器7 3 4比較來自於核心A及核心B的資料，並設定一狀態旗標用以標示出此比較是否得到相符的結果。如果資料相符的話，則該狀態旗標被設定來標示相符。如果資料不相符的話，則該旗標會設定來顯示一失配的結果且該計時器單元7 3 8會被觸動而開始一倒數計時。如果錯誤偵測器1 40在該倒數結束之前接收到一錯誤旗標的話，其會讓FRC檢查器73 0失去作用並啓動恢復單元 1 5 0來實施恢復常式。Method 600 represents an error-recovery mechanism, which is used in examples currently designated as being detected from an execution core. In one embodiment, the execution core is the execution core without “controlling” the shared resources. For example, in the FRC mode, the signal from the slave execution core is discarded after being compared with the signal at the FRC boundary from the master execution core. If no F RC error is detected, the signal from the main execution core is used to control the shared resources outside the FRC boundary. If the error starts with the master execution core instead of the slave execution core, recovery can be handled by changing the master / slave designation of the two execution cores. For example, the designation of the master / slave can be displayed in a state register with a bit status associated with each execution core -20- 1236620 (17). The status bit is in the execution core of the main state. It controls the shared resources, and these shared resources are used to implement the restoration of the state of the routine 600, such as operation 660. In one embodiment of the recovery routine, the execution core that started the error checks its master / slave status bits. If the status bit indicates that it is a slave execution core, method 600 will be executed as described above. If the status bit indicates that it is the master execution core, it will notify the slave execution core to change its status to master, change its status to slave, and suspend activity. Fig. 7 is a block diagram showing an embodiment of an FRC checker 730, which can mitigate the competition between recoverable error processing and unrecoverable error processing. The FRC checker 73 0 includes a comparison unit 73 4, a queue 73 6, and a timer unit 7 3 8. The queue 73 6 receives the data from the execution core (a), and the comparator 7 3 4 compares the data from the core A and the core B, and sets a status flag to indicate whether the comparison obtains a consistent result. If the data match, the status flag is set to indicate the match. If the data does not match, the flag is set to display a mismatch result and the timer unit 7 3 8 is triggered to start a countdown. If the error detector 140 receives an error flag before the end of the countdown, it will disable the FRC checker 73 0 and activate the recovery unit 1 50 to implement the recovery routine.

以上揭示的是一種在一多核心處理器中用來處理可恢復的錯誤及不可恢復的錯誤的機構。該多個核心可在FRC -21 - (18) 1236620 模式中操作，在該模式中’一或多個檢查器單元會比較來自於多個核心的訊號用以偵測不可恢復的錯誤。此外’每一核心都包括一錯誤單元用以偵測可恢復的錯誤。如果一可恢復的錯誤被偵測到的話’則該等檢查器單元會失去作用且一恢復常式會被實施。該多核心處理器的一多核心模式實施例可包括一靠近該檢查器的仲裁單元用以控制對共享資源的存取。FRC邊界靠近共享資源可增加被該FRC 邊界保護的邏輯並減少該多核心模式操作所需的佈線 (wiring)。本發明的實施例可偵測出所有在沒有FRC功能的系統中未被偵測出之錯誤，且支援所有可偵測的錯誤的恢復處理，包括那些在其它有FRC功能的處理器中傳統上是用重設來處理的錯誤。本文中所揭示的實施例是用來顯示本發明的不同特徵。熟悉處理器設計者在受惠於本文的揭示內容下，將可意識到，本文所揭示的實施例的各種變化與修改都將落在以下的申請專利範圍的精神與範圍之內。【圖式簡單說明】本發明可在參照附圖下被瞭解，其中相同的元件被標以相同的編號。這些圖是被提供來顯示本發明之被選取的實施例，且它們不是要用來限制本發明的範圍。第1圖爲一處理器的方塊圖，該處理器包括雙執行核心及FRC偵測與處理邏輯。 -22- 1236620 (19) 第2圖爲一第1圖的處理器的一實施例的方塊圖，其能夠在多模式中運作。第3 A圖爲可實施第2圖的多模式處理器的計算系統的實施例的方塊圖。第3 B圖爲可在第3 A圖的計算系統中發出有可恢復的錯誤的訊息的機構的方塊圖。第4圖爲一方塊圖，其代表第3A圖的計算系統的資料路徑。第5圖爲一流程圖，其代表一用來恢復在一執行核心內的軟性錯誤的機構的實施例。第6圖爲一流程圖，其代表一用來恢復在一多執行核心處理器內的一軟性錯誤的機構的實施例。第7圖爲一方塊圖，其代表一 FRC檢查器的實施例，其可緩和在可恢復的錯誤機構與不可恢復的錯5吳機構之間的競爭情況。主要元件對照表 110 處理器 120 執行核心 120(a) 第一執行核心 120(b) 第二執行核心 130 FRC檢查器 140 錯誤偵測器 150 恢復模組 -23- 1236620 (20) 160 重設模組 170 共享資源 104 虛線（FRC邊界） 124 資料管線 128 錯誤管線 1 80 仲裁單元 300 比較單元 3 10 處理器 3 70 晶片組 3 80 主記憶體 390 非揮非性記憶體 398 週邊設備 320 執行核心 324 執行資源 320(a) 第一執行核心 320(b) 第二執行核心 328 匯流排叢集 340 快取 360 前端匯流排 3 3 0(a) 界面單元 330(b) 界面單元 330 界面單元 332 FRC單元 334 仲裁單元 -24 1236620 (21) 336(a) 錯誤單元 336(b) 錯誤單元 392 恢復常式 394 重設常式 370(a) 中斷控制器 370(b) 中斷控制器 370 中斷控制器 374 第一構件 3 78 第二構件 364 區域（F SB時脈領域） 374(a) 構件 374(b) 構件 378(a) 構件 378(b) 構件 362 窺探區塊 344 設備 338 OR閘 339 OR閘 4 10 寫出緩衝器（WOB) 420 窺探資料緩衝器（SDB) 430(a) 寫線緩衝器（WLB) 430(b) 寫線緩衝器（WLB) 440(a) 讀線緩衝器 440(b) 讀線緩衝器 -25 (22) 1236620 450(a) 合倂緩衝器 450(b) 合倂緩衝器 454 進輯塊 434 邏輯丨品塊 444 邏輯塊 600 恢復常式 660 操作 730 FRC 檢查器 734 比較單元 736 佇列 73 8 計時器單元 -26-Disclosed above is a mechanism for processing recoverable and unrecoverable errors in a multi-core processor. The multiple cores can be operated in the FRC -21-(18) 1236620 mode, in which one or more checker units compare signals from multiple cores to detect unrecoverable errors. In addition, each core includes an error unit to detect recoverable errors. If a recoverable error is detected 'then the checker units will lose their effect and a recovery routine will be implemented. A multi-core mode embodiment of the multi-core processor may include an arbitration unit near the checker to control access to shared resources. The proximity of the FRC boundary to the shared resource can increase the logic protected by the FRC boundary and reduce the wiring required for the multi-core mode operation. Embodiments of the present invention can detect all undetected errors in systems without FRC functions, and support recovery of all detectable errors, including those traditionally found in other processors with FRC functions It is an error handled by reset. The embodiments disclosed herein are used to show different features of the invention. Designers familiar with the processor, having benefited from the contents of this disclosure, will realize that various changes and modifications of the embodiments disclosed herein will fall within the spirit and scope of the scope of patent applications below. [Brief description of the drawings] The present invention can be understood with reference to the drawings, in which the same elements are labeled with the same numbers. These figures are provided to show selected embodiments of the invention and they are not intended to limit the scope of the invention. Figure 1 is a block diagram of a processor that includes a dual execution core and FRC detection and processing logic. -22- 1236620 (19) Fig. 2 is a block diagram of an embodiment of the processor of Fig. 1 which can operate in multiple modes. Fig. 3A is a block diagram of an embodiment of a computing system capable of implementing the multi-mode processor of Fig. 2; Figure 3B is a block diagram of a mechanism that can send recoverable error messages in the computing system of Figure 3A. Figure 4 is a block diagram representing the data path of the computing system of Figure 3A. Fig. 5 is a flowchart showing an embodiment of a mechanism for recovering soft errors in an execution core. Fig. 6 is a flowchart showing an embodiment of a mechanism for recovering a soft error in a multi-execution core processor. FIG. 7 is a block diagram representing an embodiment of an FRC checker, which can mitigate competition between a recoverable error mechanism and an unrecoverable error mechanism. Main component comparison table 110 processor 120 execution core 120 (a) first execution core 120 (b) second execution core 130 FRC checker 140 error detector 150 recovery module-23-1236620 (20) 160 reset mode Group 170 Shared resource 104 Dotted line (FRC boundary) 124 Data pipeline 128 Error pipeline 1 80 Arbitration unit 300 Comparison unit 3 10 Processor 3 70 Chipset 3 80 Main memory 390 Non-volatile memory 398 Peripheral device 320 Execution core 324 Execution resources 320 (a) First execution core 320 (b) Second execution core 328 Bus cluster 340 Cache 360 Front-end bus 3 3 0 (a) Interface unit 330 (b) Interface unit 330 Interface unit 332 FRC unit 334 Arbitration unit-24 1236620 (21) 336 (a) Error unit 336 (b) Error unit 392 Restore routine 394 Reset routine 370 (a) Interrupt controller 370 (b) Interrupt controller 370 Interrupt controller 374 First Element 3 78 Second element 364 area (F SB clock domain) 374 (a) Element 374 (b) Element 378 (a) Element 378 (b) Element 362 Spy block 344 Equipment 338 OR gate 339 OR gate 4 10 Out buffer (WOB) 420 Snoop data buffer (SDB) 430 (a) Write line buffer (WLB) 430 (b) Write line buffer (WLB) 440 (a) Read line buffer 440 (b) Read line Buffer-25 (22) 1236620 450 (a) Combined buffer 450 (b) Combined buffer 454 Edit block 434 Logic 丨 Block 444 Logic block 600 Restore routine 660 Operation 730 FRC checker 734 Compare unit 736 Queue 73 8 Timer Unit-26-

Claims

(1) 1236620 Patent application scope 1 · A processor that includes first and second execution cores for operating in an FRC mode; a resource for processing the first and second executions from 6 A transaction of at least one of the minds; and an interface control unit for adjusting the access of the first and second execution cores to the resource, the control unit for the electrical control includes an FRC interrogation unit for comparing A change signal at the first and second execution cores, and if the comparison shows a mismatch, a signal about the error is issued. 2. If the processor of the first patent application scope, it further includes an error detector to detect errors in the first and second execution cores and to disable the FRC checker when an error is detected Thought in response. 3 · Rushen | 靑 Patent encloses the processor of item 2, wherein the error gray detector includes first and second error detectors for detecting errors in the first and second execution cores, respectively. 4 · If the processor of patent application item 3, wherein the first error detector will trigger an error signal when there is an error in the first execution core, the FRC inspection unit will fail and the second execution will be used. The core to initiate (initiate)-the recovery process, in response. 5. The processor of claim 4 in which the second execution core is designated as a FRC slave and redesignated as a FRC master 'in response to an error signal. 6. The processor according to item 5 of the patent application, wherein the second execution core retains its machine state data in a memory location and executes a -27-1236620 (2) sequence 0 (sequence). 7 · If the processor of the second patent application scope, the first and order execution core can also operate in a multi-core mode and the interface control unit further includes an arbitration unit for When operating in the core mode, adjust the access of the two execution cores to shared resources. 8. The processor according to item 7 of the patent application scope, wherein the shared resource includes a cache, which can process the transactions from the first and second cores in the multi-core mode, and can process only the FRC mode. Change from one of the first and second cores. 9 · If the processor of the patent application scope item 7, wherein when an error is detected, the error detector will trigger an interrupt, if the processor is in multi-core mode, and the error detector An accelerated interrupt will be triggered if the processor is in FRC mode. 1 0. As for the processor in the ninth scope of the patent application, the accelerated interrupt will be bypassed—the core of the core will be traversed by the interrupt in the multi-core mode. 1 1. A computing system comprising: a first memory location for storing a recovery routine; a second memory location for storing a reset routine; first and second execution cores capable of Operating in an FRC mode, an error unit is used to start the recovery routine in response to detecting an error in one of the first and second execution cores; and -28-1236620 (3) an FRC A checker for activating the reset routine in response to detecting a mismatch between signals from the first and second execution cores. 12. The computing system according to item 11 of the patent application scope, wherein the error cell disables the FRC checker in response to detecting the error in one of the first and second execution cores. 1 3 · The computing system according to item 12 of the patent application scope, wherein the reset routine includes instructions that can be executed by the first and second execution cores to be used in a multi-core mode or in the FRC mode The first and second ones perform core initialization. 14. The computing system according to item 13 of the patent application scope, further comprising a cache shared by the first and second execution cores, if the first and second execution cores are in a multi-core mode. 15. The computing system according to item 14 of the scope of patent application, further comprising an arbitration unit 'for managing the access of the first and second execution cores to the cache in a multi-core mode. Ϊ 6 · If the computing system of item 15 of the patent application scope, the frc checker will monitor the transaction signals sent from the first and second execution cores to the arbitration unit in the FRC mode and start the reset routine In response to a mismatch in the transaction signal. 17 • The computing system according to item 1 of the patent application scope, wherein the first and second execution cores operate in the FRC mode as a master execution core and a slave execution core, respectively. 1 8 · If the computing system of item i 7 of the patent application scope, wherein the first execution core is disabled and the second execution core operates as the main execution core -29-1236620 (4) in response to the An error in the execution core. 19 · The computing system according to item 11 of the patent application scope, wherein the first and second execution cores can be in a multi-core mode or a one; the PRC mode is initialized. 20. The computing system of claim 19, wherein the error unit triggers an interrupt to the first and second execution units in response to an error in an execution core. 2 1. If the computing system of the scope of application for the patent application item 20, wherein the interruption is an accelerated interruption 'if the execution core is in the FR (: mode) 022 · Such as the system of the scope of application for the patent application item 21, where The accelerated interruption will bypass a part of the execution core. 2 3. A recovery method, which includes: operating the first and second execution cores in an FRC mode; monitoring the data of the first and second execution cores to see If there is an error, compare the signals generated by the first and second execution cores; execute a recovery routine to respond to an error in the first or second execution core; and execute a reset routine to respond ~ Mismatch between the signals generated by the first and second execution cores. 24. If the recovery method in the scope of patent application No. 23, it also includes a pause signal comparison as the first or second Execute the wrong response in the core. -30- 1236620 (5) 2 5. If the recovery method of the scope of the patent application is 24, executing a reset routine to respond to the mismatch further includes: A delay time in response to the mismatch; execute the reset routine, if no error is detected in the execution core before the time length is exceeded. 2 6. If the recovery method of the 25th scope of the patent application, its It also includes the execution of the recovery routine, if an error is detected in an execution core before the time period is exceeded. 2 7 · The recovery method of item 23 of the patent application scope, in which the first operation in the f rc mode The first and second execution cores include operating the first and second execution cores in the FRC mode or in the multi-core mode in response to a reset signal. 28. For example, the recovery method for scope 27 of the patent application, wherein the execution of the The recovery routine includes: executing the recovery routine in response to an error signal issued by an interrupt if the execution core is operating in a multi-core mode; and executing the recovery routine in response to an accelerated interrupt An error signal if the execution core is operating in the FRC mode. 2 9. The recovery method according to item 23 of the patent application scope, wherein the operating in the FRC mode Diyi * and Diyi executed the core further including designating the first and second execution cores as the main execution core and the slave execution core respectively. 3 〇 If the recovery method in the scope of the patent application No. 23, the execution of the recovery routine The formula further includes deactivating the first execution core and designating the second execution core as the main execution core in response to a -31-1236620 (6) error in the first execution core.