[go: up one dir, main page]

TWI236620B - On-die mechanism for high-reliability processor - Google Patents

On-die mechanism for high-reliability processor Download PDF

Info

Publication number
TWI236620B
TWI236620B TW092132000A TW92132000A TWI236620B TW I236620 B TWI236620 B TW I236620B TW 092132000 A TW092132000 A TW 092132000A TW 92132000 A TW92132000 A TW 92132000A TW I236620 B TWI236620 B TW I236620B
Authority
TW
Taiwan
Prior art keywords
execution
core
error
frc
mode
Prior art date
Application number
TW092132000A
Other languages
Chinese (zh)
Other versions
TW200416595A (en
Inventor
Hang Nguyen
Steven Tu
Alexander Honcharik
Sujat Jamil
Original Assignee
Intel Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Intel Corp filed Critical Intel Corp
Publication of TW200416595A publication Critical patent/TW200416595A/en
Application granted granted Critical
Publication of TWI236620B publication Critical patent/TWI236620B/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/1629Error detection by comparing the output of redundant processing systems
    • G06F11/1641Error detection by comparing the output of redundant processing systems where the comparison is not performed by the redundant processing components
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/1629Error detection by comparing the output of redundant processing systems
    • G06F11/1654Error detection by comparing the output of redundant processing systems where the output of only one of the redundant processing components can drive the attached hardware, e.g. memory or I/O
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/845Systems in which the redundancy can be transformed in increased performance

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Hardware Redundancy (AREA)
  • Perforating, Stamping-Out Or Severing By Means Other Than Cutting (AREA)

Abstract

A processor includes first and second execution cores that operate in a redundant (FRC) mode, an FRC check unit to compare results from the first and second execution cores, and an error check unit to detect recoverable errors in the first and second cores. The error detector disables the FRC checker, responsive to detection of a recoverable error. A multi-mode embodiment of the processor implements a multi-core mode in addition to the FRC mode. An arbitration unit regulates access to resources shared by the first and second execution cores in multi-core mode. The FRC checker is located proximate to the arbitration unit in the multi-mode embodiment.

Description

1236620 (υ 玖、發明說明 【發明所屬之技術領域】 本發明係關於微處理器,更特定地係關於在有FRC 功能的處理器中處理錯誤的機構。 【先前技術】 伺服器及其它高階計算與通訊系統都被設計來提供高 等級的可靠性及可用性。軟性錯誤對於這兩項特性都產生 一極大的挑戰。軟性錯誤是高能粒子,如α粒子,與電荷 儲存節點之間的碰撞所造成的結果。它們在儲存陣列中, 像是快取,TLB,及類此者,是很普遍的,這些儲存陣列 包括大量的電荷儲存節點。它們亦發生在隨機狀態元件及 邏輯元件中。軟性錯誤發生的機率(軟性錯誤率或SER)會 隨著元件幾何形狀的縮小及元件密度變大而升高。 高可靠性系統包括保全裝置,其用來在軟性錯誤造成 靜默,如未被偵測到的資料訛誤(S D C ),之前偵測及管理 軟性錯誤。然而,對於會將一系統帶離其正常運作之可支 援高可靠度操作的錯誤偵測/處理機構而言,系統的可用 性會被降低。例如,如果一錯誤被偵測到的話,此一機構 會將該系統重設回其最後被知道之有效的狀態。該系統在 其遇到該重設操作時無法執行其被指定的工作。 一種習知的偵測軟性錯誤的機構爲功能冗餘檢查 (FRC)。一具有FRC功能的單一處理器可包括重複的指令 執行核心,相同的指令碼會在這些執行核心上被執行。依 -5- 1236620 (2) _ _定的實施例,每一重複的執行核心可包括一或多個快 耳又,暫存器檔及除了基本執行單元(整數,浮點,載入儲 存等等)之外的支援資源。FRC硬體比較每一核心所產生 的結果,如果偵測到一矛盾的話,則該F R C系統會將控 制交給一錯誤處理常式(r 0 u t in e)。該來自於不同的執行核 心的結果被比較的點代表該系統的FRC邊界。沒有在 FRC邊界被偵測到的錯誤會導致SDC。 因爲FRC錯誤只顯示出執行核心在結果上的不一致 ,所以FRC錯誤是可偵測的,但爲不可恢復的。如上文 提及的,該FRC錯誤處理常式典型地將系統重設至最後 知道的可靠資料點。此重設機構是相當耗時間的。該重設 機構將系統帶離其正常運作,降低系統的可用性。 FRC只是處理軟性錯誤的一個機構,且對於隨機邏輯 及隨機狀態元件而言,其爲只要的機構。陣列結構呈現出 一不同的圖像。陣列結構典型地包括同位及/或ECC硬體 ,其可藉由檢查資料的特性來偵測軟性錯誤。在許多情形 中,系統可使用相對較快的硬體或軟體機構來改正由於資 料訛誤產產生的錯誤。然而,對於有FRC功能的處理器 而言’這些錯誤會被表示爲FRC錯誤,因爲它們將執行 核心帶離鎖定步驟。經由一重設機構來處理其它可恢復的 錯誤會降低系統的可用性。 【發明內容】 本發明係關於將可恢復的錯誤處理機構及不可恢復的 -6- 1236620 (3) 錯誤處理機構有效地整合在有F R C功能的處理器中的機 構。 【實施方式】 以下的說明提出許多特定的細節用以提供對本發明的 一徹底瞭解。然而,熟悉此技藝者在瞭解本案的揭示內容 之後’將可體認到本發明可在沒有這些特定細節之下被實 施。此外,許多習知的方法,程序,構件,及電路沒有被 詳細地說明用以將重心聚焦在本發明的主要特徵上。例如 ’本發明的態樣係使用一雙核心處理器來舉例,但熟悉此 技藝者可瞭解到在適當地修改重設及恢復機構之下即可應 用在多於兩個核心的處理器上。 第1圖爲一方塊圖,其代表依據本發明的一有FRC 功能的處理器1 1 0的一個實施例。處理器1 1 〇包括第一及 第二執行核心120(a),120(b)(統稱爲執行核心120),一 FRC檢查器130,一錯誤偵查器140,一恢復模組150, 一重设模組1 6 0 ’及共享資源1 7 0。該F R C邊界的一·部分 是由虛線1 〇 4來表示。爲了舉例的目的,恢復模組1 5 〇及 重設模組1 60被顯示爲處理器1丨〇的一部分。這些模組可 全部或部分以硬體,韌體或軟體來實現且可以是在處理器 晶粒之內或之外。相類似地,共享資源1 70可包括處理器 晶粒內的構件以及在一或多個不同的晶粒上的構件。 每一執行核心1 2 0都包括一資料管線(d a t a pipeline) 124及一錯誤管線128,它們分別饋給至FRC檢 1236620 (4)1236620 (υ 玖, description of the invention [Technical field to which the invention belongs] The present invention relates to a microprocessor, and more specifically, to a mechanism for handling errors in a processor having an FRC function. [Prior Art] Servo and other high-order computing Both communication systems are designed to provide high levels of reliability and availability. Soft errors pose a great challenge to both of these characteristics. Soft errors are caused by collisions between high-energy particles, such as alpha particles, and charge storage nodes The result is that they are very common in storage arrays, such as caches, TLBs, and the like. These storage arrays include a large number of charge storage nodes. They also occur in random state elements and logic elements. Soft errors The probability of occurrence (soft error rate or SER) will increase as the component geometry shrinks and the component density increases. High-reliability systems include security devices that are used to silence soft errors, if not detected Data error (SDC), previously detected and managed soft errors. However, it is possible to remove a system from its normal operation. For error detection / processing organizations that support high reliability operations, the availability of the system will be reduced. For example, if an error is detected, the organization will reset the system back to what it was last known to be effective Status. The system is unable to perform its assigned task when it encounters the reset operation. A known mechanism for detecting soft errors is functional redundancy check (FRC). A single processor with FRC functionality may include Duplicate instruction execution cores, the same instruction code will be executed on these execution cores. According to the embodiment of -5- 1236620 (2) _ _, each repeated execution core may include one or more fast ears, Register files and supporting resources other than the basic execution unit (integer, floating point, load storage, etc.). The FRC hardware compares the results produced by each core. If a conflict is detected, the FRC The system passes control to an error handling routine (r 0 ut in e). The point at which the results from different execution cores are compared represents the system's FRC boundary. No errors detected at the FRC boundary will guide SDC. Because FRC errors only show inconsistent results from the execution core, FRC errors are detectable, but not recoverable. As mentioned above, this FRC error handling routine typically resets the system to The last known reliable data point. This reset mechanism is quite time consuming. This reset mechanism takes the system away from its normal operation and reduces the system's availability. FRC is only a mechanism for handling soft errors, and it is for random logic and random state As far as components are concerned, it is only a mechanism. The array structure presents a different image. The array structure typically includes parity and / or ECC hardware, which can detect soft errors by checking the characteristics of the data. In many cases In the system, a relatively fast hardware or software mechanism can be used to correct errors due to data misproduction. However, for FRC-capable processors, these errors will be represented as FRC errors because they will remove the core from the lock step. Handling other recoverable errors through a reset mechanism reduces system availability. [Summary of the Invention] The present invention relates to a mechanism that effectively integrates a recoverable error processing mechanism and an unrecoverable -6- 1236620 (3) error processing mechanism in a processor having an F R C function. [Embodiment] The following description proposes many specific details to provide a thorough understanding of the present invention. However, those skilled in the art, after understanding the disclosure of the present case, 'will recognize that the present invention may be practiced without these specific details. In addition, many conventional methods, procedures, components, and circuits have not been described in detail to focus on the main features of the present invention. For example, the aspect of the present invention uses a dual-core processor as an example, but those skilled in the art will understand that it can be applied to processors with more than two cores by properly modifying the reset and restore mechanism. FIG. 1 is a block diagram showing an embodiment of an FRC-enabled processor 110 according to the present invention. The processor 1 10 includes first and second execution cores 120 (a), 120 (b) (collectively referred to as execution cores 120), an FRC checker 130, an error detector 140, a recovery module 150, and a reset Module 16 0 'and shared resource 170. A part of the F R C boundary is indicated by a dashed line 104. For the purpose of example, the recovery module 150 and the reset module 160 are shown as part of the processor 1o. These modules can be implemented in whole or in part in hardware, firmware or software and can be inside or outside the processor die. Similarly, shared resources 170 may include components within a processor die and components on one or more different die. Each execution core 1 2 0 includes a data pipeline (d a t a pipeline) 124 and an error pipeline 128, which are respectively fed to the FRC inspection 1236620 (4)

查器130及錯誤檢查器140。資料管線124代表當資料移 動通過處理器1 1 〇朝向FRC檢查器1 30時造不同資料種 類上的操作邏輯。經資料管線1 2 4處理過的資料可包括結 果運算元,狀態旗標,位址,指令及在碼處理期間經由處 理器1 1〇產生及階段化(Staged)者。錯誤管線128代表在 不同的資料種類上的操作邏輯用以偵測在資料中的錯誤並 提供適當的訊號給錯誤偵測器1 4 0。例如,該訊號可以是 一或多個位元(旗標),其代表從該處理器1 1 0的不同儲存 陣列(未示出)處取得的資料的同位或E C C狀態。當訛誤的 資料被取得時,在這些陣列中的軟性錯誤可以同位或E C C 錯誤旗標形式出現。Checker 130 and error checker 140. The data pipeline 124 represents operation logic on different data types when data moves through the processor 1 10 toward the FRC checker 130. The data processed by the data pipeline 1 2 4 may include result operands, status flags, addresses, instructions, and those generated and staged by the processor 1 10 during code processing. The error pipeline 128 represents operating logic on different types of data to detect errors in the data and provide appropriate signals to the error detector 140. For example, the signal may be one or more bits (flags), which represent parity or E C C status of data obtained from different storage arrays (not shown) of the processor 110. When erroneous data is obtained, soft errors in these arrays can appear in the form of parity or E C C error flags.

如果一錯誤從任一核心1 20到達錯誤偵測器1 40的話 ’則恢復模組1 5 0被啓動用以執行一恢復常式。恢復可利 用硬體,軟體,韌體或它們的組合以相對低的潛時 (1&^11(^)來執行。例如,資料在兩執行核心12〇中同時( 或幾近同時)訛誤的機率是非常的小。這會留下沒有訛誤 的資料(corrupted data)備份使得處理器1 1〇可將資料的完 整性恢復。然而,如果在恢復模組1 5 0被啓動之前,來自 於一執行核心的被訛誤的資料與來自另一執行核心的一沒 有訛誤的資料版本到答該FRC檢查器i 3 〇的話,則一 FRC錯誤將會被觸發。因爲frc錯誤是不可恢復的,所 以如果FRC檢查器130在基本的同位/Ecc錯誤被偵測到 之前發出一 FRC錯誤訊號的話,則重設模組60會將系 統重設。 -8 - (5) 1236620 並不是所有的FRC錯誤都可追溯至基本的同位/EC C 或其它可改正的軟性錯誤。對於不能追溯的FRC錯誤而 言,用錯誤偵測器1 40來處理基本的軟性錯誤會比用FRC 檢查器來處理訛誤的資料到達FRC邊界1 04時所產生的 FRC錯誤來得快。如上文提及的,該重設處理的潛時比恢 復處理的潛時要長許多,且如果錯誤是可被恢復模組1 5 0 改正的話,則要避免用重設處理來處理錯誤。此外,重設 通常會將整個系統帶下來,而恢復則只是會導致暫時的性 能損失。因此之故,如果錯誤檢查器1 40在任一錯誤管線 1 2 8中偵測到一錯誤的話,會讓FRC檢查器1 3 0暫時失去 作用,因爲執行核心1 20將不再是在鎖定步驟。 執行核心120在FRC模式中是在鎖定步驟中運送, 但資料管線1 24及錯誤管線1 28則可相對獨立地運作。例 如,ECC硬體是相對複雜的,因此是相對慢的,特別是對 於2位元錯誤。一代表此一錯誤的旗標會在與其相關的子 料到達RFRC檢查器130之前,之後,或同時到達錯誤偵 測器1 40。此一彈性是極有利的。例如,其容許資料在其 錯誤狀態被決定之前被冒險地(s p e c U1 a t i v e 1 y)使用。因爲 軟性錯誤相對上稀少,且錯誤管線1 28大體上與資料管線 1 24 —樣快,所以此彈性是絕對正面的。只要該錯誤期標 即時到達該錯誤偵測器140來在FRC檢查器130對一可 歸因於該訛誤的資料之失配(mismatch)作用之前讓FRC檢 查器1 3 0失去作用,該潛時短的恢復常式即會被實施。 如下文所述的,處理器1 1 〇可實施策略來緩和在可恢 -9- 1236620 (6) 復的錯誤機構與不可恢復的錯誤機構之間的競爭情況。例 如,一流線化的發訊機構可使用在FRC模式中用以在一 非FRC錯誤的事件中加速讓FRC檢查器130失去功能。 此外,FRC錯誤可在重設之前被延遲一段時間,以防萬〜 有一可取消重設之可恢復的錯誤訊號遲到。 在本發明的一實施例中,處理器1 1 〇能夠在一高可靠 度(如,FRC)模式或一高性能(如,多核心)模式中操作。 當一包括了該處理器1 1 〇的計算系統被初始化或重設時, 該操作模式可被選擇。在FRC模式中,執行核心120(a) 及120(b)在該做業系統中如一單一邏輯處理器般呈現,且 所產生的結果會在FRC檢查器1 3 0處被比較。如果結果 相符的話,則一相應於該碼序列的機器狀態會被更新。 在FRC模式中,執行核心120中的一者會被指定爲 主。該主執行核心係指負責更新執行核心1 20所共享的資 源之執行核心。另一執行核心1 2 0則被指定爲從執行核心 。從執行核心係負責從相同的碼序列產生結果,其將以主 執行核心的結果來加以檢驗。因爲一錯誤會發生在主或從 執行核心’所以本發明的實施例容許主/從指定被動態地 改變。如下文中所說明的,這容許從執行核心接收主執行 核心的指定用以實施恢復,如果一可恢復的錯誤在目前被 指定爲主執行核心的執行核心中被偵測到的話。 在多核心模式中,執行核心120(a)及120(b)在作業系 統中可如一單一處理器晶粒上的兩個不同的邏輯處理器般 地呈現。在此模式中,執行核心120(a)及120(b)分別處理 -10- 1236620 (7) 不同的碼序列,且每一執行核心分別更新與其所處理的碼 序列相關聯的機器狀態。一邏輯處理器的機器狀態的一部 分可被儲存在於該相應的執行碼相關聯的一快取及/或暫 存器中。在該處理器晶粒上的某些點處,來自於執行核心 120(a)及120(b)的結果會送至共享的資源170以供處理器 晶粒(匯流排)的儲存(快取)或送出。在此實施例,額外的 邏輯被提供用以讓執行核心120(a)及120(b)共享的資源 1 70。大體上’多核心模式然許處理器的執行核心被分開 來控制。 第2圖爲一方塊圖,其代表能夠在多模式,如FRC 模式及多核心模式,中操作的處理器1 1 0的實施例。在此 實施例中,一仲裁單元1 8 0被提供用以在處理器1 1 0在多 核心模式中操作時管理執行核心120(a)及120(b)對共享資 源170的異動(transaction)。仲裁單元180與FRC130相 關聯,其將多核心模式操作的仲裁點放置在靠近FRC操 作模式的FRC邊界處。在FRC模式中,來自於執行核心 120的訊號可被FRC檢查器130所處理,該FRC檢查器 比較它們用以偵測出在執行核心中的軟性錯誤。將FRC 檢查器130及仲裁單元180設置的彼此靠近可將FRC邊 界擴展用以涵蓋來自於兩個執行核心的訊號保持相異之該 邏輯的大部分(如果不是全部的話)。其亦可減少支援處理 益1 1 0在F R C模式及多核心模式中所需的佈線(w i r i n g)。 FRC邊界以此方式的擴展很自然地增加了讓訊號傳送 至FRC檢查器130所需的時間。此,,飛行時間,,的增加可 -11 - 1236620 (8) 提供更多時間讓同位或ECC錯誤到達偵測器1 40,這可提 高錯誤恢復的機會。如上文中提及的,被促誤偵測器1 觸動的恢復常式提供的系統可用性高於由FRC檢查器1 3 0 所觸動的重設常式提供的系統可用性。經FRC邊界擴展 因而可同時增加爲了執行核心1 20而被複製的邏輯數量及 飛行時間,在該飛行期間中可被偵測的錯誤會被找出。前 者可提高FRC的保護,縱使是經由一重設機構來達成。 後者可提高能夠經由同位,ECC或類似的特定核心特性來 找出的錯誤是由恢復常式而非由重設常式來處理的可能性 〇 第3A圖爲一方塊圖,其代表依據本發明的一計算系 統3 00的一實施例。該系統3 00包括一處理器310,晶片 組3 70,主記憶體3 80,非揮發性記憶體3 90及週邊裝置 398。在此系統300中,處理器310可在FRC模式或多核 心模式中操作。模式是可被選擇的,如當該計算系統3 00 是在初始化或重設時。晶片組3 7 0管理處理器3 1 0,主記 憶體3 80,非揮發性記憶體3 90及週邊裝置3 98之間的溝 通。 處理器310包括第一及第二執行核心3 20(a)及 3 20(b)(其統稱爲執行核心3 20)。每一執行核心都包括執 行資源3 24及一匯流排叢集3 2 8。執行資源3 24可包括一 或多個整數,浮點,載入/儲存,及分支執行單元,以及 整數檔及快取用以將它們與資料(如,指令,運算元,位 址)一起提供。匯流排叢集3 28代表管理一由執行核心 -12- 1236620 (9) 3 2 0 (a)及3 20(b)所共享的快取3 40的異動以及一前端匯流 排3 6 0的異動的邏輯,前端匯流排3 6 0係爲了會在共享的 快取3 4 0中會遺失的異動而被提供的。相應於第1及2圖 的錯誤管線的資源可以與執行資源3 2 4及/或匯流排叢集 3 2 8相關聯。 界面單元(IFU) 3 3 0 (a),3 3 0(b)(統稱爲IFU3 3 0)代表介 於執行核心3 2 0與共享資源’快取3 4 0及F S B 3 6 0之間的 邊界。在此實施例中,1FU330包括一 FRC單元332及一 仲裁單元334。如上文中提及的,FRC單元3 3 2及仲裁單 元3 3 4接受來自於執行核心3 2 0的訊號,並將它們設置在 彼此靠近的位置用以節省在處理晶粒上的佈線(wiring)。 同樣被顯示在第3A圖中的有錯誤單元3 3 6(a)及3 3 6(b), 它們包括可監視在執行核心3 20(a)及3 20(b)中的可偵測到 的錯誤。 在FRC模式中,FRC單元3 3 2比較來自於執行核心 320有關共享的資源,如快取340及FSB 3 60,的異動訊 號。FRC單元332因而形成處理器310的FRC邊界的一 部分。在多核心模式中,仲裁單元3 3 4監視來自於執行核 心3 20的訊號並依據一仲裁法則來核准對與其相關的共享 資源的存取。被該仲裁單元3 3 4所使用的仲裁法則可以是 一循環法則,以優先權爲基礎的法則或類似的仲裁法則。 在FRC及多核心模式中,錯誤單元3 3 6可監視來自於執 行核心3 20有關可恢復的錯誤的訊號。 恢復模組150及重設模組160(第2圖)的一部分可位 -13- 1236620 (10) 在該處理器3 1 0上或該系統3 0 0的其它位置◦在一實施例 中,一恢復常式392及一重設常式394可被儲存在非揮發 性記憶體3 90中且這些常式的影像可被載入到主記憶體 3 8 0中用以執行。在此實施例中,恢復模組1 5 0及重設模 組160可分別包括指向恢復常式3 92及重設常式394(或 它們在主記憶體3 8 0中的影像)的指標(pointer)。 在此實施例中,該系統3 00亦包括中斷控制器3 70(a) 及3 70(B)(統稱爲中斷控制器3 70)分別用來處理核心 3 20(a)及320(b)的中斷。每一中斷控制器3 70都具有第一 及第二構件3 74及3 7 8分別用來容納中斷控制器3 70可在 其中操作的不同時脈領域。例如,F S B 3 6 0典型地是在一 不同於處理器3 1 0的頻率中操作的。因此,直接與 FSB360互動之處理器310的構件典型地是在其時脈領域 中操作的,該領域被標示爲處理器3 1 0上的區域3 64。 在此實施例中,中斷控制器3 7 0亦包括一類FRC邊 界構件,其形式爲XOR3 72的形式。XOR3 72如果偵測到 在來自於執行核心3 20(a)及3 20(b)的構件3 74(a)及3 74(b) 的外送訊號之間有一失配的話,它就會送出一 FRC錯誤 訊號。然而,可歸因於中斷控制器3 7 0的錯誤仍會從F S B 時脈領域3 64的構件3 7 8 (a)及3 7 8 (b)中的軟性錯誤中產生 。這些錯誤可藉由它們導入到執行核心3 2 0(a)及3 2 0(b)的 後續操作之間的差異(discrepancy)來偵測出。 在此實施例的系統3 00中,一共同的窺探(snoop)區 塊3 62處理進出執行核心3 20(a)及3 20(b)的窺探異動。 -14- 1236620 (11) XOR3 66提供對於來自執行核心3 20(a)及3 20(b)的窺探反 應的FRC檢查且如果偵測到一失配的話即送出一錯誤訊 號。如果處理器3 1 0是在多核心模式下操作時’可讓 XOR3 72及3 66失去作用。 第3 B圖爲一方塊圖,其代表用來將可恢復的錯誤的 狀況廣播給計算系統3 00的構件的設備3 44。例如,錯誤 單元3 66(a)及3 66(b)可分別代表執行核心3 20(a)及3 20(b) 的不同陣列(如,暫存器,快取,緩衝器等等)的ECC或同 位錯誤偵測邏輯,及/或處理這些錯誤的例外邏輯。一 OR 閘3 8 8監視來自於執行核心3 20的錯誤訊號並在一錯誤訊 號被確認時發出一訊號讓FRC單元332失去作用。該錯 誤訊號可以是一高階中斷,像是界定給Itanium處理器用 的機器檢查中止(MCA)。OR閘3 3 8的輸出亦被饋回至執 行核心3 20用以對無錯誤執行核心指出需要開始一恢復機 構。一第二OR閘3 3 9被提供用以將錯誤訊號從共享資源 送至執行核心3 2 0。 如果錯誤訊號沒有讓FRC單元3 3 2失去作用的話, 則該訛誤的資料會觸動一 FRC錯誤,且一可恢復的錯誤 會被當作一不可恢復的錯誤,如FRC錯誤,般被處理。 亦即’該系統會經歷一重設操作而不是一恢復操作。根據 該系統的特定操作,會有數種例子的情形是,錯誤訊號與 來自於執行核心的失配資料訊號(由可恢復的錯誤所產生) 兩者間之到達FRC單元3 3 2的比賽被關閉。因此之故, 設備344可包括一個機構其至少可在FRC模式中加速的 -15- 1236620 (12) 錯誤(accelerated error)訊號的傳送。 在一實施例中,設備3 3 4支援一可在FRC及高性能 兩模式中操作的高階中斷’如一 MCA。在高性能模式中 ,錯誤訊號會遇到管線暫停(Pipeline stall),如在執行核 心的前端中或在L2快取中。這可確保不會有不需要的 MCA被取得,因爲觸動該暫停的事件會讓該錯誤訊號變 成是未定的。在FRC模式中,該錯誤訊號繞過這些暫停 。在FRC中繞過暫停會導致某些不必要的錯誤訊號,但 其亦可降低一 FRC錯誤在非FRC錯誤的訊號讓FRC單元 3 3 2失去作用之前即被觸動。如參照第7圖所做的說明, 處理器1 1 0的實施例亦包括一硬體機構用來舒緩在錯誤訊 號與反應訛誤的資料之核心訊號兩者之間的比賽。 第4圖爲一方塊圖,其代表計算系統3 1 0的一實施例 的資料路徑,其包括用來支援在FRC模式中的處理器3 1 0 的FRC構件。在此實施例中,快取3 40,FCB3 60及執行 核心3 2 0都經由一串緩衝器而相耦合。例如,一寫出緩衝 器(WOB)410提供從快取340被驅出到主記憶體3 80的資 料場所,及窺探資料緩衝器(SDB)420提供從執行核心320 或快取3 40到FSB 3 60的窺探資料,以回應在這些結構中 之窺探命中(hit)(除了共享的快取3 40之外,執行核心320 每一者都可具有一或多階快取)。 一對寫入線緩衝器(WLB)43 0(a),43 0(b)分別提供場 所給從執行核心3 20(a),3 20(b)到快取340或FSB 3 60的 資料,及一對讀取線緩衝器440(a),440(b)提供場所給從 -16- 1236620 (13) FSB 3 60到快取340或執行核心3 20的資料。合倂緩衝器 (CB)4 5 0(a) ’ 45 0(b)收集將被寫至記憶體3 8 0上的資料並 週期性地將資料送至FSB3 60。例如,在觸動FSB3 60上 的一寫入異動之前,寫至同一記憶體線上的複數個資料可 被收集在CB4 5 0中。 在此實施例中,與這些緩衝器相關聯的邏輯提供FRC 檢查及當處理器310是在FRC模式中操作時的資料循徑 (routing)功能。例如,邏輯區塊454代表在CB45 0(a)及 4 5 0(13)中之資料的乂1;又及\〇11功能。如果處理器310是 在FRC模式中操作的話,則X0R功能提供FRC檢查。如 果處理器3 1 0是在多核心模式中操作的話,則MUX提供 資料循徑功能。邏輯區塊4 3 4及4 4 4分別提供相似的功能 給在 WLB43 0(a)及 43 0(b)及 RLB440(a)及 440(b)中的資料 。MUX4 60,470,480將來自於不同來源的資料引導至快 取3 4 0,F S B 3 6 0及執行核心3 2 0。 如上文中提及的,供在FRC邊界內被偵測到的錯誤 使用之恢復機構可用硬體,軟體,韌體模組的不同組合來 實施。該恢復機構的一個實施例使用與該處理器密切關聯 的碼。例如,Intel®公司的Itanium®系列處理器使用一層 被稱爲處理器摘要層(PAL)的韌體,其提供該處理器的摘 要給該計算系統的其它部分。在該PAL中實施恢復可將 恢復處理隱藏起來不讓系統層級的碼,像是系統摘要層 (SAL),如BIOS,及作業系統知道。該恢復機構之以PAL 爲基礎的操作應該要快到足以避免觸動一由作業系統所執 -17- 1236620 (14) 行的暫停時間。恢復機構亦可使用系統層級的碼,如 S AL/BIOS,或作業系統碼來實施。後者的實施可不遇到 與以PAL爲基礎的實施相同的時間限制。除非另有表示 ’否則下文所說明的恢復機構可用與上述任何資源相關的 碼來實施。 第5圖爲一流程圖,其代表一錯誤恢復機構,用來在 一錯誤於執行核心中被偵測到且觸動之FRC重設之前將 錯誤恢復。爲了回應在一執行核心中被偵測到的一同位, E C C或其它錯誤,一訊號被廣播5 1 0用以表示一恢復常式 的開始。只要該錯誤在其觸動一 FRC重設之前被偵測到 ,該説誤的資料可被偏限在該執行核心,讓另一執行核心 的機器狀態資料可爲恢復所用。因此,良好的核心的機器 狀態得以被保存5 20。爲了要準備該處理器以進行恢復, 兩個執行核心都被初始化5 3 0爲一特定的狀態,且該被保 存的機器狀態被重建540至該被初始化的核心中。FRC模 式然後被重建550且該處理器回復560到被中斷的碼。 在本發明的一個實施例中,當處理器Π 0是在FRC 模式中操作的話,執行核心1 2 0之一可以被指定爲主核心 及另一者則被指定爲從核心。在此實施例中,由主及從核 心所產生的訊號在FRC邊界被比較用以決定是否需要重 設。如果沒有FRC重設的話,由主核心所產生的訊號被 送至共享的資源1 7〇,及由從核心所產生的訊號被丟棄。 在此實施例中,在每一執行核心1 20的一狀態暫存器中的 一個位元可被用來標示其爲主或從執行核心。該位元可在 -18- (15) 1236620 系統被初始化或重設時被設定。如在下文中所詳述的,一 執行核心的主/從狀態亦可被動態地改變以容許在任_核 心中之錯誤的恢復。對於在FRC邊界內被偵測到的錯誤 而曰’如可恢伋的錯杂:’主核心及從核心的作動會不同, 端視該錯誤是哪一個核心產生的而定。 第6圖爲一流程圖’其代表可從一被指定爲從執行核 心的執行核心中被偵測到的錯誤中恢復的機構。該從執行If an error reaches the error detector 1 40 from any of the cores 1 20, then the recovery module 150 is activated to execute a recovery routine. Recovery can be performed using hardware, software, firmware, or a combination of them with a relatively low latency (1 & ^ 11 (^). For example, data was corrupted simultaneously (or nearly simultaneously) in two execution cores 120 The probability is very small. This will leave no corrupted data backup so that the processor 110 can restore the integrity of the data. However, if the recovery module 150 is started from an execution, If the core errored data and an error-free version of the data from another execution core are answered to the FRC checker i 3 〇, an FRC error will be triggered. Because the FRC error is unrecoverable, so if the FRC If the checker 130 issues an FRC error signal before the basic parity / Ecc error is detected, the reset module 60 resets the system. -8-(5) 1236620 Not all FRC errors can be traced back to Basic parity / EC C or other correctable soft errors. For non-traceable FRC errors, using error detector 1 40 to handle basic soft errors is better than using FRC checkers to handle erroneous data reaching FRC. The FRC error generated at the time of the world 1 04 comes quickly. As mentioned above, the latency of the reset process is much longer than the latency of the recovery process, and if the error can be corrected by the recovery module 1 50, Avoid resetting to handle errors. In addition, resetting usually takes the entire system down, and recovery only results in a temporary performance loss. Therefore, if the error checker 1 40 is on any error pipeline 1 2 If an error is detected in 8, the FRC checker 130 will be temporarily disabled, because the execution core 1 20 will no longer be in the lock step. The execution core 120 is shipped in the lock step in FRC mode, but the data Pipelines 1 24 and error pipelines 1 28 can operate relatively independently. For example, ECC hardware is relatively complex and therefore relatively slow, especially for 2-bit errors. A flag representing this error is associated with it. The relevant sub-materials reach the RFRC checker 130 before, after, or at the same time, the error detector 1 40. This flexibility is extremely advantageous. For example, it allows data to be risky before its error state is determined (sp ec U1 ative 1 y) use. Because soft errors are relatively rare, and error pipeline 1 28 is generally as fast as data pipeline 1 24, so this flexibility is absolutely positive. As long as the error date reaches the error detection immediately The detector 140 disables the FRC checker 130 before the mismatch of the FRC checker 130 on data attributable to the error, and the short-lived recovery routine will be implemented. As described below As mentioned, the processor 110 can implement a strategy to mitigate the competition between the recoverable error mechanism and the unrecoverable error mechanism. For example, a state-of-the-art signaler can be used in FRC mode to accelerate the failure of FRC checker 130 in a non-FRC error event. In addition, the FRC error can be delayed for a period of time before the reset, in case there is a recoverable error signal that can be canceled and reset is late. In an embodiment of the present invention, the processor 110 can operate in a high reliability (e.g., FRC) mode or a high performance (e.g., multi-core) mode. This operating mode may be selected when a computing system including the processor 110 is initialized or reset. In the FRC mode, the execution cores 120 (a) and 120 (b) are presented as a single logical processor in the job system, and the results produced are compared at the FRC checker 130. If the results match, a machine state corresponding to the code sequence is updated. In the FRC mode, one of the execution cores 120 is designated as the master. The main execution core refers to the execution core responsible for updating the resources shared by the execution core 120. The other execution core 1 2 0 is designated as the slave execution core. The slave execution core is responsible for generating results from the same code sequence, which will be checked against the results of the master execution core. Because an error can occur at the master or slave execution core ', embodiments of the present invention allow the master / slave designation to be changed dynamically. As explained below, this allows the designation of the main execution core to be received from the execution core for performing recovery if a recoverable error is detected in the execution core currently designated as the main execution core. In the multi-core mode, the execution cores 120 (a) and 120 (b) can be presented in the operating system as two different logical processors on a single processor die. In this mode, the execution cores 120 (a) and 120 (b) process -10- 1236620 (7) different code sequences, and each execution core updates the machine state associated with the code sequence it processes. A portion of the machine state of a logical processor may be stored in a cache and / or register associated with the corresponding execution code. At some point on the processor die, the results from the execution cores 120 (a) and 120 (b) are sent to the shared resource 170 for the processor die (bus) storage (cache) ) Or submit. In this embodiment, additional logic is provided to allow the execution cores 120 (a) and 120 (b) to share resources 170. Generally, the 'multi-core mode' allows the execution core of the processor to be controlled separately. Figure 2 is a block diagram representing an embodiment of a processor 110 capable of operating in multiple modes, such as FRC mode and multi-core mode. In this embodiment, an arbitration unit 180 is provided to manage the transactions of the execution cores 120 (a) and 120 (b) to the shared resource 170 when the processor 110 operates in a multi-core mode. . The arbitration unit 180 is associated with the FRC 130, which places the arbitration point for multi-core mode operation near the FRC boundary of the FRC operation mode. In the FRC mode, signals from the execution core 120 can be processed by the FRC checker 130, which compares them to detect soft errors in the execution core. Placing the FRC checker 130 and the arbitration unit 180 close to each other can extend the FRC boundary to cover most, if not all, of the logic where the signals from the two execution cores remain different. It can also reduce the support processing benefits. Wiring required in F R C mode and multi-core mode (w i r i n g). The extension of the FRC boundary in this way naturally increases the time required for the signal to be transmitted to the FRC checker 130. As a result, the increase in flight time can be increased by -11-1236620 (8) providing more time for parity or ECC errors to reach detector 1 40, which can increase the chance of error recovery. As mentioned above, the system availability provided by the recovery routine triggered by the false detector 1 is higher than the system availability provided by the reset routine triggered by the FRC checker 130. By FRC boundary extension, the number of logics and flight time that are copied to execute core 1 20 can be increased at the same time, and errors that can be detected during the flight will be found. The former can increase FRC protection, even if it is achieved through a reset mechanism. The latter can increase the likelihood that errors that can be found via parity, ECC, or similar specific core characteristics are handled by the recovery routine rather than the reset routine. Figure 3A is a block diagram that represents a method according to the invention An embodiment of a computing system 300. The system 300 includes a processor 310, a chipset 3 70, a main memory 3 80, a non-volatile memory 3 90, and peripherals 398. In this system 300, the processor 310 may operate in an FRC mode or a multi-core mode. The mode can be selected, such as when the computing system 3 00 is being initialized or reset. The chipset 3 7 0 manages the communication between the processor 3 1 0, the main memory 3 80, the non-volatile memory 3 90 and the peripheral device 3 98. The processor 310 includes first and second execution cores 3 20 (a) and 3 20 (b) (which are collectively referred to as execution cores 3 20). Each execution core includes execution resources 3 24 and a bus cluster 3 2 8. Execution resources 3 24 may include one or more integers, floating point, load / store, and branch execution units, and integer files and caches to provide them with data (eg, instructions, operands, addresses) . The bus cluster 3 28 represents the management of a change of cache 3 40 shared by the execution core -12-1262020 (9) 3 2 0 (a) and 3 20 (b) and a change of a front-end bus 3 6 0 Logically, the front-end bus 360 is provided for transactions that will be lost in the shared cache 340. The resources corresponding to the error pipeline of Figs. 1 and 2 may be associated with execution resources 3 2 4 and / or bus clusters 3 2 8. Interface Units (IFU) 3 3 0 (a), 3 3 0 (b) (collectively referred to as IFU3 3 0) represent between the execution core 3 2 0 and the shared resource 'cache 3 4 0 and FSB 3 6 0 boundary. In this embodiment, the 1FU 330 includes a FRC unit 332 and an arbitration unit 334. As mentioned above, the FRC unit 3 3 2 and the arbitration unit 3 3 4 accept signals from the execution core 3 2 0 and place them close to each other to save wiring on the processing die. . Also shown in Figure 3A are the faulty units 3 3 6 (a) and 3 3 6 (b), which include detectable units that can be monitored in the execution cores 3 20 (a) and 3 20 (b). mistake. In the FRC mode, the FRC unit 3 3 2 compares transaction signals from the execution core 320 with respect to shared resources, such as cache 340 and FSB 3 60. The FRC unit 332 thus forms part of the FRC boundary of the processor 310. In the multi-core mode, the arbitration unit 3 3 4 monitors the signal from the execution core 3 20 and approves access to the shared resources associated with it according to an arbitration rule. The arbitration rules used by the arbitration unit 3 3 4 may be a circular law, a priority-based law or a similar arbitration law. In FRC and multi-core mode, the error unit 3 3 6 can monitor signals from the execution core 3 20 for recoverable errors. A part of the recovery module 150 and the reset module 160 (Figure 2) can be positioned-13-1236620 (10) on the processor 3 1 0 or other positions of the system 3 0 0. In one embodiment, A reset routine 392 and a reset routine 394 can be stored in the non-volatile memory 3 90 and images of these routines can be loaded into the main memory 380 for execution. In this embodiment, the recovery module 150 and the reset module 160 may include indicators (respectively to the recovery routine 3 92 and the reset routine 394 (or their images in the main memory 3 8 0)) pointer). In this embodiment, the system 3 00 also includes interrupt controllers 3 70 (a) and 3 70 (B) (collectively referred to as interrupt controllers 3 70) for processing cores 3 20 (a) and 320 (b), respectively. Interruption. Each of the interrupt controllers 3 70 has first and second members 3 74 and 3 7 8 to accommodate different clock domains in which the interrupt controller 3 70 can operate. For example, F S B 3 6 0 typically operates at a frequency different from processor 3 1 0. Therefore, the components of the processor 310 that directly interact with the FSB 360 typically operate in its clock domain, which is labeled as area 3 64 on the processor 3 10. In this embodiment, the interrupt controller 370 also includes a type of FRC boundary component, which is in the form of XOR3 72. XOR3 72 will send out if it detects a mismatch between the outgoing signals from components 3 74 (a) and 3 74 (b) from execution cores 3 20 (a) and 3 20 (b) An FRC error signal. However, errors attributable to the interrupt controller 3 70 will still result from soft errors in the components 3 7 8 (a) and 3 7 8 (b) of the F S B clock domain 3 64. These errors can be detected by discrepancies between subsequent operations that they introduce into the execution cores 3 2 0 (a) and 3 2 0 (b). In the system 300 of this embodiment, a common snoop block 3 62 processes the snooping movements in and out of the execution cores 3 20 (a) and 3 20 (b). -14- 1236620 (11) XOR3 66 provides an FRC check of the snoop response from the execution cores 3 20 (a) and 3 20 (b) and sends an error signal if a mismatch is detected. If the processor 3 1 0 is operating in a multi-core mode, the XOR3 72 and 3 66 may be disabled. Figure 3B is a block diagram representing a device 3 44 used to broadcast recoverable error conditions to the components of the computing system 300. For example, error units 3 66 (a) and 3 66 (b) can represent different arrays of execution cores 3 20 (a) and 3 20 (b) (eg, scratchpad, cache, buffer, etc.) ECC or parity error detection logic, and / or exception logic to handle these errors. An OR gate 3 8 8 monitors the error signal from the execution core 3 20 and issues a signal to disable the FRC unit 332 when an error signal is acknowledged. The error signal can be a high-level interrupt, such as a machine check abort (MCA) defined for the Itanium processor. The output of the OR gate 3 38 is also fed back to the execution core 3 20 to indicate to the error-free execution core that a recovery mechanism needs to be started. A second OR gate 3 3 9 is provided to send an error signal from the shared resource to the execution core 3 2 0. If the error signal does not invalidate the FRC unit 3 32, the corrupted data will trigger an FRC error, and a recoverable error will be treated as an unrecoverable error, such as an FRC error. That is, the system will undergo a reset operation instead of a restore operation. According to the specific operation of the system, there are several examples of situations in which the error signal and the mismatched data signal from the execution core (generated by a recoverable error) have reached the FRC unit 3 3 2 and the game is closed. . For this reason, the device 344 may include a mechanism that can accelerate at least -15-1236620 (12) accelerated error signal transmission in the FRC mode. In one embodiment, the device 3 3 4 supports a high-order interrupt, such as an MCA, which can operate in both FRC and high performance modes. In high-performance mode, the error signal encounters a pipeline stall, such as in the front-end of the execution core or in the L2 cache. This ensures that no unwanted MCA is obtained because the event that triggered the pause will make the error signal undefined. In FRC mode, the error signal bypasses these pauses. Bypassing the pause in FRC will cause some unnecessary error signals, but it can also reduce an FRC error. The non-FRC error signal is triggered before the FRC unit 3 3 2 becomes ineffective. As explained with reference to Figure 7, the embodiment of the processor 110 also includes a hardware mechanism to ease the competition between the erroneous signal and the core signal that responds to the erroneous data. Fig. 4 is a block diagram representing a data path of an embodiment of the computing system 3 10, which includes FRC components for supporting the processor 3 1 0 in the FRC mode. In this embodiment, the cache 3 40, the FCB 3 60 and the execution core 3 2 0 are all coupled through a series of buffers. For example, a write-out buffer (WOB) 410 provides data places that are driven from cache 340 to main memory 3 80, and a snoop data buffer (SDB) 420 provides data from execution core 320 or cache 3 40 to FSB. 3 60 of snooping data in response to snooping hits in these structures (except for shared cache 3 40, each of execution core 320 may have one or more levels of cache). A pair of write line buffers (WLB) 43 0 (a), 43 0 (b) respectively provide a place for data from the execution core 3 20 (a), 3 20 (b) to the cache 340 or FSB 3 60, And a pair of read line buffers 440 (a), 440 (b) provide a place for data from -16-1236620 (13) FSB 3 60 to cache 340 or execution core 3 20. The combined buffer (CB) 4 5 0 (a) ′ 45 0 (b) collects the data to be written to the memory 380 and sends the data to the FSB3 60 periodically. For example, before a write transaction on FSB3 60 is triggered, a plurality of data written to the same memory line may be collected in CB4 50. In this embodiment, the logic associated with these buffers provides FRC checking and data routing functions when the processor 310 is operating in FRC mode. For example, logical block 454 represents the 乂 1 of the data in CB45 0 (a) and 450 (13); and the 〇11 function. If the processor 310 is operating in FRC mode, the X0R function provides an FRC check. If the processor 3 1 0 is operating in multi-core mode, the MUX provides a data routing function. Logical blocks 4 3 4 and 4 4 4 provide similar functions to the data in WLB43 0 (a) and 43 0 (b) and RLB440 (a) and 440 (b), respectively. MUX4 60, 470, 480 guides data from different sources to cache 3 4 0, F S B 3 6 0 and execution core 3 2 0. As mentioned above, recovery mechanisms for errors detected within FRC boundaries can be implemented with different combinations of hardware, software, and firmware modules. One embodiment of the recovery mechanism uses codes that are closely associated with the processor. For example, the Intel® Itanium® family of processors uses a layer of firmware called the processor abstraction layer (PAL), which provides a summary of the processor to the rest of the computing system. Implementing recovery in this PAL hides the recovery process from system-level code, such as the system summary layer (SAL), such as the BIOS, and the operating system. The recovery mechanism's PAL-based operation should be fast enough to avoid triggering a pause on line -17-1236620 (14) performed by the operating system. Recovery organizations can also use system-level codes, such as S AL / BIOS, or operating system codes. The latter implementation does not encounter the same time constraints as the PAL-based implementation. Unless stated otherwise, the recovery mechanisms described below can be implemented with codes associated with any of the above resources. FIG. 5 is a flowchart representing an error recovery mechanism for recovering an error before an FRC reset is detected and triggered in the execution core. In response to a parity, ECC or other error detected in an execution core, a signal is broadcast 5 1 0 to indicate the start of a recovery routine. As long as the error is detected before it triggers a FRC reset, the erroneous data can be limited to the execution core, so that the machine state data of the other execution core can be used for recovery. As a result, good core machine conditions are preserved 5-20. In order to prepare the processor for recovery, both execution cores are initialized 530 to a specific state, and the saved machine state is rebuilt 540 into the initialized core. The FRC mode is then reconstructed 550 and the processor returns 560 to the interrupted code. In one embodiment of the present invention, when the processor UI 0 is operating in the FRC mode, one of the execution cores 120 can be designated as the master core and the other is designated as the slave core. In this embodiment, the signals generated by the master and slave cores are compared at the FRC boundary to determine whether a reset is required. If there is no FRC reset, the signal generated by the master core is sent to the shared resource 170, and the signal generated by the slave core is discarded. In this embodiment, a bit in a state register of each execution core 120 can be used to indicate whether it is a master or a slave execution core. This bit can be set when the -18- (15) 1236620 system is initialized or reset. As detailed below, the master / slave state of an execution core can also be dynamically changed to allow error recovery in the core. For errors detected within the boundaries of the FRC, ‘such as recoverable errors:’ The actions of the master and slave cores are different, depending on which core the error originated from. Fig. 6 is a flowchart 'representing a mechanism that can recover from errors detected in an execution core designated as the execution core. The execution from

核心的操作被示於左邊,該主執行核心的操作被示於右邊 〇The operation of the core is shown on the left, and the operation of the main execution core is shown on the right.

如果該從執行核心6 1 0偵測到一錯誤(同位,e C C等 等)610的話,常式600會被初始化。在由PAL或相容的 處理器層級的碼所實施的常式6 0 0中,中斷訊號的廣播可 被侷限在處理晶片內的構件上,如主執行核心。除了送出 有關錯誤的訊號之外,該從執行核心讓該FRC單元失去 作用63 0並停止其活動。讓該FRC單元失去作用可防止 錯誤到達該FRC邊界時會觸動一 FRC重設,及停止其在 從執行核心中的活動可防止其干擾恢復處理。 在回應該中斷624時,主執行核心決定640其狀態資 料是否包含任何錯誤。例如,每一執行核心都包括一狀態 位元其會在一錯誤被偵測到時被設定。除了軟性錯誤幾乎 同時發生在兩執行核心中的此一非常罕見的情形之外,該 主核心將會是乾淨的。如果640是不乾淨的話,則沒有未 被訛誤的處理器狀態可供實施恢復處理。在此情形下,該 主核心會發出6 4 2 —重設訊號給從處理核心且該計算系統 •19- 1236620 (16) 執行644 —完整的,如FRC層級的,重設。If an error (parity, e C C, etc.) 610 is detected from the execution core 6 10, the routine 600 will be initialized. In routine 600 implemented by PAL or compatible processor-level code, the broadcast of interrupt signals can be limited to components within the processing chip, such as the main execution core. In addition to sending an error signal, the slave core disables the FRC unit and stops its activity. Disabling the FRC unit prevents an FRC reset from being triggered when the FRC boundary is reached by mistake, and stopping its activities in the slave execution core prevents it from interfering with recovery processing. In response to the interruption 624, the main execution core decides 640 whether its status data contains any errors. For example, each execution core includes a status bit that is set when an error is detected. Except for this very rare case where soft errors occur almost simultaneously in two execution cores, the main core will be clean. If 640 is not clean, there is no unerased processor state available for recovery processing. In this case, the master core will issue a 6 4 2 —reset signal to the slave processing core and the computing system • 19-1236620 (16) execute 644 —complete, such as FRC level, reset.

如果該主核心的狀態資料是未被毀壞的資料的話,則 主核心會儲存6 6 0其機器狀態並更新6 6 4在其管線內的程 序及緩衝器。例如,主核心可將其資料及控制暫存器與低 階快取的內容物唇存到一被保護的記憶體區域中。該主核 心亦會送出6 6 8 —有限度的重設訊號至該從核心並將其資 源設定67 6爲一特殊的狀態,如將其管線初始化。該從核 心偵測6 7 0該有限度的重設並將其管線初始化6 7 4,讓兩 核心的狀態同步。 當兩核心如上述地被同步化時,FRC模式被新開始 680。這可藉由讓每一核心執行一處理常式來達成,該常 式會設定在其狀態/控制暫存器中的適當位元。該被保存 的狀態被重建6 8 4至兩執行核心,且控制被回返6 9 0至被 中斷的碼程序處。If the status data of the main core is undamaged data, the main core will store 660 its machine status and update the programs and buffers in its pipeline. For example, the main core may store its data and control registers and the contents of the low-level cache in a protected memory area. The master core will also send 6 6 8-a limited reset signal to the slave core and set its resource 67 6 to a special state, such as initializing its pipeline. The core should detect the limited reset of 670 and initialize its pipeline to 674 to synchronize the status of the two cores. When the two cores are synchronized as described above, the FRC mode is newly started 680. This can be achieved by having each core execute a processing routine that sets the appropriate bit in its state / control register. The saved state is reconstructed from 6 8 4 to the two execution cores, and control is returned to 6 9 0 to the interrupted code program.

方法600代表錯誤一恢復機構,其係使用在當下被指 定爲從執行核心的執行核心中被偵測到的例子中。在一實 施例中,該從執行核心爲沒有”控制”共享資源的執行核心 。例如,在FRC模式中,來自於從執行核心的訊號在與 來自於主執行核心的FRC邊界處的訊號相比較之後即被 丟掉。如果沒有F RC錯誤被偵測到的話,來自於主執行 核心的訊號即被用來控制在FRC邊界外面的共享資源。 如果錯誤是開始於主執行核心而非從執行核心的話, 則恢復可藉由改變兩執行核心之主/從的指定來加以處理 。例如,主/從的指定可一狀態暫存器內與各執行核心相 -20- 1236620 (17) 關連的一個位元的狀態來顯示。狀態位元是在主狀態的執 行核心會控制共享資源,該等共享資源是被用來實施恢復 常式6 0 0的狀態保存的,如操作6 6 0。 在該恢復常式的一實施例中,開始發生錯誤的該執行 核心會檢查其主/從狀態位元。如果狀態位元顯示其爲從 執行核心的話,則方法6 0 〇會如上所述地被執行。如果狀 態位元顯示其爲主執行核心的話,則其會通知該從執行核 心將其狀態改變成主,且將它自己的狀態改爲從,並暫停 活動。 第7圖爲一方塊圖,其顯示一 FRC檢查器730的一 個實施例,其可緩和可恢復的錯誤處理與不可恢復的錯誤 處理兩者間的競爭情況。FRC檢查器73 0包括一比較單元 73 4,佇列73 6,及計時器單元7 3 8。佇列73 6接收來自於 執行核心(a)的資料,且比較器7 3 4比較來自於核心A及 核心B的資料,並設定一狀態旗標用以標示出此比較是否 得到相符的結果。如果資料相符的話,則該狀態旗標被設 定來標示相符。 如果資料不相符的話,則該旗標會設定來顯示一失配 的結果且該計時器單元7 3 8會被觸動而開始一倒數計時。 如果錯誤偵測器1 40在該倒數結束之前接收到一錯誤旗標 的話,其會讓FRC檢查器73 0失去作用並啓動恢復單元 1 5 0來實施恢復常式。Method 600 represents an error-recovery mechanism, which is used in examples currently designated as being detected from an execution core. In one embodiment, the execution core is the execution core without “controlling” the shared resources. For example, in the FRC mode, the signal from the slave execution core is discarded after being compared with the signal at the FRC boundary from the master execution core. If no F RC error is detected, the signal from the main execution core is used to control the shared resources outside the FRC boundary. If the error starts with the master execution core instead of the slave execution core, recovery can be handled by changing the master / slave designation of the two execution cores. For example, the designation of the master / slave can be displayed in a state register with a bit status associated with each execution core -20- 1236620 (17). The status bit is in the execution core of the main state. It controls the shared resources, and these shared resources are used to implement the restoration of the state of the routine 600, such as operation 660. In one embodiment of the recovery routine, the execution core that started the error checks its master / slave status bits. If the status bit indicates that it is a slave execution core, method 600 will be executed as described above. If the status bit indicates that it is the master execution core, it will notify the slave execution core to change its status to master, change its status to slave, and suspend activity. Fig. 7 is a block diagram showing an embodiment of an FRC checker 730, which can mitigate the competition between recoverable error processing and unrecoverable error processing. The FRC checker 73 0 includes a comparison unit 73 4, a queue 73 6, and a timer unit 7 3 8. The queue 73 6 receives the data from the execution core (a), and the comparator 7 3 4 compares the data from the core A and the core B, and sets a status flag to indicate whether the comparison obtains a consistent result. If the data match, the status flag is set to indicate the match. If the data does not match, the flag is set to display a mismatch result and the timer unit 7 3 8 is triggered to start a countdown. If the error detector 140 receives an error flag before the end of the countdown, it will disable the FRC checker 73 0 and activate the recovery unit 1 50 to implement the recovery routine.

以上揭示的是一種在一多核心處理器中用來處理可恢 復的錯誤及不可恢復的錯誤的機構。該多個核心可在FRC -21 - (18) 1236620 模式中操作,在該模式中’一或多個檢查器單元會比較來 自於多個核心的訊號用以偵測不可恢復的錯誤。此外’每 一核心都包括一錯誤單元用以偵測可恢復的錯誤。如果一 可恢復的錯誤被偵測到的話’則該等檢查器單元會失去作 用且一恢復常式會被實施。該多核心處理器的一多核心模 式實施例可包括一靠近該檢查器的仲裁單元用以控制對共 享資源的存取。FRC邊界靠近共享資源可增加被該FRC 邊界保護的邏輯並減少該多核心模式操作所需的佈線 (wiring)。 本發明的實施例可偵測出所有在沒有FRC功能的系 統中未被偵測出之錯誤,且支援所有可偵測的錯誤的恢復 處理,包括那些在其它有FRC功能的處理器中傳統上是 用重設來處理的錯誤。 本文中所揭示的實施例是用來顯示本發明的不同特徵 。熟悉處理器設計者在受惠於本文的揭示內容下,將可意 識到,本文所揭示的實施例的各種變化與修改都將落在以 下的申請專利範圍的精神與範圍之內。 【圖式簡單說明】 本發明可在參照附圖下被瞭解,其中相同的元件被標 以相同的編號。這些圖是被提供來顯示本發明之被選取的 實施例,且它們不是要用來限制本發明的範圍。 第1圖爲一處理器的方塊圖,該處理器包括雙執行核 心及FRC偵測與處理邏輯。 -22- 1236620 (19) 第2圖爲一第1圖的處理器的一實施例的方塊圖,其 能夠在多模式中運作。 第3 A圖爲可實施第2圖的多模式處理器的計算系統 的實施例的方塊圖。 第3 B圖爲可在第3 A圖的計算系統中發出有可恢復 的錯誤的訊息的機構的方塊圖。 第4圖爲一方塊圖,其代表第3A圖的計算系統的資 料路徑。 第5圖爲一流程圖,其代表一用來恢復在一執行核心 內的軟性錯誤的機構的實施例。 第6圖爲一流程圖,其代表一用來恢復在一多執行核 心處理器內的一軟性錯誤的機構的實施例。 第7圖爲一方塊圖,其代表一 FRC檢查器的實施例 ,其可緩和在可恢復的錯誤機構與不可恢復的錯5吳機構之 間的競爭情況。 主要元件對照表 110 處理器 120 執行核心 120(a) 第一執行核心 120(b) 第二執行核心 130 FRC檢查器 140 錯誤偵測器 150 恢復模組 -23- 1236620 (20) 160 重設模組 170 共享資源 104 虛線(FRC邊界) 124 資料管線 128 錯誤管線 1 80 仲裁單元 300 比較單元 3 10 處理器 3 70 晶片組 3 80 主記憶體 390 非揮非性記憶體 398 週邊設備 320 執行核心 324 執行資源 320(a) 第一執行核心 320(b) 第二執行核心 328 匯流排叢集 340 快取 360 前端匯流排 3 3 0(a) 界面單元 330(b) 界面單元 330 界面單元 332 FRC單元 334 仲裁單元 -24 1236620 (21) 336(a) 錯誤單元 336(b) 錯誤單元 392 恢復常式 394 重設常式 370(a) 中斷控制器 370(b) 中斷控制器 370 中斷控制器 374 第一構件 3 78 第二構件 364 區域(F SB時脈領域) 374(a) 構件 374(b) 構件 378(a) 構件 378(b) 構件 362 窺探區塊 344 設備 338 OR閘 339 OR閘 4 10 寫出緩衝器(WOB) 420 窺探資料緩衝器(SDB) 430(a) 寫線緩衝器(WLB) 430(b) 寫線緩衝器(WLB) 440(a) 讀線緩衝器 440(b) 讀線緩衝器 -25 (22) 1236620 450(a) 合 倂 緩 衝 器 450(b) 合 倂 緩 衝 器 454 進 輯 塊 434 邏 輯 丨品 塊 444 邏 輯 塊 600 恢 復 常 式 660 操 作 730 FRC 檢 查 器 734 比 較 單 元 736 佇 列 73 8 計 時 器 單 元 -26-Disclosed above is a mechanism for processing recoverable and unrecoverable errors in a multi-core processor. The multiple cores can be operated in the FRC -21-(18) 1236620 mode, in which one or more checker units compare signals from multiple cores to detect unrecoverable errors. In addition, each core includes an error unit to detect recoverable errors. If a recoverable error is detected 'then the checker units will lose their effect and a recovery routine will be implemented. A multi-core mode embodiment of the multi-core processor may include an arbitration unit near the checker to control access to shared resources. The proximity of the FRC boundary to the shared resource can increase the logic protected by the FRC boundary and reduce the wiring required for the multi-core mode operation. Embodiments of the present invention can detect all undetected errors in systems without FRC functions, and support recovery of all detectable errors, including those traditionally found in other processors with FRC functions It is an error handled by reset. The embodiments disclosed herein are used to show different features of the invention. Designers familiar with the processor, having benefited from the contents of this disclosure, will realize that various changes and modifications of the embodiments disclosed herein will fall within the spirit and scope of the scope of patent applications below. [Brief description of the drawings] The present invention can be understood with reference to the drawings, in which the same elements are labeled with the same numbers. These figures are provided to show selected embodiments of the invention and they are not intended to limit the scope of the invention. Figure 1 is a block diagram of a processor that includes a dual execution core and FRC detection and processing logic. -22- 1236620 (19) Fig. 2 is a block diagram of an embodiment of the processor of Fig. 1 which can operate in multiple modes. Fig. 3A is a block diagram of an embodiment of a computing system capable of implementing the multi-mode processor of Fig. 2; Figure 3B is a block diagram of a mechanism that can send recoverable error messages in the computing system of Figure 3A. Figure 4 is a block diagram representing the data path of the computing system of Figure 3A. Fig. 5 is a flowchart showing an embodiment of a mechanism for recovering soft errors in an execution core. Fig. 6 is a flowchart showing an embodiment of a mechanism for recovering a soft error in a multi-execution core processor. FIG. 7 is a block diagram representing an embodiment of an FRC checker, which can mitigate competition between a recoverable error mechanism and an unrecoverable error mechanism. Main component comparison table 110 processor 120 execution core 120 (a) first execution core 120 (b) second execution core 130 FRC checker 140 error detector 150 recovery module-23-1236620 (20) 160 reset mode Group 170 Shared resource 104 Dotted line (FRC boundary) 124 Data pipeline 128 Error pipeline 1 80 Arbitration unit 300 Comparison unit 3 10 Processor 3 70 Chipset 3 80 Main memory 390 Non-volatile memory 398 Peripheral device 320 Execution core 324 Execution resources 320 (a) First execution core 320 (b) Second execution core 328 Bus cluster 340 Cache 360 Front-end bus 3 3 0 (a) Interface unit 330 (b) Interface unit 330 Interface unit 332 FRC unit 334 Arbitration unit-24 1236620 (21) 336 (a) Error unit 336 (b) Error unit 392 Restore routine 394 Reset routine 370 (a) Interrupt controller 370 (b) Interrupt controller 370 Interrupt controller 374 First Element 3 78 Second element 364 area (F SB clock domain) 374 (a) Element 374 (b) Element 378 (a) Element 378 (b) Element 362 Spy block 344 Equipment 338 OR gate 339 OR gate 4 10 Out buffer (WOB) 420 Snoop data buffer (SDB) 430 (a) Write line buffer (WLB) 430 (b) Write line buffer (WLB) 440 (a) Read line buffer 440 (b) Read line Buffer-25 (22) 1236620 450 (a) Combined buffer 450 (b) Combined buffer 454 Edit block 434 Logic 丨 Block 444 Logic block 600 Restore routine 660 Operation 730 FRC checker 734 Compare unit 736 Queue 73 8 Timer Unit-26-

Claims (1)

(1) 1236620 拾、申請專利範圍 1 · 一種處理器,其包含 第一及第二執行核心,用來在一 FRC模式中操作; 一資源,用來處理來自於該第一及第二執行6心中的 至少一者的異動(transaction);及 一界面控制單元,用來調節該第一及第二執行核心對 該資源的存取,該界電控制單元包括一 FRC間查單元用 來比較來自於第一及第二執行核心的異動訊號,且如果該 比較顯示一失配的話,則發出一有關錯誤的訊號。 2 ·如申請專利範圍第1項的處理器,其更包含一錯誤 偵測器用來偵測在第一及第二執行核心中的錯誤及在偵測 到一錯誤時讓該FRC檢查器失去作用以爲回應。 3 ·如申|靑專利軔圍第2項的處理器,其中該錯灰偵測 器包含第一及第二錯誤偵測器用以分別偵測在第一及第二 執行核心中的錯誤。 4 ·如申請專利範圍第3項的處理器,其中該第一錯誤 偵測器在第一執行核心中有一錯誤時會觸動一錯誤訊號, 會乖FRC檢查單元失去作用及會使用該第二執行核心來 啓動(initiate)—恢復程序,以爲回應。 5 ·如申請專利範圍第4項的處理器,其中該第二執行 核心被指定爲一 FRC從及被重新指定爲一 FRC主’以回 應該錯誤訊號。 6.如申請專利範圍第5項的處理器,其中該第二執行 核心將其機器狀態資料保留在一記憶體位置,並執行一重 -27- 1236620 (2) 設序歹0 (sequence)。 7 ·如申請專利範圍第2項的處理器,其中該第一及訂 執行核心亦可在一多核心模式中操作且該界面控制單元進 一步包括一仲裁單元用來在該二執行核心是在多核心模式 中操作時,調節兩執行核心對共享資源的存取。 8 .如申請專利範圍第7項的處理器,其中該共享資源 包含一快取,其可處理在多核心模式中來自於第一及第二 核心的異動及,其可處理在FRC模式中只來自於第一及 第二核心兩者中的一者的異動。 9 ·如申請專利範圍第7項的處理器,其中當偵測到一 錯誤時,該錯誤偵測器會觸動一中斷,如果該處理器是在 多核心模式中的話,及該錯誤偵測器會觸動一加速的中斷 (accelerated interrupt),如果該處理器是在FRC模式中的 話。 1 0·如申請專利範圍第9項的處理器,其中該加速的 中斷會繞過(bypass)—執心核心會被多核心模式中的中斷 所遍歷(traversed)的部分。 1 1·一種計算系統,其包含: 一第一記憶體位置,用來存放一恢復常式; 一第二記憶體位置,用來存放一重設常式; 第一及第二執行核心,其能夠在一 FRC模式中操作 , 一錯誤單元,用來啓動該恢復常式,以回應在該第〜 及第二執行核心的一者中偵測到一錯誤;及 -28- 1236620 (3) 一 FRC檢查器,用以啓動該重設常式,以回應在來 自於第一及第二執行核心的訊號之間偵測到一失配。 1 2 .如申請專利範圍第1 1項的計算系統,其中該錯誤 單元話讓該FRC檢查器失去作用,以回應在該第一及第 二執行核心的一者中偵測到該錯誤。 1 3 ·如申請專利範圍第1 2項的計算系統,其中該重設 常式包括可被該第一及第二執行核心執行的指令,用以將 在一多核心模式中或在該FRC模式中的該第一及的二執 行核心初始化(initialize)。 14.如申請專利範圍第13項的計算系統,其更包含一 被該第一及第二執行核心所共享的快取,如果該第一及第 二執行核心是在多核心模式中的話。 15·如申請專利範圍第14項的計算系統,其更包含一 仲裁單元’用來管理該第一及第二執行核心在多核心模式 中對該快取的存取。 Ϊ 6 ·如申請專利範圍第1 5項的計算系統,其中該frc 檢查器會監視在FRC模式中從該第一及第二執行核心送 至該仲裁單元的異動訊號並啓動該重設常式以回應在該異 動訊號中的一失配(mismatch)。 1 7 ·如申請專利範圍第1丨項的計算系統,其中該第一 及第二執行核心在FRC模式中分別如主執行核心及從執 行核心般地操作。 1 8 ·如申請專利範圍第i 7項的計算系統,其中該第一 執行核心被失去作用且該第二執行核心如主執行核心般操 -29- 1236620 (4) 作,以回應在該第一執行核心中的一錯誤。 1 9 ·如申請專利範圍第1 1項的計算系統,其中該第一 及第二執行核心可在一多核心模式或一;PRC模式中被初 始化(initialize)。 20·如申請專利範圍第19項的計算系統,其中該錯誤 單兀會觸動對該第一及第二執行單元的一中斷(inter r up t) ’以回應在一執行核心中的一錯誤。 2 1 .如申請專利範圍第20項的計算系統,其中該中斷 爲一加速的中斷’如果該執行核心是在FR(:模式中的話 〇 22 ·如申請專利範圍第2 1項的系統,其中該加速的中 斷會繞過該執行核心的一部分。 2 3 . —種恢復方法,其包含: 在一 FRC模式中操作第一及第二執行核心; 監視該第一及第二執行核心的資料看是否有錯誤存在 , 比較由該第一及第二執行核心所產生的訊號; 執行一恢復常式,以回應在第一或第二執行核心中的 一錯誤;及 執行一重設常式,以回應在第一及第二執行核心所產 生的訊號之間的〜失配(mismatch)。 24·如申請專利範圍第23項的恢復方法,其更包含暫 停訊號比較,以作爲在第一或第二執行核心中的該錯誤的 回應。 -30- 1236620 (5) 2 5 .如申請專利範圍第24項的恢復方法,其中執行一 重設常式以回應該失配進一步包含: 觸動一延遲時間長度以回應該失配; 執行該重設常式,如果在超過該時間長度之前沒有在 執行核心中偵測到錯誤的話。 2 6.如申請專利範圍第25項的恢復方法,其更包含執 行該恢復常式,如果在超過該時間長度之前在一執行核心 中偵測到一錯誤的話。 2 7 ·如申請專利範圍第2 3項的恢復方法,其中在f rc 模式中操作第一及第二執行核心包含了在FRC模式中或 在多核心模式中操作該第一及第二執行核心,以回應一重 設訊號。 28·如申請專利範圍第27項的恢復方法,其中執行該 恢復常式包含: 執行該恢復常式以回應由一中斷所發出的一錯誤訊號 ,如果執行核心是在多核心模式中操作的話;及 執行該恢復常式以回應由一加速的中斷所發出的一錯 誤訊號,如果執行核心是在FRC模式中操作的話。 2 9.如申請專利範圍第23項的恢復方法,其中在FRC 模式中操作該弟一*及弟一執彳了核心進一步包含分別指定第 一及第二執行核心爲主執行核心及從執行核心。 3 〇 ·如申請專利範圍第2 3項的恢復方法,其中執行該 恢復常式進一步包含讓該第一執行核心失去作用並指定該 第二執行核心爲主執行核心,以回應在第一執行核心中的 -31 - 1236620 (6) 一錯誤。 -32(1) 1236620 Patent application scope 1 · A processor that includes first and second execution cores for operating in an FRC mode; a resource for processing the first and second executions from 6 A transaction of at least one of the minds; and an interface control unit for adjusting the access of the first and second execution cores to the resource, the control unit for the electrical control includes an FRC interrogation unit for comparing A change signal at the first and second execution cores, and if the comparison shows a mismatch, a signal about the error is issued. 2. If the processor of the first patent application scope, it further includes an error detector to detect errors in the first and second execution cores and to disable the FRC checker when an error is detected Thought in response. 3 · Rushen | 靑 Patent encloses the processor of item 2, wherein the error gray detector includes first and second error detectors for detecting errors in the first and second execution cores, respectively. 4 · If the processor of patent application item 3, wherein the first error detector will trigger an error signal when there is an error in the first execution core, the FRC inspection unit will fail and the second execution will be used. The core to initiate (initiate)-the recovery process, in response. 5. The processor of claim 4 in which the second execution core is designated as a FRC slave and redesignated as a FRC master 'in response to an error signal. 6. The processor according to item 5 of the patent application, wherein the second execution core retains its machine state data in a memory location and executes a -27-1236620 (2) sequence 0 (sequence). 7 · If the processor of the second patent application scope, the first and order execution core can also operate in a multi-core mode and the interface control unit further includes an arbitration unit for When operating in the core mode, adjust the access of the two execution cores to shared resources. 8. The processor according to item 7 of the patent application scope, wherein the shared resource includes a cache, which can process the transactions from the first and second cores in the multi-core mode, and can process only the FRC mode. Change from one of the first and second cores. 9 · If the processor of the patent application scope item 7, wherein when an error is detected, the error detector will trigger an interrupt, if the processor is in multi-core mode, and the error detector An accelerated interrupt will be triggered if the processor is in FRC mode. 1 0. As for the processor in the ninth scope of the patent application, the accelerated interrupt will be bypassed—the core of the core will be traversed by the interrupt in the multi-core mode. 1 1. A computing system comprising: a first memory location for storing a recovery routine; a second memory location for storing a reset routine; first and second execution cores capable of Operating in an FRC mode, an error unit is used to start the recovery routine in response to detecting an error in one of the first and second execution cores; and -28-1236620 (3) an FRC A checker for activating the reset routine in response to detecting a mismatch between signals from the first and second execution cores. 12. The computing system according to item 11 of the patent application scope, wherein the error cell disables the FRC checker in response to detecting the error in one of the first and second execution cores. 1 3 · The computing system according to item 12 of the patent application scope, wherein the reset routine includes instructions that can be executed by the first and second execution cores to be used in a multi-core mode or in the FRC mode The first and second ones perform core initialization. 14. The computing system according to item 13 of the patent application scope, further comprising a cache shared by the first and second execution cores, if the first and second execution cores are in a multi-core mode. 15. The computing system according to item 14 of the scope of patent application, further comprising an arbitration unit 'for managing the access of the first and second execution cores to the cache in a multi-core mode. Ϊ 6 · If the computing system of item 15 of the patent application scope, the frc checker will monitor the transaction signals sent from the first and second execution cores to the arbitration unit in the FRC mode and start the reset routine In response to a mismatch in the transaction signal. 17 • The computing system according to item 1 of the patent application scope, wherein the first and second execution cores operate in the FRC mode as a master execution core and a slave execution core, respectively. 1 8 · If the computing system of item i 7 of the patent application scope, wherein the first execution core is disabled and the second execution core operates as the main execution core -29-1236620 (4) in response to the An error in the execution core. 19 · The computing system according to item 11 of the patent application scope, wherein the first and second execution cores can be in a multi-core mode or a one; the PRC mode is initialized. 20. The computing system of claim 19, wherein the error unit triggers an interrupt to the first and second execution units in response to an error in an execution core. 2 1. If the computing system of the scope of application for the patent application item 20, wherein the interruption is an accelerated interruption 'if the execution core is in the FR (: mode) 022 · Such as the system of the scope of application for the patent application item 21, where The accelerated interruption will bypass a part of the execution core. 2 3. A recovery method, which includes: operating the first and second execution cores in an FRC mode; monitoring the data of the first and second execution cores to see If there is an error, compare the signals generated by the first and second execution cores; execute a recovery routine to respond to an error in the first or second execution core; and execute a reset routine to respond ~ Mismatch between the signals generated by the first and second execution cores. 24. If the recovery method in the scope of patent application No. 23, it also includes a pause signal comparison as the first or second Execute the wrong response in the core. -30- 1236620 (5) 2 5. If the recovery method of the scope of the patent application is 24, executing a reset routine to respond to the mismatch further includes: A delay time in response to the mismatch; execute the reset routine, if no error is detected in the execution core before the time length is exceeded. 2 6. If the recovery method of the 25th scope of the patent application, its It also includes the execution of the recovery routine, if an error is detected in an execution core before the time period is exceeded. 2 7 · The recovery method of item 23 of the patent application scope, in which the first operation in the f rc mode The first and second execution cores include operating the first and second execution cores in the FRC mode or in the multi-core mode in response to a reset signal. 28. For example, the recovery method for scope 27 of the patent application, wherein the execution of the The recovery routine includes: executing the recovery routine in response to an error signal issued by an interrupt if the execution core is operating in a multi-core mode; and executing the recovery routine in response to an accelerated interrupt An error signal if the execution core is operating in the FRC mode. 2 9. The recovery method according to item 23 of the patent application scope, wherein the operating in the FRC mode Diyi * and Diyi executed the core further including designating the first and second execution cores as the main execution core and the slave execution core respectively. 3 〇 If the recovery method in the scope of the patent application No. 23, the execution of the recovery routine The formula further includes deactivating the first execution core and designating the second execution core as the main execution core in response to a -31-1236620 (6) error in the first execution core.
TW092132000A 2002-12-19 2003-11-14 On-die mechanism for high-reliability processor TWI236620B (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US10/324,957 US7055060B2 (en) 2002-12-19 2002-12-19 On-die mechanism for high-reliability processor

Publications (2)

Publication Number Publication Date
TW200416595A TW200416595A (en) 2004-09-01
TWI236620B true TWI236620B (en) 2005-07-21

Family

ID=32593612

Family Applications (1)

Application Number Title Priority Date Filing Date
TW092132000A TWI236620B (en) 2002-12-19 2003-11-14 On-die mechanism for high-reliability processor

Country Status (9)

Country Link
US (1) US7055060B2 (en)
EP (1) EP1573544B1 (en)
JP (1) JP2006510117A (en)
CN (1) CN100375050C (en)
AT (1) ATE461484T1 (en)
AU (1) AU2003287729A1 (en)
DE (1) DE60331771D1 (en)
TW (1) TWI236620B (en)
WO (1) WO2004061666A2 (en)

Families Citing this family (91)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5260950A (en) * 1991-09-17 1993-11-09 Ncr Corporation Boundary-scan input circuit for a reset pin
US6625756B1 (en) * 1997-12-19 2003-09-23 Intel Corporation Replay mechanism for soft error recovery
US7139947B2 (en) 2000-12-22 2006-11-21 Intel Corporation Test access port
US7194671B2 (en) 2001-12-31 2007-03-20 Intel Corporation Mechanism handling race conditions in FRC-enabled processors
US7278080B2 (en) * 2003-03-20 2007-10-02 Arm Limited Error detection and recovery within processing stages of an integrated circuit
DE10349580A1 (en) * 2003-10-24 2005-05-25 Robert Bosch Gmbh Method and device for operand processing in a processor unit
US20050114735A1 (en) * 2003-11-20 2005-05-26 Smith Zachary S. Systems and methods for verifying core determinacy
US20050132239A1 (en) * 2003-12-16 2005-06-16 Athas William C. Almost-symmetric multiprocessor that supports high-performance and energy-efficient execution
US7296181B2 (en) * 2004-04-06 2007-11-13 Hewlett-Packard Development Company, L.P. Lockstep error signaling
US7237144B2 (en) * 2004-04-06 2007-06-26 Hewlett-Packard Development Company, L.P. Off-chip lockstep checking
US7290169B2 (en) * 2004-04-06 2007-10-30 Hewlett-Packard Development Company, L.P. Core-level processor lockstepping
US7873776B2 (en) * 2004-06-30 2011-01-18 Oracle America, Inc. Multiple-core processor with support for multiple virtual processors
US7685354B1 (en) 2004-06-30 2010-03-23 Sun Microsystems, Inc. Multiple-core processor with flexible mapping of processor cores to cache banks
KR20070062579A (en) * 2004-10-25 2007-06-15 로베르트 보쉬 게엠베하 Method and apparatus for distributing data from at least one data source in a multiprocessor system
DE102004051937A1 (en) * 2004-10-25 2006-05-04 Robert Bosch Gmbh Data distributing method for multiprocessor system, involves switching between operating modes e.g. safety and performance modes, of computer units, where data distribution and/or selection of data source is dependent upon one mode
US7856569B2 (en) * 2004-10-25 2010-12-21 Robert Bosch Gmbh Method and device for a switchover and for a data comparison in a computer system having at least two processing units
JP5053854B2 (en) * 2004-10-25 2012-10-24 ローベルト ボッシュ ゲゼルシャフト ミット ベシュレンクテル ハフツング Method and apparatus for switching in a computer system having at least two implementation units
DE502005005286D1 (en) * 2004-10-25 2008-10-16 Bosch Gmbh Robert DEVICE AND METHOD FOR MODULE SWITCHING ON A COMPUTER SYSTEM WITH AT LEAST TWO OUTPUT UNITS
DE102005037213A1 (en) * 2004-10-25 2007-02-15 Robert Bosch Gmbh Operating modes switching method for use in computer system, involves switching between operating modes using switching unit, where switching is triggered by signal generated outside system, and identifier is assigned to signal
US20060112226A1 (en) * 2004-11-19 2006-05-25 Hady Frank T Heterogeneous processors sharing a common cache
JP2006178636A (en) * 2004-12-21 2006-07-06 Nec Corp Fault tolerant computer and its control method
JP3897046B2 (en) * 2005-01-28 2007-03-22 横河電機株式会社 Information processing apparatus and information processing method
US7467325B2 (en) 2005-02-10 2008-12-16 International Business Machines Corporation Processor instruction retry recovery
US20060184771A1 (en) * 2005-02-11 2006-08-17 International Business Machines Mini-refresh processor recovery as bug workaround method using existing recovery hardware
US8732368B1 (en) * 2005-02-17 2014-05-20 Hewlett-Packard Development Company, L.P. Control system for resource selection between or among conjoined-cores
JP4555713B2 (en) * 2005-03-17 2010-10-06 富士通株式会社 Error notification method and information processing apparatus
US7747932B2 (en) * 2005-06-30 2010-06-29 Intel Corporation Reducing the uncorrectable error rate in a lockstepped dual-modular redundancy system
DE102005037247A1 (en) * 2005-08-08 2007-02-15 Robert Bosch Gmbh Method and device for controlling a memory access in a computer system having at least two execution units
DE102005037233A1 (en) * 2005-08-08 2007-02-15 Robert Bosch Gmbh Method and device for data processing
DE102005037234A1 (en) * 2005-08-08 2007-02-15 Robert Bosch Gmbh Device and method for storing data and / or commands in a computer system having at least two execution units and at least one first memory or memory area for data and / or commands
DE102005037246A1 (en) * 2005-08-08 2007-02-15 Robert Bosch Gmbh Method and device for controlling a computer system having at least two execution units and a comparison unit
US7502957B2 (en) * 2005-09-09 2009-03-10 International Business Machines Corporation Method and system to execute recovery in non-homogeneous multi processor environments
US7412353B2 (en) 2005-09-28 2008-08-12 Intel Corporation Reliable computing with a many-core processor
JP4653841B2 (en) * 2006-02-28 2011-03-16 インテル・コーポレーション Enhanced reliability of multi-core processors
US7774590B2 (en) * 2006-03-23 2010-08-10 Intel Corporation Resiliently retaining state information of a many-core processor
US7802073B1 (en) * 2006-03-29 2010-09-21 Oracle America, Inc. Virtual core management
JP2007328461A (en) * 2006-06-06 2007-12-20 Matsushita Electric Ind Co Ltd Asymmetric multiprocessor
US20100325481A1 (en) * 2006-10-20 2010-12-23 Freescale Semiconductor, Inc. Device having redundant core and a method for providing core redundancy
CN101236515B (en) * 2007-01-31 2010-05-19 迈普通信技术股份有限公司 Multi-core system single-core abnormity restoration method
DE102007009909B4 (en) * 2007-02-28 2016-09-08 Globalfoundries Inc. A method of validating an atomic transaction in a multi-core microprocessor environment
ATE537502T1 (en) * 2007-03-29 2011-12-15 Fujitsu Ltd INFORMATION PROCESSING APPARATUS AND ERROR PROCESSING METHOD
US9207661B2 (en) 2007-07-20 2015-12-08 GM Global Technology Operations LLC Dual core architecture of a control module of an engine
US7797512B1 (en) 2007-07-23 2010-09-14 Oracle America, Inc. Virtual core management
CN101136729B (en) * 2007-09-20 2011-08-03 华为技术有限公司 Method, system and device for implementing high usability
KR101038464B1 (en) * 2007-09-25 2011-06-01 후지쯔 가부시끼가이샤 Information processing device and control method
KR100958303B1 (en) * 2007-12-12 2010-05-19 한국전자통신연구원 Load balancing system and method through dynamic loading and execution of module device using internal core communication channel in multi-core system environment
US7996663B2 (en) * 2007-12-27 2011-08-09 Intel Corporation Saving and restoring architectural state for processor cores
GB2458260A (en) 2008-02-26 2009-09-16 Advanced Risc Mach Ltd Selectively disabling error repair circuitry in an integrated circuit
US8015390B1 (en) * 2008-03-19 2011-09-06 Rockwell Collins, Inc. Dissimilar processor synchronization in fly-by-wire high integrity computing platforms and displays
US7941699B2 (en) * 2008-03-24 2011-05-10 Intel Corporation Determining a set of processor cores to boot
JP4876093B2 (en) * 2008-03-31 2012-02-15 株式会社日立製作所 Control device task management device and control device task management method
US8037350B1 (en) * 2008-04-30 2011-10-11 Hewlett-Packard Development Company, L.P. Altering a degree of redundancy used during execution of an application
JP5344936B2 (en) * 2009-01-07 2013-11-20 株式会社日立製作所 Control device
JP2010282296A (en) * 2009-06-02 2010-12-16 Sanyo Electric Co Ltd Data check circuit
JP4911372B2 (en) * 2009-10-06 2012-04-04 日本電気株式会社 Time-out prevention method at the time of CPU re-initialization accompanied by CPU re-reset, apparatus and program thereof
US20110191602A1 (en) * 2010-01-29 2011-08-04 Bearden David R Processor with selectable longevity
EP2367129A1 (en) 2010-03-19 2011-09-21 Nagravision S.A. Method for checking data consistency in a system on chip
JP5445669B2 (en) 2010-03-24 2014-03-19 富士通株式会社 Multi-core system and startup method
CN101807076B (en) * 2010-05-26 2011-11-09 哈尔滨工业大学 Duplication redundancy fault-tolerant high-reliability control system having synergistic warm standby function based on PROFIBUS field bus
US8412980B2 (en) * 2010-06-04 2013-04-02 International Business Machines Corporation Fault tolerant stability critical execution checking using redundant execution pipelines
US8522076B2 (en) * 2010-06-23 2013-08-27 International Business Machines Corporation Error detection and recovery in a shared pipeline
US9063730B2 (en) 2010-12-20 2015-06-23 Intel Corporation Performing variation-aware profiling and dynamic core allocation for a many-core processor
EP2525292A1 (en) * 2011-05-20 2012-11-21 ABB Technology AG System and method for using redundancy of controller operation
US9098561B2 (en) 2011-08-30 2015-08-04 Intel Corporation Determining an effective stress level on a processor
WO2013095470A1 (en) * 2011-12-21 2013-06-27 Intel Corporation Error framework for a microprocessor and system
CN102591763B (en) * 2011-12-31 2015-03-04 龙芯中科技术有限公司 System and method for detecting faults of integral processor on basis of determinacy replay
US9146835B2 (en) * 2012-01-05 2015-09-29 International Business Machines Corporation Methods and systems with delayed execution of multiple processors
US9058419B2 (en) 2012-03-14 2015-06-16 GM Global Technology Operations LLC System and method for verifying the integrity of a safety-critical vehicle control system
US11901088B2 (en) 2012-05-04 2024-02-13 Smr Inventec, Llc Method of heating primary coolant outside of primary coolant loop during a reactor startup operation
US10096389B2 (en) 2012-05-21 2018-10-09 Smr Inventec, Llc Loss-of-coolant accident reactor cooling system
US11935663B2 (en) 2012-05-21 2024-03-19 Smr Inventec, Llc Control rod drive system for nuclear reactor
EP2864845B1 (en) * 2012-07-17 2016-09-07 Siemens Aktiengesellschaft Automated reconfiguration of a discrete event control loop
US9317389B2 (en) 2013-06-28 2016-04-19 Intel Corporation Apparatus and method for controlling the reliability stress rate on a processor
US10203958B2 (en) * 2013-07-15 2019-02-12 Texas Instruments Incorporated Streaming engine with stream metadata saving for context switching
IN2013CH04831A (en) 2013-10-28 2015-08-07 Empire Technology Dev Llc
US9904339B2 (en) 2014-09-10 2018-02-27 Intel Corporation Providing lifetime statistical information for a processor
US9785446B2 (en) * 2014-12-10 2017-10-10 Dell Products L.P. Efficient boot from a connected device
US9704598B2 (en) 2014-12-27 2017-07-11 Intel Corporation Use of in-field programmable fuses in the PCH dye
US9842036B2 (en) * 2015-02-04 2017-12-12 Apple Inc. Methods and apparatus for controlled recovery of error information between independently operable processors
US10002056B2 (en) * 2015-09-15 2018-06-19 Texas Instruments Incorporated Integrated circuit chip with cores asymmetrically oriented with respect to each other
US10740167B2 (en) * 2016-12-07 2020-08-11 Electronics And Telecommunications Research Institute Multi-core processor and cache management method thereof
US10429919B2 (en) 2017-06-28 2019-10-01 Intel Corporation System, apparatus and method for loose lock-step redundancy power management
US10466702B1 (en) * 2018-02-14 2019-11-05 Rockwell Collins, Inc. Dual independent autonomous agent architecture for aircraft
US10831578B2 (en) 2018-09-28 2020-11-10 Nxp Usa, Inc. Fault detection circuit with progress register and status register
EP3719649A1 (en) * 2019-04-05 2020-10-07 Robert Bosch GmbH Clock fractional divider module, image and/or video processing module, and apparatus
US11841776B2 (en) * 2019-06-12 2023-12-12 Intel Corporation Single chip multi-die architecture having safety-compliant cross-monitoring capability
CN110806899B (en) * 2019-11-01 2021-08-24 西安微电子技术研究所 Assembly line tight coupling accelerator interface structure based on instruction extension
CN111104243B (en) * 2019-12-26 2021-05-28 江南大学 A low-latency dual-mode lockstep soft-error-tolerant processor system
US11892505B1 (en) * 2022-09-15 2024-02-06 Stmicroelectronics International N.V. Debug and trace circuit in lockstep architectures, associated method, processing system, and apparatus
US12332737B2 (en) * 2023-03-08 2025-06-17 Nxp B.V. Method and apparatus for fault indication propagation and fault masking in a hierarchical arrangement of systems
CN117539591B (en) * 2023-11-17 2024-09-27 天翼云科技有限公司 A method and device for high availability of virtual machines based on MCE panic in cloud computing scenarios

Family Cites Families (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US3701972A (en) * 1969-12-16 1972-10-31 Computer Retrieval Systems Inc Data processing system
FR2182259A5 (en) * 1972-04-24 1973-12-07 Cii
US4325120A (en) * 1978-12-21 1982-04-13 Intel Corporation Data processing system
JPH01307815A (en) * 1988-06-07 1989-12-12 Mitsubishi Electric Corp Reset system for information processor
EP0356538B1 (en) * 1988-08-27 1993-12-22 International Business Machines Corporation Arrangement in data processing system for system initialization and reset
US5260950A (en) * 1991-09-17 1993-11-09 Ncr Corporation Boundary-scan input circuit for a reset pin
US5758058A (en) * 1993-03-31 1998-05-26 Intel Corporation Apparatus and method for initializing a master/checker fault detecting microprocessor
US6061599A (en) * 1994-03-01 2000-05-09 Intel Corporation Auto-configuration support for multiple processor-ready pair or FRC-master/checker pair
FR2721122B1 (en) * 1994-06-14 1996-07-12 Commissariat Energie Atomique Calculation unit with plurality of redundant computers.
US5802132A (en) * 1995-12-29 1998-09-01 Intel Corporation Apparatus for generating bus clock signals with a 1/N characteristic in a 2/N mode clocking scheme
US5915082A (en) * 1996-06-07 1999-06-22 Lockheed Martin Corporation Error detection and fault isolation for lockstep processor systems
US5862373A (en) * 1996-09-06 1999-01-19 Intel Corporation Pad cells for a 2/N mode clocking scheme
US5935266A (en) * 1996-11-15 1999-08-10 Lucent Technologies Inc. Method for powering-up a microprocessor under debugger control
EP0884598A1 (en) * 1997-06-13 1998-12-16 BULL HN INFORMATION SYSTEMS ITALIA S.p.A. Integrated circuit with serial test interface and logic for loading a functional register using said interface
US6625756B1 (en) * 1997-12-19 2003-09-23 Intel Corporation Replay mechanism for soft error recovery
WO1999052033A1 (en) * 1998-04-03 1999-10-14 Hitachi, Ltd. Semiconductor device
US6173351B1 (en) * 1998-06-15 2001-01-09 Sun Microsystems, Inc. Multi-processor system bridge
US6357024B1 (en) * 1998-08-12 2002-03-12 Advanced Micro Devices, Inc. Electronic system and method for implementing functional redundancy checking by comparing signatures having relatively small numbers of signals
US6393582B1 (en) * 1998-12-10 2002-05-21 Compaq Computer Corporation Error self-checking and recovery using lock-step processor pair architecture
US6640313B1 (en) * 1999-12-21 2003-10-28 Intel Corporation Microprocessor with high-reliability operating mode
US6615366B1 (en) * 1999-12-21 2003-09-02 Intel Corporation Microprocessor with dual execution core operable in high reliability mode
US6625749B1 (en) * 1999-12-21 2003-09-23 Intel Corporation Firmware mechanism for correcting soft errors

Also Published As

Publication number Publication date
WO2004061666A3 (en) 2005-06-23
CN1729456A (en) 2006-02-01
AU2003287729A1 (en) 2004-07-29
US7055060B2 (en) 2006-05-30
CN100375050C (en) 2008-03-12
AU2003287729A8 (en) 2004-07-29
EP1573544A2 (en) 2005-09-14
HK1079316A1 (en) 2006-03-31
WO2004061666A2 (en) 2004-07-22
US20040123201A1 (en) 2004-06-24
JP2006510117A (en) 2006-03-23
ATE461484T1 (en) 2010-04-15
TW200416595A (en) 2004-09-01
DE60331771D1 (en) 2010-04-29
EP1573544B1 (en) 2010-03-17

Similar Documents

Publication Publication Date Title
TWI236620B (en) On-die mechanism for high-reliability processor
JP2552651B2 (en) Reconfigurable dual processor system
US6948094B2 (en) Method of correcting a machine check error
JP4073464B2 (en) Main memory system and checkpointing protocol for fault tolerant computer system using read buffer
US5958070A (en) Remote checkpoint memory system and protocol for fault-tolerant computer system
US7827443B2 (en) Processor instruction retry recovery
EP0479230B1 (en) Recovery method and apparatus for a pipelined processing unit of a multiprocessor system
TWI274991B (en) A method, apparatus, and system for buffering instructions
US10817369B2 (en) Apparatus and method for increasing resilience to faults
US20070038891A1 (en) Hardware checkpointing system
JP3301992B2 (en) Computer system with power failure countermeasure and method of operation
US7734949B2 (en) Information error recovery apparatus and methods
US20120233499A1 (en) Device for Improving the Fault Tolerance of a Processor
US7194671B2 (en) Mechanism handling race conditions in FRC-enabled processors
Mack et al. Ibm power6 reliability
US10289332B2 (en) Apparatus and method for increasing resilience to faults
Kleen Machine check handling on Linux
JP4531535B2 (en) Hardware error control method in instruction control device with instruction processing stop means
HK1079316B (en) On-die mechanism for high-reliability processor

Legal Events

Date Code Title Description
MM4A Annulment or lapse of patent due to non-payment of fees