[go: up one dir, main page]

HK1096744B - System and method to determine a healthy group of processors and associated fireware for booting a system - Google Patents

System and method to determine a healthy group of processors and associated fireware for booting a system Download PDF

Info

Publication number
HK1096744B
HK1096744B HK07104384.9A HK07104384A HK1096744B HK 1096744 B HK1096744 B HK 1096744B HK 07104384 A HK07104384 A HK 07104384A HK 1096744 B HK1096744 B HK 1096744B
Authority
HK
Hong Kong
Prior art keywords
processor
health status
determining
sending
group
Prior art date
Application number
HK07104384.9A
Other languages
Chinese (zh)
Other versions
HK1096744A1 (en
Inventor
Schelling Todd
Original Assignee
英特尔公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from US10/171,210 external-priority patent/US7350063B2/en
Application filed by 英特尔公司 filed Critical 英特尔公司
Publication of HK1096744A1 publication Critical patent/HK1096744A1/en
Publication of HK1096744B publication Critical patent/HK1096744B/en

Links

Abstract

A system and method to determine a healthy group of processors and associated firmware for booting a system after a resetting event is disclosed. Redundant copies of processor specific firmware are examined for validity. Processors determine their own health status, and one processor determines a group of processors with the best available health status. Inter-processor interrupt messages provide the communication mechanism to allow an algorithm to determine a group of processors to continue booting the system.

Description

System and method for determining a set of health processors and associated firmware to boot a system
Technical Field
The present invention relates generally to microprocessor systems, and more particularly to microprocessor systems capable of multiprocessor operations using field upgradeable firmware.
Background
Processors within a microprocessor system may rely on firmware to perform self-test and boot operations after a reset event. In a multiprocessor system, even processors in the same general-purpose processor family may differ from each other due to processor speed, stepping level, some architectural modification (vision), and many other parameters. Thus, firmware may contain several modules that are specific to a group of processors within a general processor family.
Additionally, field upgrades to these firmware modules may be required for several reasons. Flash memory or other field upgradeable memory may initially contain initial firmware modules that may be overwritten with updated firmware modules at a later date. However, since the flash memory module can be written, it may be damaged. The firmware that controls the updating of the firmware may itself be corrupted and the system cannot be restored in the field. In this case, the system may need to be returned to the manufacturer to physically replace the flash memory with a new flash memory module containing uncorrupted firmware.
Disclosure of Invention
According to an aspect of the invention, there is provided a system comprising: a first processor to determine a first processor health status; a second processor coupled to the first processor for determining a second processor health status; and a hardware semaphore register coupled to the first processor and the second processor, wherein either or both of the first processor and the second processor are operable to attempt a boot process, and the first processor and the second processor share control of a system boot operation when the first processor health state equals the second processor health state.
According to another aspect of the invention, there is provided a method comprising: determining a first processor health status; determining a second processor health status; sending the second processor health status to the first processor; determining a group health status from the first processor health status and the second processor health status; enabling the first processor to continue booting operations when the group health state corresponds to the first processor health state; enabling the second processor to continue booting operations when the group health state corresponds to the second processor health state; and enabling the first processor and the second processor to share control of a system boot operation when the first processor health state equals the second processor health state.
According to yet another aspect of the present invention, there is provided a method comprising: determining a first processor health status; determining a second processor health status; sending the second processor health status to the first processor; reading a hardware semaphore register by the first processor before reading the hardware semaphore register by the second processor; determining a group health status from the first processor health status and the second processor health status; enabling the second processor to continue booting operations by sending the group health status to the second processor in response to a health status request when the group health status corresponds to the second processor health status, wherein the step of determining the first processor health status comprises checking a first firmware interface table and a second firmware interface table using a general purpose processor abstraction layer.
According to yet another aspect of the invention, there is provided an apparatus comprising: means for determining a first processor health status; means for determining a second processor health status; means for sending the second processor health status to the first processor; means for determining a group health status based on the first processor health status and the second processor health status; means for enabling the first processor to continue boot operations when the group health state corresponds to the first processor health state; means for enabling the second processor to continue boot operations when the group health state corresponds to the second processor health state; and means for enabling the first processor and the second processor to share control of system boot operations when the first processor health state equals the second processor health state.
According to yet another aspect of the present invention, there is provided an apparatus comprising: means for determining a first processor health status; means for determining a second processor health status; means for sending the second processor health status to the first processor; means for reading a hardware semaphore register by the first processor before the hardware semaphore register is read by the second processor; means for determining a group health status based on the first processor health status and the second processor health status; means for enabling the second processor to continue boot operations by sending the group health status to the second processor in response to a health status request when the group health status corresponds to the second processor health status, wherein the means for determining the first processor health status comprises means for checking a first firmware interface table and a second firmware interface table using a general purpose processor abstraction layer.
Drawings
The present invention is illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to similar elements and in which:
FIG. 1 is a schematic diagram of system hardware components, according to one embodiment.
FIG. 2 illustrates software components in memory according to one embodiment.
Fig. 3 is an inter-component messaging diagram according to one embodiment of the invention.
FIG. 4 is a flow diagram illustrating derivation of local processor health, according to one embodiment of the invention.
FIG. 5 is a flow diagram illustrating the selection and initialization of a health processor, according to one embodiment of the invention.
Detailed Description
The following description describes techniques for selecting and initializing processors in a multiprocessor system. In the following description, numerous specific details such as logic implementations, software module allocation, bus signaling techniques, and details of operation are set forth in order to provide a more thorough understanding of the present invention. However, it will be recognized by one of ordinary skill in the art that the present invention may be practiced without these specific details. In other instances, control structures, gate level circuits and full software instruction sequences have not been shown in detail in order not to obscure the invention. Those of ordinary skill in the art, with the included descriptions, will be able to implement appropriate functionality without undue experimentation. The present invention is disclosed in the form of hardware within a microprocessor system. However, the invention may also be embodied in other forms of processor, such as a digital signal processor, a microcomputer or a mainframe computer. Similarly, the present invention is disclosed using inter-processor interrupts as a method of signaling between processors. However, the invention may also be implemented using other forms of signaling.
In one embodiment, each processor checks the firmware modules required for its own operation, thereby initiating the selection and initialization of healthy processors in the multiprocessor. Each processor then determines its own processor health status. A deterministic method then selects a temporary master processor that collects all the health states of the processors and determines the set of processors all of which have the highest available processor health state. The temporary master processor then enables the processors in the group to continue boot operations and halts or disables the execution of processors that are not members of the group (including the temporary master processor itself, if necessary).
Referring now to FIG. 1, a diagram of system hardware components is shown, according to one embodiment. Several processors are shown, namely central processing units CPU A110, CPU B114, CPU C118, and CPU D122. In other embodiments, there may be only one processor, or there may be a pair of processors or more than 4 processors. In one embodiment, the processor may be in communication with ItaniumTMA family of processors. Processors such as CPU A110, CPU B114, CPU C118, and CPU D122 may each include one or more Interrupt Request Registers (IRRs), such as IRRs 112, 116, 120, and 124 as shown. A typical interrupt sent to a processor, such as CPU A110, may write a value to the IRR 112, where the IRR 112 may contain a vector generally describing the memory locations required to service the interrupt. CPU a 110 may enable or disable interrupt servicing. When interrupt service is disabled, IRR 112 mayTo still receive the vector, but CPU a 110 may not automatically service the interrupt. However, CPU A110 may still read the vectors contained in IRR 112. Operating an interrupt service that is disabled in this manner is often referred to as a "polling mode". In addition, each processor contains a unique ID called LID. The LID serves as the unique address for the processor on the system bus. The interrupt may specifically be directed to a processor with a known LID. The LID values may be stored in LID registers, such as LID registers 102, 104, 106, and 108 for CPU A110, CPU B114, CPU C118, and CPU D122, respectively. In other embodiments, the LID value may be stored in other circuit elements than registers.
The processors CPU A110, CPU B114, CPU C118, and CPU D122 may be connected to each other and to a chipset 134 via a system bus 130. The connection via system bus 130 and chipset 134 allows the processor to access system Random Access Memory (RAM)136, Basic Input Output System (BIOS) flash memory 138, and various input/output (I/O) devices, such as graphics controller 140 and various program storage devices. These program storage devices may include a system fixed disk 144 and a drive for removable media 146. In various embodiments, the drive for the removable media 146 may be a magnetic tape, a removable magnetic disk, a floppy disk, an electro-optical disk, or an optical disk such as a compact disk-read only memory (CD-ROM) or digital versatile disk-read only memory (DVD-ROM). I/O devices may be connected to the chipset 134 through a dedicated interface, such as an Advanced Graphics Port (AGP)142, or a general purpose interface, such as a Peripheral Component Interconnect (PCI) bus (not shown), a Universal Serial Bus (USB) (not shown), or an Integrated Drive Electronics (IDE) bus 148. Other I/O devices may include connections to a Local Area Network (LAN)150 or a Wide Area Network (WAN) 152. In other embodiments, many other interfaces may be used.
The computer system 100 may contain a hardware semaphore (semaphore) register somewhere in its architecture. A hardware semaphore register may be defined as a register that returns one value on the first read after a reset event and another value on the read after the first read. In one embodiment, chipset 134 may include one specific example of a hardware semaphore register, a boot flag (BOFL) register 154. BOFL register 154 may be used during system initialization to determine which of CPU A110, CPU B114, CPU C118, and CPU D122 may act as a temporary master processor. In one embodiment, BOFL register 154 may return one value on the first read after a reset event and another value on subsequent reads. In another embodiment, each time the processor reads BOFL register 154 after a reset event, a different number is given in a predetermined order. The first processor to read BOFL register 154 receives the value 0. Subsequent reads to BOFL register 154 return a non-zero value. The host processor is the processor that successfully reads from BOFL to a value of 0.
In one embodiment, an operating system may be installed on system fixed disk 144 and the kernel of the operating system may be loaded into system RAM 136. In other embodiments, an operating system may be loaded or executed over the LAN 150 or WAN 152.
Referring now to FIG. 2, software components in memory are shown, according to one embodiment. In one embodiment, the BIOS components are shown residing in the BIOS flash memory 138 of FIG. 1, but in other embodiments the BIOS may reside in other forms of non-volatile memory, or in other forms of volatile memory. When software components reside in non-volatile memory, they may be referred to as firmware.
The BIOS may include generally processor-related modules, such as Processor Abstraction Layer (PAL) firmware, or generally non-processor-related modules, such as System Abstraction Layer (SAL). Different processors may require different modified versions or types of PAL firmware, due in part to differences in the processor's modified versions. It may be advantageous if the version of PAL or SAL firmware within the system can be updated and the flash memory modified to accommodate the updated version.
However, significant problems may arise when attempting to update the BIOS in flash memory. In a simple embodiment, if power is interrupted during an update, the flash memory may contain corrupted copies of the BIOS, including the portion of the BIOS that controls the flash memory writes. When this happens, there is no repair other than soldering (holder) into the new flash memory containing the correct code. To reduce the frequency of these problems, in one embodiment, the PAL code and SAL may be partitioned. PAL code can be divided into: the PAL part which is the minimum required by system initialization and is called PAL-A; and the rest of the code, which in one embodiment may be referred to as PAL-B. Moreover, PAL-A may then be subdivided into primary PAL-A independent of the processor (PAL-A generic), and PAL-A code specific to the modified version of the given processor (PAL-A specific). Since PAL-A generic is processor independent, it does not need to be updated and thus may reside in an area of flash memory where updating is prohibited. Similarly, SAL can be divided into SAL-A and SAL-B, where SAL-A is the minimal portion of SAL required for system initialization or recovery, including flash update. SAL-a may be further subdivided into basic SAL-a that will no longer be updated in the future (SAL-a generic) and SAL-a that may need to be updated from time to time (SAL-a specific). To avoid corruption during updating, in one embodiment, PAL-A generic and SAL-A generic may be located in a protected portion of the flash memory that cannot be modified.
To increase system availability and reliability, PAL-A specific and SAL-A specific may have multiple copies. Consider a system in which two modified version levels of processors may be included, labeled for convenience as type 1 processors and type 2 processors. In other embodiments, there may be other processors at other modified version levels. In the embodiment of fig. 2, there may be one copy 220 of PAL-a generic, but two copies of PAL-a specific (primary PAL-a type 1 specific 230 and secondary PAL-a type 1 specific 240) for a type 1 processor. Similarly, for a type 2 processor, there may be two copies of PAL-A specific (type 2 primary PAL-A specific 232 and type 2 secondary PAL-A specific 242). There may also be one copy of SAL-a generic 222, and two copies of SAL-a specific, primary SAL-a specific 246 and secondary SAL-a specific 260. In other embodiments, there may be other firmware copies and types in the flash memory. In one embodiment, the copy may be an exact copy, but in other embodiments, the copy may have similar functionality but not exact copies.
When a processor (e.g., a type 1 processor) begins execution after a reset event, the processor begins execution at a predetermined location (referred to as a reset vector) in PAL-A generic 220. The processor executing PAL-a generic 220 may use the primary Firmware Interface Table (FIT)224 or the secondary FIT234 to discover the location of other code modules. PAL-a generic 220 code knows the entry points of primary FIT 224 and secondary FIT234 by vectors located in fixed locations primary FIT pointer 210 and secondary FIT pointer 212. PAL-A generic 220 executing on the processor may use these FIT pointers to locate the FIT, and then use the FIT to locate and validate other software modules. For example, the type 1 processor may use the primary FIT pointer 210 to find the location of the primary FIT 224. The type 1 processor may then use the location, size, checksum, and other parameters within the primary FIT 224 to locate and examine the type 1 primary PAL-a specific 230. If the type 1 processor is unable to locate or verify type 1 primary PAL-a specific 230, it may use the secondary FIT pointer 212 and the secondary FIT234 to locate and check type 1 secondary PAL-a specific 240.
If the type 1 processor locates and verifies a type 1 primary PAL-A specific 230 or a type 1 secondary PAL-A specific 240, the type 1 processor may then attempt to locate and check SAL-A. PAL-A generic 220 locates the entry point of either PAL-A primary specific type 1 230 or PAL-A secondary specific type 1 240 and begins execution. Then, the type 1 primary PAL-A specific 230 or type 1 secondary PAL-A specific 240 locates SAL-A generic 222 and passes control to it, and SAL-A generic 222 then authenticates itself with either primary SAL-A specific 246 or secondary SAL-A specific 260. In one embodiment, the type 1 processor uses the primary FIT pointer 210 and the primary FIT 224 to locate and check the primary SAL-A specific 246. If the type 1 processor is unable to locate and verify primary SAL-A specific 246, the type 1 processor may use the secondary FIT pointer 212 and the secondary FIT234 to locate and check secondary SAL-A specific 260.
After locating and verifying the portions of these PAL and SAL required for initialization or recovery, SAL-A generic 222 executing on a processor may determine a processor health status associated with the processor. The calculation of the health status may be performed by SAL-a generic 222 based on various firmware validity checks, including checksums, and may also be performed based on a handoff status code provided by PAL-a generic when handing off control to SAL-a generic 222. After determining which combination of firmware components has a satisfactory health status, the processor health status may be ranked. In one embodiment, the processor health status may be determined to be higher if a copy of primary PAL-A specific and primary SAL-A specific is found and validated. The processor health status may be determined to be slightly lower if a copy of the secondary PAL-a specific and the secondary SAL-a specific are found and verified. An even lower processor health status may be determined if only primary PAL-a specific and secondary SAL-a specific, or copies of secondary PAL-a specific and primary SAL-a specific, are found and validated. Finally, if no copies of PAL-A specific or SAL-A specific are found and validated, a minimum or "fatal" processor health status may be determined.
Referring now to FIG. 3, an inter-component messaging diagram is shown, according to one embodiment of the present disclosure. In one embodiment, each of the messages may be carried in an inter-processor interrupt (IPI). After a reset event that initiates a self-test of the processor, the processor may disable interrupts. Sending an IPI to a processor may still write a vector into the processor's IRR when the processor disables interrupts, or when the processor is in a "poll" interrupt mode. In this case, the vector placed in the processor's IRR may represent the message sender's LID, relative health value, or other data. In other embodiments, other ways of carrying the message may be used, such as a dedicated inter-processor signal, or multiplexing multiple dedicated signals over a data bus. In the embodiment of FIG. 3, CPU A302, CPU B304, and CPU C306 are shown, but in other embodiments more or fewer processors may participate in the process. After the reset event, each of the three processors (CPU A302, CPU B304, and CPU C306) performs a self-test. In one embodiment, the self-test may include determining processor health status, as described above in connection with FIG. 2. After each processor determines its own processor health status, it may be desirable to allow only those processors with the best available processor health status to continue with the boot operation. In other embodiments, the performance requirements may be such that it is desirable to allow the largest group of processors with acceptable processor health status to continue boot operations.
In the embodiment of FIG. 3, all 3 processors have determined a non-fatal processor health status. Each processor first assumes that it is the master processor and assigns itself a master LID. This step is required to determine that registration vector communication will not be lost. After this determination, each processor reads BOFL register 310 of chipset 308. The first processor to determine its processor health status (CPU B304 in this embodiment) makes a first BOFL register read 312 to BOFL register 310 after a reset event. Thus, CPU B304 becomes the master processor and continues to use the master LID as an identifier for inter-processor communications. In this embodiment, CPU A302 is the second processor to determine its processor health status and makes a second BOFL register read 314. Thus, CPU A302 becomes a slave processor and uses a unique non-master LID (slave 1LID) as an identifier for inter-processor communication. Finally, in this embodiment, CPU C306 is the third processor to determine its processor health status and makes a third BOFL register read 316. Thus, CPU C306 becomes a slave processor and uses the unique non-master LID (slave 2LID) as an identifier for inter-processor communication.
When a processor determines that it is a slave, it computes a unique slave LID and sends a check-in message to the processor using the predetermined master LID, the message representing its own LID. In one embodiment, a unique slave LID may be calculated using a geographically unique identifier passed from the PAL to the SAL. In one embodiment, the PAL may determine these identifiers from values read from one or more pins on the physical processor package. In the embodiment of FIG. 3, CPU A302 and CPU C306 send their registration messages 320, 322, respectively, to CPU B304. In one embodiment, CPU B304 may issue a corresponding health request message to the processor that sent the registration message in immediate response to receipt of the registration message. In other embodiments, CPU B304 may wait a predetermined period of time to receive all registration messages before responding with the health request message. For the embodiment of FIG. 3, CPU B304 sends health request messages 330, 332 to CPU A302 and CPU C306, respectively. CPU a 302 and CPU C306 then send copies of their processor health status to CPU B304 in the form of health response messages 340, 342, respectively. In other embodiments, the actual health status may be replaced with a vector having a predetermined relationship to a particular health status value.
Once the processor with the master LID (CPU B304 in this embodiment) receives the processor health status of all responding processors, it may determine the highest ranked available processor health status. The common processor health state for this group may be referred to as a group health state. In other embodiments, performance may be an issue and the determined processor group may be the group having the greatest number of processors with acceptable processor health. In both embodiments, the processor with the master LID (CPU B304 in this embodiment) then sends a release semaphore (semaphore) message to all slave processors and itself. The release semaphore may include a copy of the group health status. In other embodiments, the actual group health status may be replaced with a vector having a predetermined relationship to the group health status.
In the embodiment of FIG. 3, CPU B304 sends issue semaphores 350, 352, and 354 to CPU A302, CPU C306, and itself, respectively. Each processor then compares the group health status represented by the corresponding release semaphore to its own processor health status. If there is a match, the processor performs a boot operation. However, if there is no match, the processor is down or becomes inactive and does not perform a boot operation.
Referring now to FIG. 4, a flowchart for deriving local processor health is shown, according to one embodiment of the present disclosure. In other embodiments, other firmware tests, hardware tests, or some combination thereof may be performed to obtain another form of local processor health. When the process of FIG. 4 begins at block 410, the PAL gains control immediately after a reset event, and at block 412 computes a PAL handoff status, which is stored in a register for SAL use. The PAL then transfers control to the SAL. The PAL may provide this handoff status upon entry to a recovery check, which may include determining whether the PAL is compatible with the current processor or whether the processor is fully operational. The SAL then checks the previously stored PAL handoff status at block 414. The PAL handoff status is used along with other tests performed by the SAL to compute a composite local health. The PAL handoff status may convey information of a variety of possible errors. In one embodiment, the possible errors may be associated with a set of 4 state classes: normal operation with basic firmware copy; a failover (failover) operation with secondary firmware copy; failures in non-redundant or non-critical firmware components; and fatal failures.
Then at block 418, a boundary check of the primary FIT pointer and the secondary FIT pointer may be performed. This may be necessary to prevent accidental accesses to protected or reserved areas in the memory address space that may result in a system hang (hang). At block 422, a checksum test may be performed on the primary FIT and the secondary FIT. The checksums returned by these tests may be used to prevent execution of corrupted code or to prevent searches of corrupted flash tables (flash tables). The primary FIT and the secondary FIT may then be checked to determine if there is a corresponding primary SAL-a specific and secondary SAL-a specific, respectively, at block 426. This test may ensure that appropriate firmware exists to support any necessary SAL-a tests. Then, at block 430, a checksum test may be performed on the copy of the SAL-A specific detected at block 426. The checksums returned by these tests may, in turn, be used to prevent execution of corrupted code or to prevent searches of corrupted flash tables.
At block 434, the results of the previous blocks may be used to form a composite local processor health status. In one embodiment, 5 levels of processor health status may be derived. In other embodiments, other levels of processor health may be derived. If a valid primary PAL-A specific and a valid primary SAL-A specific are found, the optimal processor health status may be determined. If only a valid secondary PAL-A specific and a valid secondary SAL-A specific are found, a suboptimal processor health status may be determined. A third best processor health status may be determined if only a valid primary PAL-A specific and a valid secondary SAL-A specific are found. A fourth best processor health status may be determined if only a valid secondary PAL-A specific and a valid primary SAL-A specific are found. Finally, if a valid combination of PAL-A and SAL-A is not found, the worst processor health status may be determined, and this status may also be determined with other fatal error conditions.
Referring now to FIG. 5, a flow diagram of the selection and initialization of a health processor is shown, according to one embodiment of the present disclosure. Each processor within the system may perform the process of fig. 5. The process begins at block 510 after a reset event. After determining the local processor health status, the processor assigns itself a master LID value to ensure that subsequent check-in events are not lost. The processor then reads the BOFL register at block 514. The processor then determines whether it has become the master processor based on the value read from the BOFL at decision block 518. If so, the processor exits decision block 518 via the YES path and initiates a registration timeout period. At decision block 522, the processor determines whether the timeout period has ended. If not, the processor exits decision block 522 via the NO path and receives any registration messages present (if any) at block 526. The processor determines the LID corresponding to the sender of the check-in message. At block 530, the processor responds to any registration messages found at block 526 by sending a health request message to the corresponding slave processor. The processor then returns to decision block 522. When the timeout period has expired, the processor exits decision block 522 via the YES path. At block 534, the processor determines the group health status and sends a message containing the group health status to all LIDs identified from the received check-in message. The processor then determines whether the group health status matches its own processor health status at decision block 538. If so, the process exits decision block 538 via the YES path and the processor continues with boot operations at block 540. If, however, there is no match, the process exits decision block 538 via the NO path and the processor is halted or otherwise becomes inactive at block 544.
However, if at decision block 518 the processor determines that it is a slave processor, the processor exits decision block 518 via the NO path. The processor then assigns itself a unique slave LID. The processor then sends a check-in message representing its LID value to the processor with the master LID at block 550. The processor then waits and receives a corresponding health request message at block 554. The processor then sends its own processor health status in a health response message, block 558. At block 560, the processor waits and receives a release semaphore message. The processor then determines whether the group health status matches its processor health status at decision block 562. If so, the process exits decision block 562 via the YES path, and the processor continues with boot operations at block 566. If, however, there is no match, the process exits decision block 562 via the NO path and the processor is shut down or otherwise becomes inactive at block 544.
In the foregoing specification, the invention has been described with reference to specific exemplary embodiments thereof. It will, however, be evident that various modifications and changes may be made thereto without departing from the broader spirit and scope as set forth in the appended claims. The specification and drawings are, accordingly, to be regarded in an illustrative rather than a restrictive sense.

Claims (33)

1. A system, comprising:
a first processor to determine a first processor health status;
a second processor coupled to the first processor for determining a second processor health status; and
a hardware semaphore register coupled to the first processor and the second processor, wherein either or both of the first processor and the second processor are operable to attempt a boot process, and the first processor and the second processor continue system boot operations when the first processor health state and the second processor health state both correspond to a set of health states.
2. The system of claim 1, wherein the first processor utilizes the first processor health status and the second processor health status to determine the group health status when the first processor reads the hardware semaphore register before the second processor.
3. The system of claim 2, wherein the first processor sends a publication message to the second processor that includes the group health status.
4. The system of claim 3, wherein the second processor continues boot operations if the group health state corresponds to the second processor health state.
5. The system of claim 3, wherein the first processor reads a first value from the hardware semaphore register and the second processor reads a second value from the hardware semaphore register.
6. The system of claim 5, wherein the first processor comprises a first processor interrupt request register and the second processor comprises a second processor interrupt request register, wherein the second processor sends the second processor health status to the first processor interrupt request register.
7. The system of claim 6, wherein the first processor sends the group health status to the second processor interrupt request register.
8. A method, comprising:
determining a first processor health status;
determining a second processor health status;
sending the second processor health status to the first processor;
determining a group health status from the first processor health status and the second processor health status;
enabling the first processor to continue booting operations when the group health state corresponds to the first processor health state; and
enabling the second processor to continue booting operations when the group health state corresponds to the second processor health state.
9. The method of claim 8, wherein the enabling the second processor comprises sending the group health status to the second processor.
10. The method of claim 9, wherein said step of sending said second processor health status is performed in response to a health status request.
11. The method of claim 10, further comprising causing the first processor to read a hardware semaphore register before the second processor reads the hardware semaphore register.
12. The method of claim 11, wherein the step of the first processor reading a hardware semaphore register comprises receiving a first value.
13. The method of claim 8, wherein the step of sending the second processor health status to a first processor comprises sending an inter-processor interrupt to the first processor.
14. The method of claim 13, wherein said step of sending an inter-processor interrupt to said first processor comprises sending said second processor health status to a first processor interrupt request register when said first processor has disabled interrupts.
15. The method of claim 14, wherein the step of determining the group health status comprises obtaining the second processor health status from the first processor interrupt request.
16. The method of claim 15, further comprising enabling the second processor to continue booting operations by sending a second processor issue message to the second processor.
17. The method of claim 16, wherein the enabling the second processor comprises enabling the second processor when the second processor issues a message including the group health status matching the second processor health status.
18. A method, comprising:
determining a first processor health status;
determining a second processor health status;
sending the second processor health status to the first processor;
reading a hardware semaphore register by the first processor before reading the hardware semaphore register by the second processor;
determining a group health status from the first processor health status and the second processor health status;
enabling the second processor to continue boot operations by sending the group health status to the second processor in response to a health status request when the group health status corresponds to the second processor health status,
wherein the step of determining the first processor health state comprises checking a first firmware interface table and a second firmware interface table using a general purpose processor abstraction layer.
19. The method of claim 18, wherein said utilizing a general purpose processor abstraction layer includes examining a first copy of a first processor specific processor abstraction layer and a second copy of the first processor specific processor abstraction layer.
20. The method of claim 18, wherein the step of determining the first processor health status comprises determining whether the first copy of the first processor-specific processor abstraction layer has a first copy of an associated system abstraction layer, and further comprising determining whether the second copy of the first processor-specific processor abstraction layer has a second copy of an associated system abstraction layer.
21. An apparatus, comprising:
means for determining a first processor health status;
means for determining a second processor health status;
means for sending the second processor health status to the first processor;
means for determining a group health status based on the first processor health status and the second processor health status;
means for enabling the first processor to continue boot operations when the group health state corresponds to the first processor health state; and
means for enabling the second processor to continue boot operations when the group health state corresponds to the second processor health state.
22. The apparatus of claim 21, wherein the means for enabling the second processor comprises means for sending the group health status to the second processor.
23. The apparatus of claim 22 wherein the means for sending the second processor health status is responsive to a health status request.
24. The apparatus of claim 23, further comprising means for causing the first processor to read a hardware semaphore register before the second processor reads the hardware semaphore register.
25. The apparatus of claim 24, wherein the means for causing the first processor to read a hardware semaphore register comprises means for receiving a first value.
26. An apparatus, comprising:
means for determining a first processor health status;
means for determining a second processor health status;
means for sending the second processor health status to the first processor;
means for reading a hardware semaphore register by the first processor before the hardware semaphore register is read by the second processor;
means for determining a group health status based on the first processor health status and the second processor health status;
means for enabling the second processor to continue boot operations by sending the group health status to the second processor in response to a health status request when the group health status corresponds to the second processor health status,
wherein the means for determining the first processor health state comprises means for checking a first firmware interface table and a second firmware interface table using a general purpose processor abstraction layer.
27. The apparatus of claim 26, wherein the means for utilizing a general purpose processor abstraction layer comprises means for examining a first copy of a first processor-specific processor abstraction layer and a second copy of the first processor-specific processor abstraction layer.
28. The apparatus of claim 27, wherein the means for determining the first processor health state comprises means for determining whether the first copy of a first processor-specific processor abstraction layer has a first copy of an associated system abstraction layer, and further comprising means for determining whether the second copy of a first processor-specific processor abstraction layer has a second copy of an associated system abstraction layer.
29. The apparatus of claim 28, wherein the means for sending the second processor health status to a first processor comprises means for sending an inter-processor interrupt to the first processor.
30. The apparatus of claim 29, wherein the means for sending an inter-processor interrupt to the first processor comprises means for sending the second processor health status to a first processor interrupt request register when the first processor has disabled interrupts.
31. The apparatus of claim 30, wherein the means for determining the group health status comprises means for obtaining the second processor health status from the first processor interrupt request.
32. The apparatus of claim 31, further comprising means for enabling the second processor to continue booting operations by sending a second processor issue message to the second processor.
33. The apparatus of claim 32, wherein the means for enabling the second processor comprises means for enabling the second processor when the second processor issues a message comprising the group health status matching the second processor health status.
HK07104384.9A 2002-06-11 2003-05-09 System and method to determine a healthy group of processors and associated fireware for booting a system HK1096744B (en)

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US10/171,210 2002-06-11
US10/171,210 US7350063B2 (en) 2002-06-11 2002-06-11 System and method to filter processors by health during early firmware for split recovery architecture
PCT/US2003/014877 WO2003104994A2 (en) 2002-06-11 2003-05-09 System and method to filter processors by health during early firmware for split recovery architecture

Publications (2)

Publication Number Publication Date
HK1096744A1 HK1096744A1 (en) 2007-06-08
HK1096744B true HK1096744B (en) 2009-12-24

Family

ID=

Similar Documents

Publication Publication Date Title
JP4813442B2 (en) System and method for determining good state processor groups and associated firmware for system startup processing
US6675324B2 (en) Rendezvous of processors with OS coordination
EP1204924B1 (en) Diagnostic caged mode for testing redundant system controllers
EP1588260B1 (en) Hot plug interfaces and failure handling
US7627781B2 (en) System and method for establishing a spare processor for recovering from loss of lockstep in a boot processor
US7783877B2 (en) Boot-switching apparatus and method for multiprocessor and multi-memory system
US9372702B2 (en) Non-disruptive code update of a single processor in a multi-processor computing system
US7194614B2 (en) Boot swap method for multiple processor computer systems
TW584800B (en) Method, computer and peripheral/expansion bus bridge for booting up with debug system
EP1956486B1 (en) Failure processing in a partitioned computer system
US8713230B2 (en) Method for adjusting link speed and computer system using the same
EP0683456B1 (en) Fault-tolerant computer system with online reintegration and shutdown/restart
US7984219B2 (en) Enhanced CPU RASUM feature in ISS servers
JP2003186697A (en) System and method for testing peripheral device
EP2778934B1 (en) Information processing device, information processing method, information processing program, and recording medium in which program is recorded
HK1096744B (en) System and method to determine a healthy group of processors and associated fireware for booting a system
US7624302B2 (en) System and method for switching the role of boot processor to a spare processor responsive to detection of loss of lockstep in a boot processor
TWI881780B (en) Method of a dual bios image processing mechanism
EP1703393A2 (en) Error notification method and apparatus for an information processing system carrying out mirror operation
US20060026321A1 (en) Increasing the number of I/O decode ranges using SMI traps
CN118377661A (en) Bus error testing method, device, equipment, storage medium and program product
CN115658578A (en) A method for realizing protection switching, computer storage medium and terminal