CN114706739A

CN114706739A - A fault recording, locating method, device and server

Info

Publication number: CN114706739A
Application number: CN202210312149.8A
Authority: CN
Inventors: 林震华
Original assignee: New H3C Information Technologies Co Ltd
Current assignee: New H3C Information Technologies Co Ltd
Priority date: 2022-03-28
Filing date: 2022-03-28
Publication date: 2022-07-05

Abstract

This specification provides a fault recording and locating method, device and server, and relates to the technical field of communications. A fault recording method, applied to a BIOS chip in a server, comprising: when it is determined that an uncorrectable fault occurs in an OS running in the server, acquiring a first fault parameter associated with the uncorrectable fault in the server; recording the first fault parameter to the storage medium, wherein the storage medium is respectively connected to the BIOS chip and the BMC through the first bus; after the OS is restarted, the first fault parameter recorded in the storage medium is sent to the BMC. Through the above method, the efficiency of fault location in the server can be improved.

Description

A fault record, positioning method, device and server

技术领域technical field

本说明书涉及通信技术领域，尤其涉及一种故障记录、定位方法、装置以及服务器。This specification relates to the field of communication technologies, and in particular, to a fault recording and locating method, device, and server.

背景技术Background technique

随着网络技术的发展，承载网络服务以及数据存储的服务器的应用也逐渐增加。在数据中心里，部署有大量的服务器，这样的环境下，如何使工作人员高效便捷地实现服务器的运维是本领域技术人员亟待解决的问题。With the development of network technology, the application of servers carrying network services and data storage has gradually increased. In the data center, a large number of servers are deployed, and in such an environment, how to enable the staff to efficiently and conveniently implement the operation and maintenance of the servers is an urgent problem to be solved by those skilled in the art.

在现有的服务器中，为了方便工作人员查看服务器的运行数据，通常会采用带内和带外的管理模式，带内管理分为BIOS(基础输入输出系统，Basic Input Output System)和OS(操作系统，Operation System)两部分，带外管理通过BMC(基板管理控制器，Baseboard Management Controller)实现。在服务器正常工作的情况下，OS通过BIOS将服务器的运行信息发送给BMC进行记录，在OS出现异常被挂死时，OS无法将故障时的运行信息(也可以称为故障信息)发送给BIOS，BMC也无法获得，使得工作人员难以确定故障原因，从而降低了工作人员对服务器进行维护的效率。In the existing server, in order to facilitate the staff to view the operating data of the server, in-band and out-of-band management modes are usually adopted. In-band management is divided into BIOS (Basic Input Output System) and OS (Operational System). System, Operation System) two parts, out-of-band management through BMC (Baseboard Management Controller, Baseboard Management Controller). When the server is working normally, the OS sends the server's operating information to the BMC for recording through the BIOS. When the OS hangs abnormally, the OS cannot send the operating information (also referred to as fault information) at the time of the failure to the BIOS. , BMC is also unavailable, making it difficult for staff to determine the cause of the failure, thus reducing the efficiency of staff maintenance on the server.

发明内容SUMMARY OF THE INVENTION

为克服相关技术中存在的问题，本说明书提供了一种故障记录、定位方法、装置以及服务器。In order to overcome the problems existing in the related art, the present specification provides a fault recording and locating method, device and server.

根据本说明书实施例的第一方面，提供了一种故障记录方法，应用于服务器中的BIOS芯片，包括：According to a first aspect of the embodiments of this specification, a fault recording method is provided, applied to a BIOS chip in a server, including:

在确定服务器中运行的OS出现不可纠正故障时，获取服务器中与不可纠正故障相关联的第一故障参数；When it is determined that an uncorrectable fault occurs in the OS running in the server, obtain a first fault parameter associated with the uncorrectable fault in the server;

将第一故障参数记录至存储介质，其中，存储介质分别通过第一总线连接至BIOS芯片和BMC；recording the first fault parameter to a storage medium, wherein the storage medium is respectively connected to the BIOS chip and the BMC through the first bus;

在OS重启后，将存储介质中记录的第一故障参数发送至BMC。After the OS restarts, the first fault parameter recorded in the storage medium is sent to the BMC.

可选的，该方法，还包括：Optionally, the method further includes:

在确定服务器中运行OS出现可纠正故障时，获取服务器中与可纠正故障相关联的第二故障参数；When it is determined that a correctable fault occurs in the OS running in the server, obtain a second fault parameter associated with the correctable fault in the server;

通过第二总线将第二故障参数发送至BMC。The second fault parameter is sent to the BMC via the second bus.

可选的，该方法，还包括：Optionally, the method further includes:

在服务器启动时，对服务器中的各类器件进行检测，并记录启动参数；When the server is started, various devices in the server are detected, and the startup parameters are recorded;

通过第二总线将第二故障参数发送至BMC；或者，send the second fault parameter to the BMC via the second bus; or,

在服务器正常工作时，在预设的时间节点获取并记录服务器中的各类器件的正常参数；When the server is working normally, obtain and record the normal parameters of various devices in the server at a preset time node;

通过第二总线将正常参数发送至BMC。The normal parameters are sent to the BMC through the second bus.

可选的，第二总线为VGA总线。Optionally, the second bus is a VGA bus.

可选的，第一总线为集成电路I2C总线。Optionally, the first bus is an integrated circuit I2C bus.

根据本说明书实施例的第二方面，提供了一种故障定位方法，应用于服务器中的BMC，包括：According to a second aspect of the embodiments of this specification, a fault location method is provided, applied to the BMC in the server, including:

通过第一总线获取并记录服务器中的第一故障参数，其中，第一故障参数为服务器中运行的OS出现不可纠正故障时，写入与BMC以及服务器中的BIOS芯片所连接的存储介质中的运行参数；Acquire and record the first fault parameter in the server through the first bus, where the first fault parameter is the data written in the storage medium connected to the BMC and the BIOS chip in the server when an uncorrectable fault occurs in the OS running in the server. Operating parameters;

根据第一故障参数与比对参数，确定服务器中出现故障的器件。According to the first fault parameter and the comparison parameter, a faulty device in the server is determined.

可选的，比对参数，包括：启动参数和\或正常参数，其中，启动参数为服务器启动时所记录的运行参数，正常参数为在服务器正常工作时，在预设的时间节点获取并记录服务器中的各类器件的运行参数；Optionally, the comparison parameters include: startup parameters and\or normal parameters, where the startup parameters are the operating parameters recorded when the server is started, and the normal parameters are obtained and recorded at a preset time node when the server is working normally Operating parameters of various devices in the server;

该方法，还包括：The method also includes:

通过第二总线获取并记录服务器中的启动参数和\或正常参数；Obtain and record the startup parameters and\or normal parameters in the server through the second bus;

根据第一故障参数与比对参数，确定服务器中出现故障的器件，具体为：Determine the faulty device in the server according to the first fault parameter and the comparison parameter, specifically:

通过第一故障参数比对启动参数和\或正常参数，确定服务器中出现故障的器件。By comparing the startup parameter and/or the normal parameter with the first fault parameter, the faulty device in the server is determined.

可选的，该方法，还包括：Optionally, the method further includes:

通过第二总线获取并记录服务器中的第二故障参数，其中，第二故障参数为在确定服务器中运行的OS出现可纠正故障时所记录的运行参数；Acquire and record the second fault parameter in the server through the second bus, wherein the second fault parameter is the operating parameter recorded when it is determined that the OS running in the server has a correctable fault;

比对正常参数和第二故障参数，确定服务器中出现故障的器件。The normal parameter and the second fault parameter are compared to determine the faulty device in the server.

根据本说明书实施例的第三方面，提供了一种故障记录装置，应用于服务器中的BIOS芯片，包括：According to a third aspect of the embodiments of this specification, a fault recording device is provided, applied to a BIOS chip in a server, including:

获取单元，用于在确定服务器中运行的OS出现不可纠正故障时，获取服务器中与不可纠正故障相关联的第一故障参数；an obtaining unit, configured to obtain the first failure parameter associated with the uncorrectable failure in the server when it is determined that an uncorrectable failure occurs in the OS running in the server;

记录单元，用于将第一故障参数记录至存储介质，其中，存储介质分别通过第一总线连接至BIOS芯片和BMC；a recording unit, configured to record the first fault parameter to a storage medium, wherein the storage medium is respectively connected to the BIOS chip and the BMC through the first bus;

发送单元，用于在OS重启后，将存储介质中记录的第一故障参数发送至BMC。The sending unit is configured to send the first fault parameter recorded in the storage medium to the BMC after the OS is restarted.

根据本说明书实施例的第四方面，提供了一种故障定位装置，应用于服务器中的BMC，包括：According to a fourth aspect of the embodiments of the present specification, a fault locating apparatus is provided, which is applied to a BMC in a server, including:

记录单元，用于通过第一总线获取并记录服务器中的第一故障参数，其中，第一故障参数为服务器中运行的OS出现不可纠正故障时，写入与BMC以及服务器中的BIOS芯片所连接的存储介质中的运行参数；The recording unit is configured to obtain and record the first fault parameter in the server through the first bus, wherein the first fault parameter is written in the connection with the BMC and the BIOS chip in the server when an uncorrectable fault occurs in the OS running in the server operating parameters in the storage medium;

定位单元，用于根据第一故障参数与比对参数，确定服务器中出现故障的器件。The positioning unit is configured to determine the faulty device in the server according to the first fault parameter and the comparison parameter.

该记录单元，还用于通过第二总线获取并记录服务器中的启动参数和\或正常参数；The recording unit is also used to obtain and record the startup parameters and/or normal parameters in the server through the second bus;

定位单元，具体用于通过第一故障参数比对启动参数和\或正常参数，确定服务器中出现故障的器件。The positioning unit is specifically configured to compare the startup parameter and/or the normal parameter with the first fault parameter, and determine the faulty device in the server.

可选的，该记录单元，还用于通过第二总线获取并记录服务器中的第二故障参数，其中，第二故障参数为在确定服务器中运行的OS出现可纠正故障时所记录的运行参数；Optionally, the recording unit is further configured to acquire and record the second failure parameter in the server through the second bus, wherein the second failure parameter is the operating parameter recorded when it is determined that the OS running in the server has a correctable failure ;

定位单元，还用于比对正常参数和第二故障参数，确定服务器中出现故障的器件。The positioning unit is also used for comparing the normal parameter and the second fault parameter to determine the faulty device in the server.

根据本说明书实施例的第五方面，提供了一种服务器，包括处理器、BIOS芯片、BMC以及存储介质；According to a fifth aspect of the embodiments of this specification, a server is provided, including a processor, a BIOS chip, a BMC, and a storage medium;

存储介质通过第一总线分别连接BIOS芯片和BMC，BIOS芯片通过第二总线分别连接处理器和BMC；The storage medium is respectively connected to the BIOS chip and the BMC through the first bus, and the BIOS chip is respectively connected to the processor and the BMC through the second bus;

在确定服务器中的处理器运行的OS出现不可纠正故障时，BIOS芯片获取服务器中与故障相关联的第一故障参数；When it is determined that the OS running on the processor in the server has an uncorrectable failure, the BIOS chip obtains the first failure parameter associated with the failure in the server;

BIOS芯片将第一故障参数记录至存储介质；The BIOS chip records the first fault parameter to the storage medium;

在OS重启后，BIOS芯片将存储介质中记录的第一故障参数发送至BMC；After the OS restarts, the BIOS chip sends the first failure parameter recorded in the storage medium to the BMC;

BMC根据第一故障参数和BMC中已存储的比对参数，对服务器中所出现的不可纠正故障进行定位。The BMC locates the uncorrectable fault that occurs in the server according to the first fault parameter and the comparison parameter stored in the BMC.

本说明书的实施例提供的技术方案可以包括以下有益效果：The technical solutions provided by the embodiments of this specification may include the following beneficial effects:

本说明书实施例中，通过在BIOS芯片和BMC之间设置共享的存储介质，在BIOS芯片基于硬件故障检测确定服务器中运行的OS出现不可纠正故障时，从服务器中读取相关的故障参数并写入到存储介质中，在OS重新启动后，再将存储介质中所保存的故障参数传输至BMC，避免OS出现不可纠正故障时能够反映故障原因的故障参数丢失所带来的难以定位的问题，从而提升了服务器故障定位的效率。In the embodiment of this specification, by setting a shared storage medium between the BIOS chip and the BMC, when the BIOS chip determines that an uncorrectable failure occurs in the OS running in the server based on hardware failure detection, the relevant failure parameters are read from the server and written After the OS is restarted, the fault parameters saved in the storage medium are transferred to the BMC to avoid the problem of difficult to locate caused by the loss of fault parameters that can reflect the cause of the fault when the OS has an uncorrectable fault. This improves the efficiency of server fault location.

应当理解的是，以上的一般描述和后文的细节描述仅是示例性和解释性的，并不能限制本说明书。It is to be understood that the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the specification.

附图说明Description of drawings

此处的附图被并入说明书中并构成本说明书的一部分，示出了符合本说明书的实施例，并与说明书一起用于解释本说明书的原理。The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with this specification and together with the description serve to explain the principles of this specification.

图1是本申请实施方式所涉及的一种故障记录方法的流程图；1 is a flowchart of a fault recording method according to an embodiment of the present application;

图2是本申请实施方式所涉及的一种服务器的结构示意图；2 is a schematic structural diagram of a server involved in an embodiment of the present application;

图3是本申请实施方式所涉及的一种故障定位方法的流程图；3 is a flowchart of a fault location method according to an embodiment of the present application;

图4是本申请实施方式所涉及的一种故障记录装置的结构示意图；4 is a schematic structural diagram of a fault recording device according to an embodiment of the present application;

图5是本申请实施方式所涉及的一种故障定位装置的结构示意图。FIG. 5 is a schematic structural diagram of a fault locating device according to an embodiment of the present application.

具体实施方式Detailed ways

这里将详细地对示例性实施例进行说明，其示例表示在附图中。下面的描述涉及附图时，除非另有表示，不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本说明书相一致的所有实施方式。相反，它们仅是与如所附权利要求书中所详述的、本说明书的一些方面相一致的装置和方法的例子。Exemplary embodiments will be described in detail herein, examples of which are illustrated in the accompanying drawings. Where the following description refers to the drawings, the same numerals in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the illustrative examples below are not intended to represent all implementations consistent with this specification. Rather, they are merely examples of apparatus and methods consistent with some aspects of this specification as recited in the appended claims.

本申请提供了一种故障记录方法，应用于服务器中的BIOS芯片，如图1所示，包括：The present application provides a fault recording method, which is applied to a BIOS chip in a server, as shown in FIG. 1 , including:

S100、在确定服务器中运行的OS出现不可纠正故障时，获取服务器中与不可纠正故障相关联的第一故障参数。S100. When it is determined that an uncorrectable fault occurs in the OS running in the server, obtain a first fault parameter in the server that is associated with the uncorrectable fault.

服务器内部的结构，如图2所示，包括处理器、BIOS芯片、BMC以及存储介质，在处理器中运行有OS。在BIOS芯片中可以设定为固件优先模式，即在SMI(串行管理接口，SerialManagement Interface)中断被处理器检测到后，协调BIOS芯片收集运行参数。在BIOS芯片中可以划定一段存储空间用于暂存收集到的运行参数，或者，可以在内存中设定一段存储空间暂存收集到的运行参数。The internal structure of the server, as shown in FIG. 2 , includes a processor, a BIOS chip, a BMC, and a storage medium, and an OS runs in the processor. The BIOS chip can be set to a firmware priority mode, that is, after the SMI (Serial Management Interface, Serial Management Interface) interrupt is detected by the processor, the BIOS chip is coordinated to collect operating parameters. A section of storage space may be designated in the BIOS chip for temporarily storing the collected running parameters, or a section of storage space may be set in the memory to temporarily store the collected running parameters.

存储介质通过第一总线分别连接BIOS芯片和BMC，BIOS芯片通过第二总线分别连接处理器和BMC。其中，第一总线可以为I2C(集成电路，Inter-Integrated Circuit)总线、LPC(低引脚树接口，Low Pin Count)总线以及eSPI(增强型串行外接接口，enhancedSerial Peripheral Interface)总线等符合IPMI标准的总线之一，根据实际需求设置即可。具体而言，该存储介质为非易失性存储介质，在下电后数据不会丢失。The storage medium is respectively connected to the BIOS chip and the BMC through the first bus, and the BIOS chip is respectively connected to the processor and the BMC through the second bus. The first bus may be an I2C (Integrated Circuit, Inter-Integrated Circuit) bus, an LPC (Low Pin Tree Interface, Low Pin Count) bus, and an eSPI (Enhanced Serial Peripheral Interface, enhancedSerial Peripheral Interface) bus, etc. compliant with IPMI One of the standard buses can be set according to actual needs. Specifically, the storage medium is a non-volatile storage medium, and data will not be lost after power off.

OS在运行过程中可能出现可纠正故障以及不可纠正故障。其中，在不可纠正故障的情况下，OS被挂起无法继续运行，处理器无法继续获取到故障后运行参数，该运行参数后续可以被称为第一故障参数。Correctable faults as well as uncorrectable faults may occur during the operation of the OS. Wherein, in the case of an uncorrectable fault, the OS is suspended and cannot continue to run, and the processor cannot continue to obtain the running parameter after the fault, and the running parameter may be referred to as the first fault parameter subsequently.

需要说明的是，该第一故障参数可以根据SMI中断上报的信息进行选择，比如处理器故障、内存故障、PCIE(外围部件高速互联，Peripheral Component InterconnectExpress)故障等，则BIOS芯片该器件所对应的寄存器获取运行参数作为第一故障信息。It should be noted that the first failure parameter can be selected according to the information reported by the SMI interrupt, such as processor failure, memory failure, PCIE (Peripheral Component Interconnect Express) failure, etc., then the BIOS chip corresponds to the device. The register obtains the operating parameters as the first fault information.

S101、将第一故障参数记录至存储介质。S101. Record the first fault parameter to a storage medium.

由于在OS出现不可纠正故障时，OS将会被重新加载，此时，故障情况下的运行参数将被丢失，从而难以实现故障定位。因此，BIOS芯片在获取到第一故障参数后，可以通过第一总线，将获取到的第一故障参数传输到存储介质，以避免OS重新加载时运行参数丢失的问题。Since the OS will be reloaded when an uncorrectable fault occurs in the OS, at this time, the operating parameters under the fault condition will be lost, making it difficult to locate the fault. Therefore, after acquiring the first fault parameter, the BIOS chip can transmit the acquired first fault parameter to the storage medium through the first bus, so as to avoid the problem of loss of running parameters when the OS is reloaded.

S102、在OS重启后，将存储介质中记录的第一故障参数发送至BMC。S102. After the OS restarts, send the first fault parameter recorded in the storage medium to the BMC.

在确定OS被挂死后，基于服务器中的机制，比如看门狗等，重新启动OS，恢复其运行。After it is determined that the OS is hanged, based on the mechanism in the server, such as a watchdog, etc., the OS is restarted and its operation is resumed.

在BIOS芯片检测到OS重新启动后，BIOS芯片可以检测存储介质中是否存储有数据。如果有，则可以认为OS重启前出现过不可纠正故障，BIOS芯片将存储介质中存储的第一故障参数输出至BMC，或者，通知BMC向存储介质发起读取请求，以使存储介质将第一故障参数反馈给BMC。如果没有，则可以认为OS重启为正常，无需进行处理。After the BIOS chip detects that the OS is restarted, the BIOS chip can detect whether data is stored in the storage medium. If there is, it can be considered that an uncorrectable fault occurred before the OS restarts, and the BIOS chip outputs the first fault parameter stored in the storage medium to the BMC, or notifies the BMC to initiate a read request to the storage medium, so that the storage medium will send the first fault parameter to the storage medium. The fault parameters are fed back to the BMC. If not, the OS reboot can be considered normal and no action is required.

另外，在服务器中，针对处理器运行的OS所出现的可纠正故障，该方法，还包括：In addition, in the server, for a correctable failure of the OS running on the processor, the method further includes:

S103、在确定服务器中运行OS出现可纠正故障时，获取服务器中与可纠正故障相关联的第二故障参数。S103. When it is determined that a correctable fault occurs in the OS running in the server, acquire a second fault parameter in the server that is associated with the correctable fault.

S104、通过第二总线将第二故障参数发送至BMC。S104. Send the second fault parameter to the BMC through the second bus.

由于OS的运行不会停止，处理器可以记录这些故障的运行参数。那么，BIOS芯片如果确定OS出现的故障为可纠正故障，则可以选择主动向处理器读取第二故障参数或者等待处理器发送第二故障参数，这里可以根据实际的需求选择，不做限制。Since the operation of the OS does not stop, the processor can record the operating parameters of these failures. Then, if the BIOS chip determines that the fault in the OS is a correctable fault, it can choose to actively read the second fault parameter from the processor or wait for the processor to send the second fault parameter, which can be selected according to actual needs without limitation.

其中，第二总线可以为视频图形矩阵VGA(视频图形阵列，Video Graphics Array)总线、本地总线(Local Bus)等，具体而言，由于VGA总线的传输能够具有更大传输空间，因此，能够将更多的运行参数发送至BMC，以实现更加准确的故障定位。Wherein, the second bus can be a video graphics matrix VGA (Video Graphics Array) bus, a local bus (Local Bus), etc. Specifically, since the transmission of the VGA bus can have a larger transmission space, it can be More operating parameters are sent to BMC for more accurate fault location.

为了获取服务器全周期的运行参数，可选的，该方法，还包括：In order to obtain the running parameters of the full cycle of the server, optionally, this method also includes:

S105A、在服务器启动时，对服务器中的各类器件进行检测，并记录启动参数。S105A, when the server is started, various types of devices in the server are detected, and the startup parameters are recorded.

S106A、通过第二总线将第二故障参数发送至BMC；或者，S106A. Send the second fault parameter to the BMC through the second bus; or,

S105B、在服务器正常工作时，在预设的时间节点获取并记录服务器中的各类器件的正常参数；S105B, when the server is working normally, acquire and record the normal parameters of various devices in the server at a preset time node;

S106B、通过第二总线将正常参数发送至BMC。S106B. Send the normal parameters to the BMC through the second bus.

在服务器启动时，BIOS芯片可以对服务器中的器件进行自检，比如，检测硬盘、内存、处理器以及显卡等。此时，BIOS芯片可以通过第二总线发送至BMC，以使BMC进行记录。这里所说的启动，也包含在OS出现不可纠正故障的情况下进行的重启。When the server is started, the BIOS chip can perform self-test on the devices in the server, for example, detect the hard disk, memory, processor, and graphics card. At this time, the BIOS chip can be sent to the BMC through the second bus, so that the BMC can record. The startup mentioned here also includes a restart in the case of an uncorrectable failure of the OS.

或者，在服务器的OS中可以设置有多个时间节点，比如设置12小时为一个时间节点，并且在服务器中可以设置有一计时器。在计时器到达该时间节点时，处理器可以出发BIOS芯片收集服务器的运行参数，并由该BIOS芯片将这些运行参数发送给BMC进行保存。Alternatively, multiple time nodes may be set in the OS of the server, for example, 12 hours may be set as a time node, and a timer may be set in the server. When the timer reaches the time node, the processor can start the BIOS chip to collect the operating parameters of the server, and the BIOS chip sends the operating parameters to the BMC for saving.

相对应的，本申请还提供了一种故障定位方法，应用于服务器中的BMC，如图3所示，包括：Correspondingly, the present application also provides a fault location method, which is applied to the BMC in the server, as shown in FIG. 3 , including:

S200、通过第一总线获取并记录服务器中的第一故障参数。S200. Acquire and record the first fault parameter in the server through the first bus.

其中，第一故障参数为服务器中运行的OS出现不可纠正故障时，写入与BMC以及服务器中的BIOS芯片所连接的存储介质中的运行参数。The first failure parameter is an operating parameter written into the storage medium connected to the BMC and the BIOS chip in the server when an uncorrectable failure occurs in the OS running in the server.

S201、根据第一故障参数与比对参数，确定服务器中出现故障的器件。S201. Determine a faulty device in the server according to the first fault parameter and the comparison parameter.

其中，比对参数已经被记录于BMC中，可以包含有启动参数和正常参数等，当然也可以包含有上一次启动时所生成的第一故障参数。The comparison parameters have been recorded in the BMC, which may include startup parameters and normal parameters, and of course, may also include the first fault parameters generated during the last startup.

通过服务器发生故障时，存储到存储介质的第一故障参数，与BMC中已经存储的比对参数进行比对，可以更加简便高效地确定出服务器的故障发生在哪一器件，并可以根据对比分析确定出具体的故障原因，从而使得工作人员能够更快地排除该故障。When the server fails, the first failure parameter stored in the storage medium is compared with the comparison parameters that have been stored in the BMC, so that it is easier and more efficient to determine which device the server failure occurs in, and can be analyzed according to the comparison. Identifying the specific cause of the failure allows staff to troubleshoot the problem faster.

可选的，该方法，还包括：Optionally, the method further includes:

S202、通过第二总线获取并记录服务器中的启动参数和\或正常参数。S202, obtain and record the startup parameters and/or normal parameters in the server through the second bus.

比对参数可以包含启动参数和\或正常参数，具体比对参数的选用和获取可以根据实际需求选择。其中，启动参数为服务器启动时所记录的运行参数，正常参数为在服务器正常工作时，在预设的时间节点获取并记录服务器中的各类器件的运行参数。The comparison parameters can include startup parameters and\or normal parameters, and the selection and acquisition of specific comparison parameters can be selected according to actual needs. The startup parameters are the operating parameters recorded when the server is started, and the normal parameters are the operating parameters of various devices in the server obtained and recorded at a preset time node when the server is working normally.

步骤S201，具体为：比对服务器相邻两次启动时所记录的启动参数，确定服务器中出现故障的器件。Step S201 is specifically: comparing the startup parameters recorded when the server is started twice adjacently, and determining the faulty device in the server.

比如，在服务器正常工作时，BMC可以按照一定时间间隔，比如5分钟，通过第二总线通过显卡抓取到显示信息，该显示信息即可以作为正常参数。在服务器发送不可纠正故障时，BIOS芯片可以再次抓取显卡的显示信息，并存储到存储介质，作为第一故障参数。For example, when the server is working normally, the BMC can capture display information through the graphics card through the second bus at a certain time interval, such as 5 minutes, and the display information can be used as a normal parameter. When the server sends an uncorrectable fault, the BIOS chip can capture the display information of the graphics card again, and store it in the storage medium as the first fault parameter.

在OS重启后，BIOS芯片检测到存储介质中具有第一故障参数，则将该第一故障参数传输至BMC进行记录。此时，BMC则可以根据两次显示信息进行比对，如果两次显示信息中的内容没有变化，则可以说明，在服务器仍运行的过程中，显示的内容没有改变，则可以确定出显卡可能出现问题，从而使工作人员对显卡进行更为具体的检测。After the OS is restarted, the BIOS chip detects that the storage medium has the first fault parameter, and transmits the first fault parameter to the BMC for recording. At this time, the BMC can compare the two displayed information. If the content of the two displayed information does not change, it means that the displayed content has not changed while the server is still running, and it can be determined that the graphics card may be A problem occurred, allowing staff to perform more specific testing of the graphics card.

再比如，在服务器启动时，BIOS芯片检测内存大小为4吉比特，该内存大小作为启动参数中所包含的内容被存储到BMC中。在OS故障时，BIOS芯片再次获取到内存大小，比如为2吉比特，并作为第一故障参数中的内容存储到存储介质。For another example, when the server is started, the BIOS chip detects that the memory size is 4 gigabits, and the memory size is stored in the BMC as the content included in the startup parameters. When the OS is faulty, the BIOS chip obtains the memory size again, for example, 2 gigabits, and stores it in the storage medium as the content in the first fault parameter.

在OS重启后，BIOS芯片将该第一故障参数发送至BMC进行记录，以使工作人员在进行比对时，能够确定出内存大小的下降，从而定位出内存可能出现损坏。After the OS is restarted, the BIOS chip sends the first fault parameter to the BMC for recording, so that the staff can determine the decrease in the size of the memory during comparison, thereby locating the possible damage to the memory.

当然，第一故障参数中并非仅包含显示信息和内存大小，此处仅为举例描述。Of course, the first fault parameter does not only include display information and memory size, which are described here only by way of example.

作为另一种定位故障的运行参数，可选的，该方法，还包括：As another operating parameter for locating the fault, optionally, the method further includes:

S203、通过第二总线获取并记录服务器中的第二故障参数。S203. Acquire and record the second fault parameter in the server through the second bus.

其中，第二故障参数为在确定服务器中运行的OS出现可纠正故障时所记录的运行参数。The second failure parameter is an operating parameter recorded when it is determined that a correctable failure occurs in the OS running in the server.

步骤S201、具体为：比对正常参数和第二故障参数，确定服务器中出现故障的器件。Step S201 , specifically: comparing the normal parameter and the second fault parameter, and determining the faulty device in the server.

由于第二故障参数为可纠正故障所对应的运行参数，其并不会导致OS直接被挂死，因此，处理器仍可以通过第二总线进行传输。此时第二故障参数会经由BIOS芯片写入到BMC中。Since the second fault parameter is an operating parameter corresponding to a correctable fault, it does not cause the OS to be directly suspended, so the processor can still transmit through the second bus. At this time, the second fault parameter will be written into the BMC via the BIOS chip.

BMC可以根据服务器正常运行过程中的正常参数与该第二故障参数进行比对，从而定位出第二故障参数所对应的故障位置。The BMC can compare the normal parameters during the normal operation of the server with the second fault parameters, so as to locate the fault location corresponding to the second fault parameters.

相对应的，提供了一种故障记录装置，应用于服务器中的BIOS芯片，如图4所示，包括：Correspondingly, a fault recording device is provided, which is applied to a BIOS chip in a server, as shown in FIG. 4 , including:

获取单元，用于在确定服务器中运行的OS出现不可纠正故障时，获取服务器中与不可纠正故障相关联的第一故障参数；an obtaining unit, configured to obtain the first failure parameter associated with the uncorrectable failure in the server when it is determined that the OS running in the server has an uncorrectable failure;

可选的，该获取单元，还用于在确定所述服务器中运行OS出现可纠正故障时，获取所述服务器中与可纠正故障相关联的第二故障参数；Optionally, the acquiring unit is further configured to acquire a second fault parameter associated with the correctable fault in the server when it is determined that a correctable fault occurs in the running OS in the server;

该发送单元，还用于通过第二总线将所述第二故障参数发送至所述BMC。The sending unit is further configured to send the second fault parameter to the BMC through the second bus.

可选的，该获取单元，还用于在所述服务器启动时，对所述服务器中的各类器件进行检测，并记录启动参数；Optionally, the obtaining unit is further configured to detect various types of devices in the server when the server is started, and record startup parameters;

该发送单元，还用于通过第二总线将所述第二故障参数发送至所述BMC；或者，The sending unit is further configured to send the second fault parameter to the BMC through the second bus; or,

该获取单元，还用于在所述服务器正常工作时，在预设的时间节点获取并记录所述服务器中的各类器件的正常参数；The acquiring unit is further configured to acquire and record normal parameters of various types of devices in the server at a preset time node when the server is working normally;

该记录单元，还用于通过第二总线将所述正常参数发送至所述BMC。The recording unit is further configured to send the normal parameter to the BMC through the second bus.

可选的，所述第二总线为视频图形矩阵VGA总线。Optionally, the second bus is a video graphics matrix VGA bus.

可选的，所述第一总线为集成电路I2C总线。Optionally, the first bus is an integrated circuit I2C bus.

相对应的，提供了一种故障定位装置，应用于服务器中的BMC，如图5所示，包括：Correspondingly, a fault location device is provided, which is applied to the BMC in the server, as shown in FIG. 5 , including:

记录单元，用于通过第一总线获取并记录服务器中的第一故障参数，其中，第一故障参数为服务器中运行的OS出现不可纠正故障时，写入与BMC以及服务器中的BIOS芯片所连接的存储介质中的运行参数；The recording unit is used to obtain and record the first failure parameter in the server through the first bus, wherein the first failure parameter is written in the connection with the BMC and the BIOS chip in the server when an uncorrectable failure occurs in the OS running in the server operating parameters in the storage medium;

相对应的，提供了一种服务器，如图2所示，包括处理器、BIOS芯片、BMC以及存储介质；Correspondingly, a server is provided, as shown in FIG. 2 , including a processor, a BIOS chip, a BMC and a storage medium;

可选的，第二总线为VGA总线。Optionally, the second bus is a VGA bus.

可选的，第一总线为I2C总线。Optionally, the first bus is an I2C bus.

本说明书实施例中，通过在BIOS芯片和BMC之间设置共享的存储介质，在BIOS芯片基于硬件故障检测确定服务器中运行的OS出现不可纠正故障时，从服务器中读取相关的故障参数并写入到存储介质中，在OS重新启动后，再将存储介质中所保存的故障参数传输至BMC，避免OS出现不可纠正故障时能够反映故障原因的故障参数丢失所带来的难以定位的问题，从而提升了服务器故障定位的效率。In the embodiment of this specification, by setting a shared storage medium between the BIOS chip and the BMC, when the BIOS chip determines that the OS running in the server has an uncorrectable failure based on hardware failure detection, the relevant failure parameters are read from the server and written After the OS is restarted, the fault parameters saved in the storage medium are transferred to the BMC to avoid the problem of difficult to locate caused by the loss of fault parameters that can reflect the cause of the fault when the OS has an uncorrectable fault. This improves the efficiency of server fault location.

应当理解的是，本说明书并不局限于上面已经描述并在附图中示出的精确结构，并且可以在不脱离其范围进行各种修改和改变。本说明书的范围仅由所附的权利要求来限制。It should be understood that this specification is not limited to the precise structures described above and illustrated in the accompanying drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of this specification is limited only by the appended claims.

以上所述仅为本说明书的较佳实施例而已，并不用以限制本说明书，凡在本说明书的精神和原则之内，所做的任何修改、等同替换、改进等，均应包含在本说明书保护的范围之内。The above descriptions are only preferred embodiments of this specification, and are not intended to limit this specification. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of this specification shall be included in this specification. within the scope of protection.

Claims

1. A fault recording method is characterized in that a Basic Input Output System (BIOS) chip applied to a server comprises the following steps:

when an Operating System (OS) running in the server is determined to have an uncorrectable fault, acquiring a first fault parameter associated with the uncorrectable fault in the server;

recording the first fault parameter to a storage medium, wherein the storage medium is respectively connected to the BIOS chip and a Baseboard Management Controller (BMC) through a first bus;

and after the OS is restarted, sending the first fault parameter recorded in the storage medium to the BMC.

2. The method of claim 1, further comprising:

when the fact that the correctable faults occur in the OS running in the server is determined, acquiring second fault parameters related to the correctable faults in the server;

and sending the second fault parameter to the BMC through a second bus.

3. The method of claim 1, further comprising:

when the server is started, detecting various devices in the server and recording starting parameters;

sending the second fault parameter to the BMC through a second bus; or,

when the server works normally, normal parameters of various devices in the server are obtained and recorded at a preset time node;

and sending the normal parameters to the BMC through a second bus.

4. A method according to claim 2 or 3, wherein the second bus is a video graphics matrix, VGA, bus.

5. The method of any of claims 1-3, wherein the first bus is an integrated circuit I2C bus.

6. A fault positioning method is applied to a BMC in a server, and comprises the following steps:

acquiring and recording a first fault parameter in the server through a first bus, wherein the first fault parameter is an operation parameter written in a storage medium connected with the BMC and a BIOS chip in the server when an uncorrectable fault occurs to an OS running in the server;

and determining a device with a fault in the server according to the first fault parameter and the comparison parameter.

7. The method of claim 6, wherein the alignment parameters comprise: starting parameters and/or normal parameters, wherein the starting parameters are operation parameters recorded when the server is started, and the normal parameters are operation parameters of various devices in the server, which are obtained and recorded at a preset time node when the server works normally;

the method further comprises the following steps:

acquiring and recording starting parameters and/or normal parameters in the server through a second bus;

the determining, according to the first fault parameter and the comparison parameter, a device having a fault in the server specifically includes:

and comparing the starting parameters and/or the normal parameters through the first fault parameters, and determining a device with a fault in the server.

8. The method of claim 7, further comprising:

acquiring and recording a second fault parameter in the server through a second bus, wherein the second fault parameter is an operation parameter recorded when a correctable fault occurs in an OS (operating system) operated in the server;

and comparing the normal parameter with the second fault parameter, and determining a device with a fault in the server.

9. A fault recording device is characterized in that a BIOS chip applied to a server comprises:

an acquisition unit, configured to acquire a first fault parameter associated with an uncorrectable fault in the server when it is determined that the uncorrectable fault occurs in an OS running in the server;

the recording unit is used for recording the first fault parameter to a storage medium, wherein the storage medium is respectively connected to the BIOS chip and the BMC through a first bus;

and the sending unit is used for sending the first fault parameter recorded in the storage medium to the BMC after the OS is restarted.

10. A fault locating device is characterized in that the fault locating device is applied to a BMC in a server and comprises:

the recording unit is used for acquiring and recording a first fault parameter in the server through a first bus, wherein the first fault parameter is an operation parameter written in a storage medium connected with the BMC and a BIOS chip in the server when an uncorrectable fault occurs to an OS running in the server;

and the positioning unit is used for determining a device with a fault in the server according to the first fault parameter.

11. A server is characterized by comprising a processor, a BIOS chip, a BMC and a storage medium;

the storage medium is respectively connected with the BIOS chip and the BMC through a first bus, and the BIOS chip is respectively connected with the processor and the BMC through a second bus;

when an uncorrectable fault occurs in an OS (operating system) operated by a processor in the server, acquiring a first fault parameter associated with the fault in the server by the BIOS chip;

the BIOS chip records the first fault parameter to a storage medium;

after the OS is restarted, the BIOS chip sends the first fault parameter recorded in the storage medium to the BMC;

and the BMC positions the uncorrectable fault in the server according to the first fault parameter and the comparison parameter stored in the BMC.