CN114826962A - Link fault detection method, device, equipment and machine readable storage medium - Google Patents
Link fault detection method, device, equipment and machine readable storage medium Download PDFInfo
- Publication number
- CN114826962A CN114826962A CN202210327368.3A CN202210327368A CN114826962A CN 114826962 A CN114826962 A CN 114826962A CN 202210327368 A CN202210327368 A CN 202210327368A CN 114826962 A CN114826962 A CN 114826962A
- Authority
- CN
- China
- Prior art keywords
- feedback message
- protocol
- state
- target pcie
- mctp
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/08—Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0631—Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L69/00—Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
- H04L69/26—Special purpose or proprietary protocols or architectures
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Computing Systems (AREA)
- Computer Security & Cryptography (AREA)
- Environmental & Geological Engineering (AREA)
- Debugging And Monitoring (AREA)
Abstract
本公开提供一种链路故障检测方法、装置、设备及机器可读存储介质,该方法包括:判断目标PCIE设备是否支持MCTP协议;向支持MCTP协议的目标PCIE设备发送状态获取命令;尝试接收反馈报文,根据接收反馈报文的结果,判断关联于目标PCIE设备的PCIE链路的状态;所述反馈报文是目标PCIE设备在接收到状态获取命令后,响应于状态获取命令发送的反馈报文。通过本公开的技术方案,向支持MCTP协议的PCIE设备发送状态获取命令,以协议通信的方式判断PCIE链路是否存在故障有异常,若无法正常接收反馈报文则说明PCIE链路存在异常,简化了故障检测流程,直观方便且高效。
The present disclosure provides a link failure detection method, device, device and machine-readable storage medium, the method includes: judging whether a target PCIE device supports the MCTP protocol; sending a status acquisition command to the target PCIE device supporting the MCTP protocol; trying to receive feedback message, according to the result of receiving the feedback message, determine the state of the PCIE link associated with the target PCIE device; the feedback message is the feedback message sent by the target PCIE device in response to the state obtaining command after receiving the state obtaining command. arts. Through the technical solution of the present disclosure, a state acquisition command is sent to a PCIE device supporting the MCTP protocol, and whether the PCIE link is faulty or abnormal is determined by means of protocol communication. If the feedback message cannot be received normally, it indicates that the PCIE link is abnormal. The fault detection process is intuitive, convenient and efficient.
Description
技术领域technical field
本公开涉及通信技术领域,尤其是涉及一种链路故障检测方法、装置、设备及机器可读存储介质。The present disclosure relates to the field of communication technologies, and in particular, to a link failure detection method, apparatus, device, and machine-readable storage medium.
背景技术Background technique
BMC(Baseboard Management Controller,基板管理控制器),可以在机器未开机的状态下,对机器进行固件升级、查看机器设备、等一些操作。BMC (Baseboard Management Controller, baseboard management controller), can upgrade the firmware of the machine, view the machine equipment, and other operations when the machine is not turned on.
管理组件传输协议(MCTP)是一种与媒体无关的协议,用于在被管理计算机系统的平台管理子系统内的智能设备之间进行相互通信。该协议独立于底层物理总线及总线上的“数据链路”层消息,也就是仅定义了传输层的消息,对于下面的传输层,视为自身协议的传输层消息。The Management Component Transport Protocol (MCTP) is a media-independent protocol for intercommunication between intelligent devices within the platform management subsystem of a managed computer system. This protocol is independent of the underlying physical bus and the "data link" layer messages on the bus, that is, only the messages of the transport layer are defined. For the lower transport layer, it is regarded as the transport layer message of its own protocol.
PCIE(PCI-Express,peripheral component interconnect express)是一种高速串行计算机扩展总线标准,属于高速串行点对点双通道高带宽传输,所连接的设备分配独享通道带宽,不共享总线带宽,主要支持主动电源管理,错误报告,端对端的可靠性传输,热插拔以及服务质量(QOS)等功能。PCIE (PCI-Express, peripheral component interconnect express) is a high-speed serial computer expansion bus standard. It belongs to high-speed serial point-to-point dual-channel high-bandwidth transmission. The connected devices allocate exclusive channel bandwidth and do not share bus bandwidth. It mainly supports Active power management, error reporting, end-to-end reliable delivery, hot-plugging, and Quality of Service (QOS) features.
NCSI(Network Controller Sideband Interface,网络控制器边带接口)个由分布式管理任务组定义的用于支持服务器带外管理的边带接口网络控制器的工业标准。NCSI (Network Controller Sideband Interface) is an industry standard defined by the Distributed Management Task Force for sideband interface network controllers that support out-of-band management of servers.
一种服务器检测PCIE异常的方案通过MCA硬件机制,在发现硬件错误的时候发出中断或异常。通过MCA,系统可以探测硬件错误,如系统总线错误,ECC错误,奇偶校验错误,cache错误,TLB错误等,MCA硬件机制,处理上需要对现有的错误记录,并解析,流程上较为复杂,效率较低。A solution for a server to detect PCIE anomalies is through the MCA hardware mechanism, and an interrupt or an exception is issued when a hardware error is found. Through MCA, the system can detect hardware errors, such as system bus errors, ECC errors, parity errors, cache errors, TLB errors, etc. The MCA hardware mechanism needs to record and analyze the existing errors in processing, and the process is more complicated , the efficiency is low.
发明内容SUMMARY OF THE INVENTION
有鉴于此,本公开提供一种链路故障检测方法、装置及电子设备、机器可读存储介质,以改善上述故障检测效率低的问题。In view of this, the present disclosure provides a link fault detection method, apparatus, electronic device, and machine-readable storage medium to improve the above problem of low fault detection efficiency.
具体地技术方案如下:The specific technical solutions are as follows:
本公开提供了一种链路故障检测方法,应用于BMC设备,所述方法包括:判断目标PCIE设备是否支持MCTP协议;向支持MCTP协议的目标PCIE设备发送状态获取命令;尝试接收反馈报文,根据接收反馈报文的结果,判断关联于目标PCIE设备的PCIE链路的状态;所述反馈报文是目标PCIE设备在接收到状态获取命令后,响应于状态获取命令发送的反馈报文。The present disclosure provides a link failure detection method, which is applied to a BMC device. The method includes: judging whether a target PCIE device supports the MCTP protocol; sending a status acquisition command to the target PCIE device supporting the MCTP protocol; trying to receive a feedback message, According to the result of receiving the feedback message, the state of the PCIE link associated with the target PCIE device is determined; the feedback message is the feedback message sent by the target PCIE device in response to the state obtaining command after receiving the state obtaining command.
作为一种技术方案,所述向支持MCTP协议的目标PCIE设备发送状态获取命令,尝试接收反馈报文,根据接收反馈报文的结果,判断关联于目标PCIE设备的PCIE链路的状态,包括:若接收反馈报文失败,则在预设延迟后重新执行向支持MCTP协议的目标PCIE设备发送状态获取命令的步骤,并记录本次接收反馈报文失败的结果;若记录的接收反馈报文失败的结果达到指定次数,则判断关联于目标PCIE设备的PCIE链路的状态异常。As a technical solution, the state acquisition command is sent to the target PCIE device supporting the MCTP protocol, an attempt is made to receive a feedback message, and the state of the PCIE link associated with the target PCIE device is judged according to the result of receiving the feedback message, including: If receiving the feedback message fails, the step of sending the status acquisition command to the target PCIE device supporting the MCTP protocol is performed again after a preset delay, and the result of the failure to receive the feedback message this time is recorded; if the recorded failure to receive the feedback message fails If the result reaches a specified number of times, it is judged that the state of the PCIE link associated with the target PCIE device is abnormal.
作为一种技术方案,所述向支持MCTP协议的目标PCIE设备发送状态获取命令,包括:通过MCTP协议封装包含有状态获取命令的指定协议的报文,向支持MCTP协议的目标PCIE设备发送该报文;所述指定协议包括NCSI协议或PLDM协议或VDM协议。As a technical solution, the sending a state acquisition command to a target PCIE device supporting the MCTP protocol includes: encapsulating a packet of a specified protocol containing the state acquisition command through the MCTP protocol, and sending the packet to the target PCIE device supporting the MCTP protocol. The specified protocol includes NCSI protocol or PLDM protocol or VDM protocol.
作为一种技术方案,所述尝试接收反馈报文,根据接收反馈报文的结果,判断关联于目标PCIE设备的PCIE链路的状态,包括:若判断关联于目标PCIE设备的PCIE链路的状态异常,则上报告警。As a technical solution, the attempt to receive the feedback message, according to the result of receiving the feedback message, determine the state of the PCIE link associated with the target PCIE device, including: if judging the state of the PCIE link associated with the target PCIE device If abnormal, report to the police.
本公开同时提供了一种链路故障检测装置,应用于BMC设备,所述装置包括:协议模块,用于判断目标PCIE设备是否支持MCTP协议;命令模块,用于向支持MCTP协议的目标PCIE设备发送状态获取命令;处理模块,用于尝试接收反馈报文,根据接收反馈报文的结果,判断关联于目标PCIE设备的PCIE链路的状态;所述反馈报文是目标PCIE设备在接收到状态获取命令后,响应于状态获取命令发送的反馈报文。The present disclosure also provides a link failure detection device, which is applied to a BMC device. The device includes: a protocol module for judging whether a target PCIE device supports the MCTP protocol; a command module for sending the target PCIE device supporting the MCTP protocol to the target PCIE device. Sending a state acquisition command; the processing module is used to attempt to receive a feedback message, and according to the result of receiving the feedback message, determine the state of the PCIE link associated with the target PCIE device; the feedback message is the state of the target PCIE device receiving the message After obtaining the command, the feedback message sent in response to the state obtaining command.
作为一种技术方案,所述向支持MCTP协议的目标PCIE设备发送状态获取命令,尝试接收反馈报文,根据接收反馈报文的结果,判断关联于目标PCIE设备的PCIE链路的状态,包括:若接收反馈报文失败,则在预设延迟后重新执行向支持MCTP协议的目标PCIE设备发送状态获取命令的步骤,并记录本次接收反馈报文失败的结果;若记录的接收反馈报文失败的结果达到指定次数,则判断关联于目标PCIE设备的PCIE链路的状态异常。As a technical solution, the state acquisition command is sent to the target PCIE device supporting the MCTP protocol, an attempt is made to receive a feedback message, and the state of the PCIE link associated with the target PCIE device is judged according to the result of receiving the feedback message, including: If receiving the feedback message fails, the step of sending the status acquisition command to the target PCIE device supporting the MCTP protocol is performed again after a preset delay, and the result of the failure to receive the feedback message this time is recorded; if the recorded failure to receive the feedback message fails If the result reaches a specified number of times, it is judged that the state of the PCIE link associated with the target PCIE device is abnormal.
作为一种技术方案,所述向支持MCTP协议的目标PCIE设备发送状态获取命令,包括:通过MCTP协议封装包含有状态获取命令的指定协议的报文,向支持MCTP协议的目标PCIE设备发送该报文;所述指定协议包括NCSI协议或PLDM协议或VDM协议。As a technical solution, the sending a state acquisition command to a target PCIE device supporting the MCTP protocol includes: encapsulating a packet of a specified protocol containing the state acquisition command through the MCTP protocol, and sending the packet to the target PCIE device supporting the MCTP protocol. The specified protocol includes NCSI protocol or PLDM protocol or VDM protocol.
作为一种技术方案,所述尝试接收反馈报文,根据接收反馈报文的结果,判断关联于目标PCIE设备的PCIE链路的状态,包括:若判断关联于目标PCIE设备的PCIE链路的状态异常,则上报告警。As a technical solution, the attempt to receive the feedback message, according to the result of receiving the feedback message, determine the state of the PCIE link associated with the target PCIE device, including: if judging the state of the PCIE link associated with the target PCIE device If abnormal, report to the police.
本公开同时提供了一种电子设备,包括处理器和机器可读存储介质,所述机器可读存储介质存储有能够被所述处理器执行的机器可执行指令,处理器执行所述机器可执行指令以实现前述的链路故障检测方法。The present disclosure also provides an electronic device including a processor and a machine-readable storage medium, where the machine-readable storage medium stores machine-executable instructions that can be executed by the processor, and the processor executes the machine-executable instructions. instructions to implement the aforementioned link failure detection method.
本公开同时提供了一种机器可读存储介质,所述机器可读存储介质存储有机器可执行指令,所述机器可执行指令在被处理器调用和执行时,所述机器可执行指令促使所述处理器实现前述的链路故障检测方法。The present disclosure also provides a machine-readable storage medium storing machine-executable instructions that, when invoked and executed by a processor, cause the machine-executable instructions to cause all The processor implements the aforementioned link failure detection method.
本公开提供的上述技术方案至少带来了以下有益效果:The above-mentioned technical solutions provided by the present disclosure bring at least the following beneficial effects:
向支持MCTP协议的PCIE设备发送状态获取命令,以协议通信的方式判断PCIE链路是否存在故障有异常,若无法正常接收反馈报文则说明PCIE链路存在异常,简化了故障检测流程,直观方便且高效。Send a status acquisition command to the PCIE device that supports the MCTP protocol, and determine whether the PCIE link is faulty or abnormal by means of protocol communication. If the feedback message cannot be received normally, it means that the PCIE link is abnormal, which simplifies the fault detection process and is intuitive and convenient. and efficient.
附图说明Description of drawings
为了更加清楚地说明本公开实施方式或者现有技术中的技术方案,下面将对本公开实施方式或者现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本公开中记载的一些实施方式,对于本领域普通技术人员来讲,还可以根据本公开实施方式的这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments of the present disclosure or the prior art. Obviously, the drawings in the following description These are just some embodiments described in the present disclosure, and for those of ordinary skill in the art, other drawings can also be obtained according to these drawings of the embodiments of the present disclosure.
图1是本公开一种实施方式中的链路故障检测方法的流程图;1 is a flowchart of a method for detecting a link failure in an embodiment of the present disclosure;
图2是本公开一种实施方式中的链路故障检测装置的结构图;FIG. 2 is a structural diagram of a link failure detection device in an embodiment of the present disclosure;
图3是本公开一种实施方式中的电子设备的硬件结构图。FIG. 3 is a hardware structure diagram of an electronic device in an embodiment of the present disclosure.
具体实施方式Detailed ways
在本公开实施方式使用的术语仅仅是出于描述特定实施方式的目的,而非限制本公开。本公开和权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其它含义。还应当理解,本文中使用的术语“和/或”是指包含一个或多个相关联的列出项目的任何或所有可能组合。The terminology used in the embodiments of the present disclosure is for the purpose of describing particular embodiments only, and not for the purpose of limiting the present disclosure. As used in this disclosure and the claims, the singular forms "a," "the," and "the" are intended to include the plural forms as well, unless the context clearly dictates otherwise. It will also be understood that the term "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items.
应当理解,尽管在本公开实施方式可能采用术语第一、第二、第三等来描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离本公开范围的情况下,第一信息也可以被称为第二信息,类似地,第二信息也可以被称为第一信息。取决于语境,此外,所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”。It should be understood that although the terms first, second, third, etc. may be used in embodiments of the present disclosure to describe various information, such information should not be limited to these terms. These terms are only used to distinguish the same type of information from each other. For example, the first information may also be referred to as the second information, and similarly, the second information may also be referred to as the first information, without departing from the scope of the present disclosure. Furthermore, the use of the word "if" can be interpreted as "at the time of" or "when" or "in response to determining", depending on the context.
本公开提供一种链路故障检测方法、装置及电子设备、机器可读存储介质,以改善上述故障检测效率低的问题。The present disclosure provides a link fault detection method, device, electronic device, and machine-readable storage medium to improve the above problem of low fault detection efficiency.
具体地技术方案如后述。Specific technical solutions are described later.
在一种实施方式中,本公开提供了一种链路故障检测方法,应用于BMC设备,所述方法包括:判断目标PCIE设备是否支持MCTP协议;向支持MCTP协议的目标PCIE设备发送状态获取命令;尝试接收反馈报文,根据接收反馈报文的结果,判断关联于目标PCIE设备的PCIE链路的状态;所述反馈报文是目标PCIE设备在接收到状态获取命令后,响应于状态获取命令发送的反馈报文。In one embodiment, the present disclosure provides a link failure detection method, applied to a BMC device, the method includes: judging whether a target PCIE device supports the MCTP protocol; sending a status acquisition command to the target PCIE device supporting the MCTP protocol ; Attempt to receive the feedback message, according to the result of receiving the feedback message, judge the state of the PCIE link associated with the target PCIE device; Described feedback message is that the target PCIE device receives the state acquisition order in response to the state acquisition order Feedback message sent.
具体地,如图1,包括以下步骤:Specifically, as shown in Figure 1, the following steps are included:
步骤S11,判断目标PCIE设备是否支持MCTP协议;Step S11, judging whether the target PCIE device supports the MCTP protocol;
步骤S12,向支持MCTP协议的目标PCIE设备发送状态获取命令;Step S12, sending a state acquisition command to the target PCIE device supporting the MCTP protocol;
步骤S13,尝试接收反馈报文,根据接收反馈报文的结果,判断关联于目标PCIE设备的PCIE链路的状态。Step S13, try to receive the feedback message, and judge the state of the PCIE link associated with the target PCIE device according to the result of receiving the feedback message.
向支持MCTP协议的PCIE设备发送状态获取命令,以协议通信的方式判断PCIE链路是否存在故障有异常,若无法正常接收反馈报文则说明PCIE链路存在异常,简化了故障检测流程,直观方便且高效。Send a status acquisition command to the PCIE device that supports the MCTP protocol, and determine whether the PCIE link is faulty or abnormal by means of protocol communication. If the feedback message cannot be received normally, it means that the PCIE link is abnormal, which simplifies the fault detection process and is intuitive and convenient. and efficient.
在一种实施方式中,所述向支持MCTP协议的目标PCIE设备发送状态获取命令,尝试接收反馈报文,根据接收反馈报文的结果,判断关联于目标PCIE设备的PCIE链路的状态,包括:若接收反馈报文失败,则在预设延迟后重新执行向支持MCTP协议的目标PCIE设备发送状态获取命令的步骤,并记录本次接收反馈报文失败的结果;若记录的接收反馈报文失败的结果达到指定次数,则判断关联于目标PCIE设备的PCIE链路的状态异常。In one embodiment, the state acquisition command is sent to the target PCIE device supporting the MCTP protocol, an attempt is made to receive a feedback message, and the state of the PCIE link associated with the target PCIE device is determined according to the result of receiving the feedback message, including : If receiving the feedback message fails, re-execute the step of sending a status acquisition command to the target PCIE device supporting the MCTP protocol after a preset delay, and record the result of the failure to receive the feedback message this time; If the result of failure reaches the specified number of times, it is judged that the state of the PCIE link associated with the target PCIE device is abnormal.
在一种实施方式中,所述向支持MCTP协议的目标PCIE设备发送状态获取命令,包括:通过MCTP协议封装包含有状态获取命令的指定协议的报文,向支持MCTP协议的目标PCIE设备发送该报文;所述指定协议包括NCSI协议或PLDM协议或VDM协议。In one embodiment, the sending the status acquisition command to the target PCIE device supporting the MCTP protocol includes: encapsulating a packet of the specified protocol that contains the status acquisition command through the MCTP protocol, and sending the status acquisition command to the target PCIE device supporting the MCTP protocol. message; the specified protocol includes NCSI protocol, PLDM protocol or VDM protocol.
在一种实施方式中,所述尝试接收反馈报文,根据接收反馈报文的结果,判断关联于目标PCIE设备的PCIE链路的状态,包括:若判断关联于目标PCIE设备的PCIE链路的状态异常,则上报告警。In one embodiment, the attempt to receive the feedback message, according to the result of receiving the feedback message, determine the state of the PCIE link associated with the target PCIE device, including: if judging the status of the PCIE link associated with the target PCIE device If the status is abnormal, an alarm will be reported.
在一种实施方式中,BMC通过PCIE总线与PCH相连,PCH连接CPU的PCIE总线控制器,CPU下挂PCIE设备。BMC通过PCIE链路与PCIE设备进行通信。BMC在协议报文的基础上封装NCSI或PLDM或VDM协议报文,然后再通过PCIE链路将报文发送给PCH,PCH将报文通过PCIE控制器转发给设备,设备收到报文后,会进行处理,并将回应传给BMC。In an implementation manner, the BMC is connected to the PCH through a PCIE bus, the PCH is connected to a PCIE bus controller of the CPU, and a PCIE device is attached to the CPU. The BMC communicates with the PCIE device through the PCIE link. The BMC encapsulates NCSI or PLDM or VDM protocol packets on the basis of protocol packets, and then sends the packets to the PCH through the PCIE link. The PCH forwards the packets to the device through the PCIE controller. will be processed and the response passed to the BMC.
以支持NCSI OVER MCTP协议的网卡作为例,首先BMC会对每张网卡设置EID(endpoint ID)作为其自身的身份编号;然后BMC向网卡发送NCSI OVER MCTP的状态获取命令,如发送获取网卡link status的命令,网卡收到命令后,则通过PCIE链路向BMC作出回应,如回应端口的link status是up还是down,BMC收到网卡的回应后可以判断PCIE链路及设备都是正常的。如果在这个过程中,PCIE链路出现异常或PCIE设备有异常,都会导致命令的交互过程失败,即无法收到反馈报文。为了避免链路抖动而出现误判的情况,可以通过多次MCTP命令交互的失败来确认链路或设备异常,例如在首次接收反馈报文失败后,间隔5秒再次发送状态获取命令,以此重复三次,三次均无法正常接收反馈报文,则认为PCIE链路异常。Take a network card that supports NCSI OVER MCTP protocol as an example, firstly, BMC will set EID (endpoint ID) for each network card as its own identity number; then BMC will send NCSI OVER MCTP status acquisition command to network card, such as sending to get network card link status After receiving the command, the network card responds to the BMC through the PCIE link. If the link status of the responding port is up or down, the BMC can determine that the PCIE link and device are normal after receiving the response from the network card. During this process, if the PCIE link is abnormal or the PCIE device is abnormal, the command interaction process will fail, that is, the feedback message cannot be received. In order to avoid misjudgment caused by link flapping, you can confirm the abnormality of the link or device through the failure of multiple MCTP command exchanges. Repeat three times, if the feedback message cannot be received normally three times, it is considered that the PCIE link is abnormal.
在一种实施方式中,本公开同时提供了一种链路故障检测装置,如图2,应用于BMC设备,所述装置包括:协议模块21,用于判断目标PCIE设备是否支持MCTP协议;命令模块22,用于向支持MCTP协议的目标PCIE设备发送状态获取命令;处理模块23,用于尝试接收反馈报文,根据接收反馈报文的结果,判断关联于目标PCIE设备的PCIE链路的状态;所述反馈报文是目标PCIE设备在接收到状态获取命令后,响应于状态获取命令发送的反馈报文。In an embodiment, the present disclosure also provides a link failure detection apparatus, as shown in FIG. 2 , which is applied to a BMC device. The apparatus includes: a protocol module 21 for judging whether the target PCIE device supports the MCTP protocol; a command The module 22 is used to send a state acquisition command to the target PCIE device supporting the MCTP protocol; the processing module 23 is used to attempt to receive the feedback message, and judge the state of the PCIE link associated with the target PCIE device according to the result of receiving the feedback message ; The feedback message is a feedback message sent by the target PCIE device in response to the status acquisition command after receiving the status acquisition command.
在一种实施方式中,所述向支持MCTP协议的目标PCIE设备发送状态获取命令,尝试接收反馈报文,根据接收反馈报文的结果,判断关联于目标PCIE设备的PCIE链路的状态,包括:若接收反馈报文失败,则在预设延迟后重新执行向支持MCTP协议的目标PCIE设备发送状态获取命令的步骤,并记录本次接收反馈报文失败的结果;若记录的接收反馈报文失败的结果达到指定次数,则判断关联于目标PCIE设备的PCIE链路的状态异常。In one embodiment, the state acquisition command is sent to the target PCIE device supporting the MCTP protocol, an attempt is made to receive a feedback message, and the state of the PCIE link associated with the target PCIE device is determined according to the result of receiving the feedback message, including : If receiving the feedback message fails, re-execute the step of sending a status acquisition command to the target PCIE device supporting the MCTP protocol after a preset delay, and record the result of the failure to receive the feedback message this time; If the result of failure reaches the specified number of times, it is judged that the state of the PCIE link associated with the target PCIE device is abnormal.
在一种实施方式中,所述向支持MCTP协议的目标PCIE设备发送状态获取命令,包括:通过MCTP协议封装包含有状态获取命令的指定协议的报文,向支持MCTP协议的目标PCIE设备发送该报文;所述指定协议包括NCSI协议或PLDM协议或VDM协议。In one embodiment, the sending the status acquisition command to the target PCIE device supporting the MCTP protocol includes: encapsulating a packet of the specified protocol that contains the status acquisition command through the MCTP protocol, and sending the status acquisition command to the target PCIE device supporting the MCTP protocol. message; the specified protocol includes NCSI protocol, PLDM protocol or VDM protocol.
在一种实施方式中,所述尝试接收反馈报文,根据接收反馈报文的结果,判断关联于目标PCIE设备的PCIE链路的状态,包括:若判断关联于目标PCIE设备的PCIE链路的状态异常,则上报告警。In one embodiment, the attempt to receive the feedback message, according to the result of receiving the feedback message, determine the state of the PCIE link associated with the target PCIE device, including: if judging the status of the PCIE link associated with the target PCIE device If the status is abnormal, an alarm will be reported.
装置实施方式与对应的方法实施方式相同或相似,在此不再赘述。The device implementation is the same as or similar to the corresponding method implementation, and details are not described herein again.
在一种实施方式中,本公开提供了一种电子设备,包括处理器和机器可读存储介质,所述机器可读存储介质存储有能够被所述处理器执行的机器可执行指令,处理器执行所述机器可执行指令以实现前述的链路故障检测方法,从硬件层面而言,硬件架构示意图可以参见图3所示。In one embodiment, the present disclosure provides an electronic device including a processor and a machine-readable storage medium storing machine-executable instructions executable by the processor, the processor The machine-executable instructions are executed to implement the foregoing link failure detection method. From a hardware level, a schematic diagram of a hardware architecture can be referred to as shown in FIG. 3 .
在一种实施方式中,本公开提供了一种机器可读存储介质,所述机器可读存储介质存储有机器可执行指令,所述机器可执行指令在被处理器调用和执行时,所述机器可执行指令促使所述处理器实现前述的链路故障检测方法。In one embodiment, the present disclosure provides a machine-readable storage medium storing machine-executable instructions that when invoked and executed by a processor, the Machine-executable instructions cause the processor to implement the aforementioned link failure detection method.
这里,机器可读存储介质可以是任何电子、磁性、光学或其它物理存储装置,可以包含或存储信息,如可执行指令、数据,等等。例如,机器可读存储介质可以是:RAM(RadomAccess Memory,随机存取存储器)、易失存储器、非易失性存储器、闪存、存储驱动器(如硬盘驱动器)、固态硬盘、任何类型的存储盘(如光盘、dvd等),或者类似的存储介质,或者它们的组合。Here, a machine-readable storage medium can be any electronic, magnetic, optical, or other physical storage device that can contain or store information, such as executable instructions, data, and the like. For example, the machine-readable storage medium may be: RAM (Radom Access Memory, random access memory), volatile memory, non-volatile memory, flash memory, storage drive (such as hard disk drive), solid state drive, any type of storage disk ( such as optical discs, DVDs, etc.), or similar storage media, or a combination thereof.
上述实施方式阐明的系统、装置、模块或单元,具体可以由计算机芯片或实体实现,或者由具有某种功能的产品来实现。一种典型的实现设备为计算机,计算机的具体形式可以是个人计算机、膝上型计算机、蜂窝电话、相机电话、智能电话、个人数字助理、媒体播放器、导航设备、电子邮件收发设备、游戏控制台、平板计算机、可穿戴设备或者这些设备中的任意几种设备的组合。The systems, devices, modules or units described in the above embodiments may be specifically implemented by computer chips or entities, or by products with certain functions. A typical implementing device is a computer, which may be in the form of a personal computer, laptop computer, cellular phone, camera phone, smart phone, personal digital assistant, media player, navigation device, email sending and receiving device, game control desktop, tablet, wearable device, or a combination of any of these devices.
为了描述的方便,描述以上装置时以功能分为各种单元分别描述。当然,在实施本公开时可以把各单元的功能在同一个或多个软件和/或硬件中实现。For the convenience of description, when describing the above device, the functions are divided into various units and described respectively. Of course, when implementing the present disclosure, the functions of each unit may be implemented in one or more software and/or hardware.
本领域内的技术人员应明白,本公开的实施方式可提供为方法、系统、或计算机程序产品。因此,本公开可采用完全硬件实施方式、完全软件实施方式、或结合软件和硬件方面的实施方式的形式。而且,本公开实施方式可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。As will be appreciated by one skilled in the art, embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware implementation, an entirely software implementation, or an implementation combining software and hardware aspects. Furthermore, embodiments of the present disclosure may take the form of a computer program product embodied on one or more computer-usable storage media having computer-usable program code embodied therein, including but not limited to disk storage, CD-ROM, optical storage, and the like.
本公开是参照根据本公开实施方式的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可以由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其它可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其它可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each process and/or block in the flowchart illustrations and/or block diagrams, and combinations of processes and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to the processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing device to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing device produce Means for implementing the functions specified in a flow or flow of a flowchart and/or a block or blocks of a block diagram.
而且,这些计算机程序指令也可以存储在能引导计算机或其它可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或者多个流程和/或方框图一个方框或者多个方框中指定的功能。Furthermore, these computer program instructions may also be stored in a computer readable memory capable of directing a computer or other programmable data processing apparatus to operate in a particular manner, such that the instructions stored in the computer readable memory result in an article of manufacture comprising the instruction means, The instruction means implements the functions specified in a flow or flows of the flowcharts and/or a block or blocks of the block diagrams.
这些计算机程序指令也可装载到计算机或其它可编程数据处理设备上,使得在计算机或者其它可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其它可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded on a computer or other programmable data processing device to cause a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process such that The instructions provide steps for implementing the functions specified in the flow or blocks of the flowcharts and/or the block or blocks of the block diagrams.
本领域技术人员应明白,本公开的实施方式可提供为方法、系统或计算机程序产品。因此,本公开可以采用完全硬件实施方式、完全软件实施方式、或者结合软件和硬件方面的实施方式的形式。而且,本公开可以采用在一个或者多个其中包含有计算机可用程序代码的计算机可用存储介质(可以包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。As will be appreciated by one skilled in the art, embodiments of the present disclosure may be provided as a method, system or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware implementation, an entirely software implementation, or an implementation combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable storage media (which may include, but are not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
以上所述仅为本公开的实施方式而已,并不用于限制本公开。对于本领域技术人员来说,本公开可以有各种更改和变化。凡在本公开的精神和原理之内所作的任何修改、等同替换、改进等,均应包含在本公开的权利要求范围之内。The above descriptions are merely embodiments of the present disclosure, and are not intended to limit the present disclosure. Various modifications and variations of the present disclosure will occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present disclosure should be included within the scope of the claims of the present disclosure.
Claims (10)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202210327368.3A CN114826962A (en) | 2022-03-30 | 2022-03-30 | Link fault detection method, device, equipment and machine readable storage medium |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202210327368.3A CN114826962A (en) | 2022-03-30 | 2022-03-30 | Link fault detection method, device, equipment and machine readable storage medium |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN114826962A true CN114826962A (en) | 2022-07-29 |
Family
ID=82533610
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202210327368.3A Pending CN114826962A (en) | 2022-03-30 | 2022-03-30 | Link fault detection method, device, equipment and machine readable storage medium |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN114826962A (en) |
Cited By (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN115941436A (en) * | 2022-09-29 | 2023-04-07 | 超聚变数字技术有限公司 | A kind of warning method, computing equipment and storage medium |
| CN116137603A (en) * | 2023-02-23 | 2023-05-19 | 苏州浪潮智能科技有限公司 | Link fault detection method and device, storage medium and electronic device |
| CN116582471A (en) * | 2023-07-14 | 2023-08-11 | 珠海星云智联科技有限公司 | PCIE equipment, PCIE data capturing system and server |
| CN116723084A (en) * | 2023-06-14 | 2023-09-08 | 苏州浪潮智能科技有限公司 | PCIE link fault repair methods, devices, electronic equipment and storage media |
| CN118245295A (en) * | 2023-12-29 | 2024-06-25 | 河南昆仑技术有限公司 | PCIe link state detection method of server and server |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2012100724A1 (en) * | 2011-01-28 | 2012-08-02 | 成都市华为赛门铁克科技有限公司 | Method, device, and system for transmitting packet on pcie bus |
| US20190361763A1 (en) * | 2018-05-25 | 2019-11-28 | Qualcomm Incorporated | Safe handling of link errors in a peripheral component interconnect express (pcie) device |
| CN110958132A (en) * | 2019-10-31 | 2020-04-03 | 苏州浪潮智能科技有限公司 | Method for monitoring network card device, baseboard management controller and network card device |
| CN113010381A (en) * | 2021-03-12 | 2021-06-22 | 山东英信计算机技术有限公司 | Method and equipment for managing components |
| CN113868058A (en) * | 2021-09-28 | 2021-12-31 | 新华三技术有限公司 | Method, device and server for fault detection of peripheral component high-speed interconnection equipment |
-
2022
- 2022-03-30 CN CN202210327368.3A patent/CN114826962A/en active Pending
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2012100724A1 (en) * | 2011-01-28 | 2012-08-02 | 成都市华为赛门铁克科技有限公司 | Method, device, and system for transmitting packet on pcie bus |
| US20190361763A1 (en) * | 2018-05-25 | 2019-11-28 | Qualcomm Incorporated | Safe handling of link errors in a peripheral component interconnect express (pcie) device |
| CN110958132A (en) * | 2019-10-31 | 2020-04-03 | 苏州浪潮智能科技有限公司 | Method for monitoring network card device, baseboard management controller and network card device |
| CN113010381A (en) * | 2021-03-12 | 2021-06-22 | 山东英信计算机技术有限公司 | Method and equipment for managing components |
| CN113868058A (en) * | 2021-09-28 | 2021-12-31 | 新华三技术有限公司 | Method, device and server for fault detection of peripheral component high-speed interconnection equipment |
Cited By (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN115941436A (en) * | 2022-09-29 | 2023-04-07 | 超聚变数字技术有限公司 | A kind of warning method, computing equipment and storage medium |
| CN116137603A (en) * | 2023-02-23 | 2023-05-19 | 苏州浪潮智能科技有限公司 | Link fault detection method and device, storage medium and electronic device |
| CN116723084A (en) * | 2023-06-14 | 2023-09-08 | 苏州浪潮智能科技有限公司 | PCIE link fault repair methods, devices, electronic equipment and storage media |
| CN116582471A (en) * | 2023-07-14 | 2023-08-11 | 珠海星云智联科技有限公司 | PCIE equipment, PCIE data capturing system and server |
| CN116582471B (en) * | 2023-07-14 | 2023-09-19 | 珠海星云智联科技有限公司 | PCIE equipment, PCIE data capturing system and server |
| CN118245295A (en) * | 2023-12-29 | 2024-06-25 | 河南昆仑技术有限公司 | PCIe link state detection method of server and server |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN114826962A (en) | Link fault detection method, device, equipment and machine readable storage medium | |
| US11994940B2 (en) | Fault processing method, related device, and computer storage medium | |
| JP6383839B2 (en) | Method, storage device and system used for remote KVM session | |
| CN100440157C (en) | System and method for logging recoverable errors | |
| CN105095001B (en) | Virtual machine abnormal restoring method under distributed environment | |
| CN106844162A (en) | Storage server cabinet management system and method based on BMC | |
| US10728086B2 (en) | System and method for providing a redundant communication path between a server rack controller and one or more server controllers | |
| WO2021027481A1 (en) | Fault processing method, apparatus, computer device, storage medium and storage system | |
| CN110740072A (en) | A fault detection method, device and related equipment | |
| CN112468361A (en) | Network connection state monitoring method and device, electronic equipment and storage medium | |
| CN116483613B (en) | Processing method and device of fault memory bank, electronic equipment and storage medium | |
| US9208039B2 (en) | System and method for detecting server removal from a cluster to enable fast failover of storage | |
| CN102983989B (en) | Removing method, device and equipment of server virtual address | |
| CN111147313B (en) | Message abnormity monitoring method and device, storage medium and electronic equipment | |
| CN103559124A (en) | Fast fault detection method and device | |
| CN115454705B (en) | Troubleshooting methods, related devices, computer equipment, media and programs | |
| CN107729190A (en) | A kind of I/O path failure metastasis treating method and system | |
| WO2022160308A1 (en) | Data access method and apparatus, and storage medium | |
| CN111459863A (en) | NVME-MI-based chassis management system and method | |
| CN115686951A (en) | Method and device for troubleshooting a database server | |
| CN103905264A (en) | Monitoring system and monitoring method | |
| US9588691B2 (en) | Dynamically managing control information in a storage device | |
| CN116302625A (en) | Fault reporting method, device and storage medium | |
| CN119806745A (en) | Cloud platform virtual machine operating system anomaly detection and recovery method, device and medium | |
| US8819481B2 (en) | Managing storage providers in a clustered appliance environment |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| RJ01 | Rejection of invention patent application after publication |
Application publication date: 20220729 |
|
| RJ01 | Rejection of invention patent application after publication |