CN116361088A - Node misplug detection method and server - Google Patents
Node misplug detection method and server Download PDFInfo
- Publication number
- CN116361088A CN116361088A CN202310207253.5A CN202310207253A CN116361088A CN 116361088 A CN116361088 A CN 116361088A CN 202310207253 A CN202310207253 A CN 202310207253A CN 116361088 A CN116361088 A CN 116361088A
- Authority
- CN
- China
- Prior art keywords
- parameter
- chassis
- node
- server node
- server
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/22—Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
- G06F11/2205—Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing using arrangements specific to the hardware being tested
- G06F11/2221—Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing using arrangements specific to the hardware being tested to test input/output devices or peripheral units
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F1/00—Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
- G06F1/16—Constructional details or arrangements
- G06F1/18—Packaging or power distribution
- G06F1/181—Enclosures
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F1/00—Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
- G06F1/16—Constructional details or arrangements
- G06F1/20—Cooling means
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/22—Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
- G06F11/2273—Test methods
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/22—Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing
- G06F11/2289—Detection or location of defective computer hardware by testing during standby operation or during idle time, e.g. start-up testing by configuration test
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3003—Monitoring arrangements specially adapted to the computing system or computing system component being monitored
- G06F11/3044—Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is the mechanical casing of the computing system
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3065—Monitoring arrangements determined by the means or processing involved in reporting the monitored data
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/32—Monitoring with visual or acoustical indication of the functioning of the machine
- G06F11/324—Display of status information
- G06F11/327—Alarm or error message display
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/90335—Query processing
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- Computer Hardware Design (AREA)
- Computing Systems (AREA)
- Databases & Information Systems (AREA)
- Human Computer Interaction (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Power Engineering (AREA)
- Cooling Or The Like Of Electrical Apparatus (AREA)
Abstract
The application discloses a method for detecting node misplug and a server, wherein the method is applied to a first controller of a server node, and the method specifically comprises the following steps: acquiring identification information of a first chassis, wherein the first chassis is a chassis in which the server node is inserted; under the condition that the identification information exists in the white list of the server node, controlling the server node to enter a working state and executing a first operation; and executing a second operation and outputting alarm information when the identification information does not exist in the white list of the server node. The occurrence of the node misplug event is prevented in a software mode, so that the chassis with different structures does not need to be replaced when the multi-node server is replaced, the multiplexing rate of the chassis can be improved, and the management cost of chassis products is saved.
Description
Technical Field
The application relates to the technical field of server node management, in particular to a node misplug detection method and a server.
Background
With the increasing demand for computing power of servers, dense servers are generated, and multiple nodes can be deployed on the dense servers, each node can operate as an independent server, and such servers can also be called as multi-node servers. At present, the multi-node server has stronger operation capability and larger power consumption because a plurality of nodes can be deployed, so that the multi-node server is required to have stronger heat dissipation capability.
After the multi-node server is replaced, the improvement of the operation capability is usually brought, and the power consumption is increased in contrast to the improvement; at this time, the multi-node server may require higher heat dissipation capability. Therefore, the multi-node servers of different generations have different requirements for heat dissipation capacity, and for convenience of understanding, in this specification, the multi-node server before iteration may be referred to as a multi-node server of a low generation with respect to the multi-node server after iteration; conversely, the iterated multi-node server may be referred to as a higher-order multi-node server.
Generally, the multi-node server is disposed in a chassis and radiates heat through a heat radiation system of the chassis. When the chassis is specially used for accommodating a multi-node server of a certain generation, the chassis is matched with the nodes of the multi-node server; specifically, under the condition that both work at the rated power of each other, the heat dissipation system of the chassis can completely dissipate heat generated by each node in the multi-node server, and meanwhile, the energy efficiency meets the requirements.
If the node of the multi-node server is misplaced to the chassis corresponding to other generation multi-node servers, the situation that the heat dissipation requirement of the node is not matched with the heat dissipation capability of the chassis can be caused. Specifically, when a node of a higher generation is inserted into a chassis corresponding to a multi-node server of a lower generation, heat accumulation may be caused due to insufficient heat dissipation capability of the chassis, so that service performance is affected; when the node of the low generation is inserted into the chassis corresponding to the Gao Daici multi-node server, the power waste of the chassis may be caused by the over-strong heat dissipation capability, the power consumption is increased, and the energy efficiency is reduced.
Therefore, it is necessary to prevent occurrence of node misinsertion. In the prior art, a foolproof mode of a structure is generally adopted to prevent the misplug of the nodes, specifically, chassis with different structures are designed for multi-node servers with different generations, and the misplug of the nodes with different generations and the chassis is prevented through the corresponding relation between the chassis structural parts and the nodes of the multi-node servers.
However, the foolproof manner of the structure needs to be redesigned along with the replacement of the multi-node server, and the multiplexing rate of the chassis is low.
Disclosure of Invention
The utility model provides a node misplug detection method and server, the detection and the warning of node misplug are carried out through the inside controller of node, need not to prevent the node misplug through different quick-witted case structures for the multinode server of different times can use same quick-witted case, improves quick-witted case multiplexing rate.
The first aspect of the present application provides a method for detecting node misinsertion, which is applied to a first controller in a server node; the method comprises the following steps:
acquiring identification information of a first chassis, wherein the first chassis is a chassis in which the server node is inserted; under the condition that the identification information exists in the white list of the server node, controlling the server node to enter a working state and executing a first operation; and executing a second operation and outputting alarm information when the identification information does not exist in the white list of the server node.
If the identification information exists in the white list of the server node, the server node can be determined to be inserted into the matched chassis; otherwise, it may be determined that the server node is plugged into a non-matching chassis.
In the method, after the server node is inserted into the first chassis and is electrified to be started, the identification information of the first chassis can be obtained, whether the server node is inserted into the matched chassis can be judged, and when the server node is not matched with the first chassis, the second operation is executed, corresponding alarm information is output, so that the alarm information output by the server node can be received at the first time when a node misplug event occurs. The occurrence of the node misplug event is prevented in a software mode, so that the chassis with different structures does not need to be replaced when the multi-node server is replaced, the multiplexing rate of the chassis can be improved, and the management cost of chassis products is saved.
In one possible implementation, the performing the second operation and outputting the alert information includes: acquiring a first parameter indicating the heat dissipation requirement of the server node and a second parameter indicating the heat dissipation capability of the first chassis; when the first parameter is smaller than or equal to the second parameter and the difference value between the second parameter and the first parameter is smaller than or equal to a preset threshold value, controlling the server node to enter the working state and outputting first alarm information; when the first parameter is smaller than the second parameter and the difference value between the second parameter and the first parameter is larger than the preset threshold value, reducing the power of a heat dissipation part corresponding to the server node in the first chassis and outputting second alarm information; and under the condition that the first parameter is larger than the second parameter, limiting the real-time maximum power of the server node, and outputting third alarm information.
When the first parameter is smaller than or equal to the second parameter and the difference value between the second parameter and the first parameter is smaller than or equal to a first threshold, it is indicated that heat accumulation is not generated when the server node operates in the first chassis, and the heat dissipation capacity of the first chassis is not excessive to cause power waste.
When the first parameter is smaller than the second parameter and the difference between the second parameter and the first parameter is larger than the first threshold, it is indicated that the server node does not generate heat accumulation when running in the first chassis, but the heat dissipation capacity of the first chassis is excessive to cause power waste, and the power consumption of the server system where the server node is located is increased.
When the first parameter is greater than the second parameter, it indicates that the server node generates heat accumulation when running in the first chassis, and the server node and the server system have heat accumulation risks.
According to the method and the device, the first parameter indicating the heat dissipation requirement of the server node and the second parameter indicating the heat dissipation capacity of the first chassis are obtained, then the operation of the server node is controlled according to the size relation of the first parameter and the second parameter and the alarm information is output in a grading mode, the risk existing in the operation of the current server node can be accurately judged according to the size relation, node misplug events with different risks are processed and alarmed in a refined mode, and the server is managed by operators more conveniently.
In one possible implementation, the first parameter comprises a thermal design power of the server node, and the second parameter comprises a first heat dissipation parameter; wherein the first heat dissipation parameter is: for the space in which the server node is located, the maximum amount of heat that the first enclosure can dissipate per unit time is the quotient of the number of nodes contained in the space.
In the application, the heat dissipation requirement of the server node can be more accurately described by taking the heat design power of the server node as a first parameter for indicating the heat dissipation requirement; and the first controller can accurately judge whether the server and the server system where the server is located currently have heat accumulation risks and power consumption increase risks or not by combining the first heat dissipation parameters taking the heat dissipation capacity of unit time as a unit.
In one possible implementation, the first parameter includes an operating power of the server node, and the second parameter includes a second heat dissipation parameter; the second heat dissipation parameters are as follows: the server node is operated in an operating state in the first chassis, corresponding to a heat dissipation capacity of the first chassis, without accumulating a maximum power of heat.
In the method, the working power of the server node is used as a first parameter for indicating the heat dissipation requirement of the server node, and the working power of the server node is combined with a second heat dissipation parameter for measuring the standard, so that the relationship between the working power of the server node and the corresponding risk is more visual; meanwhile, the method is applicable to the situation that the thermal design power is not clear, and the application range is wider.
In one possible implementation, the operating power is the power rating of the server node.
In the method, the rated power of the server node is used as the working power of the server node, so that the acquisition is easier.
In one possible implementation, the operating power is an average power of the server node over a preset period of time.
In the method, the average power of the server node in the preset time after power-on is used as the working power of the server node, the heat dissipation requirement of the server node is described, and the risk judgment result of the first controller according to the average power and the second heat dissipation parameter is more accurate.
The second aspect of the present application provides a method for detecting node misinsertion, which is applied to a first controller in a server node; the method comprises the following steps:
acquiring a first parameter indicating the heat dissipation requirement of the server node and a second parameter indicating the heat dissipation capacity of a first chassis, wherein the first chassis is a chassis in which the server node is inserted; when the first parameter is smaller than or equal to the second parameter and the difference value between the second parameter and the first parameter is smaller than or equal to a preset threshold value, the server node is controlled to enter a working state and a first operation is executed; and executing a second operation and outputting alarm information when the first parameter is smaller than the second parameter and the difference between the second parameter and the first parameter is larger than the preset threshold value or when the first parameter is larger than the second parameter.
In one possible implementation, the performing the second operation and outputting the alert information includes: when the first parameter is smaller than the second parameter and the difference value between the second parameter and the first parameter is larger than the preset threshold value, reducing the power of a heat dissipation part corresponding to the server node in the first chassis and outputting second alarm information; and under the condition that the first parameter is larger than the second parameter, reducing the maximum power of the server node and outputting third alarm information.
In one possible implementation, the first parameter comprises a thermal design power of the server node, and the second parameter comprises a first heat dissipation parameter; wherein the first heat dissipation parameter is: for the space in which the server node is located, the maximum amount of heat that the first enclosure can dissipate per unit time is the quotient of the number of nodes contained in the space.
In one possible implementation, the first parameter includes an operating power of the server node, and the second parameter includes a second heat dissipation parameter; the second heat dissipation parameters are as follows: the server node is operated in an operating state in the first chassis, corresponding to a heat dissipation capacity of the first chassis, without accumulating a maximum power of heat.
In one possible implementation, the operating power is the power rating of the server node.
In one possible implementation, the operating power is an average power of the server node over a preset period of time.
A third aspect of the present application provides a server comprising a chassis, and a server node disposed within the chassis, the server node comprising a first controller for performing the method of the first or second aspect.
In one possible implementation, the first controller is an out-of-band management controller (baseboard management controller, BMC).
It should be appreciated that the implementation and benefits of the various aspects described above may be referenced to one another.
Drawings
FIG. 1 is a system architecture diagram of a multi-node server according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a chassis space according to an embodiment of the present disclosure;
fig. 3 is a schematic diagram of a connection relationship between a node and a fan module according to an embodiment of the present application;
fig. 4 is a flow chart of a method for detecting node misinsertion according to an embodiment of the present application;
fig. 5 is a flow chart of a method for detecting and alarming node risk according to an embodiment of the present application;
Fig. 6 is a flowchart of another method for detecting misinsertion of a node according to an embodiment of the present application.
Detailed Description
Embodiments of the present application will now be described with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some, but not all embodiments of the present application. As a person of ordinary skill in the art can know, with the development of technology and the appearance of new scenes, the technical solutions provided in the embodiments of the present application are applicable to similar technical problems.
The terms first, second and the like in the description and in the claims of the present application and in the above-described figures, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments described herein may be implemented in other sequences than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
In the prior art, the multi-node servers of different generations adopt different chassis structures to accommodate corresponding nodes, and the nodes of different generations are prevented from being misplaced in a physical foolproof mode, so that the influence of mismatching of the heat dissipation capacity of the chassis and the heat dissipation requirement of the nodes during working is avoided. Correspondingly, as the generation of the multi-node server is changed, the chassis of the multi-node server accommodating the new generation also needs to be redesigned, and the chassis multiplexing rate is low.
In order to solve the above problems, the embodiment of the application provides a method for detecting node misplug, which detects and alarms the node misplug by means of software, and can detect the node misplug event and prompt operators in time even if the same case is adopted. By adopting the method, the misplacement of the nodes can be avoided, so that the situation that heat accumulation affects service due to poor heat dissipation effect or power consumption is increased due to excessive heat dissipation capability is avoided; and the physical foolproof is not needed through different chassis structures, so that the chassis multiplexing rate can be improved, and the management cost of chassis products is reduced.
Referring to fig. 1, fig. 1 is a system architecture diagram of a multi-node server according to an embodiment of the present application, where the architecture includes nodes 100A to 100D (hereinafter, collectively referred to as nodes 100), a middle module 110, a fan module 120, and a chassis 130.
The node 100 is a computing node, the node 100 comprises a main board, and a central processing unit (central processing unit, CPU), a memory bank and other components are integrated on the main board and are used for performing data operation and providing computing power for the multi-node server; an out-of-band management controller (baseboard management controller, BMC) is also integrated on the motherboard to manage the hardware of the node 100 itself, and may also be used to cooperatively control other hardware in the chassis 130, such as the fan module 120, with the BMC of other nodes in the multi-node server. In addition, a controller may be integrated on the motherboard of the node 100, for processing hardware logic signals; the controller may be a complex programmable logic device (complex programmable logic device, CPLD), or a micro control unit (microcontroller unit, MCU), a field programmable gate array (field programmable gate array, FPGA), or other processor having data processing and data transmission capabilities.
The middle module 110 is used to connect the node 100 with other hardware in the chassis 130, for example, to connect the node 100 with the fan module 120, or to connect the node 100 with a hard disk module (not shown in the figure). Specifically, the middle module 110 includes a fan management board, and devices integrated on the fan management board may include a controller, a node connector, a hard disk module connector, and a fan module connector; the node 100 is connected with the middle module 110 through a node connector, the hard disk in the hard disk module is connected with the middle module 110 through a hard disk module connector, and the fan in the fan module 120 is connected with the middle module 110 through a fan module connector. It will be appreciated that each connector is used to connect the fan management board to a component, such as a fan, a hard disk or a node. The controller on the fan management board is the same as the controller on the motherboard of the node 100, and will not be described here again.
That is, the node 100 may control the fan module 120 or read and write data in the hard disk of the hard disk module through the middle module 110.
The fan module 120 is used for dissipating heat generated during operation of the node 100.
In the specific example of fig. 1, nodes 100A and 100C are inserted into chassis 130 from the left, and nodes 100B and 100D are inserted into chassis 130 from the right. Referring to fig. 2, fig. 2 is a schematic diagram of a space of the chassis 130, and in the specific example of fig. 2, the chassis 130 is divided into two spaces, i.e., left and right, and each space may accommodate two nodes 100 arranged side by side. It will be appreciated that in the specific example of fig. 1, the left two fans are in the left space, and the heat dissipation capacity of the two fans is shared by node 100A and node 100C; the right space is the same.
The chassis 130 is used for accommodating the multi-node server, and may specifically be used for accommodating the above components, and the height specification of the chassis 130 may be 1U, 2U or 4U, which is not limited herein.
It will be appreciated that in other multi-node servers, all fans in the fan module 120 may collectively dissipate heat for all nodes.
It should be noted that in a specific implementation, the multi-node server may be any device including a similar structure in fig. 1. The embodiment of the application does not limit the specific composition structure of the multi-node server. In addition, the constituent structures shown in fig. 1 do not constitute limitations of the multi-node server 100, and the multi-node server 100 may include more or less components than those shown in fig. 1, or may combine some components, or may have different arrangements of components, for example, in some multi-node servers, a chassis management board for managing internal hardware of a chassis is further included, and the BMC of the node 100 needs to control the internal hardware of the chassis through the chassis management board; as another example, instead of connectors, the central module 110 is provided with interfaces, and the central module 110 connects the node 100 and the fan module 120 via the interfaces and cables.
It is understood that reference to a node being inserted into a chassis in this document refers to a state in which the node is connected to the chassis after being inserted into the chassis. For example, when the node is connected with the fan management board of the chassis through the connector, the node is inserted into the chassis to reach a preset position, namely, the node is inserted into the chassis after being inserted into the management board; when the node is connected with the management board through the cable, the node is inserted into the chassis to reach a preset position after being connected with the management board through the cable, and the state that the node is inserted into the chassis is obtained.
Referring to fig. 3, fig. 3 is a schematic diagram of a connection relationship between a node and a fan module provided in an embodiment of the present application, wherein the connection relationship includes a plurality of nodes 100 of a multi-node server, a middle module 110, a fan module 120, and a chassis 130, and the middle module 110 includes a fan management board 1101, node connectors 1102A to 1102D (hereinafter collectively referred to as node connectors 1102), and fan module connectors 1103A and 1103D (hereinafter collectively referred to as fan module connectors 1103); the fan module 120 includes fans 1201A to 1201D (hereinafter collectively referred to as fans 1201).
Wherein the nodes 100 are connected to the fan management board 1101 through a node connector 1102, respectively. Specifically, each node 100 includes a first controller and a BMC, where the first controller and the BMC in the node are connected through a bus in the node; the fan management board 1101 includes a second controller, and the first controller and the second controller can communicate through the node connector 1102.
It should be noted that, a motherboard connector (not shown) for mating with the connector 1102 on the fan management board 1101 may be disposed on the motherboard of each node, and the motherboard connector is used for transmitting signals between each node and the fan management board.
Wherein fans 1201 in the fan module 120 are connected to the fan management board 1101 through one fan module connector 1103, respectively. That is, the BMC of the node 100 may control the rotational speed of any of the fans of the fan module 120 through the first controller and the second controller. Likewise, fan module connectors 1103 are provided on each fan of the fan modules 120 for transmitting signals between the fans and the fan management board, as well as fan connectors (not shown) that are matingly connected.
Specifically, the first controller and the second controller may be complex programmable logic devices (complex programmable logic device, CPLD), micro-control units (microcontroller unit, MCU), field programmable gate arrays (field programmable gate array, FPGA), or other processors with data processing and data transmission capabilities.
In one possible implementation, the first controller may be integrated in the BMC of the node 100, or only the BMC is provided in the node 100 without the first controller. In this possible implementation, the functions or steps that the first controller needs to perform may be performed by the BMC.
It will be appreciated that different server vendors may refer to BMCs differently, for example, one server may be referred to as iBMC, another server may be referred to as iLO, and yet another server may be referred to as iDRAC.
It should be understood that the chassis 130 needs to accommodate a plurality of servers, and the space is compact, so the heat dissipation system disposed in the chassis 130 is typically an air-cooled heat dissipation system, and the embodiments shown in fig. 1 and fig. 2 are both described with reference to the heat dissipation system as an air-cooled system. However, it should be noted that, the liquid cooling heat dissipation system is generally used for a cabinet where the cabinet is located, in some embodiments, a branch liquid cooling pipeline connected with the liquid cooling system of the cabinet may be disposed in the cabinet 130, and at this time, the fan management board may be changed into a liquid cooling management board, and the temperature and the flow of the cooling medium are controlled through the combination of the temperature sensor and the valve, so as to achieve heat dissipation for the node.
The fan management board 1101 is configured to connect the node 100 and the fan module 120, and connect the node 100 and the hard disk module through the connector, so that the node 100 can individually control or cooperatively control the fan module 120, and simultaneously can access and read/write to the hard disk in the hard disk module.
Example 1
Referring to fig. 4, fig. 4 is a flowchart of a method for detecting node misinsertion according to an embodiment of the present application, where the method is applied to a first controller of a server node, and the method specifically includes the following steps 401 to 403.
The server node may be any node in a multi-node server, or may be a server node that is inserted into a chassis and operates separately.
The server node may first obtain the identification information of the first chassis, and then confirm whether the server node inserts the matched chassis according to the identification information.
The first chassis refers to a chassis into which the server node is inserted; specifically, the server node is connected to a management board of the first chassis after being inserted into the first chassis, so that a first controller or BMC in the server node can access a second controller in the server node through the management board.
It should be understood that the management board of the first chassis may be a management board for overall management of chassis hardware, or may be a fan management board in the embodiment shown in fig. 2, or a circuit board with other similar functions; the management board is provided with a second controller and a memory unit (not shown in the figure) in which the identification information is pre-stored.
The identification information is used for identifying the model of the first chassis, and may be specifically the model information of the chassis or the serial number of the chassis. It is understood that the chassis may be of a model corresponding to the model of the multi-node server.
Specifically, after the server node is inserted into the first chassis and is powered on as a node of the multi-node server, the BMC or the first controller of the server node may request to obtain the identification information from the second controller, and the second controller obtains the identification information from the storage unit and returns the identification information to the BMC or the first controller of the server node.
For example, when a node of the V7 version multi-node server (V7 node for short) is inserted into a chassis (V6 chassis for short) corresponding to the V6 version multi-node server, the CPLD of the management board of the V6 chassis may obtain, in advance, the identification information board id=0x69 from the storage unit of the management board, where the identification information is the identification information of the V6 chassis and stored in the 0x105 register of the CPLD, and when the V7 node is inserted into the V6 node chassis, the CPLD on the V7 node motherboard may obtain, through the node connector, the board id from the 0x105 register of the CPLD of the management board.
In one possible implementation, the first controller includes a BMC. The server node may not additionally provide the first controller, and the BMC may execute relevant functions and steps in the method embodiment of the present application.
The model of the multi-node server and the model of the node in the multi-node server are respectively encoded, the model of the node in the multi-node server may be one or more, and at this time, the nodes of the one or more models have a matching relationship with the chassis of the multi-node server. It can be appreciated that the matching relationship is subjected to a strict industrial test, and when the nodes of one or more types operate in the matched chassis, the nodes can work with rated power without generating accumulated heat, and meanwhile, the power consumption of the chassis is within a preset range, so that the energy efficiency meets the requirements of various policies.
Specifically, the nodes can generate heat during normal operation, and the case is provided with a heat dissipation system for dissipating heat generated by the nodes in the case; if the heat dissipation capability provided by the heat dissipation system of the chassis for one node can completely dissipate heat generated by the node in unit time, and meanwhile, the situation of excessive heat dissipation does not occur in the heat dissipation process, the node and the chassis have a matching relationship.
TABLE 1
For example, a series of nodes and a series of boxes have a matching relationship, a B series of nodes and a B series of boxes have a matching relationship, and because the boxes of the a series of nodes and the B series of nodes are not different in appearance, there may be cases where different series of nodes are inserted into boxes that are not matched with them. As shown in table 1, when the a-series node is inserted into the B-series chassis, the 600W heat dissipation power provided for the single node in the B-series chassis is lower than the 800W heat generation power of the a-series node, which results in heat accumulation in the B-series chassis; when the B-series node is inserted into the a-series chassis, the heat dissipation power of 800W provided for a single node in the a-series chassis is higher than the heat generation power of 600W of the B-series node, but the heat dissipation capacity of the a-series chassis may be excessive, and the power consumption is increased. Therefore, when the node heat-generating power is greater than the heat-dissipating power of the chassis, or the node heat-generating power is less than the heat-dissipating power of the chassis, and the difference value between the two is greater than or equal to the preset threshold, the node is considered to be inserted into the unmatched chassis.
The heat generating power is the heat generated by the node in a unit time under the rated power state; the single-node heat dissipation power is the heat dissipated by a single node in a unit time of the chassis in a rated power state.
The server node may store a white list in advance, where the white list is a preset identification information set of the second chassis; after the identification information of the first chassis is obtained, whether the server node is inserted into the matched chassis is judged according to the identification information and the white list, so that whether a node misplug event occurs is detected; if the identification information exists in the white list, it may be determined that the server node is inserted into the matched chassis, and the first controller executes step 402; if the identification information does not exist in the white list, it may be determined that the server node is inserted into a non-matching chassis, and the first controller performs step 403.
The white list may be stored in a memory of the server node or in a memory unit in the first controller. Specifically, if the comparison of the whitelist and the identification information is performed by the first controller, the whitelist may be stored in the memory unit of the first controller; if the comparison of the whitelist and the identification information is performed by the BMC, the whitelist may be stored in a memory location of the BMC.
The second chassis is a chassis with a matching relationship of the server nodes. It can be understood that the data in the preset identification information set is the identification information of the second chassis confirmed by the test.
The first controller may compare the identification information with the white list to determine whether the identification information exists in the white list.
Specifically, when only one piece of identification information of the second chassis is in the white list, the first controller can compare the identification information of the first chassis with the identification information of the second chassis, and if the identification information is consistent with the identification information of the second chassis, the identification information is determined to exist in the white list; and if the identification information is inconsistent, determining that the identification information does not exist in the white list.
For example, when the V7 node is inserted into the V6 chassis, the CPLD of the V7 node may compare the V7 node with its own white list after obtaining the identification information board id=0x69 of the V6 chassis; the identification information in the white list of the V7 node is a board id=0x96, after the V7 node is compared, it may be determined that the identification information of the V6 chassis is not in the white list, that is, the V6 chassis is not matched with the V7 node, and step 403 may be executed by the V7 node.
Specifically, when the white list has the identification information of a plurality of second chassis, the first controller may traverse the identification information of the first chassis in the white list, and if the data consistent with the identification information of the first chassis is traversed in the white list, determine that the server node is inserted into the matched chassis; if the traversing result is null, determining that the server node is inserted into a non-matched chassis.
For example, when the A5 node is inserted into the V6 chassis, the CPLD of the A5 node may compare the CPLD with the own second chassis white list after obtaining the identification information board id=0x69 of the V6 chassis; the identification information in the second chassis white list of the A5 node includes board id1=0x66 and board id2=0x69; the A5 node may determine that the identification information of the V6 chassis is in the white list after the comparison, that is, the V6 chassis matches the A5 node, where the A5 node may execute step 402.
In one possible implementation, the first controller may be a CPLD.
Wherein, the server node enters into the working state means that the server node works with rated power.
When the first controller confirms that the first controller inserts the matched chassis, the first controller can confirm that the server node does not influence the service operation of the server due to heat accumulation caused by insufficient heat dissipation capability of the chassis when working in the chassis under rated power, or the power consumption of the chassis or the whole server system is increased due to excessive heat dissipation capability, so that the first controller can directly enter a working state to work.
The first operation is a preset operation, and the specifically executed first operation can be determined according to the service request received by the server node.
Specifically, the first controller judges whether to insert the matched chassis according to the identification information, at this time, the first controller can confirm that the first chassis is the chassis which is confirmed to be matched with the first controller through industrial test, so that the first operation can be executed to operate the service according to the received service request, or can be combined with real-time operation parameters of the server node to output information of normal operation of the server node.
When the first controller confirms that the server node is inserted into the unmatched chassis, the first controller can confirm that the server node has a certain running risk at the moment, so that alarm information can be output through display equipment or equipment which can be perceived by other operators.
In one possible implementation, the first controller may suspend the start-up service and output alert information that the server node does not match the first chassis.
In another possible implementation, the first controller may further confirm the type of the running risk, and then finely execute the corresponding second operation according to the type of the risk and output the alarm information.
Referring to fig. 5, fig. 5 is a flow chart of a method for detecting and alarming node risk according to an embodiment of the present application, where the method is applied to a first controller of a server node, and the server node is inserted into a first chassis; the method specifically includes the following steps 501 to 504.
In step 501, a first controller obtains a first parameter indicating a heat dissipation requirement of a server node, and a second parameter indicating a heat dissipation capability of a first chassis.
When the first controller determines that the identification information of the first chassis does not exist in the white list, it can determine that the server node is not matched with the first chassis, that is, a node misplug event occurs, and at the moment, alarm information of the node misplug needs to be output to an operator.
Specifically, the first controller may further detect whether a risk of heat accumulation or a risk of increased power consumption exists when the server node operates in the first chassis according to the heat dissipation capability of the first chassis and the heat dissipation requirement of the server node, and then execute a corresponding second operation according to the existing risk, and output corresponding alarm information.
Specifically, the first controller needs to obtain a first parameter indicating a heat dissipation requirement of the server node and a second parameter indicating a heat dissipation capability of the first chassis, and then determine risk according to the first parameter and the second parameter.
The first parameter indicating the heat dissipation requirement of the server node may be the actual working power of the server node, or may be the rated power or the thermal design power of the server node. Specifically, the BMC of the server node may obtain the actual operating power or rated power from the power module.
The thermal design power (thermal design power, TDP) is the heat generated by the server node in watts (w) per unit time at maximum load. The thermal design power can be pre-stored in a BMC of the server node or a storage unit of the server node; or in a memory in the first controller of the server node.
The second parameter indicating the heat dissipation capacity of the first chassis may be pre-stored in a storage unit of the management board, or may be pre-stored in a storage unit of the server node or the BMC after being associated with the identification information of the first chassis.
Specifically, a storage unit or a BMC of the server node may store a corresponding relationship table of different chassis and second parameters of different chassis in advance, and when the first controller needs to obtain the second parameters, the corresponding second parameters may be obtained from the relationship table according to the identification information of the first chassis.
It is understood that the second parameter is the heat dissipated per unit time by the first enclosure or the heat dissipating components within the first enclosure, and the unit of measurement is the power unit watt (w).
In one possible implementation, the first controller may obtain the thermal design power of the server node, as well as the first heat dissipation parameter. Wherein the second parameter includes a first heat dissipation parameter.
The first threshold is a preset threshold, and may be specifically determined according to actual requirements, for example, according to related requirements of the energy efficiency policy.
Wherein, this first heat dissipation parameter is: for the space in which the server nodes are located, the maximum amount of heat that the first enclosure can dissipate per unit time is the quotient of the number of nodes contained in the space.
Specifically, the first chassis may be divided into one or more spaces, where each space is correspondingly provided with one or more heat dissipation components for dissipating heat.
It can be understood that when the server node solely shares the heat dissipation capability of one or more heat dissipation components, that is, only the server node itself is located in the space where the server node is located, the first heat dissipation parameter is the maximum heat that the one or more components can dissipate in a unit time; when the server node shares the heat dissipating capacity of the one or more heat dissipating components with another node, i.e. there are two nodes in the space, the first heat dissipation parameter is half the maximum value of the heat dissipated by the one or more components per unit time.
For example, when the server node is node 100A in the embodiment shown in fig. 1, the server node is located in the left space of the chassis 130, and shares the heat dissipation capability of the two fans on the left with node 100C; at this time, the first heat dissipation parameter is half of the maximum heat that the two fans can dissipate in a unit time.
Optionally, when all the nodes in the first chassis are in the same space, the first heat dissipation parameter is a quotient of a maximum heat dissipated by the first chassis in a unit time and the number of all the nodes.
Optionally, when the space where the server node is located only has the server node itself, that is, the server node monopolizes a space, the first heat dissipation parameter is the maximum heat that can be dissipated by one or more heat dissipation components corresponding to the space in a unit time.
The management board of the first chassis may pre-store a first heat dissipation parameter corresponding to each space of the first chassis, and the server node may determine a space where the server node is located according to a connector serial number or an interface serial number of the access management board, and then obtain the corresponding first heat dissipation parameter from the management board.
It will be appreciated that in the above implementation, the heat dissipating component may be considered to provide no heat dissipation to other spaces, except for the corresponding spaces.
When the heat design power is smaller than or equal to the first heat dissipation parameter, the heat generated by the server node even under the maximum load can be completely dissipated by the first chassis, the service of the server node is not affected by accumulated heat, and the server node can determine that the server node has no accumulated heat risk when the first chassis operates; however, in this case, there may be a risk of power consumption increase due to the excessive heat dissipation capacity, and therefore it is necessary to determine whether the first chassis or the whole server system has a risk of power consumption increase by determining whether the difference between the first heat dissipation parameter and the thermal design power is greater than the first threshold.
On the contrary, when the thermal design power is greater than the first heat dissipation parameter, the heat generated by the server node cannot be completely dissipated by the first chassis, the part which is not dissipated by the first chassis is accumulated in the slot continuously along with the time, and the operation of the server node and the service thereof is affected after the accumulated heat reaches a certain degree.
Therefore, if the thermal design power is smaller than the first heat dissipation parameter and the difference between the first heat dissipation parameter and the thermal design power is smaller than or equal to a first threshold, the server node may determine that there is no risk of heat accumulation and no risk of power consumption increase; if the thermal design power is less than the first heat dissipation parameter and the difference between the first heat dissipation parameter and the thermal design power is greater than a first threshold, the server node may determine that there is a risk of increased power consumption; if the thermal design power is greater than the first heat dissipation parameter, the server node may determine that there is a risk of heat accumulation.
It can be understood that when the heat dissipation system of the first chassis is air-cooled, the heat dissipation fans of the air-cooled system may correspond to the space of the first chassis, that is, one or more heat dissipation fans are disposed for each space to dissipate heat of the nodes in the slot, where the first heat dissipation parameter may be obtained according to the maximum heat that can be dissipated by the heat dissipation fans of the space in unit time and the number of nodes in the space; when the heat dissipation system of the first chassis is liquid-cooled, if the liquid-cooled system is provided with branch liquid-cooled pipelines corresponding to each space, the first heat dissipation parameter can be obtained according to the maximum heat which can be dissipated by the branch liquid-cooled pipelines of the space in unit time and the number of nodes accommodated in the space; if the liquid cooling system sets a shared liquid cooling pipeline for a plurality of slots of the first chassis, the first heat dissipation parameter can be regarded as a quotient of a maximum heat dissipated by the shared liquid cooling pipeline in a unit time and the number of all nodes in the plurality of slots.
In another possible implementation, the first controller may obtain an average power of the server node for a preset period of time, and the second heat dissipation parameter. Wherein the second parameter includes a second heat dissipation parameter.
The second threshold is a preset threshold, and may be specifically determined according to actual requirements, for example, according to related requirements of the energy efficiency policy.
The server node can record the real-time power of the server node after being electrified, and when the average power needs to be calculated, the real-time power data in the last preset time period is obtained through the BMC to carry out corresponding calculation. It can be understood that the preset time period is a preset time period after the server node is powered on.
Wherein, this second heat dissipation parameter is: the server node normally works in the first chassis without accumulating the maximum power of heat corresponding to the heat radiation capability of the first chassis; that is, when the working power of the server node in the first chassis is smaller than the second heat dissipation parameter, no heat accumulation is generated.
Optionally, the second heat dissipation parameter is a quotient of a maximum heat dissipation capacity of the first chassis and a number of nodes of the multi-node server. For example, the heat dissipation system of the first chassis can completely dissipate the heat generated by the 4-node server with the working power of 3200W at most, and at this time, the second heat dissipation parameter is 800W.
Optionally, the second heat dissipation parameter is a quotient of a maximum heat dissipation capability that can be provided by a heat dissipation component corresponding to a space where the server node is located and a number of nodes accommodated in the space. For example, the space where the server node is located can only accommodate one node, and the first chassis is provided with only one cooling fan for the space, and the cooling fan can completely dissipate heat generated by the node with the maximum working power of 1000W, and at this time, the second cooling parameter provided by the first chassis for the server node is 1000W; if two server nodes are accommodated in the space, the second heat dissipation parameter provided by the first chassis for each server node is 500W.
When the average power is smaller than or equal to the second heat dissipation parameter, heat generated by the operation of the server node can be completely dissipated by the first chassis, heat accumulated under peak load can be dissipated under low load, service of the server node is not affected by accumulated heat, and at the moment, the server node can determine that no accumulated heat risk exists when the first chassis operates; however, in this case, there may be a risk of power consumption increase due to the excessive heat dissipation capacity, so it is necessary to determine whether the first chassis or the whole server system has a risk of power consumption increase by determining whether the difference between the second heat dissipation parameter and the average power is greater than the first threshold.
On the contrary, when the average power is greater than the second heat dissipation parameter, the heat generated by the operation of the server node cannot be completely dissipated by the first chassis, the part which is not dissipated by the first chassis is accumulated in the slot continuously along with the time, and the operation of the server node and the service thereof is affected after the accumulated heat reaches a certain degree.
Therefore, if the average power is less than or equal to the second heat dissipation parameter, and the difference between the second heat dissipation parameter and the average power is less than or equal to a second threshold, the server node may determine that there is no risk of heat accumulation and no risk of power consumption increase; if the average power is smaller than the second heat dissipation parameter and the difference between the second heat dissipation parameter and the average power is larger than a second threshold, the server node may determine that there is an increased risk of power consumption; if the average power is greater than the second heat dissipation parameter, the server node may determine that there is a risk of heat accumulation.
In another possible implementation, the server node may obtain the power rating of the server node, as well as the second heat dissipation parameter.
It can be understood that the implementation principle of judging the node operation risk by the rated power and the second heat dissipation parameter of the server node is similar to the aforementioned implementation principle of judging the node operation risk by the average power and the second heat dissipation parameter of the server node within the preset time period, and will not be repeated here.
The first controller can acquire the first parameter and the second parameter through the implementation, and judge whether the heat accumulation risk and the power consumption increase risk exist or not according to the first parameter and the second parameter; if the first parameter is less than or equal to the second parameter and the difference between the second parameter and the first parameter is less than or equal to a preset threshold, the first controller may determine that there is no risk of heat accumulation and no risk of power consumption increase currently, so as to execute step 502; if the first parameter is smaller than the second parameter and the difference between the second parameter and the first parameter is greater than a preset threshold, the first controller may determine that there is an increased risk of power consumption currently, so as to execute step 503; if the first parameter is less than or equal to the second parameter, the first controller may determine that there is currently a risk of heat accumulation, thereby performing step 504.
When the first parameter is the heat design power and the second parameter is the first heat dissipation parameter, the preset threshold is the first threshold; when the first parameter is the rated power of the server node or the average power within the preset time period and the second parameter is the second heat dissipation parameter, the preset threshold is the second threshold.
When the server node runs in the first chassis without heat accumulation risk and power consumption increase risk, the first controller can control the server node to enter a working state; for safety, the server node needs to output the first alarm information to the audio/video output device which can be perceived by the operator when entering the working state. Specifically, the alarm degree of the first alarm information is a slight alarm, the content of the first alarm information comprises the current occurrence of a node misplug event, the identification information of the first chassis is not in a white list of the server node, and the service running has a certain risk.
In one possible implementation, before outputting the first alarm information, the first controller may detect the degree of adaptation between the server node and other hardware in the first chassis, so as to detect whether other running risks exist in the first chassis by the server node; if the other running risks are detected, the running risks are reduced by adjusting corresponding hardware in the server node or the first chassis through the BMC, and corresponding information is added to the first alarm information; if the running risk is not detected, the first controller may add the identification information of the first chassis to a white list of the server node.
In step 503, when the first parameter is smaller than the second parameter and the difference between the second parameter and the first parameter is greater than a preset threshold, the first controller reduces the power of the server node corresponding to the heat dissipation component and outputs a second alarm message.
The heat dissipation component corresponding to the server node is a heat dissipation component corresponding to a space where the server node is located in the first chassis.
When the server node runs in the first chassis without heat accumulation risk, but with increased power consumption, the situation of excessive heat dissipation capacity may occur, and at this time, the first controller may reduce the power of the heat dissipation component corresponding to the server node so as to reduce the power consumption, and output second alarm information to the audio/video output device that can be perceived by the operator. The alarm degree of the second alarm information is a moderate alarm, the content of the second alarm information comprises a node misplug event which occurs currently, and the risk of power consumption increase exists.
Specifically, the BMC of the server node may obtain real-time power of the server node from the power module, determine a real-time heat dissipation requirement of the server node from a real-time power-heat dissipation requirement table preset in the server node, and finally control power of a heat dissipation component corresponding to the server node according to the heat dissipation requirement.
It is understood that when the corresponding heat dissipation device of the server node provides heat dissipation for other nodes at the same time, the BMC of the server node may cooperatively control the power of the heat dissipation device with the BMCs of the other nodes.
When the server node runs in the first chassis and has heat accumulation risk, the first chassis cannot dissipate heat generated by the server node, so that the first controller needs to limit real-time maximum power of the first controller to be always smaller than or equal to the first heat dissipation parameter or the second heat dissipation parameter, and meanwhile, third alarm information is output through video and audio equipment which can be perceived by operators.
Specifically, the alarm degree of the third alarm information is a serious alarm, the content of the third alarm information includes that a node misplug event occurs at present, the misplug first chassis cannot meet the heat dissipation requirement of a server node, a large heat accumulation risk exists, and the node corresponding to the chassis needs to be replaced immediately.
In the embodiment of the application, after the server node is inserted into the first chassis and is powered on and started, the first controller can acquire the identification information of the first chassis and judge whether the first chassis is inserted into the matched chassis according to the identification information, and when the server node is not matched with the first chassis, the second operation is executed to confirm the current running risk and output corresponding alarm information, so that the refined alarm information output by the server node can be received at the first time when a node misplug event occurs. The occurrence of the node misplug event is prevented in a software mode, so that the chassis with different structures does not need to be replaced when the multi-node server is replaced, the reuse rate of the chassis is improved, and the management cost of chassis products is saved.
Example two
Referring to fig. 6, fig. 6 is a flowchart illustrating another method for detecting misinsertion of a node according to an embodiment of the present disclosure, where the method is applied to a first controller in a server node; the method specifically includes the following steps 601 to 603.
The server node may be any node in a multi-node server, or may be a server node that is inserted into a chassis and operates separately.
In step 601, a first controller obtains a first parameter indicating a heat dissipation requirement of a server node, and a second parameter indicating a heat dissipation capability of a first chassis.
After the server node is inserted into the first chassis, the first controller may obtain the first parameter from a power module, a BMC or a storage unit of the server node, and obtain the second parameter from a management board of the first chassis; detecting whether a heat accumulation risk or a power consumption increase risk exists when the server node runs in the first chassis according to the first parameter and the second parameter; if there is no risk of heat accumulation and no risk of increased power consumption, the server node may determine that it inserts into the matched chassis, and execute step 602; if there is a risk of heat accumulation or increased power consumption, the server node may determine that it inserts a non-matching chassis, and execute step 603.
In this embodiment, the second parameter is preset in the management board of the first chassis. Specifically, the implementation manner of the first controller obtaining the second parameter is similar to the implementation manner of the first controller obtaining the identification information of the first chassis in step 401 of the embodiment shown in fig. 4, and the second parameter is equivalent to the identification information in step 401, and specifically, reference may be made to the relevant parts of the embodiment shown in fig. 4, which are not repeated herein.
Specifically, the implementation manner of detecting, by the first controller, whether the risk of heat accumulation or the risk of power consumption increase exists when the server node operates in the first chassis according to the first parameter and the second parameter is similar to the implementation manner of determining, by the first controller, the risk of heat accumulation and the risk of power consumption increase in step 501 of the embodiment shown in fig. 5, and specifically, reference may be made to relevant parts of the embodiment shown in fig. 5, which are not repeated herein.
The first controller determines whether to insert the matched chassis according to whether the heat dissipation capacity of the first chassis can meet the heat dissipation requirement of the server node, at this time, the first controller can confirm that the server node runs in the first chassis without heat dissipation problem, and the first operation performed can be to detect the suitability of the server node and hardware in the first chassis so as to confirm whether other running risks exist.
Specifically, when the other running risks are detected, hardware parameters in the server node or the first chassis can be adjusted through the BMC to reduce the running risks, and corresponding alarm information is output at the same time; for example, when the power supply voltage of the power supply system is detected to be too high, the working parameters of the voltage converter of the chassis can be controlled to reduce the input voltage of the server node; or when detecting that the cooperative work with the BMC of other nodes in the multi-node server is not possible, the cooperative node can be requested to suspend the cooperative control of the hardware in the first chassis, and meanwhile, the hardware is controlled to work under the preset power. When the related risk is not detected, the server node can operate the service according to the received service request, and can also output the information of normal operation of the server node by combining the real-time operation parameters of the server node.
When the server node determines that the server node is inserted into the unmatched chassis due to the risk of heat accumulation or the risk of power consumption increase, the server node may be further confirmed according to the magnitude relation between the first parameter and the second parameter determined in step 601, and the server node specifically has the risk of heat accumulation or the risk of power consumption increase when the first chassis operates, so as to output corresponding alarm information.
If the first parameter is smaller than the second parameter and the difference between the second parameter and the first parameter is larger than a preset threshold, the first controller can confirm that the power consumption increasing risk exists currently, so that the power of a heat dissipation component corresponding to the server node in the first chassis can be reduced, and second alarm information can be output.
If the first parameter is greater than the second parameter, the first controller can confirm that the heat accumulation risk exists currently, so that the real-time maximum power of the server node can be limited, and third alarm information can be output.
The specific implementation of step 603 is similar to steps 503 to 504 in the embodiment shown in fig. 3, and the relevant portions described in the embodiment of fig. 5 may be referred to, which are not described herein.
In other embodiments, the first controller may suspend the startup service and output an alert that the server node does not match the first chassis.
According to the embodiment of the application, the first parameter indicating the heat dissipation requirement of the server node and the second parameter indicating the heat dissipation capacity of the first chassis are obtained, then the risk existing in the operation of the server node is determined according to the first parameter and the second parameter, and finally the operation and the classified output of the alarm information of the server node are controlled according to different risks, so that the node misplug event can be processed and alarmed more finely, and the management of the multi-node server by operators is facilitated.
The embodiment of the application also provides a server, which comprises a first controller, wherein the first controller is used for executing the method in the embodiment shown in fig. 4 to 6.
The embodiment of the application also provides a server system, which comprises a case and a plurality of servers arranged on the case; wherein the first controller in each server is adapted to perform the method as described in the embodiments shown in fig. 4-6.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the embodiments of the present application.
It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.
In the several embodiments provided in the embodiments of the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the embodiments of the present application may be essentially or, what contributes to the prior art, or part of the technical solution may be embodied in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
Claims (10)
1. The method for detecting the misplug of the node is characterized by being applied to a first controller in a server node; the method comprises the following steps:
acquiring identification information of a first chassis, wherein the first chassis is a chassis in which the server node is inserted;
controlling the server node to enter a working state and executing a first operation under the condition that the identification information exists in the white list of the server node;
and executing a second operation and outputting alarm information when the identification information does not exist in the white list of the server node.
2. The method of claim 1, wherein performing the second operation and outputting the alert information comprises:
acquiring a first parameter indicating the heat dissipation requirement of the server node and a second parameter indicating the heat dissipation capacity of the first chassis;
controlling the server node to enter the working state and outputting first alarm information under the condition that the first parameter is smaller than or equal to the second parameter and the difference value between the second parameter and the first parameter is smaller than or equal to a preset threshold value;
when the first parameter is smaller than or equal to the second parameter and the difference value between the second parameter and the first parameter is larger than the preset threshold value, reducing the power of a heat dissipation part corresponding to the server node in the first chassis and outputting second alarm information;
And limiting the real-time maximum power of the server node under the condition that the first parameter is larger than the second parameter, and outputting third alarm information.
3. The method of claim 2, wherein the first parameter comprises a thermal design power of the server node and the second parameter comprises a first heat dissipation parameter; wherein the first heat dissipation parameter is: for the space where the server node is located, the maximum amount of heat that the first enclosure can dissipate per unit time is the quotient of the number of nodes contained in the space.
4. The method of claim 2, wherein the first parameter comprises an operating power of the server node and the second parameter comprises a second heat dissipation parameter; wherein the second heat dissipation parameter is: and the server node works in the working state in the first chassis and does not accumulate the maximum power of heat corresponding to the heat radiation capacity of the first chassis.
5. The method of claim 4, wherein the operating power is a power rating of the server node.
6. The method of claim 4, wherein the operating power is an average power of the server node over a preset period of time.
7. The method for detecting the misplug of the node is characterized by being applied to a first controller in a server node; the method comprises the following steps:
acquiring a first parameter indicating the heat dissipation requirement of the server node and a second parameter indicating the heat dissipation capacity of a first chassis, wherein the first chassis is a chassis into which the server node is inserted;
controlling the server node to enter a working state and executing a first operation under the condition that the first parameter is smaller than or equal to the second parameter and the difference value between the second parameter and the first parameter is smaller than or equal to a preset threshold value;
and executing a second operation and outputting alarm information in the case that the first parameter is smaller than the second parameter and the difference value between the second parameter and the first parameter is larger than the preset threshold value or in the case that the first parameter is larger than the second parameter.
8. The method of claim 7, wherein performing the second operation and outputting the alert information comprises:
when the first parameter is smaller than the second parameter and the difference value between the second parameter and the first parameter is larger than the preset threshold value, reducing the power of a heat dissipation part corresponding to the server node in the first chassis and outputting second alarm information;
And limiting the real-time maximum power of the server node under the condition that the first parameter is larger than the second parameter, and outputting third alarm information.
9. A server comprising a chassis, and a server node disposed within the chassis, the server node comprising a first controller for performing the method of claims 1-8.
10. The server of claim 9, wherein the first controller is an out-of-band management controller, BMC.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202310207253.5A CN116361088A (en) | 2023-03-06 | 2023-03-06 | Node misplug detection method and server |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202310207253.5A CN116361088A (en) | 2023-03-06 | 2023-03-06 | Node misplug detection method and server |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN116361088A true CN116361088A (en) | 2023-06-30 |
Family
ID=86926845
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202310207253.5A Pending CN116361088A (en) | 2023-03-06 | 2023-03-06 | Node misplug detection method and server |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN116361088A (en) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2025102717A1 (en) * | 2023-11-15 | 2025-05-22 | 超聚变数字技术有限公司 | Misinsertion detection method for external device, and computing device |
-
2023
- 2023-03-06 CN CN202310207253.5A patent/CN116361088A/en active Pending
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2025102717A1 (en) * | 2023-11-15 | 2025-05-22 | 超聚变数字技术有限公司 | Misinsertion detection method for external device, and computing device |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| KR102801566B1 (en) | Systems, methods, and devices for providing power to devices through connectors | |
| US6813150B2 (en) | Computer system | |
| EP2535786B1 (en) | Server, server assemblies and fan speed control method | |
| US10146280B2 (en) | Reconfiguration of computing device and/or non-volatile memory devices based on thermal analysis | |
| US10976793B2 (en) | Mass storage device electrical power consumption monitoring | |
| US8843604B2 (en) | Method for interlocking a server to a server system and a computer system utilizing the same | |
| CN102395937B (en) | Power capping system and method | |
| US20130110926A1 (en) | Method for Controlling Rack System | |
| US7861103B2 (en) | Dynamically configuring overcurrent protection in a power supply | |
| US8639963B2 (en) | System and method for indirect throttling of a system resource by a processor | |
| US8301920B2 (en) | Shared power domain dynamic load based power loss detection and notification | |
| CN113961984B (en) | Host computing system and method for host computing system | |
| US9214809B2 (en) | Dynamically configuring current sharing and fault monitoring in redundant power supply modules | |
| KR20150049572A (en) | System for sharing power of rack mount server and operating method thereof | |
| US20150032283A1 (en) | Data center cooling | |
| CN116361088A (en) | Node misplug detection method and server | |
| US8959376B2 (en) | Sharing power between two or more power sharing servers | |
| CN101140480A (en) | Control method of server fan | |
| CN102650933B (en) | Storage system for network communication recording device of digital substation | |
| US20210182110A1 (en) | System, board card and electronic device for data accelerated processing | |
| CN115826712A (en) | Server heat dissipation control method, device and server | |
| CN112558740B (en) | Charging system for spare equipment of component throttling power | |
| CN113721747B (en) | A server and its anti-burning board circuit and method | |
| US12339977B2 (en) | Authorizing enterprise modular chassis component movement using fully homomorphic encryption | |
| US10423184B1 (en) | Operating temperature-based data center design management |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination |