CN119690754A

CN119690754A - A memory fault processing method, device, medium and server

Info

Publication number: CN119690754A
Application number: CN202411950702.6A
Authority: CN
Inventors: 孔祥宇
Original assignee: Suzhou Metabrain Intelligent Technology Co Ltd
Current assignee: Suzhou Metabrain Intelligent Technology Co Ltd
Priority date: 2024-12-27
Filing date: 2024-12-27
Publication date: 2025-03-25

Abstract

The present invention discloses a memory fault management method, device, medium and server, which relate to the field of servers and are used to solve the system instability and reliability problems caused by memory faults. After receiving the fault alarm information of the target memory, the processor first parses the alarm information and verifies its validity. If the fault information is valid, the target physical address and target power supply corresponding to the target memory are determined according to the fault alarm information; the target memory is isolated according to the target physical address, and a power-off instruction is sent to the target power supply to power off the target memory. It allows the faulty memory to be isolated and replaced while the server is running, which improves the maintainability of the memory, can reduce business interruptions caused by memory failures, reduce maintenance costs and reduce downtime.

Description

Memory fault processing method, device, medium and server

Technical Field

The present invention relates to the field of servers, and in particular, to a method, an apparatus, a medium, and a server for managing a memory failure.

Background

The rapid development of information technology and the continuous progress of server technology make the server play a core role in the key fields of cloud computing, big data, artificial intelligence, the Internet of things and the like, and bring unprecedented high requirements to the performance, stability and expandability of the server. In the complex architecture of the server, the memory is used as a high-speed bridge between the CPU and the storage device, and is responsible for temporarily storing data and accelerating program execution, and the importance of the memory is self-evident.

However, with the increase of the workload of the server and the expansion and speed of the memory capacity, the memory fault becomes a key factor affecting the stability and reliability of the system, which may cause data loss and service interruption, even cause a chain reaction, and affect the normal operation of the whole system. The conventional memory fault processing method generally needs to turn off the power supply of the server and then switch the memory, which can cause service interruption and affect the continuity and stability of the service.

Disclosure of Invention

The invention aims to provide a memory fault management method, a device, a medium and a server, which allow a fault memory to be isolated and replaced when the server runs, improve the maintainability of the memory, reduce service interruption caused by memory faults, reduce maintenance cost and reduce downtime.

In a first aspect, the present application provides a memory fault management method, applied to a processor in a server, where the server further includes a plurality of memories and a plurality of power supplies corresponding to the memories one by one, each of the memories is connected to a power supply through the power supply, and the memory fault management method includes:

when fault alarm information of a target memory is received, analyzing the fault alarm information;

Determining whether the fault alarm information is effective according to the analysis result, and if the fault alarm information is effective, determining a target physical address and a target power supply corresponding to the target memory according to the fault alarm information;

And isolating the target memory according to the target physical address, and sending a power-down instruction to the target power supply to power off the target memory.

Optionally, the fault alarm information includes an identifier of a target central processing unit to which the target memory belongs, a sequence number under the target central processing unit, and a value of a target register corresponding to the target memory, where the value in the target register is used to represent whether the target memory has a fault;

when fault alarm information of a target memory is received, analyzing the fault alarm information, wherein the method comprises the following steps:

when fault alarm information of a target memory is received, analyzing the identification of a target central processing unit to which the target memory belongs, a serial number under the target central processing unit and a numerical value of the target register from the fault alarm information;

determining whether the fault alarm information is valid according to the analysis result comprises the following steps:

determining whether the analyzed identification of the target central processing unit is within a first preset range;

Determining whether the sequence number in the target memory under the target central processing unit is within a second preset range;

determining whether the analyzed value of the target register is a preset value;

if the identification of the target central processing unit is in the first preset range, the sequence number in the target memory under the target central processing unit is in the second preset range, and the value of the target register is the preset value, judging that the fault alarm information is valid;

And if the identification of the target central processing unit is out of the first preset range or the serial number in the target memory of the target central processing unit is out of the second preset range or the numerical value of the target register is not the preset numerical value, judging that the fault alarm information is invalid.

Optionally, the fault alarm information includes a value of a target register corresponding to the target memory, where the value in the target register is used to characterize whether the target memory has a fault, and after determining a target physical address and a target power supply corresponding to the target memory according to the fault alarm information, the method further includes:

Determining a register address of a target register corresponding to the target memory;

isolating the target memory according to the target physical address, including:

and writing the target physical address and the register address of the target register into a memory management register so that a basic input/output system isolates the target memory based on information in the memory management register.

Optionally, the server further includes a plurality of fault indication devices corresponding to the memories one by one, each fault indication device is disposed on a memory slot of the memory corresponding to the fault indication device, and after determining, according to the fault alarm information, a target physical address and a target power supply corresponding to the target memory, the method further includes:

Determining a target memory slot position where the target memory exists and a fault type of the target memory according to the fault alarm information;

isolating the target memory according to the target physical address, and sending a power-down instruction to the target power supply to power off the target memory, further comprising:

The target state of the target fault indicating device corresponding to the target memory is determined by calling a preset corresponding relation according to the fault type, wherein the preset corresponding relation is a one-to-one corresponding relation between the fault type and the state of the fault indicating device;

And sending a control signal to the target fault indication device so as to update the state of the target fault indication device to the target state.

Optionally, the isolating the target memory according to the target physical address, and sending a power-down instruction to the target power supply, so that after the target memory is powered off, the method further includes:

And loading a standby memory, and sending a power-on instruction to a standby power supply corresponding to the standby memory, so that the standby power supply drives an internal charge pump inverter and a negative linear regulator to start working, and when the negative pressure output by the negative linear regulator reaches a preset percentage of a set value, the standby power supply conducts a passage between the power supply and the standby memory.

Optionally, the fault alarm information includes a value of a register corresponding to the memory, where the value in the register is used to indicate whether the memory has a fault, and after the standby power supply conducts a path between the power supply and the standby memory, the fault alarm information further includes:

Performing hardware self-checking on the standby memory, wherein the hardware self-checking at least comprises detection of physical connection states of the standby memory and a corresponding central processing unit, detection of data transmission rate of the standby memory and detection of error rate of the standby memory;

And when the detection result of the hardware self-check meets the preset requirement, resetting the numerical value of the register corresponding to the standby memory to a second numerical value, triggering the basic input output system to load the standby memory, wherein the second numerical value represents that the memory has no fault.

Optionally, the server further comprises a plurality of fault recording modules corresponding to the memories one by one, wherein each fault recording module is used for recording the running state and fault information of the memory, and after triggering the basic input/output system to load the standby memory, the server further comprises:

Updating a fault record module corresponding to the target memory according to the fault alarm information, and recording the time of occurrence of the fault of the target memory, the fault type, the target physical address and the loading state of the standby memory;

Generating log information of fault processing flow including the target memory according to the updated fault record data in the fault record module, and sending a fault prompt through a management console or a remote interface.

In a second aspect, the present application provides a memory failure management apparatus, including:

A memory for storing a computer program;

A processor for implementing the steps of the memory fault management method as described above when executing the computer program.

In a third aspect, the present application provides a non-volatile storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of a memory failure management method as described above.

In a fourth aspect, the present application provides a server, including the memory fault management device as described above, further including a plurality of memories and a plurality of power supplies corresponding to the memories one by one, where each of the memories is connected to a power supply through the power supply, and the memory fault management device is connected to each of the memories and each of the power supplies, respectively.

The application provides a memory fault management method, a device, a medium and a server, which are used for solving the problems of system instability and reliability caused by memory faults. After receiving the fault alarm information of the target memory, the processor firstly analyzes the alarm information and verifies the validity of the fault alarm information, if the fault alarm information is valid, the processor determines a target physical address corresponding to the target memory and a target power supply according to the fault alarm information, isolates the target memory according to the target physical address and sends a power-down instruction to the target power supply so as to power off the target memory. The fault memory is allowed to be isolated and replaced when the server runs, the maintainability of the memory is improved, service interruption caused by memory faults can be reduced, the maintenance cost is reduced, and the downtime is reduced.

Drawings

For a clearer description of embodiments of the present invention, the drawings that are required to be used in the embodiments will be briefly described, it being apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to the drawings without inventive effort for those skilled in the art.

FIG. 1 is a flow chart of a memory fault management method according to the present invention;

FIG. 2 is a schematic diagram of a power supply according to the present invention;

FIG. 3 is a schematic diagram of a memory fault management device according to the present invention;

fig. 4 is a schematic diagram of a nonvolatile storage medium according to the present invention.

Detailed Description

The core of the invention is to provide a memory fault management method, a device, a medium and a server, which allow the fault memory to be isolated and replaced when the server operates, improve the maintainability of the memory, reduce the service interruption caused by the memory fault, reduce the maintenance cost and reduce the downtime.

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In a first aspect, the present application provides a memory fault management method, applied to a processor in a server, where the server further includes a plurality of memories and a plurality of power supplies corresponding to the plurality of memories one by one, each memory is connected to a power supply through the power supplies, and the memory fault management method includes:

s11, when fault alarm information of a target memory is received, analyzing the fault alarm information;

In this step, when the fault alarm information of the target memory is received, the fault information analysis flow is triggered first, so as to accurately identify the key data fields and the corresponding meanings contained in the alarm information. The fault alarm information may include, but is not limited to, a hardware monitoring module (such as an ECC check unit or a memory controller), and may include information such as an identifier of a target CPU to which the target memory belongs, a specific serial number under the CPU existing in the target, and a value of a register corresponding to the target memory. The first step of parsing is to extract and separate the core fields from the received alarm information, ensuring the integrity and identifiability of the data of each field. Then, the extracted field is semantically checked, for example, whether the target CPU identifier belongs to the CPU range supported by the current server is checked, the identifier is ensured to be effective and matched with the actual hardware configuration, whether the serial number in the target CPU is in the legal range is checked to avoid misoperation caused by error numbers, and whether the numerical value of the register is equal to a preset fault identifier value (such as an abnormal mark of a specific bit) is checked to further confirm the authenticity of the memory fault.

False alarms caused by hardware noise or other interference factors can be effectively filtered through a multi-level analysis and verification mechanism, so that the accuracy and reliability of analysis results are ensured, and an accurate input basis is provided for subsequent fault processing steps.

S12, determining whether the fault alarm information is effective according to the analysis result, and determining a target physical address and a target power supply corresponding to the target memory according to the fault alarm information if the fault alarm information is effective;

In the step, the validity of the fault alarm information is further verified on the analysis result, so that the follow-up processing is ensured to be only aimed at the real and accurate memory fault, and misoperation is avoided. For example, the current task or process is checked against a memory address mapping table of the server to determine whether the target memory pointed by the alarm exists and has been allocated to the current task or process. By checking real-time status data (such as temperature, supply voltage, current fluctuation or ECC error count) of the memory, whether abnormal phenomena consistent with alarm description exist or not is evaluated, and false alarms possibly caused by short-term interference or false alarms are eliminated. In addition, the processor can also analyze whether the fault is related to the recent operation (such as frequent reading and writing or power supply fluctuation) by combining the time stamp and the system log of the alarm, so that the reliability of the alarm is improved.

If the fault alarm information is confirmed to be effective, the processor further extracts the physical address of the target memory from the memory address mapping table and searches the connection information of the physical address and the power supply, so as to locate the target power supply corresponding to the memory.

This process ensures that subsequent isolation and power down operations can be accurately performed on the target memory and associated power supplies without affecting other normally operating memory modules.

And S13, isolating the target memory according to the target physical address, and sending a power-down instruction to the target power supply to power off the target memory.

In this step, the target memory is isolated according to the target physical address, and a power-down instruction is sent to the target power supply to power off, and the principle is that the influence of the faulty memory is ensured to be limited to the minimum range through accurate hardware control and system resource management.

Specifically, the processor removes the physical address of the target memory from the system address space using the memory address mapping table and the control authority of the system bus to prevent any process or task from further accessing the memory, thereby preventing the spread of faults or data corruption to other modules. Then, the processor sends a power-down instruction to the target power supply, positions the target power supply and executes a shutdown operation, so that a power supply circuit related to the fault memory is disconnected, and the electrical connection is thoroughly cut off, thereby preventing current impact, overheat or other secondary faults caused by continuous power supply.

In addition, the triggering of the power-off signal and the memory isolation step are completed cooperatively, so that system abnormality caused by power off without isolation is avoided. In the whole process, detailed information of isolation and power failure is recorded, and an operation and maintenance management system is notified, so that basis is provided for subsequent fault analysis and processing.

Based on the above embodiments:

As an alternative embodiment, the fault alarm information comprises an identification of a target central processing unit to which the target memory belongs, a serial number under the target central processing unit and a numerical value of a target register corresponding to the target memory, wherein the numerical value in the target register is used for representing whether the target memory has a fault or not, when the fault alarm information of the target memory is received, the fault alarm information is analyzed, the identification of the target central processing unit to which the target memory belongs, the serial number under the target central processing unit and the numerical value of the target register are analyzed from the fault alarm information when the fault alarm information of the target memory is received, whether the fault alarm information is valid or not is determined according to an analysis result, the method comprises the steps of determining whether the identification of the analyzed target central processing unit is in a first preset range or not, determining whether the serial number of the analyzed target central processing unit is in a second preset range or not, determining whether the numerical value of the analyzed target register is in the preset numerical value or not, and judging that the identification of the target central processing unit is in the second preset range or not is in the preset range if the identification of the target central processing unit is in the first preset range or not.

In this embodiment, the validity of the alarm information is accurately determined by performing multi-level verification on the key field of the fault alarm information, so as to avoid the interference of the error alarm on the system operation.

Specifically, when fault alarm information is received, firstly, an identification of a target Central Processing Unit (CPU) to which a target memory belongs, a serial number under the target CPU in the target memory, and a numerical value of a target register are extracted from the fault alarm information, wherein the information is core data for identifying a memory module and representing a fault state.

The analyzed target CPU identification can be compared with a first preset range, so that the CPU pointed by the alarm information is ensured to belong to the range of the processor cluster actually deployed by the system, and the irrelevant or misreported processor identification is filtered.

Meanwhile, the sequence number of the target memory is checked to ensure that the sequence number is in the effective address space (the second preset range) of the target CPU, so that the validity of the memory position is verified, and misoperation caused by address boundary crossing or false data is avoided.

Meanwhile, the value of the target register is judged, and whether the state recorded in the register represents that the memory is actually faulty (such as ECC error overrun, temperature abnormality and the like) is judged by comparing the value with a preset value.

And judging that the alarm information is valid only when the three conditions are met simultaneously, namely the target CPU identification is in the valid range, the memory sequence number is legal, and the register value meets the expectations, so that the subsequent isolation and power-off operation is started, otherwise, the alarm information is regarded as invalid and discarded.

The multi-layer checking mechanism utilizes specific information of a hardware layer, combines with preset operation rules of a system, effectively improves the accuracy and reliability of alarm judgment, avoids error isolation or power-off operation of a non-fault memory, and ensures the overall stability and safety of the system.

As an alternative embodiment, the fault alarm information comprises a value of a target register corresponding to the target memory, wherein the value in the target register is used for representing whether the target memory has a fault or not, and after determining a target physical address corresponding to the target memory and a target power supply according to the fault alarm information, the method further comprises the following steps:

And writing the target physical address and the register address of the target register into a memory management register so that the basic input output system isolates the target memory based on the information in the memory management register.

In this embodiment, by precisely positioning the register address and effectively configuring the memory management register, and combining with the isolation mechanism of the Basic Input Output System (BIOS), the fault memory is safely and efficiently isolated from the system.

Specifically, after determining the target physical address and the target power supply corresponding to the target memory according to the fault alarm information, further acquiring the register address of the target register associated with the target memory, where the register address is an important control entry for accessing and operating the state of the target memory is required. Then, in order to isolate the target memory, the processor writes the physical address of the target memory and the register address of the target register into the memory management register at the same time, which plays a role in marking the faulty memory in the system hardware and the BIOS. The memory management register is used as a key component for interaction between the BIOS and the hardware resource, and can transfer the stored physical address and the register address to the BIOS so as to identify a specific fault memory module. The BIOS updates the memory access strategy based on the information, and prevents the processor, the operating system or the application program from continuously accessing the failed memory area by removing or marking the target memory from the address space of the system as unavailable, thereby effectively avoiding the problem of unstable system or data damage caused by the failed memory.

The isolation mechanism based on the cooperative work of the register and the memory management fully utilizes the capability of hardware resources and the underlying logic of the system, ensures the accuracy and instantaneity of fault processing, and simultaneously provides a stable environment for the subsequent operation of the system.

As an optional embodiment, the server further includes a plurality of fault indication devices corresponding to the plurality of memories one by one, each fault indication device is disposed on a memory slot of the memory corresponding to the fault indication device, and after determining the target physical address and the target power supply corresponding to the target memory according to the fault alarm information, the server further includes:

Isolating the target memory according to the target physical address, and sending a power-down instruction to the target power supply to enable the target memory to be powered off, and further comprising:

the method comprises the steps of calling a preset corresponding relation according to a fault type to determine a target state of a target fault indicating device corresponding to a target memory, wherein the preset corresponding relation is a one-to-one corresponding relation between the fault type and the state of the fault indicating device;

In this embodiment, the fault information of the target memory is transmitted to the maintainer in an intuitive physical form through the state update of the fault indication device, so as to implement more efficient fault positioning and processing.

Specifically, after the fault alarm information is analyzed, the physical slot position of the target memory and the corresponding fault type thereof are further determined, and the information is important for realizing accurate isolation and management of alarm states. After the isolation and power-off operation is performed on the target memory, determining a target state to be displayed by the corresponding target fault indication device according to the analyzed fault type and through a one-to-one correspondence between the preset fault type and the state of the fault indication device. For example, different fault types such as memory overheating, data verification failure or power abnormality may correspond to different states of the indication device, such as a change in light color (red, yellow, green) or a change in flashing frequency. And then, the processor sends a control signal to the target fault indicating device to trigger the state of the indicating device to be updated to the target state, so that the specific fault information of the memory is visually displayed on the physical position of the memory slot.

In this way, maintenance personnel can quickly locate the fault memory and the fault cause thereof without additional complex diagnostic tools, thereby improving maintenance efficiency and reducing system downtime. The dynamic updating method combined with the hardware indicating device not only enhances the visualization and automation capacity of fault management, but also further optimizes the maintenance experience and operation reliability of the system.

As an optional embodiment, isolating the target memory according to the target physical address, and sending a power-down instruction to the target power supply, so that after the target memory is powered off, the method further includes:

And loading the standby memory, and sending a power-on instruction to a standby power supply corresponding to the standby memory, so that the standby power supply drives the internal charge pump inverter and the negative linear regulator to start working, and when the negative pressure output by the negative linear regulator reaches a preset percentage of a set value, the standby power supply conducts a passage between a power supply and the standby memory.

In this embodiment, by introducing the standby memory and the standby power supply, after the target memory fails and is powered off, it is ensured that the system can be seamlessly switched to the standby memory, so as to ensure continuous operation of the system.

Specifically, after the target memory is isolated and powered down, the processor or system control unit initiates the process of loading the spare memory. This process first activates the operation of the backup power supply by controlling the backup power supply to which a power-up command is sent. The standby power supply comprises an internal charge pump inverter and a negative linear regulator, wherein the charge pump inverter is responsible for converting the input voltage into the required stable voltage, and then negative-pressure power supply is generated through the negative linear regulator. The negative linear regulator ensures that the output value of the negative pressure gradually rises in the power supply process, and when the output negative pressure reaches a preset set value, the standby power supply can output power to the standby memory through the output end of the standby power supply, so that a power supply path between the standby power supply and the standby memory is established.

The process ensures that the standby memory starts to work only when the power supply is stable and the voltage value reaches the required level, thereby avoiding the problem that the standby memory cannot be started normally due to unstable or insufficient voltage. Through the fine power management and memory switching mechanism, after the target memory fails, the load can be smoothly switched to the standby memory, so that the system is ensured not to be interrupted or lost, and the reliability and fault tolerance of the system are improved.

As shown in fig. 2, the principle of the power supply is that, in the power-on stage, the input voltage (Vin) of the power supply is connected to the input pin of the power supply, and the input voltage should be in the range of 2.5V to 5.5V. When the input voltage stabilizes and reaches a start-up threshold of the power supply, the power supply begins to perform an internal initialization process. At this time, the charge pump inverter inside the power supply starts to operate, and a desired negative pressure output is generated by the negative linear regulator. When the negative pressure output is stable and is maintained within +/-7.5% of the set value, a power source normal (POK) output pin of the power source supply device sends out a signal, current is transmitted to a corresponding memory slot position, and a memory is started and a server is started normally. Similarly, when the standby memory is loaded, a low signal is sent to the shutdown pin (SHDN) of the power supply of the failed memory via the I2C instruction. When the SHDN pin receives a low signal, the power supply begins to perform a shutdown process. The charge pump inverter stops working, the charge pump inverter in the power supply does not generate negative pressure output any more, and the output voltage is gradually reduced. With the charge pump inverter stopped, the output voltage (NEGOUT) of the power supply will gradually decrease to ensure that the GaAsFET power amplifier is not damaged during shutdown. When the output voltage drops to a safe level (near 0V), the power supply will be completely disconnected from the GaAsFET power amplifier. At this time, after confirming that the output voltage has been completely turned off, the input voltage (Vin) can be safely turned off. The power-down operation of the fault memory is completed, and the operation and maintenance personnel can replace equipment.

As an alternative embodiment, the fault alarm information comprises a value of a register corresponding to the memory, wherein the value in the register is used for representing whether the memory has a fault or not, and the standby power supply further comprises the following steps after the standby power supply conducts a path between the power supply and the standby memory:

Performing hardware self-checking on the standby memory, wherein the hardware self-checking at least comprises detection of the physical connection state of the standby memory and a corresponding central processing unit, detection of the data transmission rate of the standby memory and detection of the error rate of the standby memory;

And when the detection result of the hardware self-check meets the preset requirement, resetting the numerical value of the register corresponding to the standby memory to a second numerical value, triggering the basic input output system to load the standby memory, and enabling the second numerical value to represent that the memory has no fault.

In this embodiment, it is ensured that the spare memory is verified for its health status by a series of hardware self-checking procedures before being put into use, and is formally accessed into the system under the condition that the spare memory is confirmed to be normal.

Specifically, after the standby power supply is successfully turned on and provides a stable power supply for the standby memory, the system performs a series of hardware self-checking operations on the standby memory. Hardware self-checking first checks the physical connection state between the spare memory and the Central Processing Unit (CPU), ensuring that there is no signal loss or transmission interruption due to hardware failure or connection looseness. And then, detecting the data transmission rate of the standby memory to verify whether the data transmission capacity between the memory and the CPU meets the performance requirement of the system or not, and avoiding influencing the overall operation efficiency of the system due to the too slow transmission speed. Finally, error rate detection of the standby memory is also performed, so that the read-write operation of the memory is ensured not to cause data errors or losses, and the detection is performed through a preset standard, so that the working state of the standby memory is ensured to reach a normal level.

Once the hardware self-checking result meets the preset standard, that is, the spare memory is detected comprehensively and runs normally, the value in the register corresponding to the spare memory is reset to a second value, and the value represents that the memory has not failed. At this time, a basic input/output system (BIOS) is triggered to load the standby memory, and is accessed to the system to replace the failed target memory, so that the system can continue to operate without being influenced by the failed memory.

Through the verification process, the reliability and stability of the standby memory can be effectively ensured, and meanwhile, unqualified standby memory is prevented from being wrongly started under the fault condition, so that the fault tolerance and stability of the system are further improved.

As an optional embodiment, the server further comprises a plurality of fault recording modules corresponding to the memories one by one, wherein each fault recording module is used for recording the running state and fault information of the memory, and after triggering the basic input output system to load the standby memory, the server further comprises:

Updating a fault recording module corresponding to the target memory according to the fault alarm information, and recording the time of occurrence of the fault of the target memory, the fault type, the target physical address and the loading state of the standby memory;

In this embodiment, the fault recording module systematically records and tracks the memory fault event, and generates detailed fault processing log information to provide a transparent and traceable fault management process.

Specifically, after a Basic Input Output System (BIOS) is triggered to load the standby memory, a fault record module associated with the target memory is updated according to the received fault alarm information. The fault recording module is responsible for recording various information of the fault occurrence of the target memory, including specific time of the fault occurrence, fault type (such as read-write error, overtime, etc.), physical address of the target memory, and loading state of the standby memory (such as whether the standby memory is loaded successfully or not, whether self-checking is passed or not, etc.). This information helps the administrator to understand the detailed context of the fault and provides valuable data support for subsequent fault analysis and handling.

Meanwhile, according to the updated fault record data, log information comprising the target memory fault processing flow is generated. The log information not only records the occurrence time and type of the fault, but also describes specific fault processing measures adopted by the system, such as steps of memory isolation, loading of standby memory, hardware self-checking and the like. The generation of the log information is not only convenient for analyzing and tracking the faults in the future, but also provides clues for diagnosing and repairing the faults for system maintenance personnel.

Finally, a fault prompt is sent through a management console or a remote interface, so that a system administrator can be ensured to obtain fault information in time. Such hints typically include fault type, affected memory, load status of spare memory, and other relevant data, facilitating quick response by administrators and efficient maintenance operations.

Through the fault record and log management, the system realizes comprehensive monitoring, recording and transparent management of the memory faults, and further improves the fault processing efficiency and the reliability of the system.

The standby power supply outputs power to the standby memory, performs hardware self-checking on the standby memory, dynamically migrates a high-priority task associated with the target memory to the standby memory according to task priority and memory load conditions after the standby memory is subjected to hardware self-checking and is loaded, redistributes a low-priority task to the remaining effective memory according to a preset task migration policy after migration of the high-priority task is completed, balances the utilization rate of memory resources, updates a memory allocation table in a task scheduler after migration of all tasks is completed, and sends a memory state update notification to an operating system to ensure that the system can perform task scheduling based on the latest memory state.

In this embodiment, after the backup memory is loaded and the normal state of the backup memory is verified through the hardware self-check, the task is dynamically migrated intelligently according to the priority of the task and the current memory load condition, so as to maximize the resource utilization rate and ensure the stable operation of the system. When the standby memory is subjected to hardware self-test and is loaded successfully, the system firstly identifies a high-priority task associated with the target memory according to the priority of the task. Because the spare memory is ready, the system will migrate these high priority tasks to the spare memory to ensure that these tasks are handled in time and will not be affected by the target memory failure.

After the high-priority task migration is completed, reasonably reallocating resources for the low-priority task according to a preset task migration strategy. Low priority tasks typically do not need to be executed immediately, so they can be migrated to the remaining active memory to make room for spare memory for more important tasks. The process ensures that the memory resources are distributed in a balanced way, avoids overload or idle of the memory, and improves the utilization efficiency of the system resources.

And updating the memory allocation table in the task scheduler once all the tasks are migrated. The memory allocation table records the relationship between each task and the memory, including the target memory position after task migration. The updated memory allocation table is passed to the operating system for task scheduling by the operating system based on the latest memory state. After receiving the memory state update notification, the operating system re-evaluates the availability of the memory resources and correspondingly adjusts the task scheduling strategy to ensure that the system can still operate efficiently after the memory load changes.

The dynamic task migration and memory management strategy enhances the flexibility and reliability of the system under the condition of memory faults or load fluctuation, and improves the timeliness of task processing and the overall performance of the system.

In a second aspect, as shown in fig. 3, the present application provides a memory failure management apparatus, including:

a memory 31 for storing a computer program;

The processor 32 is configured to implement the steps of the memory failure management method as described above when executing the computer program. For other descriptions of the memory fault management device, please refer to the above embodiment, and the description of the present application is omitted herein.

In a third aspect, as shown in fig. 4, the present application provides a nonvolatile storage medium 41, where a computer program 42 is stored on the nonvolatile storage medium 41, and the computer program 42 implements the steps of the memory failure management method described above when executed by a processor. For other descriptions of the nonvolatile storage medium 41, please refer to the above embodiment, and the disclosure is not repeated here.

In a fourth aspect, the present application provides a server, including the memory fault management device as described above, and further including a plurality of memories and a plurality of power supplies corresponding to the plurality of memories one by one, where each memory is connected to a power supply through the power supplies, and the memory fault management device is connected to each memory and each power supply respectively. For other descriptions of the server, please refer to the above embodiment, and the disclosure is not repeated here.

In a fifth aspect, the present application provides a computer program product comprising computer programs/instructions which when executed by a processor implement the steps of a memory failure management method as described above. For other descriptions of computer program products, please refer to the above-mentioned embodiments, and the disclosure is not repeated here.

It should also be noted that in this specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. The memory fault management method is characterized by being applied to a processor in a server, wherein the server further comprises a plurality of memories and a plurality of power supplies corresponding to the memories one by one, and each memory is connected with a power supply through the power supply, and the memory fault management method comprises the following steps:

2. The memory fault management method as claimed in claim 1, wherein the fault alarm information includes an identifier of a target central processing unit to which the target memory belongs, a sequence number under the target central processing unit, and a value of a target register corresponding to the target memory, where the value in the target register is used to represent whether the target memory has a fault;

3. The memory fault management method as claimed in claim 1, wherein the fault alert information includes a value of a target register corresponding to the target memory, the value in the target register being used to indicate whether the target memory has a fault, and after determining a target physical address and a target power supply corresponding to the target memory according to the fault alert information, further comprising:

4. The memory fault management method as claimed in claim 1, wherein the server further comprises a plurality of fault indication devices corresponding to the memories one by one, each fault indication device is disposed on a memory slot of the memory corresponding to the fault indication device, and after determining the target physical address and the target power supply corresponding to the target memory according to the fault alarm information, the method further comprises:

5. The memory fault management method according to any one of claims 1 to 4, wherein isolating the target memory according to the target physical address, and sending a power-down instruction to the target power supply to power down the target memory, further comprises:

6. The memory fault management method as claimed in claim 5, wherein said fault alert information includes a value of a register corresponding to said memory, said value in said register being used to indicate whether said memory has failed, and said standby power supply further includes, after said standby power supply turns on a path between said power supply and said standby memory:

7. The memory fault management method as claimed in claim 6, wherein said server further comprises a plurality of fault recording modules corresponding to a plurality of memories, each of said fault recording modules being configured to record an operation state and fault information of said memories, and after triggering a bios to load said spare memory, further comprising:

8. A memory fault management device, comprising:

A memory for storing a computer program;

a processor for implementing the steps of the memory fault management method according to any of claims 1-7 when executing a computer program.

9. A non-volatile storage medium, wherein a computer program is stored on the non-volatile storage medium, which when executed by a processor, implements the steps of the memory fault management method according to any one of claims 1-7.

10. The server, comprising the memory fault management device according to claim 8, further comprising a plurality of memories and a plurality of power supplies corresponding to the memories one by one, wherein each of the memories is connected to a power supply through the power supply, and the memory fault management device is connected to each of the memories and each of the power supplies.