[go: up one dir, main page]

CN120956636A - A method and electronic device for monitoring network card operation - Google Patents

A method and electronic device for monitoring network card operation

Info

Publication number
CN120956636A
CN120956636A CN202511203656.8A CN202511203656A CN120956636A CN 120956636 A CN120956636 A CN 120956636A CN 202511203656 A CN202511203656 A CN 202511203656A CN 120956636 A CN120956636 A CN 120956636A
Authority
CN
China
Prior art keywords
network card
target
mode
target network
error event
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202511203656.8A
Other languages
Chinese (zh)
Inventor
冯洁
彭笑笑
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Metabrain Intelligent Technology Co Ltd
Original Assignee
Suzhou Metabrain Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Metabrain Intelligent Technology Co Ltd filed Critical Suzhou Metabrain Intelligent Technology Co Ltd
Priority to CN202511203656.8A priority Critical patent/CN120956636A/en
Publication of CN120956636A publication Critical patent/CN120956636A/en
Pending legal-status Critical Current

Links

Landscapes

  • Computer And Data Communications (AREA)

Abstract

本申请公开了一种网卡运行监控方法及电子设备,涉及网卡管理技术领域,包括:监测目标网卡的运行状态,若根据运行状态判断目标网卡当前存在目标错误事件,且目标网卡的当前运行模式为数据处理器模式,则获取目标网卡基于数据处理器模式运行时的上下文数据保存至预设共享存储区,并激活第一功能栈,将当前运行模式切换至网络接口卡模式;基于预设共享存储区中的上下文数据,建立目标网卡在基于网络接口卡模式运行时的网络连接,并修复目标网卡的目标错误事件。本申请中网卡可以在基于数据处理器的模式故障时,快速切换到稳定的基于网络接口卡的工作模式,并且通过网卡自修复减少服务中断时间,显著降低运维复杂度与成本。

This application discloses a network interface card (NIC) operation monitoring method and electronic device, relating to the field of NIC management technology. The method includes: monitoring the operating status of a target NIC; if the operating status indicates that a target error event exists in the target NIC and the current operating mode of the target NIC is data processor mode, then the context data of the target NIC running in data processor mode is obtained, saved to a preset shared storage area, and the first function stack is activated to switch the current operating mode to network interface card mode; based on the context data in the preset shared storage area, a network connection is established for the target NIC running in network interface card mode, and the target error event of the target NIC is repaired. In this application, the NIC can quickly switch to a stable network interface card operating mode when the data processor mode fails, and reduces service interruption time through NIC self-repair, significantly reducing operational complexity and cost.

Description

Network card operation monitoring method and electronic equipment
Technical Field
The present application relates to the field of network card management technologies, and in particular, to a network card operation monitoring method and an electronic device.
Background
The intelligent Network card (SmartNIC, i.e., intelligent Network interface controller, smart Network Interface Controller) has evolved from a conventional NIC (Network interface card ) to a device that integrates powerful processing capabilities, capable of offloading Network, storage and secure processing tasks of a host CPU (central processing unit ). The NIC mode mainly provides basic network connection and data packet receiving and transmitting functions, processing logic is relatively simple and direct, and the DPU (data processor, data Processing Unit) mode deeply integrates computing capacity and can execute advanced tasks such as complex network function virtualization, storage acceleration, security processing, virtual machine/container network unloading and the like. However, the intelligent network card is generally statically configured into a DPU mode or a NIC mode during initialization, dynamic switching cannot be performed, and when complex tasks are operated in the DPU mode, if a software stack or a hardware acceleration unit is in error, service interruption time is long, and service continuity is affected.
Disclosure of Invention
The application provides a network card operation monitoring method and electronic equipment, which are used for solving the problem of data loss caused by fault recovery by adopting an integral reset network card or relying on host system intervention mode when the current network card is operated in a mode based on a data processor, realizing autonomous switching of the network card operation mode and problem recovery, and improving the operation stability of the equipment.
The application provides a network card operation monitoring method, which comprises the following steps:
monitoring the running state of the target network card, and judging whether the target network card currently has a target error event according to the running state;
If the target network card has a target error event and the current running mode of the target network card is a data processor mode, acquiring context data of the target network card running based on the data processor mode;
the context data are stored in a preset shared memory area, a first function stack of the target network card is activated, and the current running mode of the target network card is switched to a network interface card mode;
And establishing network connection of the target network card based on the context data in the preset shared memory area when the target network card operates based on the network interface card mode, and repairing a target error event of the target network card based on the context data.
The application also provides electronic equipment which comprises a memory and a processor, wherein the memory is used for storing a computer program, and the processor is used for realizing the steps of any network card operation monitoring method when executing the computer program.
The application can monitor the running state of the target network card, if the current running mode of the target network card is the data processor mode and the target network card is judged to have the target error event according to the running state, the context data of the target network card running based on the data processor mode is stored in the preset shared memory area, the first function stack of the target network card is activated, the current running mode of the target network card is switched to the network interface card mode, thereby establishing the network connection of the target network card running based on the network interface card mode based on the context data in the preset shared memory area, and repairing the target error event of the target network card based on the context data.
According to the technical scheme, when the target network card has the target error event, the current running mode of the target network card is switched to the working mode based on the network interface card after the context data of the target network card running based on the current working mode is stored in the shared storage area, so that the network connection of the target network card is reestablished based on the context data in the shared storage area, the network card can be automatically switched to the stable working mode based on the network interface card quickly when the mode based on the data processor fails, the service interruption time is greatly reduced, the service is not interrupted or quickly recovered, the problems of long interruption time and data loss caused by fault recovery by adopting an integral reset network card or a host system intervention mode when the current network card runs in the mode based on the data processor are solved, the dependence on manual intervention is remarkably reduced, the network card self-repairing is realized, the complexity and the cost of running are remarkably reduced, and the stability of the network card in the equipment is improved.
Drawings
For a clearer description of embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described, it being apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to the drawings without inventive effort for those skilled in the art.
Fig. 1 is a flowchart of a network card operation monitoring method provided in an embodiment of the present application;
fig. 2 is a diagram of a network card architecture according to an embodiment of the present application;
FIG. 3 is a flowchart of a specific network card operation monitoring method according to an embodiment of the present application;
fig. 4 is a block diagram of an electronic device according to the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. Based on the embodiments of the present application, all other embodiments obtained by a person of ordinary skill in the art without making any inventive effort are within the scope of the present application.
It should be noted that in the description of the present application, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. The terms "first," "second," and the like in this specification are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order.
At present, the intelligent network card is generally statically configured into a DPU mode or a NIC mode in the initialization, dynamic switching cannot be performed, and when complex tasks are operated in the DPU mode, if a software stack or a hardware acceleration unit is in error, service interruption time is long, service continuity is affected, and when the mode based on a data processor fails, the network interface card-based operation mode is switched to a stable network interface card-based operation mode rapidly and autonomously, service interruption time is greatly reduced, service is not interrupted or is recovered rapidly, and the complexity and cost of operation and maintenance are remarkably reduced and the stability of the equipment where the network card is located in during operation is improved through autonomous switching and problem repairing of the network card operation mode.
The present application will be further described in detail below with reference to the drawings and detailed description for the purpose of enabling those skilled in the art to better understand the aspects of the present application.
As shown in fig. 1, an embodiment of the present application provides a network card operation monitoring method, which is described in detail below in conjunction with an execution flow of the network card operation monitoring method, including:
And S11, monitoring the running state of the target network card, and judging whether the target network card currently has a target error event according to the running state.
It can be understood that, as shown in fig. 2, the firmware/software architecture design of the intelligent network card in this embodiment supports a complete NIC mode function stack and a DPU mode function stack (including an operating system, a driver, a management program, an application framework, etc.), sets a mode state machine, records a current running mode (DPU Active, NIC ACTIVE) and a possible transitional state (such as Switching, error Recovery), and sets a shared context storage area, so as to save and restore critical state information (such as a network connection state, a secure session key, a offload table item, etc.) during mode Switching, and ensures service continuity after Switching. The intelligent network card can be controlled through a target control platform, and the target control platform can be a host computer or a control plane and the like. Specifically, the hardware layer is used for monitoring temperature, voltage and ECC (Error checking and Correcting) of the processor (such as Arm Cores, x86 Cores), memory errors of the Error CHECKING AND, monitoring a Field Programmable Gate Array (FPGA) state register of an Application SPECIFIC INTEGRATED (ASIC), a watchdog timer, and monitoring a link state of a physical layer (PHY) PHYSICAL LAYER/SerDes (sequencer/Deserizer). The firmware/software layer is used to monitor operating system kernel crashes (Panic/Oops), critical services (e.g., vSwitch, storage target) crashes, application crashes, resource exhaustion (memory, threads), protocol stack exceptions, data consistency check failures (e.g., DMA (direct memory access, direct Memory Access)). The functional layer is used for monitoring key service indexes (such as packet loss rate sudden increase, processing delay exceeding standard and encryption and decryption failure rate). The error classifier is used to classify the detected error event (e.g., transient error, isolatable software error, firmware logic error, critical hardware error), and evaluate the scope of impact and severity.
Based on the above intelligent network card, as shown in fig. 3, when the running state of the target network card is monitored in this embodiment, the hardware layer running state, the software layer running state and the functional layer running state of the target network card may be monitored, so as to determine whether an initial error event exists in the target network card currently according to the hardware layer running state, the software layer running state and the functional layer running state. If the target network card has the initial error event, classifying the initial error event by using a preset error classifier to obtain an event category of the initial error event, and taking the initial error event as the target error event when the event category meets a preset mode switching condition. The current error event type of the network card can be further determined by carrying out multi-level error monitoring on the running state of the intelligent network card, so that corresponding measures are taken in a targeted manner, and the running stability of the intelligent network card is ensured.
In another specific embodiment, when the event type does not meet the preset mode switching condition and the current running mode of the target network card is the data processor mode, directly determining a network card repairing rule of the target network card when the target network card runs based on the data processor mode, and repairing the initial error event according to the network card repairing rule. As shown In FIG. 3, for error events that do not meet the preset mode switch condition, a lightweight repair (In-plane Recovery) may be employed, such as attempting to repair In the current mode (typically DPU mode) for transient errors or local software errors (e.g., single application crashes), specific repair operations including, but not limited to, restarting the crashed application or service, isolating the error module, resetting the associated hardware unit (e.g., a particular acceleration engine), and rolling back to a known good configuration or software version.
And as shown in fig. 3, after repairing the initial error event according to the network card repairing rule, it is further required to determine whether the initial error event is repaired successfully. And if the initial error event is failed to repair, taking the initial error event as a target error event so as to switch the current running mode of the target network card to a network interface card mode. The data processor mode is a working mode of the target network card based on the data processor, and the network interface card mode is a working mode of the target network card based on the network interface card.
And step S12, if the target network card has a target error event and the current running mode of the target network card is a data processor mode, acquiring the context data of the target network card running based on the data processor mode.
In this embodiment, if the target network card has a target error event and the current operation mode of the target network card is the data processor mode, the context data of the target network card running based on the data processor mode is obtained. The data processor mode is a working mode of the target network card based on the data processor, namely a DPU mode. That is, when the lightweight repair fails or the current error time satisfies the preset mode switching condition, the mode switching repair may be performed (Failover Recovery). Specifically, in this embodiment, when the lightweight repair fails, or the error is classified as requiring a more thorough environment reset (e.g., kernel crash, firmware logic disorder), an autonomous switch to NIC mode (network interface card mode) may be triggered.
And S13, storing the context data into a preset shared memory area, activating a first function stack of the target network card, and switching the current running mode of the target network card to a network interface card mode.
When executing the mode switching repair in this embodiment, as shown in fig. 3, it is first required to save the context data to the preset shared memory area, and activate the first function stack of the target network card to switch the current running mode of the target network card to the network interface card mode. In this process, after the context data is saved to the preset shared memory area, it may also be determined that the target network card operates based on the second functional stack and the hardware resource when the data processor mode is running, and the second functional stack and the hardware resource are unloaded or reset. And the target function accelerator in the target network card can be determined, and the operation of the target function accelerator is kept, or the target function accelerator is directly restarted, so that the network connection of the target network card is kept by using the target function accelerator.
Based on the above technical solution, when the mode switching repair is executed in this embodiment, the network card state is first saved, specifically, the current DPU mode activity can be frozen, and the key context (network connection, secure session, etc.) is saved to the shared context storage area. The mode uninstallation/reset is then performed, and in particular, the software stack and hardware resources associated with the DPU mode may be safely uninstalled or reset, and it may be appreciated that this step may involve restarting the OS (Operating System) or hypervisor on the network card. And then activating the NIC mode, firstly activating the NIC mode function stack, loading necessary configuration, and rapidly reestablishing basic network connection by using the stored context. Meanwhile, the intelligent network card can also control the hardware data surface (such as a fixed function accelerator) to keep partial forwarding capacity or restart quickly when the control surface performs switching operation, so as to ensure that basic flow is not interrupted, maintain basic network connectivity in the NIC mode and realize the keep-alive of the data surface of the network card. For example, for network processing, basic data packet forwarding (L2-L4 layer), traffic classification, checksum calculation and the like are kept, so that the basic data transmission at the hardware level is not interrupted during mode switching, hardware unloading can be selectively supported in the process, the transmission and access of stored data are accelerated, and hardware logic such as integrated encryption/decryption, firewall rule matching and the like is kept, so that safety-related tasks can be processed quickly, and the reliability of the network card is further improved.
Based on the technical scheme, the network card can not disconnect network connection when the current DPU mode is frozen, and maintains the network connection and ensures service continuity by constructing a shared context storage area, implementing measures such as data surface keep-alive and the like. The shared context storage area stores key state information such as network connection state, secure session key, unloading stream table entry and the like, and provides a basis for quickly reestablishing network connection after switching, so that key parameters of network connection are reserved in a short time of freezing DPU mode activity, and network connection can be restored based on the information instead of directly disconnecting the network connection. And when the control plane performs switching operation (including freezing the DPU mode activity), the hardware data plane (such as a fixed function accelerator) may keep partial forwarding capability or restart quickly, so as to ensure that the basic traffic is not interrupted, the network connection can be maintained to a certain extent, and if the real-time network data transmission requirement exists, the data plane keep-alive mechanism can enable the network card to continue to process partial traffic, and prevent the network connection from being interrupted due to the DPU mode freezing.
In this embodiment, after the current operation mode of the target network card is switched to the network interface card mode, a mode switching signal of the target network card may also be generated, and the mode switching signal may be sent to the target control platform through the management controller or the target interface of the target network card. The target interface is a data transmission interface of the target network card when the target network card operates based on a network interface card mode. That is, when a mode switch of the intelligent network card occurs, a host notification is required, and specifically, the host and the management system may be notified through an out-of-band management channel (such as BMC (baseboard management controller, baseboard Management Controller)) or a keep-alive NIC channel.
Step S14, based on the context data in the preset shared memory area, establishing the network connection of the target network card when the target network card operates in the mode based on the network interface card, and repairing the target error event of the target network card based on the context data.
It will be appreciated that the DPU mode and the NIC mode rely on different software function stacks (DPU mode contains complex advanced functions, NIC mode only provides basic L2-L4 forwarding), and when NIC mode is activated, it is necessary to switch to its dedicated function stack, and at this time, it is necessary to reload the configuration based on the saved context to adapt the simplified function logic of NIC mode. Therefore, in this embodiment, based on the context data in the preset shared memory area, the network connection of the target network card when running based on the network interface card mode can be established, and the target error event of the target network card can be repaired based on the context data. And after repairing the target error event of the target network card based on the context data, the current running mode of the target network card can be switched to the data processor mode again, the context data in the second functional stack and the preset shared memory area are loaded, and the target network card is continuously run according to the second functional stack and the context data based on the data processor mode. Therefore, by combining the steps and through an efficient context save/restore mechanism, the connection interruption during the switching is minimized, and the state snapshot and restoration of the network card are realized.
Specifically, when repairing the target error event of the target network card, network connection of the target network card based on the network interface card mode operation can be maintained, diagnosis is performed on the target to be repaired corresponding to the target error event in the target operation environment, a corresponding diagnosis result is obtained, and the target to be repaired is repaired based on the diagnosis result and a preset repair operation. The target to be repaired is software and firmware of the target network card running based on the data processor mode, and the target running environment is an environment corresponding to the target network card running based on the network interface card mode. And it should be noted that, the above-mentioned preset repair operation is any one or a combination of several of reloading the firmware, patching the target application corresponding to the target to be repaired, and recovering the configuration of the target to be repaired based on the preset backup. In addition, the embodiment is not limited to specific repair operations, other specific operations capable of repairing the network card can be selected according to actual situations, and the embodiment is not limited to repairing software or firmware of the network card, for example, repairing operations of a hardware layer can be performed for local hardware abnormality, local reset can be performed for an independent hardware unit in the network card, hardware resource reassignment can be performed when detecting that an error/resource exhaustion exists in a part of hardware resources, or parameter configuration can be automatically adjusted for abnormal hardware operation parameters, repairing operations of a firmware/software layer can be performed for abnormal software stack of the network card, and restarting can be performed for critical services crashed in the software stack of the network card, or automatically cleaning garbage processes, or performing software version rollback. Meanwhile, after the NIC mode function stack is activated to enable the network card to operate based on the NIC mode, the network card can also wait for an external management instruction while maintaining the NIC mode, and then execute corresponding operation according to the received management instruction so as to repair the network card. That is, the intelligent network card in this embodiment can utilize the stable environment of NIC mode to make deeper diagnosis, repair (such as reloading firmware, applying patch, recovering configuration from backup) or wait for external management instruction on the software/firmware of DPU mode. After the restoration is confirmed, the DPU mode is automatically switched back again according to the strategy, the advanced functions are restored, and the context stored before loading is loaded.
In one specific embodiment, if the target network card receives the mode switching instruction of the target control platform, the current operation mode of the target network card is directly switched according to the mode switching instruction, and in another specific embodiment, if the current operation mode of the target network card is a data processor mode and a heartbeat signal of the target control platform is not received within a preset time period, the current operation mode of the target network card is directly switched to a network interface card mode. That is, in this embodiment, a configurable policy engine is built in the intelligent network card, and mode switching is triggered based on preset or dynamically learned rules. The trigger conditions include, but are not limited to, load conditions, DPU processing load being too high/too low, occurrence/disappearance of specific types of traffic (such as traffic requiring deep detection), energy efficiency requirements, network cards needing to enter a low power consumption state (prone to NIC mode), task requirements, task instructions issued by a host or a management plane and requiring specific mode execution, error detection signals, triggering switching to NIC mode for degradation operation or repair when the network cards detect recoverable errors in the DPU mode, heartbeat/health check failure, communication interruption of the network cards and the host or the control plane, triggering switching to a more reliable NIC mode for bottom protection. Therefore, the running mode can be dynamically adjusted according to the load and the demand, the performance and the energy efficiency are optimized, and the reliability of the whole running of the intelligent network card is improved by a quick error detection and response mechanism.
In this embodiment, the running state of the target network card may be monitored, if it is determined that the target network card currently has a target error event according to the running state, and the current running mode of the target network card is a data processor mode, then context data of the target network card running based on the data processor mode is saved to a preset shared memory area, and a first function stack of the target network card is activated, and the current running mode of the target network card is switched to a network interface card mode, so that network connection of the target network card running based on the network interface card mode is established based on the context data in the preset shared memory area, and the target error event of the target network card is repaired based on the context data. According to the technical scheme, when the target network card has a target error event, the target network card is stored to the shared storage area based on the context data in the running process of the current working mode, and then the current running mode of the target network card is switched to the working mode based on the network interface card, so that the network connection of the target network card is reestablished based on the context data in the shared storage area, the network card can construct a dual-mode running environment, multi-level intelligent error detection and diagnosis are adopted, when the mode based on the data processor fails, the software/firmware of the DPU mode can be automatically switched to a stable NIC mode rapidly through an automatic switching triggering mechanism when the mode based on the data processor fails, the light-weight restoration and the mode switching restoration are realized, the service interruption time is greatly reduced, the problem that the service interruption time is long and the data is lost due to the fact that the current network card runs in the mode based on the data processor is adopted, the mode of integrally resetting the network card or relying on the host system intervention is conducted, meanwhile, the dependence on manual intervention is remarkably reduced, and the software/firmware of the DPU mode can be updated or maintained in the background mode under the mode, the self-maintenance and the running stability is remarkably reduced, and the running cost is remarkably lowered.
As shown in fig. 4, an embodiment of the present application further provides a network card operation monitoring device, including:
The state monitoring module 11 is configured to monitor an operation state of the target network card, and determine whether the target network card currently has a target error event according to the operation state;
The data obtaining module 12 is configured to obtain context data when the target network card runs based on the data processor mode if the target network card has a target error event and the current running mode of the target network card is the data processor mode;
The mode switching module 13 is configured to store the context data in a preset shared memory area, activate a first function stack of the target network card, and switch a current operation mode of the target network card to a network interface card mode;
The error repairing module 14 is configured to establish a network connection of the target network card when the target network card operates in the network interface card mode based on the context data in the preset shared memory area, and repair a target error event of the target network card based on the context data.
The description of the features in the embodiment corresponding to the network card operation monitoring device may refer to the related description of the embodiment corresponding to the network card operation monitoring method, which is not described in detail herein.
In some embodiments, the status monitoring module specifically includes:
the state monitoring unit is used for monitoring the hardware layer running state, the software layer running state and the functional layer running state of the target network card;
the event judging unit is used for judging whether the target network card currently has an initial error event according to the hardware layer running state, the software layer running state and the functional layer running state;
The event classification unit is used for classifying the initial error event by using a preset error classifier if the initial error event exists in the target network card, so as to obtain the event category of the initial error event, and taking the initial error event as the target error event when the event category meets the preset mode switching condition.
In some embodiments, the network card running monitoring device further includes:
the rule determining module is used for directly determining the network card repairing rule of the target network card when the event type does not meet the preset mode switching condition and the current running mode of the target network card is the data processor mode;
and the event repairing module is used for repairing the initial error event according to the network card repairing rule.
In some embodiments, the event repair module is further configured to determine whether the initial error event is successfully repaired, if the initial error event is successfully repaired, continuously control the target network card to operate in the data processor mode, and if the initial error event is failed to repair, taking the initial error event as the target error event so as to switch the current operation mode of the target network card to the network interface card mode.
In some embodiments, the network card running monitoring device further includes:
the resource determining unit is used for determining a second function stack and hardware resources of the target network card when the target network card runs based on the data processor mode;
and the resource resetting unit is used for unloading or resetting the second functional stack and the hardware resources.
In some embodiments, the network card running monitoring device further includes:
The mode resetting module is used for switching the current running mode of the target network card to the data processor mode again;
and the data loading module is used for loading the second functional stack and the context data in the preset shared storage area, and continuously operating the target network card according to the second functional stack and the context data based on the data processor mode.
In some embodiments, the network card running monitoring device further includes:
the instruction execution module is used for directly switching the current running mode of the target network card according to the mode switching instruction if the target network card receives the mode switching instruction of the target control platform;
and the signal receiving module is used for directly switching the current running mode of the target network card to the network interface card mode if the current running mode of the target network card is a data processor mode and the heartbeat signal of the target control platform is not received within a preset time period.
In some embodiments, the network card running monitoring device further includes:
The device determining unit is used for determining a target function accelerator in the target network card;
and the device holding unit is used for holding the operation of the target function accelerator or directly restarting the target function accelerator so as to hold the network connection of the target network card by using the target function accelerator.
In some embodiments, the network card running monitoring device further includes:
The signal generation module is used for generating a mode switching signal of the target network card;
the signal sending module is used for sending a mode switching signal to the target control platform through a target interface of the management controller or the target network card; the target interface is a data transmission interface of the target network card when operating based on the network interface card mode.
From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment.
The embodiment of the application also provides an electronic device comprising a memory and a processor, the memory storing a computer program, the processor being arranged to run the computer program to perform the steps of any of the network card operation monitoring method embodiments described above.
Embodiments of the present application also provide a computer readable storage medium having a computer program stored therein, wherein the computer program is configured to perform the steps of any of the network card operation monitoring method embodiments described above when run.
In an exemplary embodiment, the computer readable storage medium may include, but is not limited to, a U disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a removable hard disk, a magnetic disk, or an optical disk, etc. various media in which a computer program may be stored.
The embodiment of the application also provides a computer program product, which comprises a computer program, and the computer program realizes the steps in any one of the network card operation monitoring method embodiments when being executed by a processor.
Embodiments of the present application also provide another computer program product, including a non-volatile computer readable storage medium, where the non-volatile computer readable storage medium stores a computer program, where the computer program when executed by a processor implements the steps of any of the above embodiments of a network card operation monitoring method.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The network card operation monitoring method and the electronic equipment provided by the application are described in detail. The principles and embodiments of the present application have been described herein with reference to specific examples, the description of which is intended only to facilitate an understanding of the method of the present application and its core ideas. It should be noted that it will be apparent to those skilled in the art that various modifications and adaptations of the application can be made without departing from the principles of the application and these modifications and adaptations are intended to be within the scope of the application as defined in the following claims.

Claims (11)

1. The network card operation monitoring method is characterized by comprising the following steps of:
monitoring the running state of a target network card, and judging whether a target error event exists in the target network card currently according to the running state;
If the target network card has the target error event and the current running mode of the target network card is a data processor mode, acquiring context data of the target network card running based on the data processor mode;
storing the context data to a preset shared memory area, activating a first function stack of the target network card, and switching the current running mode of the target network card to a network interface card mode;
And establishing network connection of the target network card when the target network card operates based on the network interface card mode based on the context data in the preset shared memory area, and repairing the target error event of the target network card based on the context data.
2. The network card operation monitoring method according to claim 1, wherein the monitoring the operation state of the target network card includes:
Monitoring the hardware layer running state, the software layer running state and the functional layer running state of the target network card;
Correspondingly, the judging whether the target network card currently has the target error event according to the running state includes:
judging whether the target network card currently has an initial error event according to the hardware layer running state, the software layer running state and the functional layer running state;
If the initial error event exists in the target network card, classifying the initial error event by using a preset error classifier to obtain the event category of the initial error event;
And when the event category meets a preset mode switching condition, taking the initial error event as the target error event.
3. The network card operation monitoring method according to claim 2, wherein after classifying the initial error event by using a preset error classifier to obtain an event class of the initial error event, further comprising:
When the event category does not meet the preset mode switching condition and the current running mode of the target network card is a data processor mode, directly determining a network card repairing rule of the target network card when the target network card runs based on the data processor mode;
And repairing the initial error event according to the network card repairing rule.
4. The network card operation monitoring method according to claim 3, wherein after repairing the initial error event according to the network card repairing rule, further comprising:
Judging whether the initial error event is successfully repaired or not;
if the initial error event is successfully repaired, continuing to control the target network card to operate in the data processor mode;
and if the initial error event repair fails, taking the initial error event as the target error event so as to switch the current running mode of the target network card to the network interface card mode.
5. The network card operation monitoring method according to claim 1, wherein after the storing the context data in the preset shared memory area, further comprising:
determining a second function stack and hardware resources when the target network card runs based on the data processor mode;
And unloading or resetting the second functional stack and the hardware resource.
6. The network card operation monitoring method according to claim 5, wherein after repairing the target error event of the target network card based on the context data, further comprising:
The current running mode of the target network card is switched to the data processor mode again;
loading the context data in the second functional stack and the preset shared memory area, and continuously operating the target network card according to the second functional stack and the context data based on the data processor mode.
7. The network card operation monitoring method according to claim 1, wherein the repairing the target error event of the target network card based on the context data comprises:
Maintaining network connection of the target network card based on the network interface card mode operation, and diagnosing a target to be repaired corresponding to the target error event in a target operation environment to obtain a corresponding diagnosis result, wherein the target to be repaired is software and firmware of the target network card based on the data processor mode operation;
repairing the target to be repaired based on the diagnosis result and a preset repair operation;
the preset repairing operation is any one or a combination of a plurality of types of loading firmware again, patching the target application corresponding to the target to be repaired, and recovering the configuration of the target to be repaired based on the preset backup.
8. The network card operation monitoring method according to claim 1, further comprising:
If the target network card receives a mode switching instruction of a target control platform, the current running mode of the target network card is directly switched according to the mode switching instruction;
if the current running mode of the target network card is the data processor mode and the heartbeat signal of the target control platform is not received within a preset time period, the current running mode of the target network card is directly switched to the network interface card mode.
9. The network card operation monitoring method according to claim 1, wherein the switching the current operation mode of the target network card to the network interface card mode further comprises:
Determining a target function accelerator in the target network card;
Maintaining the operation of the target function accelerator or directly restarting the target function accelerator so as to maintain the network connection of the target network card by using the target function accelerator.
10. The network card operation monitoring method according to any one of claims 1to 9, further comprising, after the switching of the current operation mode of the target network card to the network interface card mode:
Generating a mode switching signal of the target network card;
And the mode switching signal is sent to a target control platform through a management controller or a target interface of the target network card, wherein the target interface is a data transmission interface of the target network card when the target network card operates based on the network interface card mode.
11. An electronic device, comprising:
A memory for storing a computer program;
a processor for implementing the steps of the network card operation monitoring method according to any one of claims 1 to 10 when executing the computer program.
CN202511203656.8A 2025-08-26 2025-08-26 A method and electronic device for monitoring network card operation Pending CN120956636A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202511203656.8A CN120956636A (en) 2025-08-26 2025-08-26 A method and electronic device for monitoring network card operation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202511203656.8A CN120956636A (en) 2025-08-26 2025-08-26 A method and electronic device for monitoring network card operation

Publications (1)

Publication Number Publication Date
CN120956636A true CN120956636A (en) 2025-11-14

Family

ID=97620232

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202511203656.8A Pending CN120956636A (en) 2025-08-26 2025-08-26 A method and electronic device for monitoring network card operation

Country Status (1)

Country Link
CN (1) CN120956636A (en)

Similar Documents

Publication Publication Date Title
US11729044B2 (en) Service resiliency using a recovery controller
US6477663B1 (en) Method and apparatus for providing process pair protection for complex applications
CN105095001B (en) Virtual machine abnormal restoring method under distributed environment
US8713350B2 (en) Handling errors in a data processing system
US9396054B2 (en) Securing crash dump files
JP4345334B2 (en) Fault tolerant computer system, program parallel execution method and program
US20090070761A1 (en) System and method for data communication with data link backup
US8984266B2 (en) Techniques for stopping rolling reboots
US7565567B2 (en) Highly available computing platform
JP2001101033A (en) Fault monitoring method for operating system and application program
EP1697842A2 (en) Method and an apparatus for controlling executables running on blade servers
CN105607973B (en) Method, device and system for processing equipment fault in virtual machine system
CN100383748C (en) Policy-based responses to system errors that occur during OS runtime
US7089413B2 (en) Dynamic computer system reset architecture
EP4062286A1 (en) Detecting and recovering from fatal storage errors
CN106789306A (en) Restoration methods and system are collected in communication equipment software fault detect
CN109976886B (en) Kernel remote switching method and device
CN111538613A (en) Cluster system exception recovery processing method and device
CN118245269B (en) PCI device fault processing method and device, and fault processing system
CN109358982B (en) Hard disk self-healing device and method and hard disk
US20230216607A1 (en) Systems and methods to initiate device recovery
US20100085871A1 (en) Resource leak recovery in a multi-node computer system
US8537662B2 (en) Global detection of resource leaks in a multi-node computer system
CN115617550A (en) Processing device, control unit, electronic device, method, and computer program
CN118646641A (en) Fault processing method and device and intelligent network card

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination