Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. Based on the embodiments of the present application, all other embodiments obtained by a person of ordinary skill in the art without making any inventive effort are within the scope of the present application.
It should be noted that in the description of the present application, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. The terms "first," "second," and the like in this specification are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order.
At present, the intelligent network card is generally statically configured into a DPU mode or a NIC mode in the initialization, dynamic switching cannot be performed, and when complex tasks are operated in the DPU mode, if a software stack or a hardware acceleration unit is in error, service interruption time is long, service continuity is affected, and when the mode based on a data processor fails, the network interface card-based operation mode is switched to a stable network interface card-based operation mode rapidly and autonomously, service interruption time is greatly reduced, service is not interrupted or is recovered rapidly, and the complexity and cost of operation and maintenance are remarkably reduced and the stability of the equipment where the network card is located in during operation is improved through autonomous switching and problem repairing of the network card operation mode.
The present application will be further described in detail below with reference to the drawings and detailed description for the purpose of enabling those skilled in the art to better understand the aspects of the present application.
As shown in fig. 1, an embodiment of the present application provides a network card operation monitoring method, which is described in detail below in conjunction with an execution flow of the network card operation monitoring method, including:
And S11, monitoring the running state of the target network card, and judging whether the target network card currently has a target error event according to the running state.
It can be understood that, as shown in fig. 2, the firmware/software architecture design of the intelligent network card in this embodiment supports a complete NIC mode function stack and a DPU mode function stack (including an operating system, a driver, a management program, an application framework, etc.), sets a mode state machine, records a current running mode (DPU Active, NIC ACTIVE) and a possible transitional state (such as Switching, error Recovery), and sets a shared context storage area, so as to save and restore critical state information (such as a network connection state, a secure session key, a offload table item, etc.) during mode Switching, and ensures service continuity after Switching. The intelligent network card can be controlled through a target control platform, and the target control platform can be a host computer or a control plane and the like. Specifically, the hardware layer is used for monitoring temperature, voltage and ECC (Error checking and Correcting) of the processor (such as Arm Cores, x86 Cores), memory errors of the Error CHECKING AND, monitoring a Field Programmable Gate Array (FPGA) state register of an Application SPECIFIC INTEGRATED (ASIC), a watchdog timer, and monitoring a link state of a physical layer (PHY) PHYSICAL LAYER/SerDes (sequencer/Deserizer). The firmware/software layer is used to monitor operating system kernel crashes (Panic/Oops), critical services (e.g., vSwitch, storage target) crashes, application crashes, resource exhaustion (memory, threads), protocol stack exceptions, data consistency check failures (e.g., DMA (direct memory access, direct Memory Access)). The functional layer is used for monitoring key service indexes (such as packet loss rate sudden increase, processing delay exceeding standard and encryption and decryption failure rate). The error classifier is used to classify the detected error event (e.g., transient error, isolatable software error, firmware logic error, critical hardware error), and evaluate the scope of impact and severity.
Based on the above intelligent network card, as shown in fig. 3, when the running state of the target network card is monitored in this embodiment, the hardware layer running state, the software layer running state and the functional layer running state of the target network card may be monitored, so as to determine whether an initial error event exists in the target network card currently according to the hardware layer running state, the software layer running state and the functional layer running state. If the target network card has the initial error event, classifying the initial error event by using a preset error classifier to obtain an event category of the initial error event, and taking the initial error event as the target error event when the event category meets a preset mode switching condition. The current error event type of the network card can be further determined by carrying out multi-level error monitoring on the running state of the intelligent network card, so that corresponding measures are taken in a targeted manner, and the running stability of the intelligent network card is ensured.
In another specific embodiment, when the event type does not meet the preset mode switching condition and the current running mode of the target network card is the data processor mode, directly determining a network card repairing rule of the target network card when the target network card runs based on the data processor mode, and repairing the initial error event according to the network card repairing rule. As shown In FIG. 3, for error events that do not meet the preset mode switch condition, a lightweight repair (In-plane Recovery) may be employed, such as attempting to repair In the current mode (typically DPU mode) for transient errors or local software errors (e.g., single application crashes), specific repair operations including, but not limited to, restarting the crashed application or service, isolating the error module, resetting the associated hardware unit (e.g., a particular acceleration engine), and rolling back to a known good configuration or software version.
And as shown in fig. 3, after repairing the initial error event according to the network card repairing rule, it is further required to determine whether the initial error event is repaired successfully. And if the initial error event is failed to repair, taking the initial error event as a target error event so as to switch the current running mode of the target network card to a network interface card mode. The data processor mode is a working mode of the target network card based on the data processor, and the network interface card mode is a working mode of the target network card based on the network interface card.
And step S12, if the target network card has a target error event and the current running mode of the target network card is a data processor mode, acquiring the context data of the target network card running based on the data processor mode.
In this embodiment, if the target network card has a target error event and the current operation mode of the target network card is the data processor mode, the context data of the target network card running based on the data processor mode is obtained. The data processor mode is a working mode of the target network card based on the data processor, namely a DPU mode. That is, when the lightweight repair fails or the current error time satisfies the preset mode switching condition, the mode switching repair may be performed (Failover Recovery). Specifically, in this embodiment, when the lightweight repair fails, or the error is classified as requiring a more thorough environment reset (e.g., kernel crash, firmware logic disorder), an autonomous switch to NIC mode (network interface card mode) may be triggered.
And S13, storing the context data into a preset shared memory area, activating a first function stack of the target network card, and switching the current running mode of the target network card to a network interface card mode.
When executing the mode switching repair in this embodiment, as shown in fig. 3, it is first required to save the context data to the preset shared memory area, and activate the first function stack of the target network card to switch the current running mode of the target network card to the network interface card mode. In this process, after the context data is saved to the preset shared memory area, it may also be determined that the target network card operates based on the second functional stack and the hardware resource when the data processor mode is running, and the second functional stack and the hardware resource are unloaded or reset. And the target function accelerator in the target network card can be determined, and the operation of the target function accelerator is kept, or the target function accelerator is directly restarted, so that the network connection of the target network card is kept by using the target function accelerator.
Based on the above technical solution, when the mode switching repair is executed in this embodiment, the network card state is first saved, specifically, the current DPU mode activity can be frozen, and the key context (network connection, secure session, etc.) is saved to the shared context storage area. The mode uninstallation/reset is then performed, and in particular, the software stack and hardware resources associated with the DPU mode may be safely uninstalled or reset, and it may be appreciated that this step may involve restarting the OS (Operating System) or hypervisor on the network card. And then activating the NIC mode, firstly activating the NIC mode function stack, loading necessary configuration, and rapidly reestablishing basic network connection by using the stored context. Meanwhile, the intelligent network card can also control the hardware data surface (such as a fixed function accelerator) to keep partial forwarding capacity or restart quickly when the control surface performs switching operation, so as to ensure that basic flow is not interrupted, maintain basic network connectivity in the NIC mode and realize the keep-alive of the data surface of the network card. For example, for network processing, basic data packet forwarding (L2-L4 layer), traffic classification, checksum calculation and the like are kept, so that the basic data transmission at the hardware level is not interrupted during mode switching, hardware unloading can be selectively supported in the process, the transmission and access of stored data are accelerated, and hardware logic such as integrated encryption/decryption, firewall rule matching and the like is kept, so that safety-related tasks can be processed quickly, and the reliability of the network card is further improved.
Based on the technical scheme, the network card can not disconnect network connection when the current DPU mode is frozen, and maintains the network connection and ensures service continuity by constructing a shared context storage area, implementing measures such as data surface keep-alive and the like. The shared context storage area stores key state information such as network connection state, secure session key, unloading stream table entry and the like, and provides a basis for quickly reestablishing network connection after switching, so that key parameters of network connection are reserved in a short time of freezing DPU mode activity, and network connection can be restored based on the information instead of directly disconnecting the network connection. And when the control plane performs switching operation (including freezing the DPU mode activity), the hardware data plane (such as a fixed function accelerator) may keep partial forwarding capability or restart quickly, so as to ensure that the basic traffic is not interrupted, the network connection can be maintained to a certain extent, and if the real-time network data transmission requirement exists, the data plane keep-alive mechanism can enable the network card to continue to process partial traffic, and prevent the network connection from being interrupted due to the DPU mode freezing.
In this embodiment, after the current operation mode of the target network card is switched to the network interface card mode, a mode switching signal of the target network card may also be generated, and the mode switching signal may be sent to the target control platform through the management controller or the target interface of the target network card. The target interface is a data transmission interface of the target network card when the target network card operates based on a network interface card mode. That is, when a mode switch of the intelligent network card occurs, a host notification is required, and specifically, the host and the management system may be notified through an out-of-band management channel (such as BMC (baseboard management controller, baseboard Management Controller)) or a keep-alive NIC channel.
Step S14, based on the context data in the preset shared memory area, establishing the network connection of the target network card when the target network card operates in the mode based on the network interface card, and repairing the target error event of the target network card based on the context data.
It will be appreciated that the DPU mode and the NIC mode rely on different software function stacks (DPU mode contains complex advanced functions, NIC mode only provides basic L2-L4 forwarding), and when NIC mode is activated, it is necessary to switch to its dedicated function stack, and at this time, it is necessary to reload the configuration based on the saved context to adapt the simplified function logic of NIC mode. Therefore, in this embodiment, based on the context data in the preset shared memory area, the network connection of the target network card when running based on the network interface card mode can be established, and the target error event of the target network card can be repaired based on the context data. And after repairing the target error event of the target network card based on the context data, the current running mode of the target network card can be switched to the data processor mode again, the context data in the second functional stack and the preset shared memory area are loaded, and the target network card is continuously run according to the second functional stack and the context data based on the data processor mode. Therefore, by combining the steps and through an efficient context save/restore mechanism, the connection interruption during the switching is minimized, and the state snapshot and restoration of the network card are realized.
Specifically, when repairing the target error event of the target network card, network connection of the target network card based on the network interface card mode operation can be maintained, diagnosis is performed on the target to be repaired corresponding to the target error event in the target operation environment, a corresponding diagnosis result is obtained, and the target to be repaired is repaired based on the diagnosis result and a preset repair operation. The target to be repaired is software and firmware of the target network card running based on the data processor mode, and the target running environment is an environment corresponding to the target network card running based on the network interface card mode. And it should be noted that, the above-mentioned preset repair operation is any one or a combination of several of reloading the firmware, patching the target application corresponding to the target to be repaired, and recovering the configuration of the target to be repaired based on the preset backup. In addition, the embodiment is not limited to specific repair operations, other specific operations capable of repairing the network card can be selected according to actual situations, and the embodiment is not limited to repairing software or firmware of the network card, for example, repairing operations of a hardware layer can be performed for local hardware abnormality, local reset can be performed for an independent hardware unit in the network card, hardware resource reassignment can be performed when detecting that an error/resource exhaustion exists in a part of hardware resources, or parameter configuration can be automatically adjusted for abnormal hardware operation parameters, repairing operations of a firmware/software layer can be performed for abnormal software stack of the network card, and restarting can be performed for critical services crashed in the software stack of the network card, or automatically cleaning garbage processes, or performing software version rollback. Meanwhile, after the NIC mode function stack is activated to enable the network card to operate based on the NIC mode, the network card can also wait for an external management instruction while maintaining the NIC mode, and then execute corresponding operation according to the received management instruction so as to repair the network card. That is, the intelligent network card in this embodiment can utilize the stable environment of NIC mode to make deeper diagnosis, repair (such as reloading firmware, applying patch, recovering configuration from backup) or wait for external management instruction on the software/firmware of DPU mode. After the restoration is confirmed, the DPU mode is automatically switched back again according to the strategy, the advanced functions are restored, and the context stored before loading is loaded.
In one specific embodiment, if the target network card receives the mode switching instruction of the target control platform, the current operation mode of the target network card is directly switched according to the mode switching instruction, and in another specific embodiment, if the current operation mode of the target network card is a data processor mode and a heartbeat signal of the target control platform is not received within a preset time period, the current operation mode of the target network card is directly switched to a network interface card mode. That is, in this embodiment, a configurable policy engine is built in the intelligent network card, and mode switching is triggered based on preset or dynamically learned rules. The trigger conditions include, but are not limited to, load conditions, DPU processing load being too high/too low, occurrence/disappearance of specific types of traffic (such as traffic requiring deep detection), energy efficiency requirements, network cards needing to enter a low power consumption state (prone to NIC mode), task requirements, task instructions issued by a host or a management plane and requiring specific mode execution, error detection signals, triggering switching to NIC mode for degradation operation or repair when the network cards detect recoverable errors in the DPU mode, heartbeat/health check failure, communication interruption of the network cards and the host or the control plane, triggering switching to a more reliable NIC mode for bottom protection. Therefore, the running mode can be dynamically adjusted according to the load and the demand, the performance and the energy efficiency are optimized, and the reliability of the whole running of the intelligent network card is improved by a quick error detection and response mechanism.
In this embodiment, the running state of the target network card may be monitored, if it is determined that the target network card currently has a target error event according to the running state, and the current running mode of the target network card is a data processor mode, then context data of the target network card running based on the data processor mode is saved to a preset shared memory area, and a first function stack of the target network card is activated, and the current running mode of the target network card is switched to a network interface card mode, so that network connection of the target network card running based on the network interface card mode is established based on the context data in the preset shared memory area, and the target error event of the target network card is repaired based on the context data. According to the technical scheme, when the target network card has a target error event, the target network card is stored to the shared storage area based on the context data in the running process of the current working mode, and then the current running mode of the target network card is switched to the working mode based on the network interface card, so that the network connection of the target network card is reestablished based on the context data in the shared storage area, the network card can construct a dual-mode running environment, multi-level intelligent error detection and diagnosis are adopted, when the mode based on the data processor fails, the software/firmware of the DPU mode can be automatically switched to a stable NIC mode rapidly through an automatic switching triggering mechanism when the mode based on the data processor fails, the light-weight restoration and the mode switching restoration are realized, the service interruption time is greatly reduced, the problem that the service interruption time is long and the data is lost due to the fact that the current network card runs in the mode based on the data processor is adopted, the mode of integrally resetting the network card or relying on the host system intervention is conducted, meanwhile, the dependence on manual intervention is remarkably reduced, and the software/firmware of the DPU mode can be updated or maintained in the background mode under the mode, the self-maintenance and the running stability is remarkably reduced, and the running cost is remarkably lowered.
As shown in fig. 4, an embodiment of the present application further provides a network card operation monitoring device, including:
The state monitoring module 11 is configured to monitor an operation state of the target network card, and determine whether the target network card currently has a target error event according to the operation state;
The data obtaining module 12 is configured to obtain context data when the target network card runs based on the data processor mode if the target network card has a target error event and the current running mode of the target network card is the data processor mode;
The mode switching module 13 is configured to store the context data in a preset shared memory area, activate a first function stack of the target network card, and switch a current operation mode of the target network card to a network interface card mode;
The error repairing module 14 is configured to establish a network connection of the target network card when the target network card operates in the network interface card mode based on the context data in the preset shared memory area, and repair a target error event of the target network card based on the context data.
The description of the features in the embodiment corresponding to the network card operation monitoring device may refer to the related description of the embodiment corresponding to the network card operation monitoring method, which is not described in detail herein.
In some embodiments, the status monitoring module specifically includes:
the state monitoring unit is used for monitoring the hardware layer running state, the software layer running state and the functional layer running state of the target network card;
the event judging unit is used for judging whether the target network card currently has an initial error event according to the hardware layer running state, the software layer running state and the functional layer running state;
The event classification unit is used for classifying the initial error event by using a preset error classifier if the initial error event exists in the target network card, so as to obtain the event category of the initial error event, and taking the initial error event as the target error event when the event category meets the preset mode switching condition.
In some embodiments, the network card running monitoring device further includes:
the rule determining module is used for directly determining the network card repairing rule of the target network card when the event type does not meet the preset mode switching condition and the current running mode of the target network card is the data processor mode;
and the event repairing module is used for repairing the initial error event according to the network card repairing rule.
In some embodiments, the event repair module is further configured to determine whether the initial error event is successfully repaired, if the initial error event is successfully repaired, continuously control the target network card to operate in the data processor mode, and if the initial error event is failed to repair, taking the initial error event as the target error event so as to switch the current operation mode of the target network card to the network interface card mode.
In some embodiments, the network card running monitoring device further includes:
the resource determining unit is used for determining a second function stack and hardware resources of the target network card when the target network card runs based on the data processor mode;
and the resource resetting unit is used for unloading or resetting the second functional stack and the hardware resources.
In some embodiments, the network card running monitoring device further includes:
The mode resetting module is used for switching the current running mode of the target network card to the data processor mode again;
and the data loading module is used for loading the second functional stack and the context data in the preset shared storage area, and continuously operating the target network card according to the second functional stack and the context data based on the data processor mode.
In some embodiments, the network card running monitoring device further includes:
the instruction execution module is used for directly switching the current running mode of the target network card according to the mode switching instruction if the target network card receives the mode switching instruction of the target control platform;
and the signal receiving module is used for directly switching the current running mode of the target network card to the network interface card mode if the current running mode of the target network card is a data processor mode and the heartbeat signal of the target control platform is not received within a preset time period.
In some embodiments, the network card running monitoring device further includes:
The device determining unit is used for determining a target function accelerator in the target network card;
and the device holding unit is used for holding the operation of the target function accelerator or directly restarting the target function accelerator so as to hold the network connection of the target network card by using the target function accelerator.
In some embodiments, the network card running monitoring device further includes:
The signal generation module is used for generating a mode switching signal of the target network card;
the signal sending module is used for sending a mode switching signal to the target control platform through a target interface of the management controller or the target network card; the target interface is a data transmission interface of the target network card when operating based on the network interface card mode.
From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment.
The embodiment of the application also provides an electronic device comprising a memory and a processor, the memory storing a computer program, the processor being arranged to run the computer program to perform the steps of any of the network card operation monitoring method embodiments described above.
Embodiments of the present application also provide a computer readable storage medium having a computer program stored therein, wherein the computer program is configured to perform the steps of any of the network card operation monitoring method embodiments described above when run.
In an exemplary embodiment, the computer readable storage medium may include, but is not limited to, a U disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a removable hard disk, a magnetic disk, or an optical disk, etc. various media in which a computer program may be stored.
The embodiment of the application also provides a computer program product, which comprises a computer program, and the computer program realizes the steps in any one of the network card operation monitoring method embodiments when being executed by a processor.
Embodiments of the present application also provide another computer program product, including a non-volatile computer readable storage medium, where the non-volatile computer readable storage medium stores a computer program, where the computer program when executed by a processor implements the steps of any of the above embodiments of a network card operation monitoring method.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The network card operation monitoring method and the electronic equipment provided by the application are described in detail. The principles and embodiments of the present application have been described herein with reference to specific examples, the description of which is intended only to facilitate an understanding of the method of the present application and its core ideas. It should be noted that it will be apparent to those skilled in the art that various modifications and adaptations of the application can be made without departing from the principles of the application and these modifications and adaptations are intended to be within the scope of the application as defined in the following claims.