[go: up one dir, main page]

WO2025003737A1 - Application d'accord de niveau de service infonuagique à base d'observabilité - Google Patents

Application d'accord de niveau de service infonuagique à base d'observabilité Download PDF

Info

Publication number
WO2025003737A1
WO2025003737A1 PCT/IB2023/056849 IB2023056849W WO2025003737A1 WO 2025003737 A1 WO2025003737 A1 WO 2025003737A1 IB 2023056849 W IB2023056849 W IB 2023056849W WO 2025003737 A1 WO2025003737 A1 WO 2025003737A1
Authority
WO
WIPO (PCT)
Prior art keywords
sla
service
node
local
controller
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/IB2023/056849
Other languages
English (en)
Inventor
Timo SIMANAINEN
Miika KOMU
Tero Kauppinen
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Telefonaktiebolaget LM Ericsson AB
Original Assignee
Telefonaktiebolaget LM Ericsson AB
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Telefonaktiebolaget LM Ericsson AB filed Critical Telefonaktiebolaget LM Ericsson AB
Priority to PCT/IB2023/056849 priority Critical patent/WO2025003737A1/fr
Publication of WO2025003737A1 publication Critical patent/WO2025003737A1/fr
Anticipated expiration legal-status Critical
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/50Network service management, e.g. ensuring proper service fulfilment according to agreements
    • H04L41/5003Managing SLA; Interaction between SLA and QoS
    • H04L41/5019Ensuring fulfilment of SLA
    • H04L41/5025Ensuring fulfilment of SLA by proactively reacting to service quality change, e.g. by reconfiguration after service quality degradation or upgrade
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/04Network management architectures or arrangements
    • H04L41/046Network management architectures or arrangements comprising network management agents or mobile agents therefor
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/08Configuration management of networks or network elements
    • H04L41/0803Configuration setting
    • H04L41/0813Configuration setting characterised by the conditions triggering a change of settings
    • H04L41/0816Configuration setting characterised by the conditions triggering a change of settings the condition being an adaptation, e.g. in response to network events
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/08Configuration management of networks or network elements
    • H04L41/0803Configuration setting
    • H04L41/0823Configuration setting characterised by the purposes of a change of settings, e.g. optimising configuration for enhancing reliability
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/08Configuration management of networks or network elements
    • H04L41/0893Assignment of logical groups to network elements
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/40Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks using virtualisation of network functions or resources, e.g. SDN or NFV entities
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/50Network service management, e.g. ensuring proper service fulfilment according to agreements
    • H04L41/5003Managing SLA; Interaction between SLA and QoS
    • H04L41/5009Determining service level performance parameters or violations of service level contracts, e.g. violations of agreed response time or mean time between failures [MTBF]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/091Measuring contribution of individual network components to actual service level
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/20Arrangements for monitoring or testing data switching networks the monitoring system or the monitored elements being virtualised, abstracted or software-defined entities, e.g. SDN or NFV

Definitions

  • the various embodiments described herein relate to the field of communication networks and, more specifically, to observability -based cloud service level agreement enforcement.
  • Observability is a measure of how well the internal states of the system can be inferred from knowledge of its external outputs. Maintenance of a large and geographically distributed system with a large number of nodes is difficult without using an observability system.
  • various nodes include observation data collectors that collect observation data associated with the various nodes. The nodes provide the collected observation data to a central location, such as a controller node, for analysis.
  • observation data collectors are typically configured to collect a pre-defined set of data and to transmit collected data to the controller node, but do not receive data or other communications from the controller node during operation.
  • an observation system could be configured to monitor the performance of the system.
  • the observation data collectors could each be configured to collect performance information for the corresponding node.
  • the controller node could execute analysis software that receives performance information from different nodes in the system and analyzes the performance information to determine performance of the system.
  • observation system cannot be easily adapted to analyze other types of data and/or perform other types of determinations.
  • an observation system could be configured to collect and analyze many different types of data.
  • the observation data collectors could each be configured to collect a variety of data associated with the corresponding node.
  • the controller node could receive the data and perform different data analysis on different portions of the data as desired.
  • CPU central processing unit
  • network resources are consumed. Because such an approach requires the observability system to collect, transmit, and process large amounts of data, the performance of the system suffers.
  • One embodiment of the present application sets forth a method performed by a first node in a cluster for enforcing service level agreements for services deployed within the cluster.
  • the method includes collecting local observation data associated with the first node.
  • the method further includes determining, based on the local observation data, that a first service level agreement (SLA) associated with a first service deployed within the cluster is not being fulfilled by the first node.
  • SLA service level agreement
  • the method further includes determining one or more local SLA enforcement operations based on the local observation data and the first SLA.
  • the method further includes performing the one or more local SLA enforcement operations.
  • One embodiment of the present application sets forth a method performed by a first node in a cluster for enforcing service level agreements for services deployed within the cluster.
  • the method includes receiving observation data collected by a plurality of nodes in the cluster.
  • the method further includes receiving, from a second node, data indicating that a first service level agreement (SLA) associated with a first service deployed within the cluster is not being fulfilled.
  • the method further includes in response to receiving data indicating that the first SLA is not being fulfilled, determining one or more SLA enforcement operations based on the observation data and the first SLA.
  • the method further includes causing the one or more SLA enforcement operations to be performed in the cluster.
  • SLA service level agreement
  • One embodiment of the present application includes a non-transitory computer-readable medium storing instructions that, when executed by one or more processors, cause the one or more processors to perform operations for enforcing service level agreements for services deployed within a cluster.
  • the operations include collecting local observation data associated with a first node.
  • the operations further include determining, based on the local observation data, that a first service level agreement (SLA) associated with a first service deployed within the cluster is not being fulfilled by the first node.
  • SLA service level agreement
  • the operations further include determining one or more local SLA enforcement operations based on the local observation data and the first SLA.
  • the operations further include performing the one or more local SLA enforcement operations.
  • One embodiment of the present application includes a non-transitory computer-readable medium storing instructions that, when executed by one or more processors, cause the one or more processors to perform operations for enforcing service level agreements for services deployed within a cluster.
  • the operations include receiving observation data collected by a plurality of nodes in the cluster.
  • the operations further include receiving, from a second node, data indicating that a first service level agreement (SLA) associated with a first service deployed within the cluster is not being fulfilled.
  • the operations further include in response to receiving data indicating that the first SLA is not being fulfilled, determining one or more SLA enforcement operations based on the observation data and the first SLA.
  • the operations further include causing the one or more SLA enforcement operations to be performed in the cluster.
  • SLA service level agreement
  • One embodiment of the present application includes an agent computing node that includes one or more processors.
  • the agent computing node further includes a memory storing instructions which, when executed by the one or more processors, cause the agent computing node to carry out operations for enforcing service level agreements for services deployed within a cluster.
  • the operations include collecting local observation data associated with a first node.
  • the operations further include determining, based on the local observation data, that a first service level agreement (SLA) associated with a first service deployed within the cluster is not being fulfilled by the first node.
  • the operations further include determining one or more local SLA enforcement operations based on the local observation data and the first SLA.
  • the operations further include performing the one or more local SLA enforcement operations.
  • SLA service level agreement
  • One embodiment of the present application includes a controller computing node that includes one or more processors.
  • the controller computing node further includes a memory storing instructions which, when executed by the one or more processors, cause the agent computing node to carry out operations for enforcing service level agreements for services deployed within a cluster.
  • the operations include receiving observation data collected by a plurality of nodes in the cluster.
  • the operations further include receiving, from a second node, data indicating that a first service level agreement (SLA) associated with a first service deployed within the cluster is not being fulfilled.
  • the operations further include in response to receiving data indicating that the first SLA is not being fulfilled, determining one or more SLA enforcement operations based on the observation data and the first SLA.
  • the operations further include causing the one or more SLA enforcement operations to be performed in the cluster.
  • SLA service level agreement
  • Figure 1 illustrates an observability system configured to implement one or more aspects of the various embodiments
  • Figure 2 is a diagram illustrating interactions between components of the system of Figure 1 to configure resources and data collection for a new service, according to various embodiments;
  • Figure 3 is a diagram illustrating interactions between components of the system of Figure 1 to determine whether the system has sufficient resources to fulfill an SLA for a service, according to various embodiments;
  • Figure 4 is a diagram illustrating interactions between components of the system of Figure 1 to enforce an SLA at an agent computing node, according to various embodiments;
  • Figure 5 is a diagram illustrating interactions between components of the system of Figure 1 to enforce an SLA at a controller computing node, according to various embodiments;
  • Figure 6 is a flowchart of method steps or enforcing an SLA at an agent computing node, according to various embodiments
  • Figure 7 is a flowchart of method steps for enforcing an SLA at a controller computing node, according to various embodiments.
  • Figure 8 illustrates network devices in an exemplary network, according to various embodiments.
  • references in the specification to “one embodiment,” “an embodiment,” “an example embodiment,” etc., indicate that the embodiment described may include a particular feature, structure, or characteristic, but every embodiment may not necessarily include the particular feature, structure, or characteristic. Moreover, such phrases are not necessarily referring to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with an embodiment, it is submitted that it is within the knowledge of one skilled in the art to affect such feature, structure, or characteristic in connection with other embodiments whether or not explicitly described.
  • Bracketed text and blocks with dashed borders may be used herein to illustrate optional operations that add additional features to embodiments of the invention. However, such notation should not be taken to mean that these are the only options or optional operations, and/or that blocks with solid borders are not optional in certain embodiments of the invention.
  • Coupled is used to indicate that two or more elements, which may or may not be in direct physical or electrical contact with each other, co-operate or interact with each other.
  • Connected is used to indicate the establishment of communication between two or more elements that are coupled with each other.
  • An electronic device stores and transmits (internally and/or with other electronic devices over a network) code (which is composed of software instructions and which is sometimes referred to as computer program code or a computer program) and/or data using machine-readable media (also called computer-readable media), such as machine-readable storage media (e.g., magnetic disks, optical disks, solid state drives, read only memory (ROM), flash memory devices, phase change memory) and machine-readable transmission media (also called a carrier) (e.g., electrical, optical, radio, acoustical or other form of propagated signals - such as carrier waves, infrared signals).
  • machine-readable media also called computer-readable media
  • machine-readable storage media e.g., magnetic disks, optical disks, solid state drives, read only memory (ROM), flash memory devices, phase change memory
  • machine-readable transmission media also called a carrier
  • carrier e.g., electrical, optical, radio, acoustical or other form of propagated signals - such as carrier waves, inf
  • an electronic device e.g., a computer
  • hardware and software such as a set of one or more processors (e.g., wherein a processor is a microprocessor, controller, microcontroller, central processing unit, digital signal processor, application specific integrated circuit, field programmable gate array, other electronic circuitry, a combination of one or more of the preceding) coupled to one or more machine-readable storage media to store code for execution on the set of processors and/or to store data.
  • processors e.g., wherein a processor is a microprocessor, controller, microcontroller, central processing unit, digital signal processor, application specific integrated circuit, field programmable gate array, other electronic circuitry, a combination of one or more of the preceding
  • an electronic device may include non-volatile memory containing the code since the non-volatile memory can persist code/data even when the electronic device is turned off (when power is removed), and while the electronic device is turned on that part of the code that is to be executed by the processor(s) of that electronic device is typically copied from the slower nonvolatile memory into volatile memory (e.g., dynamic random access memory (DRAM), static random access memory (SRAM)) of that electronic device.
  • Typical electronic devices also include a set of one or more physical network interface(s) (NI(s)) to establish network connections (to transmit and/or receive code and/or data using propagating signals) with other electronic devices.
  • NI(s) physical network interface
  • a physical NI may comprise radio circuitry capable of receiving data from other electronic devices over a wireless connection and/or sending data out to other devices via a wireless connection.
  • This radio circuitry may include transmitter(s), receiver(s), and/or transceiver(s) suitable for radiofrequency communication.
  • the radio circuitry may convert digital data into a radio signal having the appropriate parameters (e.g., frequency, timing, channel, bandwidth, etc.). The radio signal may then be transmitted via antennas to the appropriate recipient(s).
  • the set of physical NI(s) may comprise network interface controller(s) (NICs), also known as a network interface card, network adapter, or local area network (LAN) adapter.
  • NICs network interface controller
  • the NIC(s) may facilitate in connecting the electronic device to other electronic devices allowing them to communicate via wire through plugging in a cable to a physical port connected to a NIC.
  • One or more parts of an embodiment of the invention may be implemented using different combinations of software, firmware, and/or hardware.
  • a network device is an electronic device that communicatively interconnects other electronic devices on the network (e.g., other network devices, end-user devices).
  • Some network devices are “multiple services network devices” that provide support for multiple networking functions (e.g., routing, bridging, switching, Layer 2 aggregation, session border control, Quality of Service, and/or subscriber management), and/or provide support for multiple application services (e.g., data, voice, and video).
  • multiple services network devices that provide support for multiple networking functions (e.g., routing, bridging, switching, Layer 2 aggregation, session border control, Quality of Service, and/or subscriber management), and/or provide support for multiple application services (e.g., data, voice, and video).
  • Service level agreements describe a level of service that will be provided to a client by a service provider.
  • a service level agreement describes a level of service that a service or application is expected to provide, such as minimum and/or average values of various network characteristics (e.g., throughput), compute characteristics (e.g., processing time), service response time (e.g., time to respond to a user request), and/or the like.
  • an observability system is typically configured to only monitor a given system and provide reports to users (e.g., via a graphical user interface). As a result, these systems are not configured to enforce service level agreements (i.e., take actions to ensure that the system meets the service level agreements). At best, an observability system could be configured to collect observation data associated with service level agreements and provide analytics relating to the service level agreements.
  • the observability system is able to use collected observation data to perform SLA enforcement operations.
  • the node also analyzes the data to detect when an SLA is not being met or fulfilled.
  • the node attempts node-level operations to address the SLA issue locally at the node. Additionally, if the node cannot address the SLA issue locally, a central service can attempt systemlevel operations to address the SLA issue globally.
  • the observability system is able to monitor the SLA for multiple different services, if the characteristics of a service do not reach the agreed level and the system is running low on resources, the observability system can re-assign resources from low-priority services to services with high SLA requirements. Further, because the observability system monitors resource usage, the system can decide whether a newly requested SLA is able to be fulfilled with existing resources (of the monitored system). If the new request would cause the system to run out of allocated resources, the observability system could reject the SLA request or cause the system to scale out in order to fulfill the SLA request.
  • One benefit of the disclosed techniques is that the system is able to more quickly and efficiently enforce SLA for a service by detecting and addressing SLA issues at individual nodes.
  • the node does not need to wait for a central analytical service to receive observation data, analyze the data, and then provide instructions to the node. Therefore, the system is able to meet SLA requirements for a greater amount of time compared to other approaches.
  • the node first addresses SLA issues locally, less data needs to be transmitted to other nodes (e.g., to a central analytical service) in order to enforce SLAs.
  • the system also reduces or minimizes bandwidth usage and processing time/power compared to other systems that rely on transmitting data to an analysis software for processing.
  • Figure 1 illustrates an observability system configured to implement one or more aspects of the various embodiments.
  • the observability system includes a controller computing node 110 and one or more agent computing nodes, such as agent computing nodes 140(1)- (N).
  • the controller computing node 110 communicates with the agent computing nodes 140(l)-(N) over a network 130.
  • Network 130 can include any suitable combination of wired or wireless communication over local area networks and wide area networks.
  • a node can be any suitable type of node, including computing nodes, network nodes, storage nodes, and/or the like.
  • the agent computing nodes 140(l)-(N), or a portion thereof collectively implement a distributed system (e.g., an application composed of a number of microservices).
  • the controller computing node 110 is one of the nodes in the distributed system and/or is configured as a controller for the distributed system.
  • the agent computing nodes 140 and the controller computing node 110 could implement a Kubernetes system.
  • Agent computing nodes 140 collect local observation data and transmit collected data to controller computing node 110.
  • Local observation data comprises data (e.g., metrics, system information, logs, traces, and/or the like) that is associated with and collected at the agent computing node 140.
  • local observation data comprises local metrics that are measured at the agent computing node 140.
  • Local observation data includes, for example and without limitation, device statistics/measurements, network statistics/measurements, interrupt information, kernel statistics/measurements, file system statistics/measurements, session information/measurements, routing information, device logs, network logs, hardware/software/storage status information, service request status information, and/or the like.
  • Controller computing node 110 receives the local observation data from various agent computing nodes (e.g., agent computing nodes 140(l)-(N). Controller computing node 110 stores, processes, and/or analyzes the received observation data. In some embodiments, controller computing node 110 generates global observation data based on the local observation data received from the agent computing nodes 140. Global observation data comprises data (e.g., metrics, analytical data, system-wide service information, and/or the like) associated with the system being observed. In some embodiments, local observation data includes values (i.e., metrics) that are measured at an agent computing node 140. Global observation data includes values that cannot be directly measured and are instead computed based on the local observation data.
  • agent computing nodes e.g., agent computing nodes 140(l)-(N). Controller computing node 110 stores, processes, and/or analyzes the received observation data. In some embodiments, controller computing node 110 generates global observation data based on the local observation data received from the agent computing nodes 140. Global observation data comprises data (e.
  • controller computing node 110 computes various global metrics from multiple locally measured values.
  • controller computing node 110 could receive, from various agent computing nodes 140, average latencies between a given computing node and a neighboring computing node.
  • the controller computing node 110 could determine, based on the average latencies, the total (average) latency for a service chain across multiple service chain components (i.e., total latency across a plurality of computing nodes).
  • Global observation data includes, for example and without limitation, power consumption of a service, response latency of a service, service availability (e.g., as a percentage over a period of time), service reliability, service throughput, total system utilization, parallel active (e.g., not yet replied) requests to a service, and/or the like.
  • the controller computing node 110 includes a service level agreement (SLA) operator 122, a central policy controller 124, a central observability controller 126, and a central network controller 128.
  • SLA service level agreement
  • one or more components of controller computing node 110 could be included in a different node, such as one or more of the agent computing nodes 140.
  • Agent computing node 140(1) includes a local policy controller 152, a local observability controller 154, an observation data collector 156, and a local network controller 158.
  • agent computing nodes 140(2)-(N) may include the same or similar components as agent computing node 140(1) (which are not shown in the diagram to reduce clutter) and can operate in a similar manner to that discussed herein with reference to agent computing node 140(1). Any given agent computing node 140 could include more or fewer components than that shown in Figure 1, depending on the implementation.
  • controller computing node 110 e.g., SLA operator 122, central policy controller 124, central observability controller 126, and/or central network controller 128) and/or of an agent computing node 140 (e.g., local policy controller 152, local observability controller 154, observation data collector 156, and/or local network controller 158) are virtualized.
  • the controller computing node 110 could implement a virtual computing node 120 that implements the SLA operator 122, central policy controller 124, central observability controller 126, and central network controller 128.
  • agent computing node 140(1) could implement a virtual computing node 150(1) that implements the local policy controller 152, local observability controller 154, observation data collector 156, and local network controller 158.
  • SLA operator 122 receives an SLA manifest that describes the service (i.e., a service definition) and the SLA requirements associated with the service.
  • the SLA manifest is first sent to a Kubernetes API server, and the Kubernetes API server forwards the SLA definition within the manifest to SLA operator 122.
  • SLA operator 122 transmits the SLA manifest and/or SLA description to the central policy controller 124. Additionally, in some embodiments, SLA operator 122 transmits the SLA manifest and/or SLA description to the local policy controller 152 of one or more agent computing nodes 140 (e.g., the agent computing nodes 140 that are assigned to execute the service). In some embodiments, central policy controller 124 transmits the SLA manifest and/or SLA description, or a subset thereof, to the local policy controller(s) 152. The central policy controller 124 and local policy controller(s) 152 use the SLA manifest and/or SLA description to determine the SLA(s) associated with the service.
  • the central policy controller 124 and/or local policy controller(s) 152 store the SLA manifest and/or SLA description in association with the service. Subsequently, the central policy controller 124 and/or local policy controller(s) 152 retrieve the SLA manifest and/or SLA description when determining the SLA(s) associated with the service. In some embodiments, the central policy controller 124 and/or local policy controller(s) 152 determine the SLA(s) associated with the service and generate information (e.g., configuration information) that indicate the determined SLA(s). The central policy controller 124 and/or local policy controller(s) 152 use the generated information when determining the SLA(s) associated with the service.
  • information e.g., configuration information
  • the central policy controller 124 uses the SLA manifest and/or SLA description to determine initial resource configuration(s) for the system.
  • a resource configuration for the system indicates how to configure or allocate resources included in the system, such as network resources, processing resources, and/or the like, to the service.
  • Example network resource configurations include bandwidth allocation, network traffic routing, network traffic priority level, and/or the like.
  • Example processing resource configurations include nodes in the system that are assigned to the service, load balancer configurations for determining which assigned node(s) to send requests to, and/or the like.
  • the local policy controller(s) 152 use the SLA manifest and/or SLA description to determine initial resource configuration(s) for each agent computing node 140 that is executing the service.
  • a local resource configuration i.e., a resource configuration for the node indicates how to configure or allocate resources of the node, such as network resources, processing resources, memory resources, and/or the like.
  • Example network resource configurations include bandwidth allocations for the service, what connection(s) to use for communications associated with the service, which other node(s) to communicate with to handle requests associated with the service, and/or the like.
  • Example processing and memory resource allocations include amount of processor and memory, respectively, allocated to the service.
  • the central policy controller 124 determines initial resource configurations for both the system and for each agent computing node 140 that is executing the service. The central policy controller 124 transmits the local resource configuration for a given agent computing node 140 to the local policy controller 152 of the agent computing node 140.
  • SLA operator 122 determines, based on the SLA manifest and/or service definition, observation data that should be collected by the system in association with the SLA requirements of the service.
  • determining the observation data that should be collected in association with the SLA requirements for a service includes determining one or more metrics, traces, logs, and/or the like that are associated with the SLA requirements.
  • the SLA manifest for a service could indicate that a service should have a specified minimum response time and a specified minimum number of requests handled per second.
  • SLA operator 122 determines, based on the indication that the service should have the specified minimum response time and specified minimum number of requests handled per second, that the observation data collected by each agent computing node 140 should include a response latency for the service (e.g., response latency for each request associated with the service) and a number of requests received for the service (e.g., number of requests received per second).
  • a response latency for the service e.g., response latency for each request associated with the service
  • number of requests received for the service e.g., number of requests received per second.
  • SLA operator 122 could determine that the global observation data generated by the controller computing node 110 should include an overall response latency for the service (e.g., an average response latency across multiple requests and multiple nodes, a minimum response latency from a plurality of response latencies, and/or the like) and overall number of requests received by the system for the service (e.g., total requests received, average number of requests received, and/or the like).
  • an overall response latency for the service e.g., an average response latency across multiple requests and multiple nodes, a minimum response latency from a plurality of response latencies, and/or the like
  • overall number of requests received by the system for the service e.g., total requests received, average number of requests received, and/or the like.
  • SLA operator 122 causes the observability system to collect the determined observation data.
  • causing the observability system to collect the determined observation data comprises transmitting a data collection request to the central observability controller 126.
  • the data collection request indicates the observation data that should be collected by the system.
  • the data collection request indicates both the global observation data and the local observation data that should be collected.
  • the central observability controller 126 causes one or more agent computing nodes 140 (e.g., the agent computing nodes 140 that are assigned to execute the service) to collect the specified local observation data.
  • causing the one or more agent computing nodes 140 to collect the specified local observation data includes transmitting a data collection request to the one or more agent computing nodes 140. In some embodiments, causing the one or more agent computing nodes 140 to collect the specified local observation data includes transmitting data collection configuration information to the local observability controllers 154 of the one or more agent computing nodes 140. Each local observability controller 154 configures the corresponding observation data collector 156 of the agent computing node 140 to collect the specified data.
  • SLA operator 122 sends a first data collection request to the central observability controller 126 indicating the global observation data that should be collected or generated by the central observability controller 126, and sends a second data collection request to each of the one or more agent computing nodes 140 indicating the local observation data that should be collected by the observation data collector 156 at the agent computing node 140.
  • SLA operator 122 after receiving the SLA manifest, SLA operator 122 first determines whether the user requesting the service has the needed authority (e.g., access rights, privileges, and/or the like) to request the resources described in the SLA manifest. In some embodiments, determining whether the user requesting the service has the needed authority comprises transmitting an access verification check request to an access control service (not shown). The access verification check request specifies the user requesting the service and the requested resources. The access control service performs the requested check and transmits a response indicating whether the user has permission to request the resources.
  • the needed authority e.g., access rights, privileges, and/or the like
  • SLA operator 122 denies the SLA request. SLA operator 122 does not send any data collection requests to the central observability controller 126 or the local observability controller(s) 154. If the user is authorized to request the resources, then SLA operator 122 transmits a data collection request to the central observability controller 126 and/or local observability controller(s) 154.
  • SLA operator 122 determines that the user lacks the needed authority to request the resources described in the SLA manifest, SLA operator 122 does not transmit the SLA manifest/description to central policy controller 124 and local policy controller 152. In other embodiments, SLA operator 122 transmits the SLA manifest/description to central policy controller 124 and central policy controller 124 determines whether the user has permissions to request the resources described in the SLA manifest/description.
  • the observation data collectors 156 collect the specified observation data for the correspond agent computing node 140.
  • an observation data collector 156 e.g., node_exporter or opentelemetry _exporter
  • the observation data collector 156 collects observation data related to the execution environment (e.g., observation data associated with the execution environment executing a given service).
  • the observation data includes, for example and without limitation, measurement data and/or trace data.
  • Measurement data includes numeric information such as the number of received/sent network packages per second, CPU utilization percentage, or the like.
  • Trace data includes information regarding events that are determined to belong together.
  • trace data can include logs/information that follow a particular Hypertext Transfer Protocol (HTTP) session.
  • this information could include information regarding a connection request received event (SYN), a connection request response sent event (SYN ACK), a HTTP GET request received event, a HTTP 200 OK response sent event, and a connection closed event (RST).
  • the trace data includes copies of actual network traffic that was sent, received, and/or processed by agent computing node 140 (e.g., the packets that were sent, received, and processed by agent computing node 140 or portions thereof).
  • the observation data collector 156 transmits the collected observation data to local observability controller 154.
  • Local observability controller 154 is responsible for receiving collected data from the observation data collector 156 and determining where to transmit and/or store the collected data. In some embodiments, local observability controller 154 stores received data into a persistent storage (not shown). Additionally or alternatively, local observability controller 154 caches received data in temporary storage (not shown).
  • local observability controller 154 transmits the received data, or a portion thereof, to central observability controller 126 (e.g., over network 130) and local policy controller 152.
  • the set of observation data transmitted to central observability controller 126 can differ from the set of observation data transmitted to local policy controller 152.
  • the local observability controller 154 transmits the collected observation data, or a portion thereof, to the central observability controller 126 and/or local policy controller 152 via an observability application programming interface (API)Zframework such as OpenTelemetry. Additionally, local observability controller 154 could transmit observation data using a “push” mechanism (e.g., transmit observation data when the observation data is available) and/or a “pull” mechanism (e.g., transmit the observation data in response to receiving a request for the observation data).
  • a “push” mechanism e.g., transmit observation data when the observation data is available
  • a “pull” mechanism e.g., transmit the observation data in response to receiving a request for the observation data.
  • central observability controller 126 and/or local policy controller 152 can receive observation data without first requesting the data (e.g., when local observability controller 154 pushes the data) and/or can request observation data from local observability controller 154.
  • central observability controller 126 and/or local policy controller 152 could request observation data periodically (e.g., every given number of minutes, hour(s), day(s), week(s), and/or the like).
  • central policy controller 124 could receive a notification that an SLA is not being fulfilled and, in response, request observation data from central observability controller 126.
  • central observability controller 126 could request the observation data from local observability controller 154.
  • local observability controller 154 transmits the same set of observation data (e.g., all collected data) to both the controller computing node 110 and local policy controller 152. In some embodiments, local observability controller 154 identifies the set of observation data that should be transmitted to each of the controller computing node 110 and local policy controller 152. For example, based on the SLA manifest and/or SLA description, central policy controller 124 could determine a first set of observation data that should be collected to determine whether the SLA is being met by the system, and central policy controller 124 could determine a second set of observation data that should be collected to determine whether the SLA is being met by the corresponding agent computing node 140. The first set of observation data may be different from the second set of observation data. Local observability controller 154 filters the collected observation data to generate the first set of observation data and the second set of observation data. Local observability controller 154 transmits the first set of observation data to controller computing node 110 and the second set of observation data to local policy controller 152.
  • local observability controller 154 collects more observation data than it sends to the central observability controller 126 and local policy controller 152.
  • local observability controller 154 could send some of the observation data that it collected to the central observability controller 126 (e.g., observation data that is associated with SLA enforcement for a service) but temporarily store the other collected observation data (e.g., observation data that is not associated with SLA enforcement for the service) in a non-persistent storage (not shown).
  • the non-persistent storage is an in-memory database. In general, non-persistent storage allows for faster storage/access compared to the persistent storage but is more expensive (and thus typically has less storage capacity).
  • the central observability controller 126 could send a request to local observability controller 154 for observation data stored in non-persistent storage when needed (e.g., in response to receiving a notification that an SLA was not met).
  • Local observability controller 154 could provide observation data stored in the non-persistent storage to the central observability controller 126 upon receiving such a request from the central observability controller 126.
  • the local policy controller 152 receives the observation data associated with a service from local observability controller 154. Local policy controller 152 determines whether the SLA(s) for the service are being met based on the observation data. In some embodiments, determining whether an SLA is being met includes determining one or more SLAs associated with a service based on SLA information received from the SLA operator 122 and/or central policy controller 124, such as the SLA manifest or SLA description(s) associated with the service. In some embodiments, determining whether the SLA is being met includes determining one or more metrics to generate in association with the SLA, based on the SLA information (e.g., SLA manifest or SLA description).
  • SLA information e.g., SLA manifest or SLA description
  • local policy controller 152 could determine that the number of requests received by the agent computing node 140 during a given time period based on the observation data, and compute the number of requests received per second. Further, local policy controller 152 could determine how to generate the metric (e.g., how to aggregate or process the local observation data). The specific operations performed on the observation data could differ based on the type of metric being generated and the type of observation data received.
  • local policy controller 152 determines one or more local SLA enforcement operations that should be performed.
  • Local SLA enforcement operations are operations that can affect changes in how the agent computing node 140 is executing the service, i.e., that potentially affect the SLA of the service, such as modifying or configuring node parameters or node component parameters (e.g., scheduling, traffic shaping, queue prioritization and/or the like); modifying node resource configurations (e.g., from the initial resource configurations of the service instance); modifying or configuring node resources (e.g., computing resources, network resources, storage resources, and/or the like of the node) that are assigned to the service; modifying or configuring node resources that are assigned to other services (e.g., so that more resources are available to the service); reassigning resources from another service to the service; and/or the like.
  • modifying or configuring node parameters or node component parameters e.g., scheduling, traffic shaping, queue prioritization and/or the like
  • determining the one or more local SLA enforcement operations includes determining a cause of the SLA not being met/fulfilled based on the observation data.
  • the method or approach by which the cause of an SLA not being met/fulfilled can vary depending on the implementation. In general, any suitable approach or technique for determining the cause of an SLA not being met/fulfilled can be used.
  • determining the cause(s) of an SLA not being met/fulfilled include requesting additional observation data.
  • local policy controller 152 could determine that the SLA not being met is caused by either compute issues (e.g., insufficient computational resources, scheduling, too many requests received, and/or the like) or networking issues (e.g., pathing, routing, packet priority, traffic rules, and/or the like). In response, local policy controller 152 could analyze local observation data to identify any network and/or compute issues. Additionally, local policy controller 152 could request additional observation data associated with network and/or compute operations of the node (e.g., that was not initially transmitted to local policy controller 152 in association with evaluating the SLA).
  • compute issues e.g., insufficient computational resources, scheduling, too many requests received, and/or the like
  • networking issues e.g., pathing, routing, packet priority, traffic rules, and/or the like.
  • local policy controller 152 could analyze local observation data to identify any network and/or compute issues. Additionally, local policy controller 152 could request additional observation data associated with network and/or compute operations of the node (e.g., that was not initially transmitted
  • local policy controller 152 determines the one or more local SLA enforcement operations based on the cause (or causes) of the SLA not being met/fulfilled. For example, a latency-based SLA requirement not being met could be caused by the agent computing node 140 being overloaded (e.g., too many services using too much of the node’s resources), the node receiving too many requests for the service, a waiting scenario where the service was unable to reply in time (e.g., due to scheduling), and/or the like. A different local SLA enforcement operation could be used to address each cause. In response to determining which cause resulted in the latency -based SLA requirement not being met, local policy controller 152 identifies the corresponding local SLA enforcement operation.
  • SLA enforcement operations associated with a networking issue could include, for example, selecting different traffic rules, modifying/configuring packet priority, changing pathing decisions, and/or the like.
  • SLA enforcement operations associated with computing issues could include, for example, modifying scheduling rules or queues, scaling out resources, moving computation to other resources, migrating to other resources, and/or the like.
  • local policy controller 152 determines the one or more local SLA enforcement operations by executing a plurality of local SLA enforcement operations in a simulation of the execution system (i.e., a digital twin). Local policy controller 152 determines, based on executing the plurality of local SLA enforcement operations in the digital twin, which local SLA enforcement operations caused the SLA requirement(s) to be met/fulfilled in the simulated environment, if any. Local policy controller 152 selects the local SLA enforcement operation(s) that caused the SLA requirements) to be met/fulfilled.
  • local policy controller 152 maps different types of SLA requirements to different types of SLA enforcement operations. Local policy controller 152 determines the one or more SLA enforcement operations based on the SLA requirement that is not being met. [0061] In some embodiments, determining the one or more local SLA enforcement operations includes determining the available resources of the agent computing node 140, the current resource configuration of the service, and/or the current resource configuration of one or more other services. Local policy controller 152 determines whether any available resources and/or resources assigned/allocated to other services can be assigned/ allocated to the service.
  • local policy controller 152 After determining the one or more local SLA enforcement operations, local policy controller 152 causes the one or more local SLA enforcement operations to be performed. In some embodiments, causing an SLA enforcement operation to be performed includes the local policy controller 152 performing the SLA enforcement operation.
  • causing the SLA enforcement operation to be performed includes the local policy controller 152 requesting that another application, module, component, computing device/node, and/or the like perform the SLA enforcement operation.
  • the node includes various controllers, services, and other software/hardware components that manage various system resources. If an SLA enforcement operation reconfigures or reallocates a given resource, local policy controller 152 transmits a request or otherwise communicates with the controller corresponding to the given resource to reconfigure or reallocate the resource.
  • agent computing node 140 includes a local network controller 158 and agent computing node 140 includes a local network controller 158.
  • the local network controller 158 controls network resources of the node.
  • local network controller 158 could be responsible for managing packet queue prioritization, traffic shaping, scheduling, and/or the like.
  • local policy controller 152 could transmit a request to the local network controller 158.
  • the request could indicate the network resources that should be reconfigured, reallocated, and/or adjusted, and the specific changes that should be made to the indicated resources.
  • the target(s) and content of a request can vary depending on the particular SLA enforcement operation being performed.
  • local policy controller 152 determines whether the SLA can be enforced locally at the agent computing node 140. If the SLA not being met/fulfilled is caused by a local problem (i.e., a problem at the agent computing node 140), then the SLA can be enforced locally by the agent computing node 140. As an example, local policy controller 152 could determine that the node has too high of a latency when responding to a service request. Because the SLA issue relates to the node, local policy controller 152 could determine that the SLA issue can be addressed (i.e., the SLA enforced) at the node itself.
  • the local policy controller 152 could, for example, increase the processing priority of the service instance running in the agent computing node 140. In contrast, if the SLA not being met/fulfilled is caused by a global problem, then the local policy controller 152 determines that the SLA cannot be enforced locally by the agent computing node 140 (i.e., the SLA issue cannot be fixed/addressed locally).
  • a global problem is a problem that exists in several nodes or across the system, where addressing the problem in a single node (e.g., at the agent computing node 140) would not fix the problem.
  • local policy controller 152 could determine that the SLA issue cannot be addressed at the node itself.
  • the local policy controller 152 determines the one or more local SLA enforcement operations that should be performed. If local policy controller 152 determines that the SLA cannot be enforced locally, then local policy controller 152 transmits a notification to central policy controller 124. The notification indicates that the SLA for the service is not being met/fulfilled. Additionally, the notification could indicate the specific metric(s) or other condition(s) that are causing the SLA to not be met/fulfilled.
  • local policy controller 152 attempts to perform one or more local SLA enforcement operations, regardless of whether the problem is a global or local problem. In such embodiments, local policy controller 152 could determine the one or more local SLA enforcement operations, without determining whether the SLA can be enforced locally.
  • local policy controller 152 transmits a notification to central policy controller 124 indicating that the SLA for the service is not being met/fulfilled, regardless of whether local policy controller 152 attempted to fix the SLA issue locally.
  • local policy controller 152 transmits the notification to central policy controller 124 only if local policy controller 152 determines that the SLA issue cannot be fixed locally and/or if local policy controller 152 determines that the SLA issue was not fixed after performing the one or more SLA enforcement operations (or causing the one or more SLA enforcement operations to be performed).
  • local policy controller 152 could initially not transmit a notification if it determines that the SLA issue can be fixed locally.
  • Local policy controller 152 receives additional observation data subsequent to performing one or more local SLA enforcement operations. Local policy controller 152 determines whether the SLA is being met/fulfilled following performance of the one or more local SLA enforcement operations. If local policy controller 152 determines that the SLA is still not met/fulfilled, then local policy controller 152 transmits the notification to the central policy controller 124.
  • central observability controller 126 receives observation data collected by various observation data collectors 156 from the local observability controllers 154 of the corresponding agent computing nodes 140. In some embodiments, central observability controller 126 transmits the received observation data to other applications, modules, computing devices, and/or the like for further processing. For example, central observability controller 126 could transmit observation data to an analysis application that analyzes the data and displays results to a user via a user interface. As another example, central observability controller 126 could transmit observation data associated with a service to a central policy controller 124 that determines whether an SLA for the service is being satisfied based on the observation data.
  • central observability controller 126 identifies a subset of observation data that is requested by a target destination (e.g., application, software module, computing device, etc.). For example, if central policy controller 124 is configured to determine whether the SLA for a given service is being met, central policy controller 124 could request observation data that is associated with the service and/or with the SLA from central observability controller 126. Central observability controller 126 transmits the identified subset of observation data to central policy controller 124. [0069] In some embodiments, central observability controller 126 stores the received observation data in a persistent storage (not shown). For example, central observability controller 126 could store the observation data in a persistent database.
  • a target destination e.g., application, software module, computing device, etc.
  • Central policy controller 124 could retrieve the observation data, or a portion thereof, from the persistent storage instead of or in addition to receiving the observation data directly from central observability controller 126. For example, central policy controller 124 could query a persistent database for observation data associated with the SLA(s) for a given service.
  • the central policy controller 124 receives the observation data associated with the service from central observability controller 126. In some embodiments, central policy controller 124 determines whether the SLA(s) for the service are being met based on the observation data. In some embodiments, determining whether an SLA is being met includes determining one or more SLAs associated with a service based on SLA information received from the SLA operator 122, such as the SLA manifest or SLA description(s) associated with the service. In some embodiments, determining whether the SLA is being met includes determining global observation data (e.g., metrics) to generate in association with the SLA, based on the SLA information (e.g., SLA manifest or SLA description).
  • SLA information e.g., SLA manifest or SLA description
  • central policy controller 124 could determine that the number of requests per second that is received by the system should be generated. Further, central policy controller 124 could determine how to generate the global observation data (e.g., how to aggregate or process the local observation data to generate the global observation data). Referring to the above example, central policy controller 124 could determine that the number of requests per second should be generated by aggregating the number of requests received at each node that is executing the service.
  • central policy controller 124 determines that an SLA is not being met based on receiving a notification from an agent computing node 140.
  • the notification indicates that the SLA for a given service is not being met/fulfilled.
  • a local policy controller 152 of the agent computing node 140 analyzes observation data collected locally at the agent computing node 140 and determines whether the SLA for the given service is being met.
  • the local policy controller 152 transmits a notification to the central policy controller 124.
  • central policy controller 124 determines that the SLA for the service is not being met.
  • central policy controller 124 determines one or more global SLA enforcement operations that should be performed.
  • Global SLA enforcement operations are operations that can affect changes in how the system is executing the service, i.e., that potentially affect the SLA of the service, such as modifying or configuring system or system component parameters (e.g., load balancer(s), router(s), SDN(s), and/or the like); modifying resource configurations (e.g., from the initial resource configurations); modifying or configuring resources (e.g., computing resources, network resources, storage resources, and/or the like) that are assigned to the service; modifying or configuring resources assigned to other services (e.g., so that more resources are available to the service); reassigning resources from another service to the service; and/or the like.
  • the system resources include one or more computing nodes (e.g., agent computing nodes 140), one or more storage nodes,
  • determining the one or more SLA enforcement operations includes determining a cause of the SLA not being met/fulfilled based on the observation data.
  • the method or approach by which the cause of an SLA not being met/fulfilled can vary depending on the implementation. In general, any suitable approach or technique for determining the cause of an SLA not being met/fulfilled can be used.
  • determining the cause(s) of an SLA not being met/fulfilled include requesting additional observation data.
  • central policy controller 124 could determine that the SLA not being met is caused by either compute issues (e.g., insufficient computational resources, scheduling, too many requests received, and/or the like) or networking issues (e.g., pathing, routing, packet priority, traffic rules, and/or the like). In response, central policy controller 124 could analyze the observation data received from various agent computing nodes to identify any network and/or compute issues. Additionally, central policy controller 124 could request additional observation data associated with network and/or compute operations (e.g., that was not initially transmitted to central policy controller 124 in association with evaluating the SLA).
  • compute issues e.g., insufficient computational resources, scheduling, too many requests received, and/or the like
  • networking issues e.g., pathing, routing, packet priority, traffic rules, and/or the like.
  • central policy controller 124 could analyze the observation data received from various agent computing nodes to identify any network and/or compute issues. Additionally, central policy controller 124 could request additional observation data associated with network and/or compute operations (e.g.
  • central policy controller 124 determines the one or more global SLA enforcement operations based on the cause (or causes) of the SLA not being met/fulfilled. For example, a latency-based SLA requirement not being met could be caused by the system being overloaded (e.g., not enough nodes or other resources assigned to the service), the system routing too many requests to an overloaded node, traffic associated with the service not being sufficiently prioritized, and/or the like. A different global SLA enforcement operation could be used to address each cause. In response to determining which cause resulted in the latency-based SLA requirement not being met, central policy controller 124 identifies the corresponding global SLA enforcement operation.
  • SLA enforcement operations associated with a networking issue could include, for example, selecting different traffic rules, modifying/configuring packet priority, changing pathing decisions, and/or the like.
  • SLA enforcement operations associated with computing issues could include, for example, modifying scheduling rules or queues, scaling out resources, moving computation to other resources, migrating to other resources, and/or the like.
  • central policy controller 1 4 determines the one or more global SLA enforcement operations by executing a plurality of global SLA enforcement operations in a simulation of the execution system (i.e., a digital twin of the execution system).
  • Central policy controller 124 determines, based on executing the plurality of global SLA enforcement operations in the digital twin, which global SLA enforcement operations caused the SLA requirement(s) to be met/fulfilled in the simulated environment, if any.
  • Central policy controller 124 selects the global SLA enforcement operation(s) that caused the SLA requirement(s) to be met/fulfilled.
  • central policy controller 124 maps different types of SLA requirements to different types of SLA enforcement operations. Central policy controller 124 determines the one or more SLA enforcement operations based on the SLA requirement that is not being met.
  • determining the one or more SLA enforcement operations includes determining the available resources of the system, the current resource configurations of the system (e.g., load balancer, router, SDN, various nodes in the system, and/or the like), and/or the resources configurations associated with one or more other services.
  • Central policy controller 124 determines whether any available resources and/or resources assigned/allocated to other services can be assigned/allocated to the service. In some embodiments, if a resource is assigned/allocated to a second service whose SLA is being exceeded or that has a lower priority, then the resource is reassigned, re-allocated, or priority is otherwise given to the service whose SLA is being enforced.
  • the central policy controller 124 could relocate one or more services (e.g., the service whose SLA is not met and/or other services executing on the overloaded node) from the overloaded node to a different node that has free resources. Additionally or alternatively, central policy controller 124 could reconfigure the associated load balancer to avoid sending traffic to the overloaded node or send less traffic towards the overloaded node. In some embodiments, if central policy controller 124 determines that all service instances are on overloaded nodes, then instead of moving services and/or load balancing, central policy controller 124 scales out the service whose SLA is not being met.
  • central policy controller 124 could assign additional nodes that have free resources to the service and/or allocate new/additional resources on nodes with free resources to the service. [0079] After determining the one or more SLA enforcement operations, central policy controller 124 causes the one or more SLA enforcement operations to be performed. In some embodiments, causing a SLA enforcement operation to be performed includes the central policy controller 124 performing the SLA enforcement operation. In some embodiments, causing the SLA enforcement operation to be performed includes or the central policy controller 124 requesting that another application, module, component, computing device/node, and/or the like perform the SLA enforcement operation. The system includes various controllers, services, and other software/hardware components that manage various system resources. If an SLA enforcement operation reconfigures or reallocates a given resource, central policy controller 124 transmits a request or otherwise communicates with the controller corresponding to the given resource to reconfigure or reallocate the resource.
  • controller computing node 110 includes a central network controller 128 and agent computing node 140 includes a local network controller 158.
  • the central network controller 128 controls system network resources.
  • central network controller 128 could comprise one or more of a software defined network (SDN) controller, load balancer (LB), router, and/or the like.
  • SDN software defined network
  • LB load balancer
  • local network controller 158 controls network resources of the agent computing node 140.
  • central policy controller 124 could transmit a request to the central network controller 128 and/or to one or more local network controllers 158.
  • the target(s) and content of a request can vary depending on the particular SLA enforcement operation being performed.
  • the system includes an SDN controller (e.g., at central network controller 128).
  • the central policy controller 124 could request that the SDN controller prioritize or increase bandwidth of a poor performing connection (i.e., the SLA for a service is not met), or limit the bandwidth of a connection if SLA is being exceeded and the resources are needed for other connections (i.e., limit bandwidth for a first service whose SLA is exceeded and allocate bandwidth for a second service whose SLA is not fulfilled).
  • the system could include a load balancer (e.g., at central network controller 128).
  • the load balancer could be initially configured by the central policy controller 124 based on SLA requirements, and subsequently reconfigured based on the SLA requirements and collected observation data, to direct traffic to node(s) that have the resources to fulfill the SLA. Therefore, the system can configure the load balancer to act as a bandwidth limiter to allow only a configured amount of traffic to be passed to a service instance in order to enforce the SLA for the service.
  • the system could include a scheduling control (e.g., at a local network controller 158) that manages kernel scheduling (e.g., using a kernel extension mechanism such as an eBPF module).
  • the scheduling control could receive a resource reconfiguration request from local policy controller 152 and/or central policy controller 124 to prioritize a critical SLA connection.
  • the scheduling control requests the kernel to immediately execute a service instance that handles a critical SLA connection.
  • the scheduling control could also receive a request to prioritize critical packet(s). Scheduling control can request that the kernel modify the incoming packet queue to move a critical packet to the beginning of the queue, so that the critical packet is handled before non- critical packets.
  • the system could include an application scheduler (e.g., at each agent computing node 140).
  • the application scheduler can be initially configured to not prioritize any connections within an instance of a service or to prioritize a first set of one or more connections. Subsequently, central policy controller 124 and/or local policy controller 152 could reconfigure the application scheduler to prioritize a second set of one or more connections (e.g., by transmitting a request that indicates the connection(s) that should be prioritized).
  • Example networking SLAs include, without limitation, latency, round-trip time, response time, jitter, throughput, transactions per second, link error rates, packet drop rates, and/or the like.
  • the observability system could collect metrics associated with the network 130. The system uses the collected metrics to determine whether various network SLAs are being fulfilled.
  • central policy controller 124 and/or local policy controller 152 can configure routers included in the system (e.g., via central network controller 128 or local network controller 158) to prioritize or reroute network traffic to fulfill the SLA.
  • the observability system can analyze the observation data to determine whether the SLAs for various services are being met or fulfilled. If the SLA is not being fulfilled, the observability system can take action(s) to resolve the issue. In particular, by detecting and resolving SLA issues at individual nodes, the observability system can quickly resolve single-node issues (i.e., local SLA problems) without waiting for a central service to receive observation data, perform data analysis, and detect the SLA issues before the issues can be resolved.
  • single-node issues i.e., local SLA problems
  • FIG. 2 is a diagram illustrating interactions between components of the system of Figure 1 to configure resources and data collection for a new service, according to various embodiments.
  • the SLA operator 122 receives a SLA manifest for a new service that is being deployed within the system under observation.
  • the system under observation could be the observability system itself (e.g., the observability system is also the execution system) or could be a system that is separate from the observability system.
  • the SLA manifest describes the service (i.e., a service definition) and the SLA requirements associated with the service.
  • SLA operator 122 determines whether the user requesting the service has permission to request the specified SLA (e.g., permission to request the resources described in the SLA manifest and/or the specified quantity of resources). If SLA operator 122 determines that the user has permission to request the specified SLA, then SLA operator 122 proceeds with operation 204 below. If SLA operator 122 determines that the user does not have permission to request the specified SLA, then SLA operator 122 instead denies the SLA request and does not proceed. In such cases, the service could execute on the execution system without enforcement of the SLA by the observability system.
  • SLA operator 122 determines whether the user requesting the service has permission to request the specified SLA (e.g., permission to request the resources described in the SLA manifest and/or the specified quantity of resources). If SLA operator 122 determines that the user has permission to request the specified SLA, then SLA operator 122 proceeds with operation 204 below. If SLA operator 122 determines that the user does not have permission to request the specified SLA, then SLA
  • SLA operator 122 transmits an SLA configuration for the service to the central policy controller 124.
  • the SLA configuration indicates the resources that need to be allocated or assigned to the service in order to meet the SLA requirements associated with the service.
  • transmitting the SLA configuration comprises determining the set of resources that need to be allocated/assigned to the service based on an SLA manifest or SLA description that was included in the SLA manifest.
  • SLA operator 122 transmits data indicating the set of resources to central policy controller 124.
  • transmitting the SLA configuration comprises transmitting the SLA manifest and/or a portion of data included in the SLA manifest, such as the SLA description(s) or service description.
  • central policy controller 124 checks whether the resources specified in the SLA configuration are available after receiving the SLA configuration. That is, central policy controller 124 determines whether the execution system has sufficient resources available for meeting the SLA requirements of the service. In some embodiments, if central policy controller 124 received data indicating a set of resources that need to be allocated/assigned to the service from SLA operator 122, checking the available resources to determine whether sufficient resources are available includes comparing the set of resources with the set of available resources of the execution system.
  • checking the available resources includes determining the set of resources that need to be allocated/assigned to the service based on the SLA manifest and/or SLA description.
  • Central policy controller 124 compares the determined set of resources with the set of available resources of the execution system.
  • central policy controller 124 rejects the SLA request. In such embodiments, the service could execute without enforcement of the SLA by the observability system. In some embodiments, if the execution system does not have sufficient available resources, central policy controller 124 requests additional resources be made available. As discussed in further detail below with respect to Figure 3, if additional resources are made available, then central policy controller 124 determines that the system includes sufficient available resources to meet the SLA requirements. If additional resources are not made available, then central policy controller 124 determines that the system includes insufficient resources to meet the SLA requirements.
  • the central policy controller 124 sets an initial resource configuration for the system based on the SLA requirements of the service.
  • setting the initial resource configuration includes determining initial parameters and other configuration information for one or more system resources, such as assigning nodes to the service, determining load balancing settings, traffic routing configurations for the service, and/or the like.
  • setting the initial resource configuration includes transmitting resource assignments/requests to one or more other controllers, services, and/or the like.
  • central policy controller 124 could communicate with a central network controller 128 to set initial network resource configurations.
  • the central policy controller 124 transmits a resource configuration to the local policy controller 152.
  • the resource configuration indicates the resources that need to be allocated/assigned to the service at the agent computing node 140 in order to meet/fulfill the SLA.
  • transmitting the resource configuration comprises transmitting the SLA configuration received from SLA operator 122 (e.g., data indicating a set of resources, a SLA manifest, or a SLA definition).
  • central policy controller 124 determines a subset of resources that are associated with the agent computing node 140 and configuration(s) corresponding to the subset of resources.
  • Central policy controller 124 transmits data indicating the subset of resources and their corresponding configurations.
  • the local policy controller 152 sets initial resource configurations based on the resource configuration received from the central policy controller 124.
  • setting the initial resource configuration includes determining initial parameters and other configuration information for one or more node resources, such as assigning or allocated resources (e.g., hardware, software, and/or network resources of the node) to the service, determining scheduling configurations for the service, determining priorities associated with the service, and/or the like.
  • setting the initial resource configuration includes transmitting resource assignments/requests to one or more other controllers, services, and/or the like. For example, local policy controller 152 could transmit traffic rules to a local network controller 158 to set initial network resource configurations.
  • the SLA operator 122 determines observation data associated with the SLA of the service.
  • the observation data comprises data that is needed to determine whether the SLA is being met. Such data could include, for example and without limitation, one or more metrics, measurements, logs, traces, and/or the like. Additionally, SLA operator 122 could determine parameters associated with collection of the observation data, such as where the observation data should be collected (e.g., at which nodes), how often the observation data should be collected, and/or the like.
  • the SLA operator 122 transmits a data collection request to central observability controller 126. The data collection request specifies the types of observation data determined in operation 214. Additionally, the data collection request could specify other data collection parameters that were determined by SLA operator 122 (e.g., where to collect the data and/or how often to collect the data).
  • central observability controller 126 transmits data collection configuration information to a corresponding data collection request to the local observability controller 154 of each node that is executing the service.
  • the data collection configuration information for a given node is used to configure the observation data collector 156 of the given node to collect the specified observation data.
  • central observability controller 126 could transmit a data collection request that specifies the observation data that should be collected at the node and, optionally, other parameters associated with the data collection.
  • central observability controller 126 determines, for a given agent computing node 140, the observation data that should be collected at the given agent computing node 140 based on the data collection request from SLA operator 122. Central observability controller 126 generates data collection configuration information for the given node based on the observation data that should be collected at that node.
  • central observability controller 126 instead of transmitting the data collection configuration information to the local observability controller 154 of a node, transmits the data collection configuration information to the observation data collector 156 of the node.
  • the observation data collector 156 receives the data collection configuration information and collects observation data in accordance with the data collection configuration.
  • local observability controller 154 configures the observation data collector 156 to collect the specified data.
  • the observation data collector 156 collects observation data in accordance with the data collection configuration.
  • the observation data collector 156 transmits the collected observation data to the local observability controller 154.
  • the local observability controller 154 transmits the observation data received from the observation data collector 156 to the central observability controller 126. In some embodiments, local observability controller 154 transmits all of the observation data received from observation data collector 156. In some embodiments, local observability controller 154 transmits a portion of the observation data received from observation data collector 156. For example, central observability controller 126 could request specific observation data, such as the observation data that is needed to determine whether the SLA for a given service is being met. As another example, local observability controller 154 could determine which observation data was specified by the data collection configuration information transmitted by central observability controller 126. [0103] As shown in Figure 2, operations 202-224 are performed in order.
  • one or more of the operations can be performed in a different order and/or performed in parallel.
  • operations 204-212 could be performed before operations 214-218.
  • one or more of operations 204-212 could be performed in parallel with one or more of operations 214-218.
  • SLA operator 122 could perform operation 206 (i.e., determine whether the system has sufficient resources) prior to transmitting the SLA configuration to central policy controller 124. In such cases, central policy controller 124 does not need to perform operation 206.
  • SLA operator 122 could transmit a SLA configuration to both central policy controller 124 and to local policy controller 152. In such cases, central policy controller 124 could omit operation 210.
  • Figure 3 is a diagram illustrating interactions between components of the system of Figure 1 to determine whether the system has sufficient resources to fulfill an SLA for a service, according to various embodiments.
  • the SLA operator 122 receives an SLA manifest for a new service that is being deployed within the system under observation.
  • the system under observation could be the observability system itself (e.g., the observability system is also the execution system) or could be a system that is separate from the observability system.
  • the SLA manifest describes the service (i.e., a service definition) and the SLA requirements associated with the service.
  • the SLA operator 122 receives a SLA description rather than the SLA manifest.
  • SLA operator 122 determines whether the user requesting the service has permission to request the specified SLA (e.g., permission to request the resources described in the SLA manifest and/or the specified quantity of resources). If SLA operator 122 determines that the user has permission to request the specified SLA, then SLA operator 122 proceeds with operation 204 below. If SLA operator 122 determines that the user does not have permission to request the specified SLA, then SLA operator 122 instead denies the SLA request and does not proceed with checking the available resources.
  • SLA operator 122 determines whether the user requesting the service has permission to request the specified SLA (e.g., permission to request the resources described in the SLA manifest and/or the specified quantity of resources). If SLA operator 122 determines that the user has permission to request the specified SLA, then SLA operator 122 proceeds with operation 204 below. If SLA operator 122 determines that the user does not have permission to request the specified SLA, then SLA operator 122 instead denies the SLA request and does not proceed with checking the available resources.
  • the SLA operator 122 transmits an SLA configuration for the service to the central policy controller 124.
  • the SLA configuration indicates the resources that need to be allocated or assigned to the service in order to meet the SLA requirements associated with the service.
  • transmitting the SLA configuration comprises determining the set of resources that need to be allocated/assigned to the service based on an SLA manifest or SLA description that was included in the SLA manifest.
  • SLA operator 122 transmits data indicating the set of resources to central policy controller 124.
  • transmitting the SLA configuration comprises transmitting the SLA manifest and/or a portion of data included in the SLA manifest, such as the SLA description(s) or service description.
  • central policy controller 124 determines whether the system has sufficient resources to meet the SLA. In some embodiments, central policy controller 124 determines, based on the SLA configuration, a set of resources that need to be allocated/assigned to the service. Central policy controller 124 further determines the available resources in the system. Central policy controller compares the set of resources with the available resources in the system. For example, the set of resources could include a first number of computing nodes and the available resources could include a second number of computing nodes. Central policy controller 124 determines whether the first number is less than or equal to the second number (i.e., whether the number of available computing nodes is greater than the number of computing nodes that need to be allocated to meet the SLA).
  • central policy controller 124 determines that the available resources are not sufficient to meet the SLA, at operation 308, central policy controller 124 transmits a request for additional resources to cluster orchestrator 320.
  • the request indicates the amount of remaining resources needed to meet the SLA (i.e., the difference between the available resources and the needed resources). In some embodiments, the request indicates the total amount of resources needed to meet the SLA.
  • cluster orchestrator 320 determines whether the system can provide additional resources to the service. For example, if central policy controller 124 requested additional computing nodes, cluster orchestrator 320 determines whether the number of computing nodes in the system could be increased (i.e., whether additional computing nodes can be added to the cluster). [0112] At operation 312, the cluster orchestrator 320 transmits a response to the resource request to central policy controller 124. The response indicates whether cluster orchestrator 320 determined that the system can provide additional resources to the service.
  • the central policy controller 124 rejects the SLA request if the system has insufficient resources to meet the SLA. That is, if at operation 306, central policy controller 124 determined that the system does not have sufficient resources and the response received from cluster orchestrator 320 indicated that additional resources cannot be provided, then central policy controller 124 rejects the SLA request.
  • the central policy controller 124 configures cluster resources based on the SLA request. That is, if at operation 306, central policy controller 124 determined that the system has sufficient resources, or if central policy controller 124 determined that the system does not have sufficient resources but cluster orchestrator 320 indicated that additional resources can be provided, then central policy controller configures cluster resources based on the SLA request.
  • Figure 4 is a diagram illustrating interactions between components of the system of Figure 1 to enforce an SLA at an agent computing node 140, according to various embodiments.
  • local observability controller 154 transmits requested observation data to local policy controller 152.
  • the requested observation data includes observation data that is needed for the local policy controller 152 to determine whether the SLA for a service is being fulfilled.
  • local policy controller 152 requests observation data that is associated with one or more services for which local policy controller 152 is evaluating the SLA.
  • local policy controller 152 determines the observation data that is needed to determine whether the SLA is being fulfilled.
  • Local policy controller 152 requests the data from local observability controller 154 (e.g., periodically) for evaluating the SLA for the given service.
  • local policy controller 152 determines that the SLA for a service is not fulfilled based on the observation data. In some embodiments, determining whether the SLA for a service is being fulfilled includes determining the SLA requirements for the service. For example, local policy controller 152 could retrieve data indicating the SLA requirements associated with the service. Local policy controller 152 analyzes the information included in the observation data based on the SLA requirements for the service to determine whether the SLA is being met/fulfilled.
  • local policy controller 152 after determining that the SLA is not fulfilled, local policy controller 152 also determines whether the SLA should be enforced locally. If the SLA should be enforced locally, then local policy controller proceeds to perform operation 406 below. If local policy controller 152 determines that the SLA should not be enforced locally (i.e., should be enforced centrally/globally), then local policy controller instead proceeds to perform operation 412 below. [0119] At operation 406, local policy controller 152 determines an updated resource configuration for the node based on the SLA requirements of the service. In some embodiments, determining the updated resource configuration is based on which SLA requirements are not being met and the observation data associated with the SLA requirements. For example, if an SLA requirement is not being met is related to the availability of a particular resource, then local policy controller 152 updates the resource configuration of the particular resource to allocate, prioritize, or otherwise make additional units of the particular resource available to the service.
  • determining the updated resource configuration includes determining a cause of the SLA for the service not being met based on the observation data.
  • local policy controller 152 stores a mapping of different types of SLA requirements to different SLA enforcement operations (i.e., to different resource configuration changes). Based on the mapping and the specific SLA requirements that are not being met, local policy controller 152 determines the SLA enforcement operations that should be performed.
  • the updated resource configuration corresponds to a network configuration.
  • local policy controller 152 transmits the updated resource configuration to local network controller 158.
  • the local network controller 158 updates the network resource configuration in accordance with the updated resource configuration.
  • local policy controller 152 transmits a notification to central policy controller 124.
  • the notification indicates that local policy controller 152 determined that the SLA requirements for the service was not being met/fulfilled.
  • Figure 5 is a diagram illustrating interactions between components of the system of Figure 1 to enforce an SLA at a controller computing node 110, according to various embodiments.
  • local policy controller 152 transmits a notification to central policy controller 124.
  • the notification indicates that local policy controller 152 determined that the SLA for a given service was not being met/fulfilled.
  • local policy controller 152 transmits a notification each time local policy controller 152 determines that the SLA for a service is not met.
  • local policy controller 152 transmits the notification in response to determining that the SLA for the service is not met and that the SLA cannot be enforced locally.
  • local policy controller 152 transmits the notification in response to determining that the SLA is still not met after performing one or more local SLA enforcement operations.
  • central observability controller 126 transmits requested observation data to central policy controller 124.
  • the requested observation data includes observation data that is needed for the central policy controller 124 to determine whether the SLA for the service is being fulfilled.
  • central policy controller 124 transmits a request for data associated with the service in response to receiving the notification in operation 502.
  • central policy controller 124 requests or receives the data independent of receiving a notification from local policy controller 152.
  • central policy controller 124 could periodically request observation data associated with one or more services from central observability controller 126 (i.e., via a pull mechanism).
  • central observability controller 126 could periodically transmit observation data associated with one or more services to central policy controller 124 without central policy controller 124 transmitting a request for the data (i.e., via a push mechanism).
  • central policy controller 124 determines an updated resource configuration based on the SLA requirements of the service. In some embodiments, determining the updated resource configuration is based on which SLA requirements are not being met and the observation data associated with the SLA requirements. For example, if an SLA requirement is not being met is related to the availability of a particular resource, then central policy controller 124 updates the resource configuration of the particular resource to allocate, prioritize, or otherwise make additional units of the particular resource available to the service.
  • determining the updated resource configuration includes determining a cause of the SLA for the service not being met based on the observation data. In some embodiments, determining the updated resource configuration is based on a mapping between different types of SLA requirements and different SLA enforcement operations (i.e., different resource configuration changes). Central policy controller 124 determines an updated resource configuration in line with the SLA enforcement operation that should be performed.
  • the updated resource configuration corresponds to a network resource configuration (e.g., routing, load balancing, SDN, and/or the like).
  • central policy controller 124 transmits the updated resource configuration to central network controller 128.
  • central network controller 128 updates the resource configuration for the cluster and/or for multiple nodes in the cluster in accordance with the updated resource configuration.
  • the updated resource configuration includes an updated network resource configuration for a given node
  • central policy controller 124 transmits the updated resource configuration of the given node to the local policy controller 152 of the node.
  • the local policy controller 152 transmits the updated resource configuration to local network controller 158.
  • local network controller 158 updates the network resource configuration for the node in accordance with the updated network resource configuration.
  • central policy controller 124 instead of central policy controller 124 transmitting the updated network resource configuration to local policy controller 152, central policy controller 124 or central network controller 128 transmit the updated network resource configuration directly to local network controller 158.
  • FIG. 6 is a flowchart of method steps for enforcing an SLA at an agent computing node, according to various embodiments. Although the method steps are described with reference to the system of Figure 1, persons skilled in the art will understand that any system configured to implement the method steps, in any order, falls within the scope of the present disclosure.
  • method 600 begins at step 602, where an agent computing node collects local observation data.
  • Local observation data comprises data that is measured at, generated by, received by, transmitted from, or otherwise associated with the agent computing node itself.
  • the agent computing node determines a service level agreement (SLA) associated with a service that is deployed in a cluster.
  • the cluster could be, for example, the system illustrated in Figure 1.
  • determining the SLA associated with the service is based on a SLA manifest, SLA description, or other information that describes the service and its SLA requirements.
  • the agent computing node stores information indicating the corresponding SLA requirements for different services deployed in the cluster. For example, the agent computing node could store SLA information for each service for which it previously received an SLA configuration.
  • the agent computing node determines, based on the local observation data, whether the SLA for the service is being fulfilled. If the SLA is being met/fulfilled, the method returns to step 602, where the agent computing node continues to collect local observation data. If the SLA is not being met/fulfilled, the method proceeds to step 608 below.
  • the agent computing node determines whether the node should enforce the SLA locally. In some embodiments, determining whether the node should enforce the SLA locally includes determining whether the cause of the SLA not being met is local to the agent computing node. In some embodiments, the agent computing node always attempts to enforce the SLA locally. In such embodiments, step 608 does not need to be explicitly performed (i.e., is optional or is skipped).
  • the agent computing node determines that the node should enforce the SLA locally, the method proceeds to step 610.
  • the agent computing node updates one or more local resource configuration(s) of the node.
  • updating the one or more local resource configurations is based on the SLA requirements of the service.
  • the agent computing node identifies one or more local resources that can affect the SLA requirements and determines changes to make to the one or more local resources, if any. In some cases, the agent computing node may not be able to make changes to a given local resource, for example, if no additional units of the local resource are available (i.e., the resource is overloaded) or the resource is being used by a service with a higher priority.
  • step 602 After updating the one or more local resource configurations, the method returns to step 602, where the agent computing node continues to collect local observation data. In some embodiments, prior to returning to step 602, the agent also performs step 612 below.
  • the agent computing node determines that the node should not enforce the SLA locally, the method proceeds to step 612.
  • the agent computing node transmits a notification to another node.
  • the notification indicates that the SLA for the service is not being met/fulfilled.
  • the agent computing node transmits the notification to a controller computing node or other node that is configured to perform central/global SLA enforcement.
  • FIG. 7 is a flowchart of method steps for enforcing an SLA at a controller computing node, according to various embodiments. Although the method steps are described with reference to the system of Figure 1, persons skilled in the art will understand that any system configured to implement the method steps, in any order, falls within the scope of the present disclosure.
  • method 700 begins at step 702, where a controller computing node receives observation data from a plurality of agent computing nodes.
  • the controller computing node receives, from a given agent computing node, observation data that was collected locally at the agent computing node.
  • the controller computing node receives data indicating that a service level agreement (SLA) for a service is not being fulfilled.
  • SLA service level agreement
  • the controller computing node could receive a notification from an agent computing node included in the plurality of agent computing nodes (from which the controller computing node received observation data). After receiving the notification, the method proceeds to step 708 below.
  • step 706 the controller computing node determines whether an SLA for a service is being fulfilled based on the observation data.
  • step 706 can be performed independent of step 704.
  • the controller computing node could evaluate the SLAs for one or more services upon receipt of observation data associated with the one or more services.
  • the SLA indicated at step 704 can be different from the SLA being evaluated at step 706 (e.g., correspond to different services).
  • the controller computing node is able to detect that an SLA is not being met without having received any notifications associated with the SLA (i.e., when no agent computing nodes detected that the SLA is not being met).
  • step 702 the controller computing node continues to receive observation data. If the controller computing node determines that an SLA is not being met/fulfilled, the method proceeds to step 708.
  • the controller computing node determines one or more SLA enforcement operations that should be performed. Determining the one or more SLA enforcement operations is based on the SLA requirements that are not being met. The controller computing node determines SLA enforcement operations that can affect the SLA requirements. For example, if an SLA requirement is not being met is related to the availability of a particular resource, then the controller computing node identifies operations that allocate, prioritize, or otherwise make the particular resource, or more of the particular resource, available to the service.
  • the controller computing node may not be able to make changes to a given resource associated with the SLA requirements. For example, if the resource has reached maximum capacity, if the resource is being used by another service with a higher priority, if another service is already being given higher prioritization, and so on, then the controller computing node does not select SLA enforcement operation(s) that affect the given resource.
  • the controller computing node causes the one or more SLA enforcement operations to be performed in the cluster.
  • causing an SLA enforcement operation to be performed could include transmitting a request to perform the operation to another node (e.g, an agent computing node), transmitting a request to perform the operation to another component of the system (e.g., a load balancer, a router, a SDN controller, and/or the like), or performing the operation at the controller computing node.
  • FIG. 8 illustrates network devices (NDs) 800A-H in an exemplary network, according to various embodiments.
  • connectivity between the NDs 800A-H is illustrated by way of lines between various NDs.
  • the NDs 800A-H are physical devices, and the connections between any two NDs can be a wireless connection or a wired connection (often referred to as a link).
  • An additional line extending from NDs 800 A, 800E, and 800F illustrates that these NDs connect the network to other network(s) and/or devices, and therefore, can act as ingress and egress points for the network.
  • NDs 800A, 800E, and 800F can be referred to as edge NDs
  • NDs 800B-D and 800G-H are referred to as core NDs.
  • Figure 8 further illustrates three exemplary implementations of a network device 800: a special-purpose network device 802, a general purpose network device 804, and a hybrid network device 806.
  • a special-purpose network device 802 uses custom applicationspecific integrated-circuits (ASICs) and a special-purpose operating system (OS).
  • ASICs applicationspecific integrated-circuits
  • OS special-purpose operating system
  • special-purpose network device 802 includes networking hardware 810 comprising a set of one or more processor(s) 812, forwarding resource(s) 814 (which typically include one or more ASICs and/or network processors), and physical network interfaces (NIs) 816 (through which network connections are made, such as those shown by the connectivity between NDs 800 A-H).
  • Specialpurpose network device 802 also includes non-transitory machine-readable storage media 818 which stores networking software 820.
  • network hardware 810 executes the networking software 820 to instantiate a set of one or more networking software instance(s) 822.
  • Each of the networking software instance(s) 822, and the portion of the networking hardware 810 that is executing that network software instance form a separate virtual network element 830A-R.
  • Each of the virtual network element(s) (VNEs) 830A-R includes a control communication and configuration module 832A-R (sometimes referred to as a local control module or control communication module) and forwarding table(s) 834A-R, such that a given virtual network element (e.g., 830A) includes the control communication and configuration module (e.g., 832A), a set of one or more forwarding table(s) (e.g., 834A), and the portion of the networking hardware 810 that executes the virtual network element (e.g., 830 A).
  • a control communication and configuration module 832A-R sometimes referred to as a local control module or control communication module
  • forwarding table(s) 834A-R such that a given virtual network element (e.g., 830A) includes the control communication and configuration module (e.g., 832A), a set of one or more forwarding table(s) (e.g., 834A), and the portion of the networking hardware 810 that
  • networking software 820 includes observability component 823, which when executed by networking hardware 810, causes the special -purpose network device 802 to perform one or more of the operations described above (e.g., to collect observation data, receive collected observation data, determine whether an SLA is being met based on collected observation data, update resource configurations to enforce SLA, and/or the like).
  • the special-purpose network device 802 can be physically and/or logically considered to include: 1) a ND control plane 824 (sometimes referred to as a control plane) comprising the processor(s) 812 that execute the control communication and configuration module(s) 832A-R; and 2) a ND forwarding plane 826 (sometimes referred to as a forwarding plane, a data plane, or a media plane) comprising forwarding resource(s) 814 that utilize the forwarding table(s) 834A-R and the physical NIs 816.
  • a ND control plane 824 (sometimes referred to as a control plane) comprising the processor(s) 812 that execute the control communication and configuration module(s) 832A-R
  • a ND forwarding plane 826 sometimes referred to as a forwarding plane, a data plane, or a media plane
  • forwarding resource(s) 814 that utilize the forwarding table(s) 834A-R and the physical NIs 816.
  • the ND control plane 824 (the processor(s) 812 executing the control communication and configuration module(s) 832A-R) is typically responsible for participating in controlling how data (e.g., packets) is to be routed (e.g., the next hop for the data and the outgoing physical NI for that data) and storing that routing information in the forwarding table(s) 834A-R, and the ND forwarding plane 826 is responsible for receiving that data on the physical NIs 816 and forwarding that data out the appropriate ones of the physical NIs 816 based on the forwarding table(s) 834A-R.
  • data e.g., packets
  • the ND forwarding plane 826 is responsible for receiving that data on the physical NIs 816 and forwarding that data out the appropriate ones of the physical NIs 816 based on the forwarding table(s) 834A-R.
  • General-purpose network device 804 uses common off-the-shelf (COTS) processors and a standard OS. As shown in Figure 8, general-purpose network device 804 includes hardware 840 comprising a set of one or more processor(s) 842 (which are often COTS processors) and physical NIs 846, as well as non-transitory machine readable storage media 848 having stored therein software 850. During operation, the processor(s) 842 execute the software 850 to instantiate one or more sets of one or more applications 864A-R.
  • COTS off-the-shelf
  • general-purpose network device 804 does not utilize any virtualization.
  • general-purpose network device 804 uses one or more forms of virtualization.
  • virtualization layer 854 represents the kernel of an operating system (or shim executing on a base operating system) that allows for the creation of multiple instances 862A-R that can each be used to execute one or more of the sets of applications 864A-R.
  • the multiple instances may also be referred to as software containers, virtualization engines, virtual private servers, jails, and/or the like.
  • the multiple instances 862A-R are user spaces (e.g., a virtual memory space) that are separate from each other and/or separate from the kernel space in which the operating system is run. The set of applications running in a given user space, unless explicitly allowed, cannot access the memory of the other processes (e.g., other user spaces).
  • the virtualization layer 854 represents a hypervisor (sometimes referred to as a virtual machine monitor (VMM)) or a hypervisor executing on top of a host operating system.
  • VMM virtual machine monitor
  • Each of the sets of applications 864A-R is run on top of a guest operating system within a corresponding instance 862A-R (i.e., virtual machine) that is run on top of the hypervisor.
  • the guest operating system and/or application(s) do not know that they are running on a virtual machine as opposed to running on a “bare metal” host electronic device.
  • the guest operating system and/or application(s), through para-virtualization are aware of the present of virtualization.
  • one, some, or all of the applications are implemented as unikernel(s).
  • a unikernel can be generated by compiling directly with an application only a limited set of libraries (e.g., from a library operating system (LibOS) including drivers/libraries of OS services) that provide the particular OS services needed by the application.
  • a unikernel can be implemented to run directly on hardware 840, directly on a hypervisor (e.g., running within a LibOS virtual machine), in a software container, and/or the like.
  • unikernels running directly on a hypervisor represented by virtualization layer 854, unikernels running within software containers represented by instances 862A-R, or as a combination of unikernels and the above-described techniques (e.g., unikernels and virtual machines both run directly on a hypervisor, unikernels and sets of applications that are run in different software containers).
  • the instantiation of the one or more sets of one or more applications 864A-R, as well as virtualization if implemented, are collectively referred to as software instance(s) 852.
  • the virtual network element(s) 860A-R perform similar functionality to the virtual network element(s) 830A-R, e.g., similar to the control communication and configuration module(s) 832A and forwarding table(s) 834A.
  • This virtualization of the hardware 840 is sometimes referred to as network function virtualization (NFV).
  • NFV network function virtualization
  • NFV can be used to consolidate many network equipment types onto industry standard high volume server hardware, physical switches, and physical storage, which could be located in data centers, NDs, customer premise equipment (CPE), and/or the like.
  • each instance 862A-R corresponding to one VNE 860A-R
  • other embodiments may implement this correspondence at a finer level granularity (e.g., line card virtual machines virtualize line cards, control card virtual machine virtualize control cards, etc.).
  • line card virtual machines virtualize line cards
  • control card virtual machine virtualize control cards etc.
  • the techniques described herein with reference to a correspondence of instances 862A-R to VNEs also apply to embodiments where such a finer level of granularity and/or unikernels are used.
  • the virtualization layer 854 includes a virtual switch that provides similar forwarding services as a physical Ethernet switch. Specifically, this virtual switch forwards traffic between instances 862A-R and the physical NI(s) 846, as well as optionally between the instances 862A-R. In addition, this virtual switch can enforce network isolation between the VNEs 860A-R that, by policy, are not permitted to communicate with each other (e.g., by honoring virtual local area networks (VLANs)).
  • VLANs virtual local area networks
  • software 850 includes an observability component 853, which when executed by processor(s) 842, causes the general-purpose network device 804 to perform one or more of the operations described above (e.g., to collect observation data, receive collected observation data, determine whether an SLA is being met based on collected observation data, update resource configurations to enforce SLA, and/or the like).
  • Hybrid network device 806 includes a combination of special-purpose and general-purpose hardware and/or software.
  • hybrid network device 806 could include custom ASICs/special-purpose OS and COTS processors/standard OS in a single ND or a single card within an ND.
  • a platform VM i.e., a VM that that implements the functionality of the special-purpose network device 802 provides para-virtualization to the networking hardware present in the hybrid network device 806.
  • NE network element
  • each of the VNEs receives data on the physical NIs (e.g., 816, 846) and forwards that data out the appropriate ones of the physical NIs (e.g., 816, 846).
  • a VNE implementing IP router functionality could forward IP packets on the basis of some of the IP header information in the IP packet.
  • IP header information includes, for example, source IP address, destination IP address, source port, destination port (where “source port” and “destination port” refer herein to protocol ports, as opposed to physical ports of a ND), transport protocol (e.g., user datagram protocol (UDP), Transmission Control Protocol (TCP), differentiated services code point (DSCP) values, and/or the like.
  • transport protocol e.g., user datagram protocol (UDP), Transmission Control Protocol (TCP), differentiated services code point (DSCP) values, and/or the like.
  • a network interface may be physical or virtual, depending on the given implementation.
  • an interface address is an IP address assigned to a NI, be it a physical NI or virtual NI.
  • a virtual NI may be associated with a physical NI, with another virtual interface, or stand on its own (e.g., a loopback interface, a point-to-point protocol interface).
  • a NI (physical or virtual) may be numbered (a NI with an IP address) or unnumbered (a NI without an IP address).
  • a loopback interface (and its loopback address) is a specific type of virtual NI (and IP address) of a NE/VNE (physical or virtual) often used for management purposes.
  • IP address is referred to as the nodal loopback address.
  • the IP address(es) assigned to the NI(s) of a ND are referred to as IP addresses of that ND.
  • IP addresses of that ND At a more granular level, the IP address(es) assigned to NI(s) assigned to a NE/VNE implemented on a ND can be referred to as IP addresses of that NE/VNE.
  • An embodiment may be an article of manufacture in which a non-transitory machine-readable storage medium (such as microelectronic memory) has stored thereon instructions (e.g., computer code) which program one or more data processing components (generically referred to here as a “processor”) to perform the operations described above.
  • a non-transitory machine-readable storage medium such as microelectronic memory
  • instructions e.g., computer code
  • processor data processing components
  • some of these operations might be performed by specific hardware components that contain hardwired logic (e.g., dedicated digital filter blocks and state machines). Those operations might alternatively be performed by any combination of programmed data processing components and fixed hardwired circuit components.

Landscapes

  • Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

Des modes de réalisation de la présente demande concernent des techniques d'application d'accords de niveau de service pour des services déployés à l'intérieur d'une grappe. Un procédé consiste à collecter des données d'observations locales associées au premier nœud. Le procédé consiste en outre à déterminer, sur la base des données d'observations locales, qu'un premier accord de niveau de service (SLA) associé à un premier service déployé à l'intérieur de la grappe n'est pas satisfait par le premier nœud. Le procédé consiste en outre à déterminer une ou plusieurs opérations d'application SLA locales sur la base des données d'observations locales et du premier SLA. Le procédé comprend en outre la réalisation de la ou des opérations d'application SLA locales.
PCT/IB2023/056849 2023-06-30 2023-06-30 Application d'accord de niveau de service infonuagique à base d'observabilité Pending WO2025003737A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/IB2023/056849 WO2025003737A1 (fr) 2023-06-30 2023-06-30 Application d'accord de niveau de service infonuagique à base d'observabilité

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/IB2023/056849 WO2025003737A1 (fr) 2023-06-30 2023-06-30 Application d'accord de niveau de service infonuagique à base d'observabilité

Publications (1)

Publication Number Publication Date
WO2025003737A1 true WO2025003737A1 (fr) 2025-01-02

Family

ID=87158383

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2023/056849 Pending WO2025003737A1 (fr) 2023-06-30 2023-06-30 Application d'accord de niveau de service infonuagique à base d'observabilité

Country Status (1)

Country Link
WO (1) WO2025003737A1 (fr)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160179560A1 (en) * 2014-12-22 2016-06-23 Mrittika Ganguli CPU Overprovisioning and Cloud Compute Workload Scheduling Mechanism
EP3327990A1 (fr) * 2016-11-28 2018-05-30 Deutsche Telekom AG Réseau de communication radio avec surveillance basée sur seuils multiples pour la gestion de ressources radio
EP3929745A1 (fr) * 2020-06-27 2021-12-29 INTEL Corporation Appareil et procédé pour une structure de commande d'attribution dynamique de ressources en boucle fermée
US20220417115A1 (en) * 2021-06-23 2022-12-29 Microsoft Technology Licensing, Llc End-to-end service level metric approximation

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160179560A1 (en) * 2014-12-22 2016-06-23 Mrittika Ganguli CPU Overprovisioning and Cloud Compute Workload Scheduling Mechanism
EP3327990A1 (fr) * 2016-11-28 2018-05-30 Deutsche Telekom AG Réseau de communication radio avec surveillance basée sur seuils multiples pour la gestion de ressources radio
EP3929745A1 (fr) * 2020-06-27 2021-12-29 INTEL Corporation Appareil et procédé pour une structure de commande d'attribution dynamique de ressources en boucle fermée
US20220417115A1 (en) * 2021-06-23 2022-12-29 Microsoft Technology Licensing, Llc End-to-end service level metric approximation

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"3rd Generation Partnership Project; Technical Specification Group Services and System Aspects; Telecommunication management; Study on management aspects of communication services (Release 16)", vol. SA WG5, 9 July 2019 (2019-07-09), XP051757373, Retrieved from the Internet <URL:http://www.3gpp.org/ftp/tsg_sa/WG5_TM/TSGS5_125AH/Docs/S5-194531.zip> [retrieved on 20190709] *

Similar Documents

Publication Publication Date Title
CN110463140B (zh) 计算机数据中心的网络服务水平协议
EP3624400B1 (fr) Technologies pour déployer des machines virtuelles dans une infrastructure de fonction de réseau virtuel
US11706088B2 (en) Analyzing and configuring workload distribution in slice-based networks to optimize network performance
EP3053041B1 (fr) Procédé, système, programme informatique et produit-programme informatique permettant de surveiller des flux de paquets de données entre des machines virtuelles (vm) dans un centre de données
Lam et al. Netshare and stochastic netshare: predictable bandwidth allocation for data centers
JP7623432B2 (ja) スライスベースネットワークにおける輻輳回避
US9882832B2 (en) Fine-grained quality of service in datacenters through end-host control of traffic flow
WO2021101602A1 (fr) Système et procédé de prise en charge de l&#39;utilisation de notifications d&#39;encombrement vers l&#39;avant et vers l&#39;arrière dans une matrice de commutation privée dans un environnement informatique à hautes performances
CN112035395B (zh) 在使用加速部件的系统中处理租户要求
US20170371692A1 (en) Optimized virtual network function service chaining with hardware acceleration
US20150350102A1 (en) Method and System for Integrated Management of Converged Heterogeneous Resources in Software-Defined Infrastructure
US10892994B2 (en) Quality of service in virtual service networks
EP3283953B1 (fr) Fourniture de services dans un système ayant un plan d&#39;accélération matériel et un plan logiciel
US11144423B2 (en) Dynamic management of monitoring tasks in a cloud environment
CN108667777A (zh) 一种服务链生成方法及网络功能编排器nfvo
JP7769135B2 (ja) 通信システムに係る所与の予測目的で用いられる機械学習モデルの決定
WO2025003737A1 (fr) Application d&#39;accord de niveau de service infonuagique à base d&#39;observabilité
JP7716598B2 (ja) 通信システムに係る所与の予測目的で用いられる機械学習モデルの決定
WO2024111027A1 (fr) Commande d&#39;affichage d&#39;écran de surveillance sur lequel une valeur d&#39;indice de performance d&#39;un élément inclus dans un système de communication est indiquée
Blenk et al. SDN-enabled Application-aware Network Control Architectures and their Performance Assessment
WO2024069219A1 (fr) Mise à l&#39;échelle automatique d&#39;application côté réception

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23738909

Country of ref document: EP

Kind code of ref document: A1