WO2024227123A1

WO2024227123A1 - Load-aware packet size distribution measurement in a network device

Info

Publication number: WO2024227123A1
Application number: PCT/US2024/026724
Authority: WO
Inventors: Meg Lin; William Brad Matthews; Rupa Budhia; Bruce Kwan
Original assignee: Marvell Asia Pte Ltd
Current assignee: Marvell Asia Pte Ltd
Priority date: 2023-04-28
Filing date: 2024-04-28
Publication date: 2024-10-31
Anticipated expiration: 2025-10-28

Abstract

A network device determines a load metric corresponding to a processing load of the network device. In response to determining that the load metric meets a threshold, the network device begins measuring distribution information regarding a distribution of sizes of packets processed by the network device, and later ends measuring the distribution information. The network device then uses the distribution information to control the network device.

Description

LOAD-AWARE PACKET SIZE DISTRIBUTION MEASUREMENT IN A NETWORK

DEVICE

Field of Technology

[0001] The present disclosure relates generally to communication networks, and more particularly to power saving techniques for use within a network device.

Background

[0002] The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.

[0003] A computer network is a set of computing components interconnected by communication links. Each computing component may be a separate computing device, such as, without limitation, a hub, a network switch, a bridge, a router, a server, a gateway, or personal computer, or a component thereof. Each computing component, or “network device,” is considered to be a node within the network. A communication link is a mechanism of connecting at least two nodes such that each node may transmit data to and receive data from the other node. Such data may be transmitted in the form of signals over transmission media such as, without limitation, electrical cables, optical cables, or wireless media.

[0004] The structure and transmission of data between nodes is governed by a number of different protocols. There may be multiple layers of protocols, typically beginning with a lowest layer, such as a “physical” layer that governs the transmission and reception of raw bit streams as signals over a transmission medium. Each layer defines a data unit (the protocol data unit, or “PDU”), with multiple data units at one layer combining to form a single data unit in another. Additional examples of layers may include, for instance, a data link layer in which bits defined by a physical layer are combined to form a frame or cell, a network layer in which frames or cells defined by the data link layer are combined to form a packet, and a transport layer in which packets defined by the network layer are combined to form a Transmission Control Protocol (TCP) segment or a User Datagram Protocol (UDP) datagram. The Open Systems Interconnection (OSI) model of communications describes these and other layers of communications. However, other models defining other ways of layering information may also be used. The Internet Protocol (IP) suite, or “TCP/IP stack,” is one example of a common group of protocols that may be used together over multiple layers to communicate information.

However, techniques described herein may have application to other protocols outside of the TCP/IP stack.

[0005] A given node in a network may not necessarily have a link to each other node in the network, particularly in more complex networks. For example, in wired networks, each node may only have a limited number of physical ports into which cables may be plugged to create links. Certain “terminal” nodes — often servers or end-user devices — may only have one or a handful of ports. Other nodes, such as switches, hubs, or routers, may have a great deal more ports, and typically are used to relay information between the terminal nodes. The arrangement of nodes and links in a network is said to be the topology of the network, and is typically visualized as a network graph or tree.

[0006] A given node in the network may communicate with another node in the network by sending data units along one or more different “paths” through the network that lead to the other node, each path including any number of intermediate nodes. The transmission of data across a computing network typically involves sending units of data, such as packets, cells, or frames, along paths through intermediary networking devices, such as switches or routers, that direct or redirect each data unit towards a corresponding destination.

[0007] While a data unit is passing through an intermediary networking device — a period of time that is conceptualized as a “visit” or “hop” — the device may perform any of a variety of actions, or processing steps, with the data unit. The exact set of actions taken will depend on a variety of characteristics of the data unit, such as metadata found in the header of the data unit, and in many cases the context or state of the network device. For example, address information specified by or otherwise associated with the data unit, such as a source address, destination address, a virtual local area network (VLAN) identifier, path information, etc., is typically used to determine how to handle a data unit (i.e., what actions to take with respect to the data unit). For instance, an IP data packet may include a destination IP address field within the header of the IP data packet, based upon which a network router may determine one or more other networking devices, among a number of possible other networking devices, to which the IP data packet is to be forwarded.

[0008] Different types of data units tend to have different packet sizes. For example, control packets (such as packets for setting up a connection, tearing down a connection, acknowledgment packets, etc.) tend to be relatively small, such as less than 100 bytes, whereas data packets tend to be significantly larger than control packets, often exceeding 1000 bytes. As another example, data packets with video data are typically more than 1000 bytes, whereas data packets with audio data are typically smaller, e.g., several hundred bytes. Typically, when a connection is being set up or torn down, the percentage of packets with relatively small packet sizes tends to be high, whereas when the connection is up and running the percentage of packets with relatively small packet sizes tends to be low.

[0009] As briefly discussed above, a network device analyzes headers of data units (e.g., packets) to determine how to handle the data units. For example, a network device having multiple ports coupled to multiple network links, such as a network switch, a bridge, a router, a gateway, etc., will analyze a header of a received data unit to determine one or more ports via which the data unit is to be transmitted. For a given data rate, the processing load of a network device is higher for small data units as compared to large data units because a rate at which the network device receives packet headers is higher with small data units as compared to large data units (for a given data rate). In other words, given a same data rate, the processing load of a network device varies depending on the relative amounts of packets with small sizes and packets with large sizes.

Summary

[0010] In an embodiment, a method for controlling operation of a network device includes: determining, at the network device, a load metric corresponding to a processing load of the network device; in response to determining that the load metric meets a first threshold, beginning, at the network device, measuring distribution information regarding a distribution of sizes of packets processed by the network device; ending, at the network device, measuring the distribution information regarding the distribution of sizes of packets processed by the network device; and using, at the network device, the distribution information to control the network device. [0011] In another embodiment, a network device comprises: a plurality of network interfaces; a packet processor configured to process data units received via the plurality of network interfaces to determine network interfaces, among the plurality of network interfaces, that are to transmit the data units; first circuitry that is configured to determine a load metric corresponding to a processing load of the network device; second circuitry that is configured to: in response to determining that the load metric meets a first threshold, begin measuring distribution information regarding a distribution of sizes of packets processed by the network device, and end measuring the distribution information regarding the distribution of sizes of packets processed by the network device; and a controller configured to use the distribution information to control the network device.

Brief Description of the Drawings

[0012] Fig. 1 is a simplified diagram of an example networking system in which load-aware packet size distribution (PSD) information generation techniques described herein are practiced, according to an embodiment.

[0013] Fig. 2A is a simplified diagram of an example network device in which PSD information generation techniques are utilized, according to an embodiment.

[0014] Fig. 2B is another simplified diagram of the example network device of Fig. 2A, according to an embodiment.

[0015] Fig. 3 is a simplified block diagram of an example PSD information generation circuitry, according to an embodiment.

[0016] Fig. 4 is a is a graph showing an illustrative example of PSD information measured at different processing load levels of network device, according to an embodiment.

[0017] Fig. 5 is a simplified example state diagram for circuitry that controls generation of PSD information in a network device, according to an embodiment.

[0018] Fig. 6 is another simplified example state diagram for circuitry that controls generation of PSD information in a network device, according to another embodiment.

[0019] Fig. 7 is a simplified flow diagram of an example method for controlling a network device based on PSD measurements, according to an embodiment. Detailed Description

[0020] In the following description, for the purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present inventive subject matter. It will be apparent, however, that the present inventive subject matter may be practiced without these specific details. In other instances, well-known structures and devices are shown in block diagram form in order to avoid unnecessarily obscuring the present inventive subject matter.

[0021] Network device power consumption is typically most critical when the network device (or a portion thereof) is fully loaded. During time periods when the network device is experiencing high loading, packet processors and data paths of the network device are heavily stressed. Often, the percentage of small-sized packets is larger during time periods of low loading and smaller during time periods of high loading. However, current network devices do not differentiate the collection of packet size distribution information between periods of high loading and periods of low loading. As a result, the packet size distribution information generated by current network devices is typically measured across periods of both high and low loading and therefore often does not accurately portray packet size distribution at times of high loading. Thus, the packet size distribution information generated by current network devices has low utility for reducing power consumption of a network device during periods of high loading. [0022] Approaches, techniques, and mechanisms are disclosed for measuring, in response to a processing load of the network device meeting a condition, distribution information regarding sizes of packets (sometimes referred to herein as “packet size distribution information”) processed by a network device. Such packet size distribution information is useful for controlling the network device to more optimally adjust operation of the network device, at least in some embodiments. As an illustrative example, if the processing load is high and the packet size distribution information indicates a relatively high percentage of packets with small packet sizes, the network device may identify packet flows contributing high amounts of packets with small packet sizes, and may adjust the processing of those flows to redistribute processing and/or memory resources amongst packet flows, and/or to reduce power consumption of the network device. [0023] Fig. 1 is a simplified diagram of an example networking system 100, also referred to as a network, in which power saving techniques described herein are practiced, according to an embodiment. Networking system 100 comprises a plurality of interconnected nodes HO a-HO n (collectively nodes 110), each implemented by a different computing device. For example, a node 110 may be a single networking computing device, such as a router or switch, in which some or all of the processing components described herein are implemented in applicationspecific integrated circuits (ASICs), field programmable gate arrays (FPGAs), or other integrated circuit(s). As another example, a node 110 may include one or more memories storing machine- readable instructions for implementing various components described herein, one or more hardware processors configured to execute the instructions stored in the one or more memories, and various data repositories in the one or more memories for storing data structures utilized and manipulated by the various components.

[0024] Each node 110 is connected to one or more other nodes 110 in network 100 by one or more communication links, depicted as lines between nodes 110. The communication links may be any suitable wired cabling or wireless links. Note that system 100 illustrates only one of many possible arrangements of nodes within a network. Other networks may include fewer or additional nodes 110 having any number of links between them.

[0025] While each node 110 may or may not have a variety of other functions, in an embodiment, each node 110 is configured to send, receive, and/or relay data to one or more other nodes 110 via communication links. In general, data is communicated as a series of discrete units or structures of data represented by signals transmitted over the communication links.

[0026] Different nodes 110 within a network 100 may send, receive, and/or relay data units at different communication levels, or layers. For instance, a first node 110 may send a data unit at the network layer (e.g., a TCP segment) to a second node 110 over a path that includes an intermediate node 110. The data unit may be broken into smaller data units (“subunits”) at various sublevels before it is transmitted from the first node 110. For example, the data unit may be broken into packets, then cells, and eventually sent out as a collection of signal-encoded bits to the intermediate device. Depending on the network type and/or the device type of the intermediate node 110, the intermediate node 110 may rebuild the entire original data unit before routing the information to the second node 110, or the intermediate node 110 may simply rebuild the subunits (e.g., packets or frames) and route those subunits to the second node 110 without ever composing the entire original data unit.

[0027] When a node 110 receives a data unit, it typically examines addressing information within the data unit (and/or other information within the data unit) to determine how to process the data unit. The addressing information may include, for instance, a media access control (MAC) address, an IP address, a VLAN identifier, information within a multi -protocol label switching (MPLS) label, or any other suitable information. If the addressing information indicates that the receiving node 110 is not the destination for the data unit, the node may look up forwarding information within a forwarding database of the receiving node 110 and forward the data unit to one or more other nodes 110 connected to the receiving node 110 based on the forwarding information. The forwarding information may indicate, for instance, an outgoing port over which to send the data unit, a header to attach to the data unit, a new destination address to overwrite in the data unit, etc. In cases where multiple paths to the destination node 110 are possible, the forwarding information may include information indicating a suitable approach for selecting one of those paths, or a path deemed to be the best path may already be defined.

[0028] Addressing information, flags, labels, and other metadata used for determining how to handle a data unit are typically embedded within a portion of the data unit known as the header. One or more headers are typically at the beginning of the data unit, and are followed by the payload of the data unit. For example, a first data unit having a first header corresponding to a first communication protocol may be encapsulated in a second data unit at least by appending a second header to the first data unit, the second header corresponding to a second communication protocol. For example, the second communication protocol is below the first communication protocol in a protocol stack, in some embodiments.

[0029] A header has a structure defined by a communication protocol and comprises fields of different types, such as a destination address field, a source address field, a destination port field, a source port field, and so forth, according to some embodiments. In some protocols, the number and the arrangement of fields is fixed. Other protocols allow for variable numbers of fields and/or variable length fields with some or all of the fields being preceded by type information that indicates to a node the meaning of the field and/or length information that indicates a length of the field. In some embodiments, a communication protocol defines a header having multiple different formats and one or more values of one or more respective fields in the header indicate to a node the format of the header. For example, a header includes a type field, a version field, etc., that indicates to which one of multiple formats that header conforms.

[0030] Different communication protocols typically define respective headers having respective formats.

[0031] For convenience, data units are sometimes referred to herein as “packets,” which is a term often used to refer to data units defined by the IP. The approaches, techniques, and mechanisms described herein, however, are applicable to data units defined by suitable communication protocols other than the IP. Thus, unless otherwise stated or apparent, the term “packet” as used herein should be understood to refer to any type of data structure communicated across a network, including packets as well as segments, cells, data frames, datagrams, and so forth.

[0032] Any node in the depicted network 100 may communicate with any other node in the network 100 by sending packets through a series of nodes 110 and links, referred to as a path. For example, Node B (110 b) may send packets to Node H (110 h) via a path from Node B to Node D to Node E to Node H. There may be a large number of valid paths between two nodes. For example, another path from Node B to Node H is from Node B to Node D to Node G to Node H.

[0033] In an embodiment, a node 110 does not actually need to specify a full path for a packet that it sends. Rather, the node 110 may simply be configured to calculate the best path for the packet out of the device (e.g., via which one or more egress ports should send the packet to be transmitted). When a node 110 receives a packet that is not addressed directly to the node 110, based on header information associated with a packet, such as path and/or destination information, the node 110 relays the packet along to either the destination node 110, or a “next hop” node 110 that the node 110 calculates is in a better position to relay the packet to the destination node 110, according to some embodiments. In this manner, the actual path of a packet is product of each node 110 along the path making routing decisions about how best to move the packet along to the destination node 110 identified by the packet, according to some embodiments. [0034] As data units are routed through different nodes in a network, the nodes may, on occasion, discard, fail to send, or fail to receive data units, thus resulting in the data units failing to reach their intended destination. The act of discarding of a data unit, or failing to deliver a data unit, is typically referred to as “dropping” the data unit. Instances of dropping a data unit, referred to herein as “drops” or “packet loss,” may occur for a variety of reasons, such as resource limitations, errors, or deliberate policies.

[0035] One or more of the nodes 110 utilize load-aware packet size distribution (PSD) measurement techniques, examples of which are described below. For example, Fig. 1 depicts node 1 lOd and node 110g as having load-aware PSD measurement modules that utilize PSD measurement techniques, such as described below, that involve initiating PSD measurements in response to a processing load of the node 110 meeting a condition.

[0036] Fig. 2A is a simplified diagram of an example network device 200 in which load- aware PSD measurement techniques are utilized, according to an embodiment. The network device 200 is a computing device comprising any combination of i) hardware and/or ii) one or more processors executing machine-readable instructions, being configured to implement the various logical components described herein.

[0037] In some embodiments, the node 1 lOd and node 110g of Fig. 1 have a structure the same as or similar to the network device 200. In another embodiment, the network device 200 may be one of a number of components within a node 110. For instance, network device 200 may be implemented on one or more integrated circuits, or “chips,” configured to perform switching and/or routing functions within a node 110, such as a network switch, a router, etc. The node 110 may further comprise one or more other components, such as one or more central processor units, storage units, memories, physical interfaces, LED displays, or other components external to the one or more chips, some or all of which may communicate with the one or more chips. In some such embodiments, the node 110 comprises multiple network devices 200.

[0038] In other embodiments, the network device 200 is utilized in a suitable networking system different than the example networking system 100 of Fig. 1.

[0039] The network device 200 includes a plurality of packet processing modules 204, with each packet processing module being associated with a respective plurality of ingress network interfaces 208 (sometimes referred to herein as “ingress ports” for purposes of brevity) and a respective plurality of egress network interfaces 212 (sometimes referred to herein as “egress ports” for purposes of brevity). The ingress ports 208 are ports by which packets are received via communication links in a communication network, and the egress ports 212 are ports by which at least some of the packets are transmitted via the communication links after having been processed by the network device 200.

[0040] Although the term “packet” is used to describe the data units processed by the network device 200, the data units may be packets, cells, frames, or other suitable structures. For example, in some embodiments the individual atomic data units upon which the depicted components operate are cells or frames. That is, data units are received, acted upon, and transmitted at the cell or frame level, in some such embodiments. These cells or frames are logically linked together as the packets to which they respectively belong for purposes of determining how to handle the cells or frames, in some embodiments. However, the cells or frames are not actually assembled into packets within device 200, particularly if the cells or frames are being forwarded to another destination through device 200, in some embodiments. [0041] Ingress ports 208 and egress ports 212 are depicted as separate ports for illustrative purposes, but typically correspond to the same physical network interfaces of the network device 200. That is, a single network interface acts as both an ingress port 208 and an egress port 212, in some embodiments. Nonetheless, for various functional purposes, certain logic of the network device 200 may view a single physical network interface as logically being a separate ingress port 208 and egress port 212. Moreover, for various functional purposes, certain logic of the network device 200 may subdivide a single physical network interface into multiple ingress ports 208 or egress ports 212 (e.g., “virtual ports”), or aggregate multiple physical network interfaces into a single ingress port 208 or egress port 212 (e.g., a trunk, a link aggregate group (LAG), an equal cost multipath (ECMP) group, etc.). Hence, in various embodiments, ingress ports 208 and egress ports 212 are considered distinct logical constructs that are mapped to physical network interfaces rather than simply as distinct physical constructs.

[0042] In some embodiments, at least some ports 208/212 are coupled to one or more transceivers (not shown in Fig. 2A), such as Serializer/Deserializer (“SerDes”) blocks. For instance, ingress ports 208 provide serial inputs of received data units into a SerDes block, which then outputs the data units in parallel into a packet processing module 204. On the other end, a packet processing module 204 provides data units in parallel into another SerDes block, which outputs the data units serially to egress ports 212. There may be any number of input and output SerDes blocks, of any suitable size, depending on the specific implementation (e.g., four groups of 4><25 gigabit blocks, eight groups of 4*100 gigabit blocks, etc.).

[0043] Each packet processing module 204 comprises an ingress portion 204-xa and an egress portion 204-xb. The ingress portion 204-xa generally performs ingress processing operations for packets such as one of, or any suitable combination of two or more of: packet classification, tunnel termination, Layer-2 (L2) forwarding lookups, Layer-3 (L3) forwarding lookups, etc.

[0044] The egress portion 204-xb generally performs egress processing operations for packets such as one of, or any suitable combination of two or more of: packet duplication (e.g., for multicast packets), header alteration, rate limiting, traffic shaping, egress policing, flow control, maintaining statistics regarding packets, etc.

[0045] Each ingress portion 204-xa is communicatively coupled to multiple egress portions 204-xb via an interconnect 216. Similarly, each egress portion 204-xb is communicatively coupled to multiple ingress portions 204-xa via the interconnect 216. The interconnect 216 comprises one or more switching fabrics, one or more crossbars, etc., according to various embodiments.

[0046] In operation, an ingress portion 204-xa receives a packet via an associated ingress port 208 and performs ingress processing operations for the packet, including determining one or more egress ports 212 via which the packet is to be transmitted (sometimes referred to herein as “target ports”). The ingress portion 204-xa then transfers the packet, via the interconnect 216, to one or more egress portion 204-xb corresponding to the determined one or more target ports 212. Each egress portion 204-xb that receives the packet performs egress processing operations for the packet and then transfers the packet to one or more determined target ports 212 associated with the egress portion 204-xb for transmission from the network device 200.

[0047] In some embodiments, the ingress portion 204-xa determines a virtual target port and one or more egress portions 204-xb corresponding to the virtual target port map the virtual target portion to one or more physical egress ports 212. In some embodiments, the ingress portion 204- xa determines a group of target ports 212 (e.g., a trunk, a LAG, an ECMP group, etc.) and one or more egress portions 204-xb corresponding to the group of target ports selects one or more particular target egress ports 212 within the group of target ports. In the present disclosure, the term “target port” refers to a physical port, a virtual port, a group of target ports, etc., unless otherwise stated or apparent.

[0048] Each packet processing module 204 is implemented using any suitable combination of fixed circuitry and/or a processor executing machine-readable instructions, such as specific logic components implemented by one or more FPGAs, ASICs, or one or more processors executing machine-readable instructions, according to various embodiments.

[0049] In some embodiments, at least respective portions of multiple packet processing modules 204 are implemented on a single IC (or “chip”). In some embodiments, respective portions of multiple packet processing modules 204 are implemented on different respective chips.

[0050] In an embodiment, components of each ingress portion 204-xa are arranged in a pipeline such that outputs of one or more components are provided as inputs to one or more other components. In some embodiments in which the components are arranged in a pipeline, one or more components of the ingress portion 204-xa are skipped or bypassed for certain packets. In other embodiments, the components are arranged in a suitable manner that is not a pipeline. The exact set and/or sequence of components that process a given packet may vary, in some embodiments, depending on the attributes of the packet and/or the state of the network device 200, in some embodiments.

[0051] Similarly, in an embodiment, components of each egress portion 204-xb are arranged in a pipeline such that outputs of one or more components are provided as inputs to one or more other components. In some embodiments in which the components are arranged in a pipeline, one or more components of the egress portion 204-xb are skipped or bypassed for certain packets. In other embodiments, the components are arranged in a suitable manner that is not a pipeline. The exact set and/or sequence of components that process a given packet may vary, in some embodiments, depending on the attributes of the packet and/or the state of the network device 200, in some embodiments.

[0052] Each ingress portion 204-xa includes circuitry 220 (sometimes referred to herein as “ingress arbitration circuitry”) that is configured to reduce traffic loss during periods of bursty traffic and/or other congestion. In some embodiments, the ingress arbitration circuitry 220 is configured to function in a manner that facilitates economization of the sizes, numbers, and/or qualities of downstream components within the packet processing module 204 by more intelligently controlling the release of data units to these components. In some embodiments, the ingress arbitration circuitry 220 is further configured to support features such as lossless protocols and cut-through switching while still permitting high rate bursts from ports 208.

[0053] The ingress arbitration circuitry 220 is coupled to an ingress buffer memory 224 that is configured to temporarily store packets that are received via the ports 208 while components of the packet processing module 204 process the packets.

[0054] Each data unit received by the ingress portion 204-xa is stored in one or more entries within one or more buffers, which entries are marked as utilized to prevent newly received data units from overwriting data units that are already buffered in the buffer memory 224. After a data unit is released to an egress portion 204-xb, the one or more entries in which a data unit is buffered in the ingress buffer memory 224 are then marked as available for storing newly received data units, in some embodiments.

[0055] Each buffer may be a portion of any suitable type of memory, including volatile memory and/or non-volatile memory. In an embodiment, the ingress buffer memory 224 comprises a single-ported memory that supports only a single input/output (I/O) operation per clock cycle (i.e., either a single read operation or a single write operation). Single-ported memories are utilized for higher operating frequency, though in other embodiments multi-ported memories are used instead. In an embodiment, the ingress buffer memory 224 comprises multiple physical memories that are capable of being accessed concurrently in a same clock cycle, though full realization of this capability is not necessary. In an embodiment, each buffer is a distinct memory bank, or set of memory banks. In yet other embodiments, different buffers are different regions within a single memory bank. In an embodiment, each buffer comprises many addressable “slots” or “entries” (e.g., rows, columns, etc.) in which data units, or portions thereof, may be stored.

[0056] Generally, buffers in the ingress buffer memory 224 comprises a variety of buffers or sets of buffers, each utilized for varying purposes and/or components within the ingress portion 204-xa. [0057] The ingress portion 204-xa comprises a buffer manager (not shown) that is configured to manage use of the ingress buffers 224. The buffer manager performs, for example, one of or any suitable combination of the following: allocates and deallocates specific segments of memory for buffers, creates and deletes buffers within that memory, identifies available buffer entries in which to store a data unit, maintains a mapping of buffers entries to data units stored in those buffers entries (e.g., by a packet sequence number assigned to each packet when the first data unit in that packet was received), marks a buffer entry as available when a data unit stored in that buffer is dropped, sent, or released from the buffer, determines when a data unit is to be dropped because it cannot be stored in a buffer, performs garbage collection on buffer entries for data units (or portions thereof) that are no longer needed, etc., in various embodiments.

[0058] The buffer manager includes buffer assignment logic (not shown) that is configured to identify which buffer, among multiple buffers in the ingress buffer memory 224, should be utilized to store a given data unit, or portion thereof, according to an embodiment. In some embodiments, each packet is stored in a single entry within its assigned buffer. In yet other embodiments, a packet is received as, or divided into, constituent data units such as fixed-size cells or frames, and the constituent data units are stored separately (e.g., not in the same location, or even the same buffer).

[0059] The ingress arbitration circuitry 220 is also configured to maintain ingress queues 228, according to some embodiments, which are used to manage the order in which data units are processed from the buffers in the ingress buffer memory 224. Each data unit, or the buffer locations(s) in which the data unit is stored, is said to belong to one or more constructs referred to as queues. Typically, a queue is a set of memory locations (e.g., in the ingress buffer memory 224) arranged in some order by metadata describing the queue. The memory locations may (and often are) non-contiguous relative to their addressing scheme and/or physical or logical arrangement.

[0060] In some embodiments, the sequence of constituent data units as arranged in a queue generally corresponds to an order in which the data units or data unit portions in the queue will be released and processed. Such queues are known as first-in-first-out (“FIFO”) queues, though in other embodiments other types of queues may be utilized. [0061] The ingress portion 204-xa also includes an ingress packet processor 232 that is configured to perform ingress processing operations for packets such as one of, or any suitable combination of two or more of packet classification, tunnel termination, L2 forwarding lookups, L3 forwarding lookups, etc., according to various embodiments. For example, the ingress packet processor 232 includes an L2 forwarding database and/or an L3 forwarding database, and the ingress packet processor 232 performs L2 forwarding lookups and/or L3 forwarding lookups to determine target ports for packets. In some embodiments, the ingress packet processor 232 uses header information in packets to perform L2 forwarding lookups and/or L3 forwarding lookups. [0062] The ingress arbitration circuitry 220 is configured to release a certain number of data units (or portions of data units) from ingress queues 228 for processing (e.g., by the ingress packet processor 232) or for transfer (e.g., via the interconnect 216) each clock cycle or other defined period of time. The next data unit (or portion of a data unit) to release may be identified using one or more ingress queues 228. For instance, respective ingress ports 208 (or respective groups of ingress ports 208) are assigned to respective ingress queues 228, and the ingress arbitration circuitry 220 selects queues 228 from which to release one or more data units (or portions of data units) according to a selection scheme, such as a round-robin scheme or another suitable selection scheme, in some embodiments. Additionally, when ingress queues 228 are FIFO queues, the ingress arbitration circuitry 220 selects a data unit (or a portion of a data unit) from a head of a FIFO ingress queue 228, which corresponds to a data unit (or portion of a data unit) that has been in the FIFO ingress queue 228 for a longest time, in some embodiments.

[0063] Generally, when the ingress portion 204-xa is finished processing packets, the packets are transferred to one or more egress portions 204-xb via the interconnect 216. Transferring a data unit from an ingress portion 204-xa to an egress portions 204-xb comprises releasing (or dequeuing) the data unit and transferring the data unit to the egress portion 204-xb via the interconnect 216, according to an embodiment.

[0064] The egress portion 204-xb comprises circuitry 248 (sometimes referred to herein as “traffic manager circuitry 248”) that is configured to control the flow of data units from the ingress portions 204-xa to one or more other components of the egress portion 204-xb. The egress portion 204-xb is coupled to an egress buffer memory 252 that is configured to store egress buffers. A buffer manager (not shown) within the traffic manager circuitry 248 temporarily stores data units received from one or more ingress portions 204-xa in egress buffers as they await processing by one or more other components of the egress portion 204-xb. The buffer manager of the traffic manager circuitry 248 is configured to operate in a manner similar to the buffer manager of the ingress arbitration circuitry 220 discussed above.

[0065] The egress buffer memory 252 (and buffers of the egress buffer memory 252) is structured the same as or similar to the ingress buffer memory 224 (and buffers of the ingress buffer memory 224) discussed above. For example, each data unit received by the egress portion 204-xb is stored in one or more entries within one or more buffers, which entries are marked as utilized to prevent newly received data units from overwriting data units that are already buffered in the egress buffer memory 252. After a data unit is released from the egress buffer memory 252, the one or more entries in which the data unit is buffered in the egress buffer memory 252 are then marked as available for storing newly received data units, in some embodiments.

[0066] Generally, buffers in the egress buffer memory 252 comprises a variety of buffers or sets of buffers, each utilized for varying purposes and/or components within the egress portion 204-xb.

[0067] The buffer manager (not shown) is configured to manage use of the egress buffers 252. The buffer manager performs, for example, one of or any suitable combination of the following: allocates and deallocates specific segments of memory for buffers, creates and deletes buffers within that memory, identifies available buffer entries in which to store a data unit, maintains a mapping of buffers entries to data units stored in those buffers entries (e.g., by a packet sequence number assigned to each packet when the first data unit in that packet was received), marks a buffer entry as available when a data unit stored in that buffer is dropped, sent, or released from the buffer, determines when a data unit is to be dropped because it cannot be stored in a buffer, performs garbage collection on buffer entries for data units (or portions thereof) that are no longer needed, etc., in various embodiments.

[0068] The traffic manager circuitry 248 is also configured to maintain egress queues 256, according to some embodiments, that are used to manage the order in which data units are processed from the egress buffers 252. The egress queues 256 are structured the same as or similar to the ingress queues 228 discussed above. [0069] In an embodiment, different egress queues 256 may exist for different destinations. For example, each port 212 is associated with a respective set of one or more egress queues 256. The egress queue 256 to which a data unit is assigned may, for instance, be selected based on forwarding information indicating the target port determined for the packet.

[0070] In some embodiments, different egress queues 256 correspond to respective flows or sets of flows. That is, packets for each identifiable traffic flow or group of traffic flows is assigned a respective set of egress queues 256. In some embodiments, different egress queues 256 correspond to different classes of traffic, QoS levels, etc.

[0071] In some embodiments, egress queues 256 correspond to respective egress ports 212 and/or respective priority sets. For example, a respective set of multiple queues 256 corresponds to each of at least some of the egress ports 212, with respective queues 256 in the set of multiple queues 256 corresponding to respective priority sets.

[0072] Generally, when the egress portion 204-xb receives packets from ingress portions 204-xa via the interconnect 116, the traffic manager circuitry 248 stores (or “enqueues”) the packets in egress queues 256.

[0073] The ingress buffer memory 224 corresponds to a same or different physical memory as the egress buffer memory 252, in various embodiments. In some embodiments in which the ingress buffer memory 224 and the egress buffer memory 252 correspond to a same physical memory, ingress buffers 224 and egress buffers 252 are stored in different portions of the same physical memory, allocated to ingress and egress operations, respectively.

[0074] In some embodiments in which the ingress buffer memory 224 and the egress buffer memory 252 correspond to a same physical memory, ingress buffers 224 and egress buffers 252 include at least some of the same physical buffers, and are separated only from a logical perspective. In such an embodiment, metadata or internal markings may indicate whether a given individual buffer entry belongs to an ingress buffer 224 or egress buffer 252. To avoid contention when distinguished only in a logical sense, ingress buffers 224 and egress buffers 252 may be allotted a certain number of entries in each of the physical buffers that they share, and the number of entries allotted to a given logical buffer is said to be the size of that logical buffer. In some such embodiments, when a packet is transferred from the ingress portion 204-xa to the egress portion 204-xb within a same packet processing module 204, instead of copying the packet from an ingress buffer entry to an egress buffer, the data unit remains in the same buffer entry, and the designation of the buffer entry (e.g., as belonging to an ingress queue versus an egress queue) changes with the stage of processing.

[0075] The egress portion 204-xb also includes an egress packet processor 268 that is configured to perform egress processing operations for packets such as one of, or any suitable combination of two or more of: packet duplication (e g., for multicast packets), header alteration, rate limiting, traffic shaping, egress policing, flow control, maintaining statistics regarding packets, etc., according to various embodiments. As an example, when a header of a packet is to be modified (e.g., to change a destination address, add a tunneling header, remove a tunneling header, etc.) the egress packet processor 268 modifies header information in the egress buffers 252, in some embodiments.

[0076] In an embodiment, the egress packet processor 268 is coupled to a group of egress ports 212 via egress arbitration circuitry 272 that is configured to regulate access to the group of egress ports 212 by the egress packet processor 268.

[0077] In some embodiments, the egress packet processor 268 is additionally or alternatively coupled to suitable destinations for packets other than egress ports 212, such as one or more internal central processing units (not shown), one or more storage subsystems, etc.

[0078] In the course of processing a data unit, the egress packet processor 268 may replicate a data unit one or more times. For example, a data unit may be replicated for purposes such as multicasting, mirroring, debugging, and so forth. Thus, a single data unit may be replicated, and stored in multiple egress queues 256. Hence, though certain techniques described herein may refer to the original data unit that was received by the network device 200, it will be understood that those techniques will equally apply to copies of the data unit that have been generated by the network device for various purposes. A copy of a data unit may be partial or complete.

Moreover, there may be an actual physical copy of the data unit in egress buffers 252, or a single copy of the data unit 252 may be linked from a single buffer location (or single set of locations) in the egress buffers 252 to multiple egress queues 256.

[0079] Fig. 2B is another simplified block diagram of the network device 200, according to an embodiment. As illustrated in Fig. 2B, the network device 200 also includes one or more central processing units (CPUs) 276. The one or more CPUs 276 are configured to perform management functions for the network device 200, such as configuration of the packet processing modules 204, optimization of the network device 200, data collection, statistics collection, etc. The CPU(s) 276 are coupled to one or more memories 278 that store machine- readable instructions, and the CPU(s) 276 are configured to execute the machine-readable instructions.

[0080] Referring again to Fig. 2A, the ingress arbitration circuitry 220 includes one or more load-aware PSD modules 280. Each load-aware PSD module 280 is configured to initiate measuring PSD information regarding a distribution of sizes of packets processed by the network device in response to determining that a processing load of the network device meets a condition. The processing load is represented by a suitable load metric. In some embodiments, the load metric corresponds to an individual entity corresponding to the ingress portion 204-xa, such as i) a rate at which data is being received at a port 208, ii) a fill level of an ingress queue 228, iii) a length of an ingress queue 228, iv) a time delay between when a packet is added to an ingress queue 228 and when the packet is dequeued from the ingress queue 228, v) an occupancy level of an ingress buffer 224, etc. In some embodiments, the load-aware PSD module 280 is configured to determine when a processing load of the network device meets a condition at least by comparing an individual load metric to a threshold. In some embodiments, the load-aware PSD module 280 is configured to determine when a processing load of the network device meets a condition at least by comparing multiple individual load metrics to a threshold (or multiple respective different thresholds), and determining whether any of the multiple individual load metrics meet the corresponding threshold(s). In other embodiments, the load-aware PSD module 280 is configured to determine when a processing load of the network device meets a condition at least by comparing multiple individual load metrics to a threshold (or multiple respective different thresholds), and determining whether all of the multiple individual load metrics meet the corresponding threshold(s).

[0081] In some embodiments, the load metric is a suitable mathematical combination of two or more suitable individual load metrics such as described above, and the load-aware PSD module 280 is configured to determine when a processing load of the network device meets a condition at least by comparing the mathematical combination of two or more suitable individual load metrics to a threshold. [0082] In an embodiment, the load-aware PSD module 280 is configured to measure PSD information corresponding to an individual entity corresponding to the ingress portion 204-xa, such as PSD information regarding packets received by a port 208, packets stored in an ingress queue 228, packets stored in an ingress buffer 224, etc. For example, when the load metric corresponds to an individual entity, the load-aware PSD module 280 measures PSD information regarding the individual entity, in an embodiment.

[0083] In other embodiments, the load-aware PSD module 280 is configured, additionally or alternatively, to measure PSD information corresponding to a group of entities corresponding to the ingress portion 204-xa, such as PSD information regarding packets received by a set of multiple ports 208, packets stored in a set of multiple ingress queues 228, packets stored in a set of multiple ingress buffers 224, etc. For example, when the load-aware PSD module 280 determines whether any of multiple individual load metrics corresponding to multiple entities meet a corresponding threshold(s), the load-aware PSD module 280 measures PSD information regarding all of the multiple entities, in an embodiment.

[0084] Although Fig. 2A illustrates each ingress arbitration circuitry 220 including one load- aware PSD module 280, in other embodiments each of at least one ingress arbitration circuitry 220 includes multiple load-aware PSD modules 280, in some embodiments. Two or more ingress arbitration circuitry 220 include different numbers of multiple load-aware PSD modules 280, in some embodiments. At least one ingress arbitration circuitry 220 does not include any load-aware PSD modules 280, in some embodiments.

[0085] The traffic manager circuitry 248 includes one or more load-aware PSD modules 284. The load-aware PSD modules 284 are similar to the load-aware PSD modules 280, but measure load metrics and PSD information regarding entities of the egress portion 204-xb. For example, the load aware PSD module uses load metrics such as i) a rate at which data is being transmitted a port 212, ii) a fill level of an egress queue 256, iii) a length of an egress queue 256, iv) a time delay between when a packet is added to an egress queue 256 and when the packet is dequeued from the egress queue 256, v) an occupancy level of an egress buffer 252, etc.

[0086] Additionally, the load-aware PSD module 284 is configured to measure PSD information corresponding to an individual entity corresponding to the egress portion 204-xab such as PSD information regarding packets transmitted by a port 212, packets stored in an egress queue 256, packets stored in an egress buffer 252, etc. For example, when the load metric corresponds to an individual entity, the load-aware PSD module 284 measures PSD information regarding the individual entity, in an embodiment.

[0087] In other embodiments, the load-aware PSD module 284 is configured, additionally or alternatively, to measure PSD information corresponding to a group of entities corresponding to the egress portion 204-xb, such as PSD information regarding packets transmitted by a set of multiple ports 212, packets stored in a set of multiple egress queues 256, packets stored in a set of multiple egress buffers 252, etc. For example, when the load-aware PSD module 284 determines whether any of multiple individual load metrics corresponding to multiple entities meet a corresponding threshold(s), the load-aware PSD module 284 measures PSD information regarding all of the multiple entities, in an embodiment.

[0088] The egress arbitration circuitry 272 also includes one or more load-aware PSD modules 288. The load-aware PSD modules 288 are similar to the load-aware PSD modules 284, but measure PSD information after packets have been processed by the egress packet processor 268, which may change the size of packets by adding tunnel headers, removing tunnel headers, modifying headers, etc. The load aware PSD module 288 uses load metrics such as i) a rate at which data is being transmitted a port 212, ii) a fill level of an egress queue 256, iii) a length of an egress queue 256, iv) a time delay between when a packet is added to an egress queue 256 and when the packet is dequeued from the egress queue 256, v) an occupancy level of an egress buffer 252, etc.

[0089] Additionally, the load-aware PSD module 288 is configured to measure PSD information for packets that have been processed by the egress packet processor 268 and that correspond to an individual entity corresponding to the egress portion 204-xab such as PSD information regarding packets transmitted by a port 212, packets stored in an egress queue 256, packets stored in an egress buffer 252, etc. For example, when the load metric corresponds to an individual entity, the load-aware PSD module 288 measures PSD information regarding the individual entity, in an embodiment.

[0090] In other embodiments, the load-aware PSD module 288 is configured, additionally or alternatively, to measure PSD information for packets that have been processed by the egress packet processor 268 and that correspond to a group of entities corresponding to the egress portion 204-xb, such as PSD information regarding packets transmitted by a set of multiple ports 212, packets stored in a set of multiple egress queues 256, packets stored in a set of multiple egress buffers 252, etc. For example, when the load-aware PSD module 288 determines whether any of multiple individual load metrics corresponding to multiple entities meet a corresponding threshold(s), the load-aware PSD module 288 measures PSD information regarding all of the multiple entities, in an embodiment.

[0091] In various embodiments, the PSD modules 280, 284, 288 are implemented using hardware circuitry and/or one or more processors executing machine readable instructions stored in one or more memories coupled to the one or more processors.

[0092] PSD information generated by the PSD modules 280, 284, 288 is used to control the network device 200, in some embodiments. As an illustrative example, when load of a port 208, 212 or queue 228, 256 is high and when PSD information indicates a relatively high percentage of short-length packets corresponding the port 208, 212 or queue 228, 256, the network device 200 redistributes, within the network device 200, processing of one or more packet types, one or more flows of packets, etc., contributing to the high percentage of short-length packets corresponding the port 208, 212 or queue 228, 256. For instance, the network device 200 redirects the one or more packet types, one or more flows of packets, etc., contributing to the high percentage of short-length packets to a dedicated queue 228, 256, and processes (e.g., with the ingress packet processor 232, the egress packet processor 268, etc.) packets in the dedicated queue 228, 256 at a reduced rate to reduce power consumption associated with the processing of the packets in the dedicated queue 228, 256.

[0093] More generally, the network device 200 adjusts buffer allocation algorithms, queue allocations, buffer admission policy, buffer storage algorithms, etc., based on the PSD information determined when a load metric indicates a high processing load, according to some embodiments.

[0094] As another example, the network device 200 redirects the one or more packet types, one or more flows of packets, etc., contributing to the high percentage of short-length packets to a dedicated queue 228, 256, and a clock rate of a processor that is processing packets in the dedicated queue 228, 256 at a reduced rate to reduce power consumption associated with the processing of the packets in the dedicated queue 228, 256. [0095] As another example, the network device 200 redirects the one or more packet types, one or more flows of packets, etc., contributing to the high percentage of short-length packets to another egress port 212 with a lower load, the other egress port 212 corresponding to an alternative path through a network.

[0096] In some embodiments, network device 200 sends collected PSD information and optionally other telemetry information such as buffer lengths, queue lengths, latency measurements, etc., to an analyzer and/or controller that is external to the network device 200. In an embodiment, the analyzer and/or controller determines initial operating parameters (e.g., a processing rate for the network device 200, a selection of network paths to be routed through the network device 200, etc.) to be reduce power consumption by the network device 200.

Additionally, the analyzer and/or controller determines whether operating parameters of the network device and/or other network devices in the network should be adjusted based on collected PSD information and optionally other telemetry information, and responsively adjusts the operating parameters of the network device 200 and/or other network devices, in an embodiment. The analyzer and/or controller uses reinforcement learning to determine optimal operating parameters for the network device 200 and/or other network devices, for example, by using the collected PSD information and optionally other telemetry information as feedback, according to an embodiment.

[0097] Referring now to Figs. 2A-B, in some embodiments, when load of one or more ports 208, 212 or one or more queues 228, 256 is high and when PSD information indicates a relatively high percentage of short-length packets corresponding the port(s) 208, 212 or queue(s) 228, 256, the one or more CPUs (276) adjust operations being performed by the CPU(s) 276. For example, the CPU(s) 276 reduce a rate at which statistical information regarding the network device 200 is collected by the CPU(s) 276. Additionally or alternatively, in embodiments in which the CPU(s) 276 perform functions related to artificial intelligence and/or machine learning (AI/ML) network analysis and/or control, the CPU(s) 276 reduce a rate at which such functions are performed.

[0098] In some embodiments, the PSD information is provided to another device in a communication network to which the network device 200 belongs for network control and/or monitoring operations such as network optimization, congestion management, troubleshooting, etc. For example, the CPU 276 generates one or more packets that include PSD information, and the CPU 276 controls the network device 200 to transmit the one or more packets to another network device in the communication network. The other communication uses the PSD information in the one or more packets to perform one or more functions related to network optimization, congestion management, troubleshooting, etc., in an embodiment.

[0099] In another embodiment, in response to the PSD information indicating a high percentage of small-sized packets during a period of high processing load, the network device 200 identifies one or more other network devices in the communication network that are transmitting high numbers of small-sized packets to the network device 200, and then transmits flow control packets to the one or more other network devices.

[00100J In another embodiment, in response to the PSD information indicating a high percentage of small-sized packets during a period of high processing load, the network device 200 identifies one or more packet flows that are contributing high numbers of small-sized packets, and then begins notifying one or more other network devices that packets in the one or more packet flows are causing congestion in the network device 200, such as by explicit congestion notification (ECN) marking packets in the one or more packet flows or by using another suitable congestion notification mechanism.

[00101] PSD information is statistical information regarding the distribution of packet sizes in a set of multiple packets. Counts of packets having packet sizes that fall within respective packet size ranges is an illustrative example of PSD information, and is sometimes referred to as a histogram of packet sizes, or a packet size histogram. Other examples of PSD information includes a combination of statistical measurements such as a combination of i) one or more statistics measuring a respective central tendency (e.g., mean, median, mode, etc.) and ii) one or more statistics measuring dispersion or variation (e.g., range, standard deviation, variance, mean absolute difference, median absolute deviation, average deviation, etc.).

[00102] Fig. 3 is a simplified block diagram of an example PSD module 300, according to an embodiment. The PSD module 300 corresponds to one or more of the PSD module 280, 284, 288 of Fig. 2A, in some embodiments. In some embodiments, one or more (or all) of the PSD module 280, 284, 288 having a suitable structure that is different than the PSD module 300. Additionally, the PSD module 300 is used in a suitable network device different than the network device 200 of Figs. 2A-B.

[00103] The PSD module 300 includes a bank 304 of counters (sometimes referred to herein as the “counter bank 304”) that is used for maintaining counts of packets that fall within different packet size ranges. Counts of packets that fall within different packet size ranges is an example of packet size distribution (PSD) information.

[00104] The counter bank 304 includes H sets of counters, where H is a suitable positive integer. Each set of counters is sometimes referred to herein as a “histogram set”. Each histogram set corresponds to an entity of the network device (e.g., a port, a queue, a buffer, etc.), or a group of entities, and each histogram set is used to maintain counts of packets, corresponding to the entity or group of entities, that fall within different packet size ranges. Each histogram set includes a plurality of counters 312, each counter 312 corresponding to a respective packet size range. The different packet size ranges being counted by counters 312 in a histogram set are sometimes referred to herein as “bins,” and the counters 312 in a histogram set are sometimes referred to herein as a “bin counters”.

[00105] In an embodiment, a granularity of packet size ranges being counted in a histogram set is configurable. Thus, some counters 312 in a histogram set are not used for larger size ranges, in some embodiments.

[00106] In some embodiments, at least some counters 312 can be used selectively used for different histogram sets. For example, if the granularity of a first histogram set does not need a maximum number of counters 312, counters 312 that are not needed for the first histogram set can be used for another histogram set.

[00107] The PSD module 300 also includes a histogram index generator 320 that is configured to generate an indicator of a corresponding histogram set within the counter bank 304 (sometimes referred to herein as a “histogram index”) based on an indicator of an entity (or a group of entities) associated with a packet (e.g., a port that received or will transmit the packet, a queue in which the packet was stored, a buffer in which the packet was stored, etc.). In an embodiment, indicators of entities include port identifiers (e.g., identifiers of ingress ports 208 and/or egress ports 212). In another embodiment, indicators of entities include identifiers of groups of ports. In another embodiment, indicators of entities additionally or alternatively include queue identifiers (e.g., identifiers of ingress queues 228 and/or egress queues 256) and/or identifiers of groups of queues. In some embodiments, indicators of entities additionally or alternatively include buffer identifiers (e.g., identifiers of ingress buffers 224 and/or egress buffers 252) and/or identifiers of groups of buffers.

[00108] In some embodiments, the histogram index generator 320 is configured to generate the histogram index further based on one or more characteristics of the packet, such as packet type, a protocol type, a type of packet flow to which the packet belongs, etc. Thus, a histogram set can be used for generating PSD information for packets associated with a particular entity (or group of entities) and having one or more particular packet characteristics, in an embodiment. As an example, a first histogram set is used for generating PSD information for packets associated with a particular entity and having a first set of one or more particular characteristics, and a second histogram set is used for generating PSD information for packets associated with the particular entity and having a second set of one or more particular characteristics. As another example, a histogram set is used for generating PSD information for packets associated with a particular entity and having a set of one or more particular characteristics, but packets associated with the particular entity and not having the set of one or more particular characteristics are not used for generating the PSD information.

[00109] In some embodiments, the histogram index generator 320 includes (or is coupled to) a configuration memory 322 that stores configuration information that includes associations between histogram sets and entities. In such embodiments, the histogram index generator 320 uses the associations between histogram sets and entities to determine one or more histogram sets associated with an entity (or group of entities). In some embodiments, the configuration information in the configuration memory 322 includes associations between histogram sets and entity(ies)/packet characteristic tuples. In such embodiments, the histogram index generator 320 uses the associations between histogram sets and the entity(ies)/packet characteristics to determine a histogram set associated with an entity(ies)/packet characteristics tuple.

[00110] The PSD module 300 also includes a granularity table 324 that is configured to store respective indications of granularity of the histogram sets of the counter bank 304. For example, the granularity table 324 stores a respective indication of granularity for each histogram set, in an embodiment. The indication of granularity indicates a size range of each counter 312 in the histogram set, in an embodiment. The indication of granularity additionally or alternatively indicates a number of counters 312 in the histogram set, in another embodiment.

[00111] The granularity table 324 is configured to receive a histogram index from the histogram set index generator 320. The granularity table 324 uses the histogram index to lookup an indication of granularity that corresponds to histogram index, and outputs the indication of granularity.

[00112] The PSD module 300 also includes a bin counter index generator 328 that is configured to generate a relative index of a bin counter 312 within a histogram set based on i) a packet size of a packet that is to be counted, and ii) an indication of granularity received from the granularity table 324.

[00113] In some embodiments, such as embodiments in which the granularity of packet sizes being measured by the counter bank 304 is not configurable, the granularity table 324 is omitted and the bin counter index generator 328 generates the relative index without the indication of granularity received from the granularity table 324.

[00114] A merged index generator 332 is configured to generate a merged index into the counter bank 304 using i) the histogram index generated by the histogram index generator 320 and ii) the bin counter index generated by the bin counter index generator 328. The merged index generated by the merged index generator 332 selects a counter 312 from amongst multiple histogram sets in the counter bank 304, in an embodiment. The merged index generated by the merged index generator 332 selects a counter 312 from amongst all of the counters 312 in the counter bank 304, in an embodiment.

[00115] The PSD module 300 also includes an update signal generator 348 that is configured to initiate measurement of PSD information for packets processed by the network device in response to determining that a load metric meets a condition. For example, when a load metric corresponding to an entity (e.g., a port, a queue, a buffer, etc.) meets a condition, the update signal generator 348 initiates measurement of PSD information regarding packets associated with the entity (e.g., packets received via a port, packets transmitted by a port, packets stored in a queue, packets stored in a buffer, etc.), according to an embodiment. In some embodiments, when a load metric corresponding to a group of entities meets a condition, the update signal generator 348 initiates measurement of PSD information regarding packets associated with the group of entities.

[00116] In some embodiments, when the update signal generator 348 has initiated measurement of PSD information regarding packets associated with an entity(ies), the update signal generator 348 determines when the counter bank 304 is to update the PSD information regarding the entity(ies). For example, the update signal generator 348 generates an update signal that indicates when the counter bank 304 is to update PSD information.

[00117] In an embodiment, the update signal generator 348 generates the update signal based on packet events corresponding to entity(ies). Examples of packet events corresponding to entity(ies) include a packet being received via a port, a packet being transmitted via a port, a packet being scheduled for transmission via a port, a packet being stored in a queue, a packet being dequeued from a queue, a packet being stored in a buffer, a packet being retrieved from a buffer, etc., according to various embodiments.

[00118] In an embodiment, the update signal generator 348 generates the update signal further based on one or more characteristics of a packet corresponding to a packet event, such as packet type, a protocol type, a type of packet flow to which the packet belongs, a classification of the packet, etc. For example, the update signal generator 348 generates the update signal only for packets having one or more particular packet characteristics so that PSD information is measured only for packets having the one or more particular packet characteristics, in an embodiment.

[00119] In some embodiments, the update signal generator 348 includes (or is coupled to) a configuration memory 352 that stores configuration information that includes associations between entities and packet characteristics that indicate, for each entity (or group of entities), the packet characteristics of packets corresponding to an entity (or group of entities) for which packet distribution information is to be measured. In such embodiments, the update signal generator 348 uses the associations between entities and packet characteristics to determine when to generate the update signal so that PSD information is measured only for packets having certain particular packet characteristics, in an embodiment.

[00120] In some embodiments, the update signal generator 348 is configured to initiate measurement of PSD information for packets corresponding to an entity (or group of entities) in response to determining that a load metric corresponding to the entity(ies) exceeds a first threshold. In an embodiment, the update signal generator 348 is configured to stop measurement of PSD information for packets corresponding to the entity(ies) in response to determining that the load metric corresponding to the entity(ies) falls below a second threshold. In an embodiment, the second threshold is the same as the first threshold. In another embodiment, the second threshold is below the first threshold to provide hysteresis.

[00121] In various embodiments, the histogram index generator 320, the bin counter index generator 328, the merged index generator 332, and/or the update signal generator 348 are implemented using hardware circuitry and/or one or more processors executing machine readable instructions stored in one or more memories coupled to the one or more processors. [00122] Fig. 4 is a graph showing an illustrative example of PSD information 400 measured at less than 50% peak load and at greater than 50% peak load, according to an embodiment. The PSD information 400 includes counts of packets falling into six packet size ranges (or “bins”): i) less than or equal to 64 bytes, ii) 65-127 bytes, iii) 128-511 bytes, iv) 512-2047 bytes, v) 2048- 4095 bytes, and vi) 4096-9216 bytes. In various other embodiments, PSD information measured by a load-aware PSD module 280, 284, 288, 300 includes more packet size ranges and/or different packet size ranges as compared to the example PSD information 400 of Fig. 4.

[00123] Fig. 5 is a simplified example state diagram 500 for circuitry that controls generation of PSD information in a network device, according to an embodiment. The update signal generator 348 of Fig. 3 implements the state diagram 500, according to an embodiment, and the state diagram 500 is described with reference to Fig. 3 for ease of explanation. In other embodiments, the update signal generator 348 implements another suitable set of state transitions different than the state diagram 500. Additionally, the state diagram 500 is implemented by another PSD measurement apparatus different than the PSD module 300 of Fig. 3, in some embodiments.

[00124] In a state 504, PSD measurement is turned off for an entity or group of entities. The update signal generator 348 remains in the state 504 while a load metric corresponding to the entity/group of entities remains below a first threshold.

[00125] In response to the load metric exceeding the first threshold (and/or equaling the first threshold, in some embodiments), the update signal generator 348 transitions to a state 508. Upon transitioning to the state 508, the update signal generator 348 turns PSD measurement on for the entity/group of entities. Additionally, PSD measurement remains on for the entity/group of entities while the update signal generator 348 remains in the state 508.

[00126] The update signal generator 348 remains in the state 508 while the load metric corresponding to the entity/group of entities remains above a second threshold. In response to the load metric falling below the second threshold (and/or equaling the second threshold, in some embodiments), the update signal generator 348 transitions to the state 504. Upon transitioning to the state 504, the update signal generator 348 turns PSD measurement off for the entity/group of entities. In an embodiment, the second threshold is below the first threshold to provide hysteresis. In another embodiment, the second threshold is equal to the first threshold.

[00127] In some embodiments, PSD measurements corresponding to the entity/group of entities are reset upon transitioning to the state 508. For example, counters 312 of a histogram set corresponding to the entity/group of entities are reset upon transitioning to the state 508. In other embodiments, PSD measurements corresponding to the entity/group of entities are not reset upon transitioning to the state 508 so that running PSD measurements are made over multiple distinct instances of high load associated with the entity/group of entities.

[00128] Fig. 6 is another simplified example state diagram 600 for circuitry that controls generation of PSD information in a network device, according to another embodiment. The update signal generator 348 of Fig. 3 implements the state diagram 600, according to an embodiment, and the state diagram 600 is described with reference to Fig. 3 for ease of explanation. In other embodiments, the update signal generator 348 implements another suitable set of state transitions different than the state diagram 600. Additionally, the state diagram 600 is implemented by another PSD measurement apparatus different than the PSD module 300 of Fig. 3, in some embodiments.

[00129] In a state 604, PSD measurement is turned off for an entity or group of entities. The update signal generator 348 remains in the state 604 while a load metric corresponding to the entity/group of entities remains below a threshold.

[00130] In response to the load metric exceeding the threshold (and/or equaling the threshold, in some embodiments), the update signal generator 348 transitions to a state 608. Upon transitioning to the state 608, the update signal generator 348 starts a timer of the update signal generator 348. The timer is configured to measure a suitable time period. [00131] Additionally, the update signal generator 348 turns PSD measurement on for the entity /group of entities upon transitioning to the state 608. PSD measurement remains on for the entity/group of entities while the update signal generator 348 remains in the state 608.

[00132] The update signal generator 348 remains in the state 608 while the timer has not expired. In response to the timer expiring, the update signal generator 348 transitions to the state 604. Upon transitioning to the state 604, the update signal generator 348 turns PSD measurement off for the entity/group of entities.

[00133] Regarding Figs. 4 and 5, the terms “above” and “below” are relative terms that depend on the load metric being compared. For example, the load metric being “above” a threshold corresponds to a processing load of a network device being relatively high, whereas the load metric being “below” a threshold corresponds to a processing load of a network device being relatively low. If a particular load metric is inversely proportional to processing load (such as a processing load indicating available processing capacity), then the particular load metric being below a threshold indicates relatively high processing load, whereas the particular load metric being above a threshold indicates relatively low processing load.

[00134] In various embodiments, the state transition diagram 500 and/or the state transition diagram 600 are implemented using hardware circuitry (e.g., a hardware state machine) and/or one or more processors executing machine readable instructions stored in one or more memories coupled to the one or more processors.

[00135] Fig. 7 is a simplified flow diagram of an example method 700 for controlling a network device based on PSD measurements, according to an embodiment. The network device 200 of Figs. 2A-B implements the method 700, according to an embodiment, and the method 700 is described with reference to Figs. 2A-B for ease of explanation. In other embodiments, the network device 200 implements another suitable method for controlling the network device 200 based on PSD measurements different than the method 700. Additionally, the method 700 is implemented by another suitable network device different than the network device 200 of Figs. 2A-B, in some embodiments.

[00136] In some embodiments, the method 700 is implemented using the PSD module 300 of Fig. 3, and the method 700 is described with reference to Fig. 3 for ease of explanation. In other embodiments, the PSD module 300 is used to implement another suitable method for controlling a network device based on PSD measurements. Additionally, the method 700 is implemented using another suitable PSD measurement apparatus different than the PSD module 300 of Fig. 3, in some embodiments.

[00137] At block 704, a network device determines a load metric corresponding to a processing load of the network device. For example, the packet processing module 204 determines (e.g., the ingress arbitration circuitry 220 determines, the traffic manager circuitry 248 determines, etc.) the load metric, the load metric corresponding to an entity (or group of entities) of the packet processing module 204, such as a port 208, 212, a queue 228, 256, a buffer 224, 252, etc.

[00138] At block 708, the network device determines whether the load metric determined at block 704 meets a threshold. In response to determining that the load metric does not meet the threshold, the flow repeats block 708. For example, block 704 involves repeatedly determining the load metric over time, and block 708 involves repeatedly comparing the load metric to the threshold over time. In some embodiments, the ingress arbitration circuitry 220 and/or the traffic manager circuitry 248 determines whether the load metric meets the threshold. In an embodiment, the PSD module 300 (e.g., the update signal generator 348) determines whether the load metric meets the threshold.

[00139] In response to determining at block 708 that the load metric meets the threshold, the flow proceeds to block 712. At block 712, the network device begins generating PSD measurements. In an embodiment, the load metric determined at block 704 corresponds to an entity or group of entities of the network device, and the PSD measurements that are begun at block 712 are for packets corresponding to the entity or group of entities. In some embodiments, the ingress arbitration circuitry 220 and/or the traffic manager circuitry 248 begins generating PSD measurements at block 712. In an embodiment, the PSD module 300 begins generating the PSD measurements at block 712.

[00140] At block 716, the network device ends the PSD measurements that were begun at block 712. In some embodiments, the ingress arbitration circuitry 220 and/or the traffic manager circuitry 248 begins generating PSD measurements at block 712. In an embodiment, the PSD module 300 begins generating the PSD measurements at block 712. [00141] In an embodiment, the threshold to which the load metric is compared at block 708 is a first threshold, and the method 700 further comprises comparing the load metric to a second threshold. In such embodiments, the PSD measurements are ended at block 716 in response to determining that the load metric falls below the second threshold.

[00142] In another embodiment, the method 700 further comprises starting a timer in connection with beginning generation of PSD measurements at block 712. In such embodiments, the PSD measurements are ended at block 716 in response to determining that the timer has expired.

[00143] At block 720, the network device uses the PSD measurements made in connection with blocks 712 and 716 to control the network device.

[00144] In an embodiment, using the PSD measurements at block 720 includes adjusting a buffer allocation algorithm implemented by the network device based on the PSD measurements. In another embodiment, using the PSD measurements at block 720 includes adjusting queue allocations of the network device based on the PSD measurements.

[00145] In another embodiment, using the PSD measurements at block 720 includes, when load of a port 208, 212 or queue 228, 256 is high and when PSD measurement indicates a relatively high percentage of short-length packets corresponding the port 208, 212 or queue 228, 256, the network device 200 redistributes, within the network device 200, processing of one or more packet types, one or more flows of packets, etc., contributing to the high percentage of short-length packets corresponding the port 208, 212 or queue 228, 256.

[00146] In another embodiment, using the PSD measurements at block 720 includes the network device 200 redirecting one or more packet types, one or more flows of packets, etc., contributing to a high percentage of short-length packets to another egress port 212 with a lower load, the other egress port 212 corresponding to an alternative path through a network.

[00147] In another embodiment, using the PSD measurements at block 720 includes the one or more CPUs (276) adjusting operations being performed by the CPU(s) 276 based on the PSD measurements.

[00148] In other embodiments, the method 700 includes, in addition to the block 720 or instead of the block 720, the network device providing the PSD measurements made in connection with blocks 712 and 716 to another network device for use by the other network device in controlling and/or monitoring a communication network in which the network device operates.

[00149] In other embodiments, the method 700 includes, in addition to the block 720 or instead of the block 720, the network device providing transmitting flow control messages and/or congestion notification messages to another network device based on the PSD measurements made in connection with blocks 712 and 716.

[00150] At least some of the various blocks, operations, and techniques described above are suitably implemented utilizing dedicated hardware, such as one or more of discrete components, an integrated circuit, an ASIC, a programmable logic device (PLD), a processor executing firmware instructions, a processor executing software instructions, or any combination thereof. When implemented utilizing a processor executing software or firmware instructions, the software or firmware instructions may be stored in any suitable computer readable memory such as in a random access memory (RAM), a read-only memory (ROM), a solid state memory, etc. The software or firmware instructions may include machine readable instructions that, when executed by one or more processors, cause the one or more processors to perform various acts described herein.

[00151] While the present invention has been described with reference to specific examples, which are intended to be illustrative only and not to be limiting of the invention, changes, additions and/or deletions may be made to the disclosed embodiments without departing from the scope of the invention.

Claims

What is claimed is:

1. A method for controlling operation of a network device, the method comprising: determining, at the network device, a load metric corresponding to a processing load of the network device; in response to determining that the load metric meets a first threshold, beginning, at the network device, measuring distribution information regarding a distribution of sizes of packets processed by the network device; ending, at the network device, measuring the distribution information regarding the distribution of sizes of packets processed by the network device; and using, at the network device, the distribution information to control the network device.

2. The method of claim 1, further comprising: starting a timer of the network device in response to determining that the load metric meets the first threshold; wherein ending the measuring of the distribution information comprises ending the measuring of the distribution information in response to the timer expiring.

3. The method of claim 1, wherein ending the measuring of the distribution information is in response to determining that the load metric meets a second threshold.

4. The method of claim 3, wherein: determining that the load metric meets the first threshold comprises at least one of i) determining that the load metric is equal to the first threshold, and ii) determining that the load metric is above the first threshold; and determining that the load metric meets the second threshold comprises at least one of i) determining that the load metric is equal to the second threshold, and ii) determining that the load metric is below the second threshold.

5. The method of claim 4, wherein the first threshold is equal to the second threshold.

6. The method of claim 4, wherein the first threshold is above the second threshold.

7. The method of any of claims 1-6, wherein using the distribution information to control the network device comprises: moving, at the network device, transmission of a flow of packets from a first port of the network device to a second port of the network device based on the distribution information.

8. The method of any of claims 1-7, wherein using the distribution information to control the network device comprises: changing, at the network device, storing packets in a first queue of the network device to storing packets in a second queue of the network device based on the distribution information.

9. The method of any of claims 1-8, wherein using the distribution information to control the network device comprises: adjusting a processing rate at which the network device processes packets based on the distribution information.

10. The method of any of claims 1-9, wherein using the distribution information to control the network device comprises: adjusting an algorithm corresponding to buffer allocation and/or use of buffers based on the distribution information.

11. The method of any of claims 1-10, further comprising at least one of: i) sending, by the network device, one or more flow control messages to another network device based on the distribution information; and ii) marking, by the network device, one or more packets to be sent to another network device based on the distribution information, the marking of the one or more packets to notify the other network device that the network device is experiencing congestion.

12. A network device, comprising: a plurality of network interfaces; a packet processor configured to process data units received via the plurality of network interfaces to determine network interfaces, among the plurality of network interfaces, that are to transmit the data units; first circuitry that is configured to determine a load metric corresponding to a processing load of the network device; second circuitry that is configured to: in response to determining that the load metric meets a first threshold, begin measuring distribution information regarding a distribution of sizes of packets processed by the network device, and end measuring the distribution information regarding the distribution of sizes of packets processed by the network device; and a controller configured to use the distribution information to control the network device.

13. The network device of claim 12, wherein the second circuitry includes a timer, and wherein the second circuitry is configured to: start the timer in response to determining that the load metric meets the first threshold; end measuring the distribution information in response to the timer expiring.

14. The network device of claim 12, wherein the second circuitry is configured to: end measuring the distribution information in response to determining that the load metric meets a second threshold.

15. The network device of claim 14, wherein the second circuitry is configured to: determine one of i) that the load metric is equal to the first threshold, and ii) that the load metric is above the first threshold; and determine one of i) that the load metric is equal to the second threshold, and ii) that the load metric is below the second threshold.

16. The network device of claim 15, wherein the first threshold is equal to the second threshold.

17. The network device of claim 15, wherein the first threshold is above the second threshold.

18. The network device of any of claims 12-17, wherein the plurality of network interfaces includes a first network interface and a second network interface, and wherein the controller is configured to: move transmission of a flow of packets from the first network interface to the second network interface based on the distribution information.

19. The network device of any of claims 12-18, further comprising: a plurality of queues to store packets processed by the network device, the plurality of queues including a first queue and a second queue; wherein the controller is configured to: change storing packets in the first queue to a storing packets in the second queue based on the distribution information.

20. The network device of any of claims 12-19, wherein the controller is configured to: adjust a processing rate at which the packet processor processes packets based on the distribution information.

21. The network device of any of claims 12-20, the controller is configured to: adjust an algorithm corresponding to buffer allocation and/or use of buffers based on the distribution information.

22. The network device of any of claims 12-21, wherein the controller is configured to at least one of: i) control the network device to send one or more flow control messages to another network device based on the distribution information; and ii) control the network device to mark one or more packets to be sent to another network device based on the distribution information, the marking of the one or more packets to notify the other network device that the network device is experiencing congestion.