HK1175321B

HK1175321B - Agile data center network architecture

Info

Publication number: HK1175321B
Application number: HK13102097.3A
Authority: HK
Inventors: A.格林伯格; P．拉希里; D.A.马尔茨; P.帕特尔; S.森古普塔; N．贾殷; C.金
Original assignee: 微软技术许可有限责任公司
Priority date: 2009-05-28
Filing date: 2010-05-28
Publication date: 2017-05-05

Description

Flexible data center network architecture

Background

Conventional data center network architectures have some design deficiencies that reduce their flexibility (i.e., their ability to assign any server of the data center network to any service). First, the configuration of conventional networks is typically tree-shaped in nature and includes relatively expensive equipment. This may lead to congestion and the development of computing hotspots-even if spare capacity is available elsewhere in the network. Second, traditional data center networks do little to prevent traffic in one service from flooding affecting other services around it. When one service experiences a traffic flood, it is common for all those services sharing the same network sub-tree to experience sweep damage. Again, routing designs in conventional data center networks typically achieve scaling by assigning topologically significant Internet Protocol (IP) addresses to servers and partitioning the servers among virtual local area networks. However, this can create a significant configuration burden when reallocating servers among services and thereby further partitioning the resources of the data center. Moreover, human involvement is often required in these reconfiguration, limiting the speed of this process. Finally, other considerations, such as the difficulty of configuring conventional data center networks and the cost of the equipment used in these networks, can also negatively impact the flexibility of these networks.

SUMMARY

The present patent application relates in particular to a flexible network architecture that can be utilized in a data center. An implementation provides a virtual layer-two (layer-2) network that connects machines such as servers of a layer-three (layer-3) infrastructure.

Another implementation includes a plurality of computing devices communicatively coupled via a plurality of switches. Individual computing devices may be associated with application addresses. A separate computing device may be configured to act as a source and another separate computing device may be configured to act as a destination. The source computing device may be configured to send the packet to the application address of the destination computing device. The implementation may also include a flexible proxy configured to intercept the packet and identify a location address associated with the destination computing device and select a separate switch through which to send the packet to the location address.

The implementations listed above are provided for the purposes of illustration and do not include and/or limit all of the claimed subject matter.

Drawings

The drawings illustrate implementations of the concepts conveyed in this application. Features of the illustrated implementations may be more readily understood by reference to the following description taken in conjunction with the accompanying drawings. Wherever possible, the same reference numbers will be used throughout the drawings to refer to the same or like elements. Moreover, each left-most digit such as a drawing reference number conveys the drawing in which the reference number is first introduced and the associated discussion.

Fig. 1-6 illustrate examples of flexible network architectures according to some implementations of the inventive concepts.

7-9 illustrate examples of flexible network data center layouts according to some implementations of the inventive concepts.

FIG. 10 is a flow diagram of a flexible network method that may be implemented in accordance with some implementations of the inventive concept.

Detailed Description

Overview

The present patent application relates in particular to a flexible network architecture that can be utilized in a data center. Cloud services drive the creation of large data centers that may hold hundreds of thousands of servers. These data centers may simultaneously support a large and varying number of different services (web applications, email, map-reduce clusters, etc.). The implementation of a cloud services data center may rely on a scale-out design: reliability and performance are achieved through a large pool of resources (e.g., servers) that can be quickly reallocated between services as needed. The ability to assign any server of the data center network to any service may be considered flexibility of the data center network. To effectively take advantage of the advantages of a data center, which are associated with significant costs, network flexibility may be valuable. Without network flexibility, data center server resources would be stranded and money would be wasted.

First example Flexible network architecture

For purposes of introduction, consider fig. 1-2, which illustrate an example of a flexible network architecture 100. Flexible data network architecture 100 may include multiple server-side computing devices, such as servers 102(1), 102(2), 102(3), and 102 (N).

The terms "server" and "machine" should be understood to refer to any device that can send or receive data. For example, these terms should be understood to refer to any of the following: a physical server, a virtual machine running on a server (e.g., using virtualization technology), a computing device running a single operating system, a computing device running more than one operating system, a computing device running different operating systems (e.g., microsoft windows, Linux, FreeBSD), a computing device other than a server (e.g., a laptop, an addressable power supply), or a portion of a computing device (e.g., a network attached disk, a network attached memory, a storage subsystem, a Storage Area Network (SAN), a graphics processing unit, a digital accelerometer, a quantum computing device).

The flexible network architecture 100 may promote scalability relative to the number of servers. One way scalability can be achieved is by creating ethernet-like planar addresses (flashing) for servers 102(1) -102(N) using application addresses. The ethernet layer two semantics may be associated with implementing a network state that supports planar addresses, where any Internet Protocol (IP) address may be assigned to any server connected to any network port — as if the servers were on a Local Area Network (LAN).

In this case, Application Addresses (AA)104(1), 104(2), 104(3), 104(N) may be assigned to each server 102(1), 102(2), 102(3), 102(N), respectively. From the server perspective, any server may talk to any other server via the associated application addresses 104(1), 104(2), 104(3), 104 (N). This may be considered a second level of functionality, as the application addresses may be arranged in any manner, including all that would be valid for a Local Area Network (LAN) containing servers 102(1), 102(2), 102(3), 102 (N). However, as will be explained below, in some implementations, the underlying infrastructure of the flexible network architecture may be a third layer as indicated at 106. Thus, these implementations can create a virtual tier-two network 108 on or with the tier-three infrastructure 106. There may be more than one virtual tier-two network 108 created on the same third tier infrastructure 106, and each server may belong to one or more of these virtual tier-two networks 108.

FIG. 2 illustrates an external client 202 connected to the flexible network architecture 100 via a network 204. The flexible network architecture 100 may allow external clients to communicate with global or location addresses 206 assigned to one or more of the servers 102(1) -102(N), without the external clients needing to be aware of the application addresses 104(1) -104 (N). These inventive concepts are explained in more detail below with reference to the discussion of fig. 3-5.

Second example Flexible network architecture

FIG. 3 illustrates an example flexible network architecture 300 in which the above-described concepts may be implemented. In this case, external client 302 may communicate with flexible system 304 via Internet 306 and/or other networks. In this implementation, flexible system 304 includes: a set of routers indicated generally at 308 and specifically indicated at 308(1) through 308 (N); a plurality of intermediate routers indicated generally at 310 and indicated specifically at 310(1), 310(2), and 310 (N); a plurality of aggregation switches indicated generally at 312 and specifically indicated at 312(1), 312(2), and 312 (N); a plurality of top-of-rack (TOR or TOR) switches indicated generally at 314 and specifically at 314(1), 314(2), and 314 (N); and a plurality of servers indicated generally at 316 and specifically indicated at 316(1), 316(2), 316(3), 316(4), 316(5), and 316 (N). Due to the space constraints of the drawing page, only six servers 316(1) - (316(N) are shown, but the flexible system 304 can easily accommodate thousands, tens of thousands, hundreds of thousands or more servers. Note that for simplicity, and due to space constraints of the drawing page, not all connections (i.e., communication paths) between components are shown in FIGS. 3-8.

Servers 316(1) and 316(2) are associated with TOR switch 314(1) as server rack 318 (1). Similarly, servers 316(3) and 316(4) are associated with TOR switch 314(2) as server rack 318(2), and servers 316(5) and 316(N) are associated with TOR switch 314(N) as server rack 318 (N). Again, this is due to space constraints of the drawing page, and the server rack typically includes ten or more servers. Further, separate servers may be associated with flexible proxies. For example, server 316(1) is associated with flexible proxy 320 (1). Similar relationships are shown between servers 316(2) -316(N) and flexible agents 320(2) -320(N), respectively.

The functionality of flexible agents 320(1) -320(N) is described in more detail below. In short, the flexible proxy may facilitate communication between individual servers. In this particular example, the flexible agent may be viewed as a logical module stored as computer readable instructions on the server. Other implementations may involve configurations in which a flexible proxy 320 serving a set of servers is located on a switch (e.g., TOR switch 314 or intermediate switch 310). When located on a switch, the flexible proxy may process packets as they flow from the server 316 to the intermediate switch 310 over the network. In such a configuration, flexible agent 320 may be implemented using a combination of custom hardware on the packet forwarding path and software instructions executed in the forwarding path or in a control processor of the switch.

Flexible system 304 also includes three directory services modules 322(1) -322 (N). The number of directory service modules shown is not critical to a flexible system, and other implementations may employ fewer or more directory service modules (and/or other components shown). The functionality of the directory server is discussed in more detail below. Briefly, the directory services module may contain, among other information, a mapping of application addresses to location addresses (either or both of a forward or reverse mapping) that may be used by flexible agents 320(1) - (320 (N) (and/or other components) to facilitate communication over flexible system 304. In this case, directory services modules 322(1) -322(N) are associated with particular servers 316(1), 316(3), and 316 (5). In other configurations, the directory services module may be present with other components, such as a data center control server, a switch, and/or on a dedicated computing device.

Flexible system 304 can be viewed as comprising two logical groupings. The first logical grouping is a link state network that carries either a location address or a global address as indicated at 326. The second logical grouping is a pool of interchangeable servers that own the application address as indicated at 328. In short, the components of the link state network 326 need not exchange information to track which server in the server pool 328 is currently using which application address. Also, from the server's perspective, a server may communicate with any other server in the server pool 328 via the application address of that other server. This process is facilitated by a flexible proxy, directory service, and/or other component in a manner that is transparent to the server. In other words, the process may be transparent to the application running on the server, although other components on the server may be aware of the process.

The router 308, intermediate switches 310, aggregation switch 312, TOR switch 314, and servers 316(1) -316(N) may be communicatively coupled, such as using layer three technology. From the perspective of each individual server, communication with the other servers appears as layer two communication (i.e., virtual layer two). However, inter-rack communications, such as from the source server 316(1) of the server rack 318(1) to the destination server 316(3) of the server rack 318(2), may actually occur through the third tier infrastructure. For example, flexible proxy 320(1) may intercept the communication (i.e., the packet addressed to the application address of server 316 (3)) and facilitate its transmission.

Flexible agent 320(1) may access one or more of directory services modules 322(1) -322(N) to obtain a mapping of application addresses to location addresses associated with server 316 (3). For example, the mapped location address may be to TOR switch 314 (2). The flexible proxy may use the location address to encapsulate the packet. The flexible proxy may then select a single (or a group of) aggregation and/or intermediate switches through which to send or pop the encapsulated packets. The features of this selection process are described in more detail below. Upon receiving the encapsulated packet at TOR switch 314(2), the TOR switch may decapsulate (de-encapsulate) the packet and send the packet to server 316 (3). In an alternative embodiment, the location address may be associated with a virtual machine running on server 316(3) or server 316(3), and the packet may be decapsulated on the destination server itself. In these embodiments, the location address assigned to the server or virtual machine may be hidden from other applications operating on the server, which are addresses used by other hosts on the LNA to communicate with the applications, to keep these applications the illusion that they are connected by a LAN.

In an alternative embodiment, the packet may be decapsulated by other components after crossing the third layer/second layer boundary. Examples of components that may perform this decapsulation may include a hypervisor and/or a root partition of a virtual machine monitor, for example.

This configuration may allow a large number of servers to be added to the server pool 328, while other servers may appear from the server's perspective as if they were on the same subnet. Alternatively or additionally, components of link state network 326 need not know the server application address. Furthermore, the directory service can simply be updated whenever address information changes, such as when a server is added or removed, rather than having to update multiple different types of components.

In summary, the second tier semantics may be associated with implementing a network state that supports planar addresses, where any IP address may be assigned to any server connected to any network port — as if the servers were on a LAN. Moreover, components (i.e., switches) in the link state network 326 may be aware of other components within the link state network and need not be aware of components of the server pool 328. Furthermore, TOR switches may be aware of servers in their respective racks, and need not be aware of servers of other racks. Further, the flexible agent may intercept a server Application Address (AA) packet and identify a Location Address (LA) associated with a destination computing device of the AA. The flexible proxy may then select an individual switch (or group of switches) through which to send packets to the LA. In this case, the individual switches may include any one or more of the available switches.

This configuration also facilitates other server features related to the service. For example, data center management software, such as may be included in directory services modules 322(1) -322(N), may assign any server 316(1) -316(N) to any service and configure that server with any IP address desired by that service. The network configuration of each server may be equivalent to that when connected via a LAN, and may support features such as link-local broadcast. The goal of communication isolation between services may be associated with providing a simple and consistent Application Program Interface (API) for defining services and communication groups. In this regard, a directory service may define a group of servers associated with a service (e.g., a customer). Full connectivity between servers in one group may be allowed and policies such as Access Control Lists (ACLs) may be specified for governing which servers in different groups should be allowed to communicate.

The above configuration is also applicable to traffic management. For purposes of explanation, it is assumed that the first customer paid a relatively high rate for the services performed by the server of the flexible system 304 and obtained a relatively high quality of service agreement accordingly. Further, assume that the second customer paid a relatively lower rate and received a correspondingly lower quality of service agreement. In this case, a relatively high percentage or all of the intermediate switches 310(1) - (310N) may be allocated to handle traffic for a first customer, while a smaller number of switches may be allocated to a second customer. In other words, a first subset of switches may be assigned to a first customer and a second subset of switches may be assigned to a second customer. The first and second sets may be mutually exclusive or overlap. For example, in some implementations, individual switches may be dedicated to a particular customer or assigned to multiple customers. For example, intermediate switch 310(1) may be assigned to both customers, while intermediate switches 310(2) and 310(N) may be assigned exclusively to the first customer.

In summary, and as will be explained in more detail below, the flexible network architecture 300 may be associated with one or more of the following goals: uniform high capacity between servers, performance isolation between services, ethernet layer two semantics, and/or communication isolation between services. The goal of uniform high capacity among servers may be associated with achieving the following network states: wherein the rate of traffic flow in the network is largely unrestricted except by the available capability limitations of the network interface cards of the sending and receiving servers. As such, from the developer's perspective, by achieving this goal, the network topology may no longer be a major concern when adding servers to a service. The goal of performance isolation between services may be associated with achieving the following network states: the traffic of one of the services is not affected by the traffic handled by any other service-as if each service were connected by a separate physical switch. The goal of the ethernet layer two semantics can be associated with implementing a network state that supports planar addresses, where virtually any IP address can be assigned to any server connected to any network port-as if the servers were on a LAN. In this manner, the data center management software can assign any server to any service and configure the server with any IP address desired by the service.

The network configuration of each server may be equivalent to that when connected via a LAN, and may support features such as link-local broadcast. The goal of communication isolation between services may be associated with providing a simple and consistent API for defining services and communication groups. In this regard, a directory system may be provided (i.e., via, for example, directory services modules 322(1) -322(N)) that defines a server group. Full connectivity between servers in one group may be allowed, and policies may be specified for governing which servers in different groups should be allowed to communicate.

By utilizing the described flexible network architecture, a data center network may be provided that is associated with one or more of the following network features: (1) a flat address that allows service instances to be placed anywhere in the network, (2) load balancing (e.g., Valiant Load Balancing (VLB)) that uses randomization to spread traffic evenly across network paths, and (3) a new end-system-based address resolution service that enables layer two ethernet semantics while scaling to a large server pool.

To achieve the foregoing goals, in at least some embodiments, one or more of the following flexible network architecture design principles may be employed in implementations:

utilizing topology with extended path diversity

By utilizing a "mesh" topology, multiple paths may be provided between individual groups of servers. For example, communication between the servers of the server rack 318(1) and the servers of the server rack 318(N) may pass from the TOR switch 314(1) through any of the aggregation switches 312(1) -312(2) to any of the intermediate switches 310(1) -310 (N). The communication may arrive at TOR switch 314(N) from the intermediate switch through any of aggregation switches 312(2) -312 (N).

This configuration may provide several benefits. For example, the presence of multiple paths may allow for the reduction and/or elimination of congestion of the network without requiring explicit traffic engineering or adjustment of parameters. Furthermore, the multiple paths allow for "scaling" of the network design. In other words, more capacity can be added by adding more low cost switches. In contrast, conventional hierarchical network designs concentrate traffic on one or very few links at the higher levels of the hierarchy. As a result, conventional networks may require the purchase of expensive "large" switches to handle high traffic densities.

Also, by utilizing a "mesh" topology, multiple paths may allow for graceful degradation in the event of a link or switch failure. For example, a flexible network implemented according to the flexible data center network architecture having "n" switches at a given layer may lose only 1/n of its capacity upon a switch failure, as compared to a traditional network that may lose 50% of its capacity. A flexible network implemented according to the flexible data network architecture can potentially utilize the entire bipartite topology.

Randomization for processing volatility

Data centers may have a large amount of volatility in their workload, their traffic, and their failure modes. Accordingly, a large resource pool may be created. Then the work can be randomly diffused on the work; some performance on best case may be lost to improve the worst case to the average case. In at least some implementations, a topology associated with extended path diversity may be utilized (e.g., as demonstrated in fig. 3). A workflow may be routed across the topology using load balancing techniques, such as Valiant Load Balancing (VLB) techniques. Briefly, VLB techniques may involve randomly selecting one or more paths for carrying data transmissions, where a path is made up of a series of links and/or switches. The path may then be reselected, where the reselection requires a change to one or more of the switches or links that make up the original path. The reselection may be made periodically, such as after sending/receiving a specified number of bytes/packets, and/or in response to a transmission problem indication associated with the selected path, switch, or link. For example, if packet delays or other communication impairments are detected, the selection process may be repeated. By applying the present principles, the goal of uniform capacity and performance isolation can be met.

More specifically, to address volatility and uncertainty in the data center traffic matrix, load balancing techniques (e.g., VLBs) may be utilized to randomly hash streams across network paths. One goal of this approach may be to provide bandwidth guarantees for any traffic change subject to network ingress-egress constraints as in the hose traffic model. Briefly, the hose model specifies: the data transfer rate over a given path cannot exceed the slowest or most constrained portion of the path.

Using load balancing techniques like VLB at the flow granularity, meaning that most packets flowing out beyond the reselection path follow the same path through the network, can be advantageous because it can reduce the chance that packets of the flow are reordered or experience rapidly changing latencies at the destination, and/or the operation of the path Maximum Transmission Unit (MTU) discovery protocol is interrupted due to differences in MTUs within the flow. Some types of traffic (e.g., those not compromised by packet reordering) and some environments (e.g., those with very uniform delay along all paths) may prefer to use load balancing like VLB at packet granularity (meaning that a potentially different path is used for each packet in a sequence of packets). Any commonly accepted definition of a stream may be used, such as: an IP5 tuple flow, an IP2 tuple flow, or a set of packets between two subnets or address ranges.

In the context of providing a flexible data center network, ingress-egress constraints may correspond to server line card speeds. In conjunction with a high-bisection bandwidth topology (e.g., a folded Clos topology), the load balancing technique can be utilized to create a non-interfering packet-switched network (a counterpart of a non-blocking circuit-switched network) and provide hotspot-free performance for traffic patterns that do not have a continuous load that exceeds server ingress-egress port speed. In this regard, in some implementations, the end-to-end congestion control mechanism of the Transmission Control Protocol (TCP) may be used to implement the hose model and avoid over-running the server port speed. This principle may lead to the logical topology shown in fig. 3, which may include three different switch layers: TOR314, aggregate 312, and middle 310. Flows from one server to another may take random paths through random intermediate switches, across TORs and aggregation switches. Load balancing techniques such as VLBs can be used in the context of an inter-switch fabric of a data center to smooth utilization while eliminating persistent traffic congestion.

Separating names from locations

Separating the name from the location may create degrees of freedom that may be used to implement the new feature. This principle can be exploited to allow flexibility in data center networks and improve utilization by reducing the fragmentation that can result from bindings between addresses and locations. By applying this principle and the principles described below covering the end system, the second tier semantic objectives can be met. In this manner, developers may be allowed to assign IP addresses regardless of network topology and without the need to reconfigure their applications or network switches.

To increase network flexibility (support of any service on any server, dynamic growth and shrinkage of server pools, and workload migration), an IP addressing scheme may be used that separates names (called AA) and locators (called LA). A flexible directory service, such as may be represented as directory services modules 322(1) -322(N), may be defined to manage the mapping between AA and LA in a scalable and reliable manner. The flexible directory service may be invoked by shim (shim) layers running in a network stack on separate servers. In the implementation shown in FIG. 3, the shim layer may appear as flexible agents 320(1) -320 (N).

Covering terminal system

Software, including operating systems, on data center servers is typically extensionally modified for use within data centers. For example, new or modified software may create a hypervisor for a virtualized or blob (blob) file system to store data across servers. The programmability of the software can be utilized rather than modifying the software on the switch. Furthermore, changes to the hardware of the switch or server may be avoided or limited and legacy applications may remain unchanged. By using software on each server to work within the limitations of currently available low cost switch Application Specific Integrated Circuits (ASICs), designs can be created that are built and deployed today. For example, by intercepting ARP requests on servers and converting them into lookup requests for directory systems, rather than attempting to control ARP via software or hardware changes on switches, scalability issues presented by broadcast Address Resolution Protocol (ARP) packets may be reduced and/or eliminated.

FIG. 4 illustrates an example flexible agent 320(1) in more detail. In this case, flexible agent 320(1) operates on server machine 402 that includes user mode 406 and kernel mode 408. The server machine includes a user mode agent 410 in the user mode. The kernel mode includes TCP component 412, IP component 414, encapsulator 416, NIC418, and routing information cache 420. The server machine may include and/or communicate with a directory service 322 (1). The directory service can include a server role component 422, a server health component 424, and a network health component 426. Flexible proxy 320(1) may include user mode proxy 410, wrapper 416, and routing information cache 420. Encapsulator 416 can intercept the ARP and send it to user mode proxy 410. The user mode agent may query directory service 322 (1). It should be understood that other arrangements of the blocks are possible, such as including the user mode proxy in a kernel mode component or invoking directory lookup via mechanisms other than ARP, such as in a routing table lookup process or via mechanisms such as an IP table or IP chain.

In the flexible network architecture of fig. 3, the end system controls may provide a mechanism to quickly inject new functionality. In this manner, the flexible agent may provide fine-grained path control by controlling the randomization used in load balancing. Further, to achieve separation of name and location, the flexible proxy can replace the ARP function of the Ethernet with a query to the flexible directory service. The flexible directory service itself may be implemented on a server rather than a switch. The flexible directory service allows fine-grained control over server reachability, grouping, access control, resource allocation (e.g., capacity of intermediate switches), isolation (e.g., non-overlapping intermediate switches), and dynamic growth and shrinkage.

Using network technology

Utilizing one or more network technologies with robust implementations in a network switch may simplify the design of flexible networks and increase the willingness of operators to deploy such networks. For example, in at least some implementations, a link state routing protocol can be implemented on a network switch to hide certain failures from servers and can also be used to help reduce the load on the flexible directory service. These protocols can be used to maintain the topology and routing of the flexible network, which can reduce coupling between the flexible directory service and the network control plane. By defining a routing design for anycast (anycast) addresses on the switches, the flexible architecture can utilize Equal Cost Multipath (ECMP) to hide switch failures from servers. This may further reduce the load on the directory system. Other routing protocols that support the use of multiple paths are also suitable.

Implementation details regarding virtual second tier networking examples

Laterally extended topology

Conventional networks typically concentrate traffic in several switches at the highest level of the network. This both constrains the halved bandwidth to the capacity of these devices and severely impacts the network when they fail. However, to avoid these problems, a flexible network topology driven by the principle of using randomization to handle traffic volatility can be utilized. In this regard, a method of laterally expanding network devices may be employed. This may lead to a relatively wide network of low complexity switches dedicated to fast forwarding, as shown in fig. 3. This is an example of a folded Clos network in which the links between the intermediate switches 310(1) -310(N) and the aggregation switches 312(1) -312(N) may form a complete bipartite graph. As in the conventional topology, the TORs may be connected to two aggregation switches. However, the large number of paths between any two aggregation switches means that if there are n intermediate switches, failure of any one of these switches will reduce the halved bandwidth by only 1/n — a desirable characteristic that can be referred to as graceful bandwidth degradation. Further, a network such as a Clos network may be designed such that oversubscription (oversubscription) does not exist. For example, in fig. 3, aggregation and intermediate switches with D interface port counts may be used. The switches may be connected such that the capacity between switches at each layer is D x D/2 times the link capacity.

Networks such as Clos networks may be particularly well suited for load balancing (e.g., VLBs) because the network may provide bandwidth guarantees for potentially all possible traffic matrices subject to ingress-egress restrictions at server line cards by intermediate switch bounce at the top level or "spine" of the network. The routing may be simple and resilient (e.g., may take a random path up to a random intermediate node and take a random path down).

The flexible architecture may provide greater path control than is achieved with conventional network architectures. More specifically, the intermediate node may be partitioned, with traffic classes dedicated to different partitions to allocate higher total bandwidth to some traffic classes. The congestion indication may be sent back to the sender by an Explicit Congestion Notification (ECN) or similar mechanism, such as in Institute of Electrical and Electronics Engineers (IEEE)802.1Qau congestion control. As such, the sender of the accumulated ECN signal may respond by changing the fields in the source packet used to select an alternate path through the network (referred to above as a reselection path).

Flexible routing

To implement the principle of separating names from locators, a flexible network may use two IP address families. Fig. 3 illustrates such a separation. The network infrastructure may operate in accordance with the LA. Switches and interfaces (310(1) -310(N), 312(1) -312(N), and 314(1) -314(N) may be assigned LA. that may run link state IP routing protocols that carry the LAs.

Applications such as applications 316(1) -316(N) running on the server may not be aware of the LA but of the AA. This separation may be associated with several benefits. First, packets can be tunneled to the appropriate LA rather than sent directly to the AA (switches do not need to maintain a routing entry for each host to pass them on). This means that a flexible directory service that converts AA to LA can implement policies about which services should be allowed to communicate. Second, low cost switches often have small routing tables (e.g., 12K entries) that can keep all LA routes but are overwhelmed by the number of AAs. This concept may be particularly valuable because it may allow a network to be constructed that is larger than the number of routing entries that a switch can hold. Third, this separation allows flexibility, as any AA can be assigned to any server regardless of topology. Fourth, the freedom to allocate LAs separately from AA means that LAs can be allocated in such a way that they can be summarized in a topologically significant way, further limiting the amount of routing state that a switch must carry while not interfering with the ability to allocate application addresses in any way required by the service within the data center or the operator of the data center.

Alternate embodiments of the present invention may use other types of data as the LA and AA addresses. For example, the LA address may be IPv4 and the AA address may be IPv6, or vice versa, or IPv6 addresses may be used for both AA and LA addresses, or IEEE802.1MAC addresses may be used as AA addresses and IP addresses (v4 or v6) as LA addresses, or vice versa, and so forth. Addresses may also be created by combining different types of addresses, such as VLAN tags or VRF identifiers with IP addresses.

The following discussion explains how the topology, routing design, flexible proxy, and flexible directory services can be combined to virtualize the underlying network structure and create the illusion to the servers 316(1) -316(N) of the flexible network that they are connected to their set of other servers 316(1) -316(N) in the second-tier LAN (and anything thereon) and that the host is part of the second-tier LAN of a relatively large data center range.

Address resolution and packet forwarding

In at least some implementations, the solution mentioned below is provided in order for servers 316(1) -316(N) to believe that they share a single large VLAN with other servers in the same service, while eliminating the broadcast ARP scaling bottleneck that can plague large ethernet networks. Initially, it should be noted that the following solution is backward compatible and transparent to existing data center applications.

Packet forwarding

The AA is typically not announced into the routing protocols of the network. Accordingly, in order for the server to receive the packet, the source of the packet may first encapsulate the packet, setting the destination of the outer header to the host's LA. Upon reaching the device holding the LA address, the packet is decapsulated and delivered to the destination server. In one embodiment, the LA of a destination server is assigned to the TOR under which the destination server is located. Once the packet reaches its destination TOR, the TOR switch may decapsulate the packet and deliver it based on the destination AA in the inner header according to standard layer two delivery rules. Alternatively, the LA may be associated with a physical destination server or a virtual machine running on the server.

Address resolution

The server may be configured to believe that the AA address is in the same LAN as it, so when the application first sends a packet to the AA, the kernel network stack on that host may generate a broadcast ARP request for the destination AA. A flexible proxy running in the networking stack of the source host can intercept the ARP request and convert it into a unicast query to the flexible directory service. When the flexible directory service answers the query, it may provide the LA to which the packet should be tunneled. It may also provide an intermediate switch or group of intermediate switches that can be used to bounce the packet.

Inter-service access control for directory services

Servers may not be able to send packets to an AA if they cannot obtain the LA of the TOR to which they must transmit their packet channel. Accordingly, flexible directory services 322(1) -322(N) may enforce communication policies. When processing a lookup request, the flexible directory service knows which server is making the request, the services to which both the source and destination belong, and the isolation policy between these services. If the policy is "deny," the flexible directory service may simply deny providing the LA. One advantage of the flexible network architecture is that: when inter-server communication is enabled, packets can flow directly from the sending server to the receiving server without bypassing the IP gateway. This is not similar to the connection of two VLANs in a conventional architecture.

Interaction with the Internet

Often, approximately 20% of the traffic handled by a data center may be to or from the internet. It is therefore advantageous for a data center network to be able to handle these large quantities. Although it may initially seem strange that the mobile network architecture utilizes a third tier architecture to implement a virtual tier-two network, one advantage of this is that external traffic can flow directly across the high-speed silicon of the switches that make up the flexible data center network with this architecture without being forced to traverse the gateway servers to have their headers rewritten, as required in some conventional and proposed network environments.

A server that needs to be directly reachable from the internet (e.g., a front-end web server) can be assigned two addresses: LA and AA. The LA may be used for internetwork communication. The AA may be used to communicate with a data center of a back-end server. The LAs may be extracted from a pool advertised via Border Gateway Protocol (BGP) and externally reachable. Traffic from the internet can then reach the server directly. Packets from the server to an external destination may be routed to the core routers while being dispersed between the available links and the core routers by ECMP.

Processing broadcasts

The flexible network architecture may provide a second layer of semantics to applications for backward compatibility. This may include support for broadcast and multicast. The flexible network architecture solution is to eliminate the most common broadcast sources completely, such as ARP and Dynamic Host Configuration Protocol (DHCP). ARP can be handled by intercepting ARP packets in the flexible proxy 320 and providing a response after consulting information from the flexible directory service as described above, while DHCP packets can be intercepted at the TOR using a conventional DHCP relay agent and unicast forwarded to the DHCP server. To process other broadcast packets, each group of hosts that should be able to receive broadcast packets sent by other hosts in the group may be assigned an IP broadcast address. This address can be assigned by the directory system and the flexible agent can learn it by querying the directory system.

Packets sent to the broadcast address may be modified to instead be destined for the broadcast address of the service. The flexible proxy of the flexible network architecture may rate limit broadcast traffic to prevent storms. The flexible proxy may maintain an estimate of the rate at which broadcast packets have been sent by the server during the most recent time interval (e.g., the past 1 second and past 60 seconds) and prevent the server from sending more than a configured number of broadcast packets during each time interval. Packets sent in excess of what is allowed may be either dropped or delayed until the next interval. Native IP multicast may also be supported.

One potential advantage of embodiments in which the switch operates as a layer three router is: it is particularly easy to implement the delivery of packets addressed to a multicast group to all hosts or machines belonging to the multicast group. Any of the existing IP multicast routing protocols, such as PIM-BIDIR, may be configured onto the switch. This will cause them to compute a multicast distribution tree for the endpoint at each host or machine belonging to the multicast group. The flexible proxy on the host, machine or server registers the host, machine or server as part of the appropriate multicast group, typically by sending an IGMP join message to its default gateway. The multicast routing protocol would then be responsible for adding the host, machine or server to the distribution tree for the multicast group. Switches operating at the second layer may use various mechanisms, such as one VLAN per multicast group, or flooding of padding packets through the network, where a flexible agent on each host, machine, or server filters out packets that the agent's host, machine, or server should not receive.

Randomization with multipath routing

The flexible network architecture can employ/utilize the principles of using randomization to handle volatility using, at least in some embodiments, two related mechanisms: VLB and Equal Cost Multipath (ECMP). The goals of both are similar — VLBs randomly distribute traffic across intermediate nodes and ECMPs send traffic across equivalent paths to reduce or avoid persistent congestion. As explained in more detail below, the VLB and ECMP can be complementary in that each can be used to overcome the limitations of the other. Both of these mechanisms may provide controls that the packet sender may use to influence the selection of a path across the network. The flexible proxy allows these controls to be utilized to avoid congestion.

Fig. 5 illustrates a subset of the flexible network architecture 300 introduced in fig. 3. Fig. 5 provides more detail of server-to-server communication. This example involves server 316(1) communicating with server 316 (5). The sending server 316(1) and the destination server 316(5) may operate in a server pool 328 that acts as a VLAN and has an application address of 10.128/9. Intermediate switches 310(1) -310(N) reside on link state network 326.

Flexible network architecture 300 may allow the benefits of VLBs to be realized by forcing packets to bounce off randomly selected intermediate nodes. In this case, the sender's flexible proxy 320(1) may accomplish this by encapsulating each packet to an intermediate switch 310(1) -310 (N). The intermediate switch tunnels the packet to the TOR of the destination (314 (N) in this case). The packet may then be first passed to one of the intermediate switches (such as 310(2)), decapsulated by that switch, passed to the LA of TOR314(N), decapsulated again and finally sent to destination server 316 (5).

If flexible agent 320(1) knows the address of active intermediate switches 310(1) - (310N), it may randomly select among them when sending packets. However, this would require updating perhaps hundreds of thousands of flexible agents upon failure of the intermediate switch. Instead, the same LA address may be allocated to multiple intermediate switches (LA address 10.0.0.5 in this case). The flexible directory service (shown in FIG. 3) may return this anycast address to flexible agent 320(1) as part of one or more lookup results. ECMP may be responsible for delivering packets encapsulated to the anycast address to one of active intermediate switches 310(1) -310 (N). If the switch fails, the ECMP can react, eliminating the need to notify the flexible agent.

ECMP, however, may have scaling limitations. Today's legacy switches can support 16-way ECMP, while 256-way ECMP switches are also available or soon available. VLB packaging can compensate if there happen to be more paths available than ECMP can use. One solution is to define several anycast addresses, each individual anycast address being associated with as many intermediate switches 310(1) -310(N) as ECMP can accommodate. Senders may hash across anycast addresses to distribute load, and when a switch fails, the anycast address may be reassigned by the directory system to other switches so that individual servers need not be notified. For purposes of explanation, this aspect may be considered a network control function provided by a directory system.

The VLB-based amnesic routing can be implemented using pure OSPF/ECMP mechanisms on a folded-closed network topology. This configuration does not require decapsulation support at the intermediate switch. For example, if N is the number of uplinks on each TOR, the aggregation switches may be grouped into sets. In some cases, each of these sets may contain exactly N switches. Each TOR may have uplink to all N switches in a set, or no uplink to any of the switches in a set. With such a connection of TORs, it can be shown that bandwidth guarantees for any traffic subject to server ingress/egress constraints continue to hold, even if protocols like OSPF and/or ECMP are used to route between TORs.

Using OSPF or ECMP for routing among TORs may cause one or some packets (such as packets between two TORs in the same aggregation switch set) to take a path that does not pass through an intermediate switch. Thus, these paths may be referred to as "early turnaround paths" because they follow the shortest path between the source and destination and allow early turnaround of traffic between servers under the same TOR or under TORs connected to the same aggregation switch. These traffic flows do not need to enter the core aggregation/intermediate network.

Potential benefits of using early turnaround paths may include freeing up capacity in the cores for other classes of traffic (e.g., external traffic). When existing applications have been written to minimize cross-TOR traffic, for example, the freed capacity in the "average" case is huge. Viewed another way, this may allow the kernel to have a few under-provisioned (under-provisioned) and still be able to work well for server-to-server traffic. Using early turnaround paths may allow a wider range of devices to be used as intermediate switches, thereby making these switches less costly.

Handling congestion

With both ECMP and VLB, the following possibilities may exist: large flows will be hashed to the same link and intermediate switches, respectively, which may lead to congestion. If this happens, the sending flexible agent may change the path through the flexible network that its flow takes by changing the value of the field that the ECMP uses to select the next hop (i.e., the next switch through which the packet should pass). In this regard, the flexible proxy may detect and handle such situations with a simple mechanism, such as periodically, or when TCP detects a severe congestion event (e.g., full window loss) or explicit congestion notification, or after sending/receiving a threshold number of bytes/packets, re-hashing the large flow.

Maintaining host information

Network systems implemented according to the flexible network architecture may use a scalable, reliable, and/or efficient storage or directory system designed for data center workloads. A network implemented according to the flexible network architecture may possess one or more of the following four characteristics: uniform high capacity, performance isolation, second tier semantics, and inter-service communication isolation. The network may also exhibit graceful degradation in which the network may use any capacity left after failure. In this way, the network can be reliable/resilient in the face of failures. In this regard, the directory system in such a network may provide two potentially critical functions: (1) lookup and update of AA to LA mappings, and (2) a reactive cache update mechanism that can support latency-sensitive operations such as live virtual machine migration.

Characterization requirements

The look-up workload of a directory system may be frequent and bursty. A server may communicate with up to hundreds of thousands of other servers in a short period of time, with each flow generating a lookup for the AA to LA mapping. For updates, the workload may be driven by failures and server startup events. Many failures are typically small in size, while large cross-correlation failures may be rare.

Performance requirements

The burstiness of the workload means that lookups may require high throughput and low response time to quickly establish a large number of connections. Because lookups increase the time required to communicate with the server for the first time, the response time should be kept as small as possible. For example, several tens of milliseconds is a reasonable value. However, for updates, the potentially critical requirement is reliability, while response time is less critical. Furthermore, because updates are typically scheduled in advance, high throughput can be achieved through batch updates.

Consistency considerations

In a traditional layer two network, ARP can provide eventual consistency due to ARP timeouts. Further, the host can announce its arrival by issuing a gratuitous ARP. As an extreme example, consider live Virtual Machine (VM) migration in a network implemented according to the flexible network architecture. VM migration may utilize a fast update of stale mappings (AA to LA). A potential goal of VM migration may be to maintain ongoing communications across location changes. These considerations mean that: weak consistency or final consistency of AA to LA mapping is acceptable as long as a reliable update mechanism can be provided.

Flexible directory system or service design

The performance parameters and workload patterns looked up may be significantly different from the updated performance parameters and workload patterns. As such, consider the dual-layer flexible directory services architecture 600 illustrated in FIG. 6. In this case, flexible directory services architecture 600 includes flexible agents 602(1) -602(N), directory services modules 604(1) -604(N), and Replication State Machine (RSM) servers 606(1) -606 (N). In this particular example, the individual directory service modules are implemented on dedicated computers 608(1) - (608 (N), respectively. In other implementations, the directory service modules may be represented on a computer that performs other system functions. In this implementation, the number of directory service modules is generally modest relative to the overall system size. For example, for 100K servers (i.e., servers 316(1) - (316(N)) of FIG. 3), one implementation may utilize approximately 50-100 directory modules. This range is provided for explanatory purposes and is not critical.

Directory services modules 604(1) -604(N) may be viewed as read-optimized, replicated directory servers that can cache AA-to-LA mappings. Directory services modules 604(1) -604(N) may communicate with flexible agents 602(1) -602(N), and a small number (e.g., approximately 5-10 servers) of write optimized, Replicated State Machine (RSM) servers 606(1) -606(N) capable of providing strong consistent, reliable storage of AA-to-LA mappings.

The directory services modules 604(1) -604(N) may ensure low latency, high throughput, and high availability for high lookup rates. Meanwhile, RSM servers 606(1) -606(N) may, at least in some embodiments, use Paxos consistency algorithms or the like for modest update rates to ensure strong consistency and durability.

Each individual directory services module 604(1) -604(N) may cache AA-to-LA mappings stored in RSM server 606(1) -606(N) and may use the cached state to reply to lookups from flexible agents 602(1) -602 (N). Because strong consistency may not be a requirement, the directory services module may lazily synchronize its local mapping with the RSM server periodically (e.g., every 30 seconds). To achieve both high availability and low latency, the flexible agent may send lookups to a number k (e.g., 2) of randomly selected directory service modules 604(1) -604 (N). If multiple replies are received, the flexible agent may simply select and store the fastest reply in its cache.

The directory services modules 604(1) -604(N) may also process updates from the network provisioning system. For consistency and durability, updates may be sent to a single randomly selected directory services module and may be written to RSM server 606(1) -606 (N). Specifically, at update time, the directory services module may first forward the update to the RSM. The RSM may reliably copy the update to individual RSM servers and then reply with an acknowledgement (acknowledgement) to the directory services module, which may then forward the acknowledgement back to the original client.

As a potential optimization for enhancing consistency, directory service modules 604(1) -604(N) may optionally distribute validated updates to a small number of other directory service modules. If the originating client does not receive an acknowledgement within a timeout (e.g., 2 seconds), the client may send the same update to another directory services module in exchange for reliability and/or availability in response time.

Other embodiments of the directory system are possible. For example, a Distributed Hash Table (DHT) may be constructed using a directory server, with AA/LA mappings stored as entries in the DHT. Other existing directory systems, such as active directory or lightweight directory systems, may also be used, however, performance may not be as good or consistent as the previously described embodiments.

Ensuring final consistency

Updates may result in inconsistencies because the AA-to-LA mapping may be cached at the directory services module or at the flexible agent's cache. To understand the inconsistency without wasting server and network resources, a reactive cache-update mechanism may be utilized to ensure both scalability and performance simultaneously. The cache-update protocol may utilize key observations: the stale host mapping is corrected only if it is used to deliver traffic. In particular, when a stale mapping is used, some packets may arrive at a stale LA — the TOR or server that no longer hosts the destination server. The TOR or server may forward such non-deliverable packets to the directory service module, triggering the directory service module to selectively correct stale mappings in the origin server's cache, e.g., via unicast. In another embodiment of the update, the directory service may multicast the update to all groups of servers that are allowed to communicate with the affected servers. Further implementation

Optimality of load balancing

As described above, load balancing techniques such as VLBs can use randomization to handle volatility-some performance of best case traffic patterns can be sacrificed by shifting the traffic patterns (including best case and worst case) to average cases. This performance loss may manifest itself as a higher utilization of certain links than they would under a more optimized traffic engineering system. However, an assessment of the actual data center workload shows: the simplicity and versatility of load balancing techniques such as VLBs may be associated with relatively small capacity loss compared to more complex traffic engineering schemes.

Layout arrangement

Fig. 7-9 illustrate three possible layout configurations of a data center network implemented according to the flexible network architecture. In FIGS. 7-9, the relevant servers are not shown with the TOR due to the spatial constraints of the drawing page.

Fig. 7 shows an open floor plan (openfloorplan) data center layout 700. Data center layout 700 includes TORs 702(1) -702(N), aggregation switches 704(1) -704(N), and intermediate switches 706(1) -706 (N). In fig. 7, TORs 702(1) -702(N) are shown surrounding a central "netcage" 708 and may be connected (e.g., using copper cables and/or fiber optic cables, etc.). Aggregation and intermediate switches 704(1) -704(N), 706(1) -706(N) may be placed in close proximity within network cage 708, respectively, allowing them to be interconnected using copper cables (which are less costly, thicker, and have low distance extensions than optical fibers). By bundling a number (e.g., 4) of 10G links into a single cable using an appropriate standard, such as, for example, the quad small form-factor pluggable (QSFP) standard, the number of cables within a network cage and their cost may be reduced (e.g., by a factor of four) and by a factor of about one-half.

In the open floor plan data center layout 700, intermediate switches 706(1) -706(N) are centrally arranged in the network cage 708, while aggregation switches 704(1) -704(N) are interposed between intermediate switches 706(1) -706(N) and TOR switches 702(1) -702(N) (and associated servers).

The open plan data center layout 700 is scalable as needed. For example, additional server racks may be added by creating server racks by associating computing devices in the form of servers with TORs 702(1) -702 (N). The server rack may then be connected to aggregation switches 704(1) -704(N) of network cage 708. Other server racks and/or individual servers can be removed without disrupting the services provided by the open plan data center layout.

Fig. 8 illustrates a modular container-based layout 800. Topology 800 includes TORs 802(1) -802(N), aggregation switches 804(1) -804(N), and intermediate switches 806(1) -806 (N). In this case, intermediate switches 806(1) -806(N) are included in a data center infrastructure 808 of the layout. The aggregation switches and TOR switches may be associated as pluggable containers connected to the data center infrastructure. For example, aggregation switches 804(1) and 804(2) are associated with TOR switches 802(1) and 802(2) in a pluggable container 810(1) that is connectable to a data center infrastructure 808. Similarly, aggregation switches 804(3) and 804(4) are associated with TOR switches 802(3) and 802(4) in pluggable container 810(2) and aggregation switches 804(5) and 804(N) are associated with TOR switches 802(5) and 802(N) in pluggable container 810 (N).

As with fig. 7, in fig. 8, the servers associated with TORs to form a server rack are not shown due to the space constraints of the drawing page. Furthermore, due to space constraints, each pluggable container only shows two aggregation switches and two TOR switches. Of course, other implementations may employ more or fewer of either or both of these components. Moreover, other implementations may employ more or fewer insertable containers than the three shown here. One feature of interest is: the layout 800 may be adapted to bring one cable bundle 812 from each pluggable container 810(1) -810(N) to the data center spine (i.e., data center infrastructure 808). In summary, the data center infrastructure 808 may allow the layout 800 to expand or contract in size by adding or removing individual pluggable containers 810(1) -810 (N).

FIG. 9 illustrates an "infrastructure-less" and "containerized" data center layout 900. The layout includes TORs 902(1) -902(N), aggregation switches 904(1) -904(N), and intermediate switches 906(1) -906(N) arranged in a plurality of containers 908(1) -908 (N). For example, TORs 902(1) -902(2), aggregation switches 904(1) -904(2), and intermediate switches 906(1) are arranged into containers 908 (1).

The containers 908(1) -908(N) may allow for "infrastructure-less" and "containerized" data center layouts 900 to be implemented. This layout 900 may be associated with running cable bundle 910(1) between individual container pairs 908(1) and 908 (3). Another cable bundle 910(2) may run between each individual container pair 908(2) and 908 (N). Each individual cable bundle 910(1), 910(2) may carry links connecting aggregation switches 904(1), 904(2) in container 908(1) to intermediate switches 906(3) in container 908(3) and vice versa.

In general, each individual container 908(1) -908(N) may include multiple switches. These switches may include TORs 902(1) -902(N), aggregation switches 904(1) -904(N), and intermediate switches 906(1) -906(N) arranged as complementary pluggable containers. Complementary pairs of insertable containers may be coupled by connecting the aggregation switch of a first insertable container to the intermediate switch of a second insertable container via a cable bundle and vice versa. For example, container 908(1) may be connected to container 908(3) via cable bundle 910 (1). Specifically, the bundle may connect aggregation switches 904(1) and 904(2) of container 908(1) to intermediate switches 906(3) of container 908 (3). Similarly, bundle 910(1) may connect aggregation switches 904(5) and 904(6) of container 908(3) to intermediate switches 906(1) of container 908 (1).

In at least some implementations, the flexible network architecture can include the following components: (1) a set of switches commonly connected into a topology; (2) a set of servers, each server connected to one or more of the switches; (3) a directory system to which requests are made when a server wishes to send packets to another server, and which responds with information that the server (or a representative live proxy for the server) uses to address or encapsulate packets it wishes to send so that they can traverse the switch topology; (4) a mechanism for controlling congestion in the network that reduces/prevents utilization on any link from growing so high that packets are dropped by switches sending into that link; and (5) a module on the server, the module in communication with the directory service; encapsulating, addressing, or decapsulating packets as needed; and participate in congestion control as needed.

In at least one embodiment, there is a flexible agent on each server that provides functions such as: (1) communicating with a flexible directory service to retrieve encapsulation information for forwarding packets to a destination, registering the server with the system, etc.; (2) making random selections among sets of alternates (e.g., among intermediate switches) and caching the selections as needed; (3) encapsulating/decapsulating the packet; and (4) detect and respond to congestion indications from the network. Alternatively, in at least some embodiments, these functions may be distributed among servers and switches in the network. For example, a default route may be used to direct a packet to a group of switches (such as intermediate switches) and, for each packet, implement the functions listed above on the intermediate switch that the packet traverses.

In at least some embodiments, implementing the flexible network architecture described herein can include creating a network among a group of switches in a data center to enable each switch in the network to send packets to any other switch in the network. It is not necessary for the switches or the network to use the same type of address as used by each server to communicate with other servers to direct packets among themselves. For example, MAC addresses, IPv4 addresses, and/or IPv6 addresses may all be suitable.

In at least one embodiment of the flexible network, one consideration among a group of switches in a data center IS to configure each of them to have an IP address (IPv4 or IPv6) and configure them to run one or more standard layer three routing protocols, typical examples of which are open-shortest path first (OSPF), intermediate system-intermediate system (IS-IS), or Border Gateway Protocol (BGP). The benefits of such an embodiment are: the coupling between the network and the directory system is reduced and the control plane of the network created by its routing protocol maintains the ability of the network to forward packets between switches so that the directory system does not need to react to and notify servers of most changes in topology.

Alternatively or additionally, the directory system may monitor the topology of the network (e.g., monitor the health of switches and links) and change the encapsulation information it provides to the servers when the topology changes. The directory system may also notify servers to which it previously sent responses that those responses are no longer valid. The potential benefits of the first embodiment over the alternative embodiments are: the coupling between the network and the directory system is reduced and the control plane of the network created by its routing protocol maintains the ability of the network to forward packets between switches so that the directory system does not need to react to and notify servers of most changes in topology. In summary, by monitoring one or more parameters related to network performance, packet delivery delays may be reduced or avoided. These parameters may be indicative of a network event, such as a loss of communication over a particular path.

In one embodiment, the switches of the network are configured to have IPv4 addresses extracted from the subnet of LA addresses. The switches are configured to run the OSPF routing protocol. The addresses of these switches are distributed among the switches by the OSPF protocol. An infinite interface extension to OSPF can be used to reduce the amount of information distributed by the OSPF protocol. The server-facing port of each top-of-rack (TOR) switch is configured on the switch as part of a Virtual Local Area Network (VLAN). The subnet(s) that make up the AA space are configured on the switch to be assigned to VLANs towards the server. The address of the VLAN is not distributed into OSPF, whereas the VLAN is typically not trunked. Packets destined for a server are encapsulated to the TOR to which the server is connected. The TOR will decapsulate the packets as it receives them and then forward them to the VLAN towards the server based on the destination address of the server. The server will then receive these packets as in a normal LAN.

In another embodiment, instead of configuring AA subnets onto a server-facing VLAN of a TOR switch, a LA subnet unique to each TOR is assigned to the server-facing VLAN. The LA subnet is distributed via OSPF. A server connected to the TOR is configured to have at least two addresses. The LA address extracted from the LA subnet is assigned to the VLAN towards the server of which it is a part, as well as the AA address. Packets destined for a server are encapsulated into LAs that have been deployed onto the server. A module on the server may decapsulate the packets as it receives them and pass them locally to a virtual machine or process on the server that is their destination based on the AA address contained in the packet.

In another embodiment, the TOR switches may operate as layer two switches and the aggregation layer switches may operate as layer three switches. This design may allow the use of a potentially cheaper layer two switch as a TOR switch (and many TOR switches), while layer three functions may be implemented in a relatively small number of aggregation layer switches. In this design, the decapsulation function may be performed at the layer two switch, the layer three switch, the destination server, or the destination virtual machine.

In either embodiment, the additional addresses may be configured onto the switch or distributed via a routing protocol such as OSPF. These addresses will typically be topologically significant (i.e., LA). These addresses will typically be used to direct packets to infrastructure services-i.e., servers, switches, or network devices that provide so-called additional services. Examples of these services include load balancers (which may be hardware-based (like BigIP from F5) or software-based load balancers), source network address translators (S-NATs), servers that are part of a directory system, servers that provide DHCP services, or gateways to other networks, such as the internet or other data centers.

In one embodiment, each switch may be configured as a route reflector client using the BGP protocol. Additional addresses may be distributed to switches by configuring them onto route reflectors or allowing BGP to distribute them to switches. This embodiment has the following benefits: adding or removing additional addresses does not result in OSPF recalculation that would overload the routing processors of these switches.

In another embodiment, the mechanism for controlling congestion in the network is implemented on the server itself. A suitable mechanism is a Transmission Control Protocol (TCP) -like mechanism in which traffic sent by a server to a destination is limited by the server to a rate that appears to be carried by the network. Improvements to the use of protocols such as TCP are described below. In an alternative embodiment, quality of service mechanisms on the switches may be used for congestion control. Examples of such mechanisms include Weighted Fair Queuing (WFQ) and its derivatives, Random Early Detection (RED), RSVP, explicit control protocol (XCP), and Rate Control Protocol (RCP).

In at least one embodiment, a module on the server observes packets being received from the flexible network and alters the sending of packets or the encapsulation of packets based on information it obtains or infers from the received packets. The flexible proxy may reduce congestion in the network by: (1) altering the sending of packets to reduce their sending rate, or (2) altering the encapsulation of packets so that they take different paths through the network, may be accomplished by making any or all of the random choices again among possible alternatives made when it first selects encapsulation and addressing of packets.

Examples of observations that a flexible agent may make and reactions to them include: (1) if the flexible proxy detects a full window loss of a TCP packet, the flexible proxy re-randomizes the path through the network that the packet will take. This is particularly advantageous because it puts the flow on a different (desirably non-congested) path while it is believed that all packets previously sent on the flow have exited from the network, so that changing the path taken by the packet will not result in a rearranged packet being received by the destination. (2) The flexible proxy may periodically re-randomize the path taken by the packet. (3) The flexible agent may calculate the effective rate reached by the flow and re-randomize if the rate is below a desired threshold. (4) The flexible agent may monitor Explicit Congestion Notification (ECN) markings in received packets and reduce the rate or re-randomize the path of any packets to the destination. (5) The switch may execute logic to detect a link that has entered or is about to enter a congestion state (e.g., as in ieee qcn and 802.1 au) and send a notification to the upstream switch and/or server. Flexible agents receiving these indications may reduce the rate of their packets or re-randomize the path of the packets.

One advantage of the described embodiments is that: they allow live migration of Virtual Machines (VMs) because VMs can be relocated from one server to another while retaining use of the same IP address. The directory system may simply be updated to direct packets destined for the IP address of the VM to the server to which the VM was relocated during this move. The physical change in location need not interfere with ongoing communications.

In at least one embodiment, through non-uniform computation of the split ratio, a small portion of the capacity of the network may be reserved or preferentially allocated to a set of services operating on the network, such that a preferred service has its packets spread over a larger or smaller number of paths, or over a set of paths that are disjoint from the paths used by another set of services. Multiple classes of preferences or QoS may be created using this same technique.

Example of the method

FIG. 10 illustrates a flow diagram of a flexible networking technique or method 1000 in accordance with at least some implementations of the present concepts. The order in which the method 1000 is described is not intended to be construed as a limitation, and any number of the described blocks can be combined in any order to implement the method, or an alternate method. Furthermore, the method may be implemented in any suitable hardware, software, firmware, or any combination thereof, such that a computing device may implement the method. In one case, the method is stored as a set of instructions on a computer-readable storage medium such that execution of a processor of a computing device causes the computing device to perform the method. In another case, the method is stored on a computer readable storage medium of an ASIC for execution by the ASIC.

At 1002, the method obtains encapsulation information for forwarding the packet to a destination.

At block 1004, the method selects a path through available hardware, such as a switch.

At block 1006, the method encapsulates the packet for delivery over the path.

At block 1008, the method monitors for an indication of congestion. For example, the method may monitor parameters related to network performance. For example, TCP may provide updates related to packet transmission rates and/or loads on network components that may act as network parameters related to congestion. The method may reselect a path and/or take other actions when congestion is detected.

Claims

1. A method of providing a network, comprising:

providing a virtual layer two network (108) connecting first and second machines by assigning application addresses (104) to the first and second machines and assigning location addresses (206) to components of a layer three infrastructure (106), the providing comprising:

determining, from the directory service, whether the destination server is in a group of servers associated with the service;

if the destination server is in the server group, encapsulating, at the first machine, a virtual second layer packet in a third layer packet, wherein a separate location address of a separate component of the third layer infrastructure is specified to the third layer packet, and transmitting the third layer packet to the separate component of the third layer infrastructure, wherein the separate component decapsulates the encapsulated virtual second layer packet and transmits the decapsulated virtual second layer packet to the second machine; and

denying provision of the individual location address if the destination server is not in the server group.

2. The method of claim 1, further comprising using an early turnaround path between individual machines.

3. The method of claim 1, wherein the machine comprises a server or a virtual machine.

4. The method of claim 1, wherein providing a virtual tier-two network comprises providing a plurality of virtual tier-two networks.

5. The method of claim 1, further comprising encapsulating packets between a first machine and a second machine of the machines with a location address of a third tier component along a separate path of the third tier infrastructure between the first machine and the second machine.

6. The method of claim 1, further comprising randomly selecting a separate path of the third layer infrastructure between the first and second machines.

7. The method of claim 6, further comprising selecting the individual path using valiant load balancing.

8. The method of claim 6, further comprising reselecting the individual path periodically or in response to a network event.

9. A server (316(1)) comprising:

at least one processor for executing computer-readable instructions; and

a flexible agent (320) executable by the at least one processor and configured to:

receiving a virtual second layer packet for delivery to another server (316 (N));

determining, from the directory service, whether the other server is in a group of servers associated with the service;

encapsulating the virtual second layer packet in a third layer packet if the other server is in the server group, wherein a separate location address of a separate component of a third layer infrastructure is specified to the third layer packet, and transmitting the third layer packet to the separate component of the third layer infrastructure, wherein the separate component decapsulates the encapsulated virtual second layer packet and transmits the decapsulated virtual second layer packet to a second machine; and

denying provision of the individual location address if the other server is not in the group of servers.

10. The server of claim 9, wherein the passing is via an intermediate switch, and the flexible agent is configured to randomly select the intermediate switch from a plurality of intermediate switches.

11. The server of claim 10, wherein the flexible agent is configured to select a path for the passing and to reselect a new path comprising a new intermediate switch selected from a plurality of intermediate switches upon receiving an indication of a communication impairment.

12. The server of claim 10, wherein the server is configured to support multiple virtual machines and wherein the flexible agent is configured to select a path for communicating packets between two virtual machines.