HK1250539B

HK1250539B - Method and system for balancing storage data traffic in converged networks

Info

Publication number: HK1250539B
Application number: HK18109929.7A
Authority: HK
Inventors: J‧G‧汉科; C‧温克尔
Original assignee: Twitter, Inc.
Priority date: 2015-08-06
Filing date: 2016-07-01
Publication date: 2022-08-12

Description

Method and system for balancing storage data traffic in a converged network

技术领域Technical Field

本发明涉及耦合至网络的计算设备访问经由适配器耦合至所述网络的存储设备的方法和系统，并且涉及用于实现这种方法和系统的设备。在一些实施例中，本发明涉及均衡(例如，以试图优化)系统中的存储数据流量，在所述系统中，耦合至网络的计算设备(服务器)进行操作以访问同样通过适配器耦合至所述网络的存储设备。The present invention relates to methods and systems for enabling a computing device coupled to a network to access a storage device coupled to the network via an adapter, and to devices for implementing such methods and systems. In some embodiments, the present invention relates to balancing (e.g., in an attempt to optimize) storage data traffic in a system in which computing devices (servers) coupled to a network operate to access storage devices also coupled to the network via an adapter.

背景技术Background Art

过去，数据中心通常实现两种完全分离的网络基础设施：数据通信网络(通常基于以太网)以及用于存储设备访问的单独“存储”网络。典型的存储网络实施常规的光纤通道(Fibre Channel)协议。在存储网络被配置且被用于主要承载“存储数据”流量(其中，“存储数据”表示从至少一个存储设备中检索的或待存储于所述至少一个存储设备上的数据)并且数据网络被配置且被用于主要承载其他数据流量(即，不是存储数据的数据)的意义上，表述“数据通信网络”和“数据网络”在本文中作为同义词而用于表示类别与“存储网络”类别不同的网络。In the past, data centers typically implemented two distinct network infrastructures: a data communications network (typically Ethernet-based) and a separate "storage" network for accessing storage devices. Typical storage networks implemented the conventional Fibre Channel protocol. The expressions "data communications network" and "data network" are used synonymously herein to refer to a network of a different class from the "storage network," in the sense that the storage network is configured and used to primarily carry "storage data" traffic (where "storage data" refers to data retrieved from or to be stored on at least one storage device), while the data network is configured and used to primarily carry other data traffic (i.e., data other than storage data).

然而，不期望的是，实现多种网络类型(例如，单独的数据和存储网络)增大了运营数据中心的资金和操作成本。However, implementing multiple network types (eg, separate data and storage networks) undesirably increases the capital and operating costs of operating a data center.

最近，许多数据中心已经开始对承载存储数据流量和其他(非存储数据)流量的单一网络的使用进行调查(并且一些已经开始使用所述单一网络)。这种单一网络在本文中将被称为“融合网络(converged network)”。融合网络的示例是基于以太网的网络，在所述基于以太网的网络上，所有流量都在耦合至网络的服务器与(经由适配器)耦合至网络的存储设备之间发送。不幸的是，待通过融合网络发送的这两种类型的网络流量(存储数据流量和其他数据流量)具有不同的特性。Recently, many data centers have begun investigating (and some have already begun using) the use of a single network to carry storage data traffic and other (non-storage data) traffic. Such a single network will be referred to herein as a "converged network." An example of a converged network is an Ethernet-based network in which all traffic is sent between servers coupled to the network and storage devices coupled to the network (via adapters). Unfortunately, the two types of network traffic to be sent over the converged network (storage data traffic and other data traffic) have different characteristics.

为了承载除了存储数据流量之外的流量，数据网络(例如，使用互联网协议来实施以太网的数据网络)可以(并且因此通常)被实施为非托管或最低限度托管网络。这使得来往于数据网络添加和移除计算机和其他硬件变得简单。例如，DHCP协议通常可以(在没有人为干预的情况下)向新设备提供所述新设备在数据网络上进行操作所需要的所有信息。To carry traffic other than storage data traffic, data networks (e.g., those implementing Ethernet using the Internet Protocol) can (and therefore are typically) implemented as unmanaged or minimally managed networks. This makes it easy to add and remove computers and other hardware from the data network. For example, the DHCP protocol can typically provide a new device with all the information it needs to operate on the data network (without human intervention).

然而，网络环路可能在数据网络中引起严重的问题(即，对应当丢弃的分组的连续转发)。为此，数据网络经常实施某种协议(例如，生成树协议(Spanning Tree Protocol))以便确保在数据网络上的任何两个设备之间只有一条路径是已知的。在数据网络上很少明确地建立冗余数据路径。进一步地，数据网络上的流量是相对不可预测的，并且经常对应用进行写入以便容许在数据网络上可获得任何带宽。However, network loops can cause serious problems in data networks (i.e., the continuous forwarding of packets that should be discarded). For this reason, data networks often implement some kind of protocol (e.g., the Spanning Tree Protocol) to ensure that only one path is known between any two devices on the data network. Redundant data paths are rarely explicitly established on data networks. Furthermore, traffic on data networks is relatively unpredictable, and applications are often written to tolerate any available bandwidth on the data network.

相比而言，存储网络通常是托管网络。网络管理员通常手动分配什么计算机可以与存储网络上的哪些存储设备通信(即，通常不存在自配置)。在使(被实现为与数据网络分离的存储网络中的)网络连接自适应于变化的条件方面存在很少的进展。进一步地，为了提供低级数据存储通常需要的高可用性和容错水平，在(耦合至存储网络的)存储设备与计算机之间通常存在完全冗余的路径。In contrast, storage networks are typically managed networks. Network administrators typically manually assign which computers can communicate with which storage devices on the storage network (i.e., there is typically no self-configuration). There has been little progress in making network connections (implemented as separate storage networks from data networks) adaptable to changing conditions. Furthermore, to provide the high availability and fault tolerance levels typically required for low-level data storage, fully redundant paths typically exist between storage devices (coupled to the storage network) and computers.

由于存储网络(及其存储数据流量)与数据网络(及其非存储数据流量)之间的差异，将存储数据流量和其他流量两者组合在融合网络中可能导致网络利用不均衡，这可能降低数据中心中的应用的整体性能。本发明的典型实施例解决了融合网络的这种利用不均衡例如以便允许数据中心的应用接近可获得的最大性能。Due to the differences between storage networks (and their storage data traffic) and data networks (and their non-storage data traffic), combining both storage data traffic and other traffic in a converged network can result in uneven network utilization, which can degrade the overall performance of applications in the data center. Typical embodiments of the present invention address this uneven utilization of converged networks, for example, to allow applications in the data center to approach the maximum performance available.

以下定义在整个本说明书中适用，包括权利要求书中：The following definitions apply throughout this specification, including in the claims:

“存储设备”表示被配置用于存储和检索数据的设备(例如，磁盘驱动器)。通常，使用逻辑块地址(LBA)和多个块来访问存储设备。逻辑块是总存储容量的固定大小区块(例如，512或4096字节)。传统的旋转磁盘驱动器是存储设备的示例；"Storage device" means a device (e.g., a disk drive) configured to store and retrieve data. Typically, a storage device is accessed using a logical block address (LBA) and a number of blocks. A logical block is a fixed-size chunk of total storage capacity (e.g., 512 or 4096 bytes). A traditional rotating disk drive is an example of a storage device;

“服务器”表示被配置用于跨网络(融合网络)访问和使用存储设备以便存储和检索数据(例如，文件和/或应用)的计算设备；"Server" means a computing device configured to access and use storage devices across a network (converged network) for the purpose of storing and retrieving data (e.g., files and/or applications);

“适配器”表示被配置用于将存储设备或包括两个或更多个存储设备的存储系统(例如，JBOD)连接至网络(例如，融合网络)的设备。在本发明的典型实施例中，每个存储设备通常都可由服务器经由两个或更多个适配器访问，以便提供对存储于存储设备上的数据的容错访问；"Adapter" means a device configured to connect a storage device or a storage system including two or more storage devices (e.g., JBOD) to a network (e.g., a converged network). In typical embodiments of the present invention, each storage device is typically accessible to a server via two or more adapters to provide fault-tolerant access to data stored on the storage device.

“接口”表示服务器或适配器的将设备(服务器或适配器)连接至网络(例如，融合网络)的部件。接口的示例是物理设备(即，网络接口控制器(NIC))以及多个NIC的软件定义封装器(关于链路聚合)。在本发明的典型实施例中，接口是在融合网络中具有其自己的互联网协议(IP)地址的硬件或软件元件；"Interface" means a component of a server or adapter that connects a device (server or adapter) to a network (e.g., a converged network). Examples of interfaces are physical devices (i.e., network interface controllers (NICs)) and software-defined wrappers for multiple NICs (for link aggregation). In an exemplary embodiment of the present invention, an interface is a hardware or software element that has its own Internet Protocol (IP) address in a converged network;

“代理”表示服务器(或适配器)的被配置成在服务器(或适配器)的操作期间在服务器(或适配器)上运行以便交换(或准备交换)网络(例如，融合网络)上的存储数据流量的软件或硬件部件或子系统。在本发明的一些实施例中，不是融合网络上的所有服务器和适配器都具有代理。然而，将非参与服务器和/或适配器(不具有代理的服务器和/或适配器)耦合至网络可能限制可以(根据本发明的实施例)实现的均衡程度；并且"Agent" means a software or hardware component or subsystem of a server (or adapter) that is configured to run on a server (or adapter) during operation of the server (or adapter) to switch (or prepare to switch) storage data traffic on a network (e.g., a converged network). In some embodiments of the present invention, not all servers and adapters on a converged network have agents. However, coupling non-participating servers and/or adapters (servers and/or adapters without agents) to the network may limit the degree of equalization that can be achieved (according to embodiments of the present invention); and

“数据路径”表示使用适配器和服务器中的每一个上的一个接口经由适配器在存储设备与服务器之间发送数据的路径(即，从存储设备通过适配器接口并且通过服务器接口到达服务器的路径，或者从服务器通过服务器接口和适配器接口到达存储设备的路径)。在IP网络中，数据路径通常可以通过服务器接口的IP地址和适配器接口的IP地址的组合来表示，并且可选地，还通过将在适配器处使用的端口号来表示。然而，在链路聚合的情况下，完整路径将取决于绑定到一个IP地址中的接口组内用于所述路径的实际接口。A "data path" refers to the path for sending data between a storage device and a server via an adapter using one interface on each of the adapter and the server (i.e., a path from the storage device through the adapter interface and through the server interface to the server, or a path from the server through the server interface and the adapter interface to the storage device). In an IP network, a data path can typically be represented by a combination of the IP address of the server interface and the IP address of the adapter interface, and optionally, by the port number used at the adapter. However, in the case of link aggregation, the complete path will depend on the actual interface used for the path within the group of interfaces bound to one IP address.

当包括两个或更多个存储设备的存储系统(例如，JBOD)耦合至适配器并且适配器和服务器两者都耦合至融合网络时，我们设想服务器(为了访问存储系统的存储设备)将通常指定(即，被配置用于使用)存储系统的特定存储设备(例如，JBOD的一个磁盘驱动器)以及服务器与存储设备之间的数据路径。根据本发明的典型实施例，数据路径可以不时改变以便均衡网络上的存储数据流量。根据本发明的一些实施例，(服务器与存储系统之间的)数据路径可以不时改变以便均衡网络上的存储数据流量(而且，根据本发明，适配器对存储系统的待由服务器访问的特定设备的选择可以不时变化，但是这种变化将不一定是确定的)。When a storage system (e.g., a JBOD) comprising two or more storage devices is coupled to an adapter and both the adapter and the server are coupled to a converged network, it is envisioned that the server (in order to access the storage devices of the storage system) will typically specify (i.e., be configured to use) a particular storage device of the storage system (e.g., one disk drive of the JBOD) and a data path between the server and the storage device. According to typical embodiments of the present invention, the data path may change from time to time in order to balance the storage data traffic on the network. According to some embodiments of the present invention, the data path (between the server and the storage system) may change from time to time in order to balance the storage data traffic on the network (and, according to the present invention, the adapter's selection of a particular device of the storage system to be accessed by the server may change from time to time, but such change will not necessarily be deterministic).

一般而言，当存储数据流量与融合网络上的其他数据流量组合时，不同类型流量的属性可以组合以产生对网络总体带宽的不充分使用，从而限制数据通信流量和/或存储流量的性能。Generally speaking, when storage data traffic is combined with other data traffic on a converged network, the attributes of the different types of traffic may combine to produce inefficient use of the network's overall bandwidth, thereby limiting the performance of the data communication traffic and/or the storage traffic.

例如，现代服务器计算机通常包括两个或更多个1Gbps或10Gbps网络接口(在服务器连接至融合网络的背景下，在本文中被称为“接口”)。许多这种服务器运行允许大量服务器一起工作以解决涉及海量数据的问题的软件包(例如，Hadoop开源软件包)。然而，这种软件(例如，Hadoop)通常要求每个服务器具有唯一的名称和地址。因此，运行软件(例如，Hadoop)的服务器之间的数据通信流量将通常仅使用每个服务器上可用的这两个(或更多个)网络连接之一。For example, modern server computers typically include two or more 1 Gbps or 10 Gbps network interfaces (referred to herein as "interfaces" in the context of servers being connected to a converged network). Many such servers run software packages (e.g., the Hadoop open source software package) that allow large numbers of servers to work together to solve problems involving large amounts of data. However, such software (e.g., Hadoop) typically requires that each server have a unique name and address. Consequently, data communication traffic between servers running the software (e.g., Hadoop) will typically use only one of the two (or more) network connections available on each server.

相比而言，存储数据流量通常被配置成在服务器与磁盘驱动器之间具有冗余路径以便使所述部件中的任何部件的故障存续。这些冗余路径可以用于重新引导存储数据流量(例如，使存储数据流量分散于网络接口当中)以避免由数据通信流量(非存储流量)占用的网络接口。然而，用于实现这种重新引导的标准机制(例如，多路径I/O或“MPIO”方法)产生了融合网络上的存储数据流量的严重性能损失。具体地，正常存储数据负载分散机制基于以下操作：以轮询方式跨所有可用接口发送存储命令或者确定对每条链路上有多少工作未完成的某个度量(例如，未完成命令的数量或未完成字节的总数或其他某个度量)以及向“最不繁忙”的接口发送命令。这些机制对服务器与磁盘驱动器之间的存储数据流量造成很大的性能损失的原因在于：为了获得最大性能，由磁盘驱动器执行的命令必须到磁盘上的连续位置。如果未发送用于访问连续位置的命令，则需要“寻找”操作来将磁盘驱动器的读/写磁头移动到新位置。每个这种寻找操作都将通常将整体性能降低约1％或更多。常规分散机制(轮询或“最不繁忙”分散机制)增大了执行磁盘访问命令序列的寻找数量，这是因为所述机制频繁地使所述序列中的连续命令采取从服务器到磁盘驱动器的不同路径。因为所述不同路径将(由于每条路径上的其他操作)具有不同的处理时间和等待时间，所以按一个顺序发出的命令将通常按不同的顺序执行。每次重排序都将引起寻找，并且由此降低整体数据承载能力。已经观察到，当这些常规分散机制应用于Hadoop存储操作时，其将存储数据流量的总体性能降低约75％(即，可以传送的存储数据的量约为在不使用轮询或最不繁忙机制的情况下可能的量的25％)。In contrast, storage data traffic is typically configured with redundant paths between servers and disk drives to survive failures in any of these components. These redundant paths can be used to redirect storage data traffic (e.g., by spreading it across network interfaces) to avoid network interfaces being occupied by data communication traffic (non-storage traffic). However, standard mechanisms for implementing this redirection (e.g., multipath I/O or "MPIO" approaches) incur significant performance penalties for storage data traffic on converged networks. Specifically, typical storage data load-spreading mechanisms are based on sending storage commands across all available interfaces in a round-robin fashion or determining some measure of how much work is outstanding on each link (e.g., the number of outstanding commands, the total number of outstanding bytes, or some other measure) and sending commands to the "least busy" interface. These mechanisms impose significant performance penalties on storage data traffic between servers and disk drives because, to achieve maximum performance, commands executed by the disk drives must be directed to consecutive locations on the disk. If commands are not sent to access consecutive locations, a "seek" operation is required to move the disk drive's read/write head to the new location. Each such seek operation will typically reduce overall performance by about 1% or more. Conventional dispersal mechanisms (round-robin or "least busy" dispersal) increase the number of seeks required to execute a sequence of disk access commands because they frequently cause consecutive commands in the sequence to take different paths from the server to the disk drives. Because the different paths will have different processing times and latencies (due to the other operations on each path), commands issued in one sequence will typically be executed in a different order. Each reordering will cause a seek and, thereby, reduce overall data carrying capacity. It has been observed that when these conventional dispersal mechanisms are applied to Hadoop storage operations, they reduce the overall performance of storage data traffic by about 75% (i.e., the amount of storage data that can be transferred is about 25% of the amount that would be possible without the use of round-robin or least busy mechanisms).

被称为“链路聚合”的另一种常规技术有时应用于分割第一设备(通常为服务器)与第二设备(通常为另一个服务器)之间、在可用于将这些设备耦合至网络的所有接口组成的组之间的流量，所述第一设备具有可用于将所述设备耦合至网络的多个接口，并且所述第二设备也具有用于将所述设备耦合至网络的多个接口。根据链路聚合，为了实现一种负载均衡，在通过网络将每个新的数据值流(即，将不会无序传输的每个新的数据值序列)从所述设备之一的所选接口传输至另一个设备的所选接口之前(例如，以随机或伪随机方式)做出对第一设备的接口之一以及第二设备的接口之一的新选择。这允许(对许多流进行求平均而获得的)数据通信流量使用所有可用接口并且保持在每个接口(除非一个接口故障)上发送的数据的量之间的大致均衡。Another conventional technique, known as "link aggregation", is sometimes applied to split traffic between a first device (typically a server) and a second device (typically another server), between a group consisting of all interfaces available for coupling the devices to a network, the first device having a plurality of interfaces available for coupling the device to the network, and the second device also having a plurality of interfaces for coupling the devices to the network. According to link aggregation, in order to achieve a kind of load balancing, a new selection of one of the interfaces of the first device and one of the interfaces of the second device is made (e.g., in a random or pseudo-random manner) before each new stream of data values (i.e., each new sequence of data values that will not be transmitted out of order) is transmitted from a selected interface of one of the devices to a selected interface of the other device over the network. This allows the data communication traffic (averaged over many streams) to use all available interfaces and maintain a rough balance between the amount of data sent on each interface (unless one interface fails).

常规地，不建议执行链路聚合以便通过网络传输存储数据。然而，即使使用某种形式的链路聚合(与常规的推荐实践相反)以试图均衡服务器的多个接口与适配器的多个接口之间通过融合网络发生的存储数据流量，对链路聚合的这种使用将防止融合网络中的存储数据流量的显著不均衡。显著不均衡将由维持存储流量的容错必要的设计决策引起。也就是说，对从服务器(经由至少一个适配器)到每个存储设备的完全冗余路径的需要要求每个存储设备(或包括多个存储设备的存储系统)必须通过各自耦合于存储设备(或存储子系统)与网络之间的两个完全分离的网络连接设备(即，两个单独的适配器)附接到网络。否则，如果仅存在一个适配器，则适配器的故障将使存储设备(或子系统)不可用。因为每个这种适配器都必须是单独的设备，所以链路聚合无法均衡提供到同一存储设备(或存储子系统)的冗余数据路径的两个适配器之间的网络负载，并且无法防止通过一个适配器的存储数据流量相对于通过提供到同一存储设备(或存储子系统)的冗余数据路径的另一个适配器的存储数据流量的显著不均衡。因为适配器是单独的设备，所以一个可能比可以访问同一存储设备的其他一个或多个更繁忙并且因此更慢。相比而言，本发明的典型实施例可以即使当正在使用链路聚合时也减轻融合网络中的存储数据流量不均衡(并且防止显著的存储流量不均衡)。Conventionally, it is not recommended to perform link aggregation in order to transmit storage data over a network. However, even if some form of link aggregation is used (contrary to conventional recommended practice) in an attempt to balance the storage data traffic occurring between multiple interfaces of a server and multiple interfaces of an adapter over a converged network, such use of link aggregation will prevent a significant imbalance in the storage data traffic in the converged network. The significant imbalance will be caused by design decisions necessary to maintain fault tolerance for storage traffic. That is, the need for a fully redundant path from the server (via at least one adapter) to each storage device requires that each storage device (or storage system including multiple storage devices) must be attached to the network through two completely separate network connection devices (i.e., two separate adapters), each coupled between the storage device (or storage subsystem) and the network. Otherwise, if only one adapter is present, a failure of the adapter will render the storage device (or subsystem) unavailable. Because each such adapter must be a separate device, link aggregation cannot balance the network load between two adapters that provide redundant data paths to the same storage device (or storage subsystem) and cannot prevent a significant imbalance in storage data traffic through one adapter relative to storage data traffic through another adapter that provides redundant data paths to the same storage device (or storage subsystem). Because the adapters are separate devices, one may be busier and therefore slower than one or more others that may access the same storage device. In contrast, exemplary embodiments of the present invention can mitigate storage data traffic imbalance (and prevent significant storage traffic imbalance) in a converged network even when link aggregation is being used.

发明内容Summary of the Invention

本文中，术语系统(例如，网络、或耦合至网络的设备、或设备的可以耦合至网络的网络接口)的“带宽”表示或者系统的“消耗带宽”或者系统的“可用带宽”。表述系统的“消耗带宽”在本文中表示通过系统的数据速率(比特率)(例如，通过系统发生数据业务量的速率，或者对已经在某个时间间隔内通过系统发生的数据业务量的速率的平均或其他统计表征)。表述系统的“全可用带宽”在本文中表示系统的最大可能数据速率(比特率)(即，将通过系统发生数据业务量的最大速率)。表述系统的“可用带宽”在本文中表示系统的全可用带宽减去系统的消耗带宽。As used herein, the term "bandwidth" of a system (e.g., a network, or a device coupled to a network, or a network interface of a device that can be coupled to a network) means either the "consumed bandwidth" of the system or the "available bandwidth" of the system. The expression "consumed bandwidth" of a system is used herein to mean the data rate (bit rate) passing through the system (e.g., the rate at which data traffic occurs through the system, or an average or other statistical characterization of the rate at which data traffic has occurred through the system over a certain time interval). The expression "total available bandwidth" of a system is used herein to mean the maximum possible data rate (bit rate) of the system (i.e., the maximum rate at which data traffic will occur through the system). The expression "available bandwidth" of a system is used herein to mean the total available bandwidth of the system minus the consumed bandwidth of the system.

在一些实施例中，本发明是一种用于均衡耦合至融合网络的计算设备(本文中被称为“服务器”)访问(通过适配器)耦合至网络的存储设备的系统中的存储数据流量(例如，以试图优化存储数据流量)的方法。在服务器上实现的一组代理(“服务器代理”)以及在适配器上实现的一组代理(“适配器代理”)被配置用于检测整个网络中的存储和数据流量的不均衡并对其进行响应，并且用于重新引导存储数据流量以便减少不均衡并且由此提高整体网络性能(针对数据通信和存储流量两者)。其他实施例包括被配置用于执行这种方法的系统以及被配置用于实施这种方法或用于这种系统中的设备。In some embodiments, the present invention is a method for balancing storage data traffic in a system where computing devices (referred to herein as "servers") coupled to a converged network access (via adapters) storage devices coupled to the network (e.g., in an attempt to optimize storage data traffic). A set of agents implemented on the servers ("server agents") and a set of agents implemented on the adapters ("adapter agents") are configured to detect and respond to imbalances in storage and data traffic across the network, and to redirect storage data traffic to reduce the imbalance and thereby improve overall network performance (for both data communications and storage traffic). Other embodiments include systems configured to perform this method and devices configured to implement this method or for use in such a system.

通常，代理(服务器代理和适配器代理)中的每一个都自主地操作(除了因为适配器代理可以在某些情况下对来自服务器代理的请求和通知进行响应之外)，并且没有中央计算机或管理器引导所述代理的操作。通常，适配器代理仅在适配器和服务器(在服务器中实现适配器)为至少一个存储设备提供存储数据路径时与服务器代理直接交互，服务器代理永不与其他服务器代理直接通信，并且适配器代理永不与其他适配器代理直接通信。然而，本发明的典型实施例允许所有代理对其他代理的行为作出反应并且影响所述行为，以便均衡整体网络流量并且以便避免使行为不稳定。此外，如果任何网络耦合设备故障，存续的网络耦合设备将在不具有任何中断的情况下继续均衡网络流量(并且调整以适应故障的结果)。Typically, each of the agents (server agents and adapter agents) operates autonomously (except that adapter agents can, under certain circumstances, respond to requests and notifications from server agents), and there is no central computer or manager directing the agents' operations. Typically, adapter agents interact directly with server agents only when the adapter and server (where the adapter is implemented) provide a storage data path for at least one storage device; server agents never communicate directly with other server agents, and adapter agents never communicate directly with other adapter agents. However, exemplary embodiments of the present invention allow all agents to react to and influence the behavior of other agents in order to balance overall network traffic and avoid destabilizing behavior. Furthermore, if any network-coupled device fails, the surviving network-coupled devices will continue to balance network traffic (and adjust to the consequences of the failure) without any interruption.

根据典型的实施例，以完全分散的方式均衡融合网络上的存储数据流量，通信被执行以便实现仅在适配器与服务器(不是在服务器之间或在适配器之间或从适配器到两个或更多个服务器)之间的每条数据路径的端点之间发生的均衡。任何参与者(例如，服务器接口、服务器代理、适配器接口或适配器代理)的故障仅影响所述参与者作为成员的路径。一般而言，在任何服务器代理与适配器代理(例如，服务器代理不与多于一个适配器代理共享这种通信)之间仅存在一对一通信。相比而言，用于均衡多个存储设备和多个服务器当中的存储数据流量的常规方法还未以此方式分散。According to typical embodiments, storage data traffic on a converged network is balanced in a fully decentralized manner, with communications being performed to achieve balancing that occurs only between the endpoints of each data path between an adapter and a server (not between servers or between adapters or from an adapter to two or more servers). A failure of any participant (e.g., a server interface, a server agent, an adapter interface, or an adapter agent) affects only the paths of which that participant is a member. Generally, there is only one-to-one communication between any server agent and an adapter agent (e.g., a server agent does not share such communication with more than one adapter agent). In contrast, conventional methods for balancing storage data traffic among multiple storage devices and multiple servers have not been decentralized in this manner.

根据典型的实施例，服务器代理和适配器代理进行操作以收集关于网络状态的信息，并且以使服务器(在适当情况下)将针对存储设备的所有流量从(服务器与存储设备之间的)一条数据路径重新引导到被选择用于减少网络不均衡的(服务器与存储设备之间的)不同数据路径。According to a typical embodiment, the server agent and the adapter agent operate to collect information about the state of the network and to cause the server (where appropriate) to redirect all traffic destined for the storage device from one data path (between the server and the storage device) to a different data path (between the server and the storage device) selected to reduce network imbalance.

在本发明方法的典型实施例中，假设另一个实体(例如，管理或分配过程)已经将可以在服务器与每个存储设备(例如，磁盘驱动器)之间使用的所有数据路径通知给每个服务器(及其代理)，所述服务器可以访问所述存储设备以便来往于所述存储设备被传送数据。通常，进一步假设每个服务器(及其代理)已经获知服务器与存储设备之间的(针对可由服务器访问的每个存储设备)优选数据路径(例如，基于对网络的静态分析，或以确定性方式确定(例如，到具有最低IP地址的适配器接口的路径))。In typical embodiments of the present method, it is assumed that another entity (e.g., a management or allocation process) has informed each server (and its agents) of all data paths that can be used between the server and each storage device (e.g., disk drive) that the server can access in order to transfer data to and from the storage device. Typically, it is further assumed that each server (and its agents) has learned the preferred data path between the server and the storage device (for each storage device accessible by the server) (e.g., based on a static analysis of the network, or determined in a deterministic manner (e.g., the path to the adapter interface with the lowest IP address)).

在一类实施例中，本发明是一种系统，所述系统包括：至少一个服务器，所述至少一个服务器具有至少一个服务器接口，其中，所述服务器被配置成通过所述服务器接口耦合至融合网络，并且所述服务器被配置成包括服务器代理；至少一个存储设备；以及至少一个适配器，所述至少一个适配器被配置成耦合至所述存储设备并且具有至少一个适配器接口(以及可选地还有至少另外一个适配器，所述至少另外一个适配器具有至少一个适配器接口并且被配置用于将所述存储设备耦合至所述网络)，其中，所述适配器被配置用于经由所述适配器接口将所述存储设备耦合至所述网络，并且所述适配器被配置成包括适配器代理。In one class of embodiments, the present invention is a system comprising: at least one server having at least one server interface, wherein the server is configured to couple to a converged network via the server interface and the server is configured to include a server agent; at least one storage device; and at least one adapter configured to couple to the storage device and having at least one adapter interface (and optionally at least one other adapter having at least one adapter interface and configured to couple the storage device to the network), wherein the adapter is configured to couple the storage device to the network via the adapter interface and the adapter is configured to include an adapter agent.

所述适配器代理被耦合且被配置用于：The adapter agent is coupled and configured to:

判定每个所述适配器接口是否过载，并且生成每个所述适配器接口的适配器接口过载指示，其中，每个所述适配器接口的所述适配器接口过载指示指示所述适配器接口是否过载；并且determining whether each of the adapter interfaces is overloaded, and generating an adapter interface overload indication for each of the adapter interfaces, wherein the adapter interface overload indication for each of the adapter interfaces indicates whether the adapter interface is overloaded; and

响应于来自所述服务器代理的请求而向所述服务器代理报告至少一个所述适配器接口过载指示(例如，响应于来自所述服务器代理的所述请求而使所述适配器向至少一个所述适配器接口断言指示至少一个所述适配器接口过载指示的数据)。Reporting at least one of the adapter interface overload indications to the server agent in response to a request from the server agent (e.g., causing the adapter to assert data indicating at least one of the adapter interface overload indications to at least one of the adapter interfaces in response to the request from the server agent).

所述服务器代理被耦合且被配置用于：The server agent is coupled and configured to:

使所述服务器向所述适配器代理断言请求，并且标识由所述适配器代理响应于所述请求而向所述服务器断言(即，提供)的至少一个适配器接口过载指示；并且causing the server to assert a request to the adapter agent and identifying at least one adapter interface overload indication asserted (ie, provided) by the adapter agent to the server in response to the request; and

针对包括所述服务器接口并且所述服务器通过其而经由所述适配器访问所述存储设备的路径，以使用所述适配器接口过载指示的方式来判定所述路径是否过载。For a path including the server interface and through which the server accesses the storage device via the adapter, it is determined whether the path is overloaded by using the adapter interface overload indication.

在一些实施例中，所述服务器代理被耦合且被配置用于对所述路径过载的确定进行响应，包括通过：In some embodiments, the server agent is coupled and configured to respond to a determination that the path is overloaded, including by:

判定是否选择到所述存储设备的新路径以供随后使用，以及determining whether to select a new path to the storage device for subsequent use, and

在确定应该选择所述新路径之后，使所述服务器将所述服务器与所述存储设备之间的存储数据流量路由改变至所述新路径。优选地，所述服务器代理被耦合且被配置用于：在使所述服务器将所述服务器与所述存储设备之间的存储数据流量路由改变至所述新路径之后等待具有足够持续时间的一段时间间隔，使得所述改变至所述新路径的影响可以反映在由每个所述适配器代理对所述适配器代理的每个适配器接口上的流量的持续监测的结果中；并且在所述等待之后，开始评估(例如，重新评估)到所述存储设备的路径，包括除了所述新路径之外的至少一条路径。在优选实施例中，所述等待的所述时间间隔是由被选择作为选定间隔(例如，10秒)的正常变量的随机数确定的，受制于预定的最短等待和最长等待。After determining that the new path should be selected, causing the server to reroute the storage data traffic between the server and the storage device to the new path. Preferably, the server agent is coupled and configured to: wait for a time interval of sufficient duration after causing the server to reroute the storage data traffic between the server and the storage device to the new path so that the effect of the change to the new path can be reflected in the results of continuous monitoring of traffic on each adapter interface of the adapter agent by each adapter agent; and after the wait, begin evaluating (e.g., re-evaluating) paths to the storage device, including at least one path other than the new path. In a preferred embodiment, the time interval of the wait is determined by a random number selected as a normal variable of a selected interval (e.g., 10 seconds), subject to a predetermined minimum wait and maximum wait.

在一些实施例中，所述系统包括第一适配器，所述第一适配器被配置用于将所述存储设备耦合至所述网络；以及第二适配器，所述第二适配器被配置用于将所述存储设备耦合至所述网络(以及可选地还有至少另外一个适配器，所述至少另外一个适配器被配置用于将所述存储设备耦合至所述网络)，所述第一适配器包括至少一个第一适配器接口，并且所述第二适配器包括至少一个第二适配器接口，所述第一适配器包括第一适配器代理，并且所述第二适配器包括第二适配器代理，并且所述服务器代理被耦合且被配置用于：In some embodiments, the system includes a first adapter configured to couple the storage device to the network; and a second adapter configured to couple the storage device to the network (and optionally at least one additional adapter configured to couple the storage device to the network), the first adapter including at least one first adapter interface and the second adapter including at least one second adapter interface, the first adapter including a first adapter agent and the second adapter including a second adapter agent, and the server agent is coupled and configured to:

监测发生在每个所述服务器接口上的数据流量(例如，接收流量和发射流量)以确定每个所述服务器接口的消耗带宽，并且根据每个所述服务器接口的所述消耗带宽来确定每个所述服务器接口的可用带宽；并且monitoring data traffic (e.g., receive traffic and transmit traffic) occurring on each of the server interfaces to determine a consumed bandwidth of each of the server interfaces, and determining an available bandwidth of each of the server interfaces based on the consumed bandwidth of each of the server interfaces; and

对由所述第一适配器代理响应于从所述服务器向所述第一适配器断言的请求而提供至所述服务器的至少一个可用带宽指示进行标识，其中，每个所述可用带宽指示指示一个所述第一适配器接口的可用带宽，并且对由所述第二适配器代理响应于从所述服务器向所述第二适配器断言的请求而提供至所述服务器的至少一个附加可用带宽指示进行标识，其中，每个所述附加可用带宽指示指示一个所述第二适配器接口的可用带宽；并且identifying at least one available bandwidth indication provided by the first adapter agent to the server in response to a request asserted from the server to the first adapter, wherein each of the available bandwidth indications indicates available bandwidth for one of the first adapter interfaces, and identifying at least one additional available bandwidth indication provided by the second adapter agent to the server in response to a request asserted from the server to the second adapter, wherein each of the additional available bandwidth indications indicates available bandwidth for one of the second adapter interfaces; and

将包括所述服务器接口以及所述第二适配器的一个所述第二适配器接口的路径上的可用带宽确定为所述服务器接口上的所述可用带宽与所述一个所述第二适配器接口的所述可用带宽中的最小值。An available bandwidth on a path including the server interface and one of the second adapter interfaces of the second adapter is determined as a minimum value of the available bandwidth on the server interface and the available bandwidth of the one of the second adapter interfaces.

可选地，所述适配器代理被耦合且被配置用于：Optionally, the adapter agent is coupled and configured to:

监测发生在每个所述适配器接口上的数据流量(例如，接收流量和发射流量)，并且生成每个所述适配器接口的消耗带宽指示，其中，每个所述适配器接口的所述消耗带宽指示指示所述适配器接口的消耗带宽；并且monitoring data traffic (e.g., receive traffic and transmit traffic) occurring on each of the adapter interfaces and generating a consumed bandwidth indication for each of the adapter interfaces, wherein the consumed bandwidth indication for each of the adapter interfaces indicates the consumed bandwidth of the adapter interface; and

生成每个所述适配器接口的可用带宽指示，其中，每个所述适配器接口的所述可用带宽指示指示所述适配器接口的可用带宽；并且generating an available bandwidth indication for each of the adapter interfaces, wherein the available bandwidth indication for each of the adapter interfaces indicates the available bandwidth of the adapter interface; and

响应于来自所述服务器代理的请求而向所述服务器代理报告至少一个所述适配器接口过载指示、以及至少一个所述消耗带宽指示和/或至少一个所述可用带宽指示(例如，响应于来自所述服务器代理的所述请求而使所述适配器向至少一个所述适配器接口断言指示至少一个所述适配器接口过载指示以及至少一个所述消耗带宽指示和/或至少一个所述可用带宽指示的数据)。reporting at least one of the adapter interface overload indications, and at least one of the consumed bandwidth indications and/or at least one of the available bandwidth indications to the server agent in response to a request from the server agent (e.g., causing the adapter to assert data indicating at least one of the adapter interface overload indications, and at least one of the consumed bandwidth indications and/or at least one of the available bandwidth indications to at least one of the adapter interfaces in response to the request from the server agent).

可选地，而且，所述适配器代理被耦合且被配置用于：Optionally, further, the adapter agent is coupled and configured to:

估计所述适配器处理附加数据的能力(例如，所述适配器的计算负载能力)；和/或estimating the adapter's ability to process the additional data (eg, the adapter's computational load capacity); and/or

筛选原始过载指示值以生成经筛选的过载值，其中，所述原始过载指示值指示已确定的过载，并且所述经筛选的过载值指示所述已确定的过载是否持久，并且其中，至少一个所述适配器接口过载指示指示所述经筛选的过载值。A raw overload indication value is filtered to generate a filtered overload value, wherein the raw overload indication value indicates a determined overload and the filtered overload value indicates whether the determined overload is persistent, and wherein at least one of the adapter interface overload indications indicates the filtered overload value.

在一些实施例中，所述适配器代理被耦合且被配置用于：生成每个所述适配器接口的可用带宽指示，其中，每个所述适配器接口的所述可用带宽指示指示所述适配器接口的可用带宽，包括通过：使从至少一个所述服务器代理接收到的针对一个所述适配器接口的每个规划附加带宽使用值老化，由此生成所述适配器接口的老化规划带宽使用值，并且针对每个所述适配器接口维护所述适配器接口的每个所述老化规划带宽使用值之和。在一些这种实施例中，所述适配器代理被耦合且被配置用于根据以下各项来生成每个所述适配器接口的所述可用带宽指示：所述适配器接口的全可用带宽、所述适配器接口的消耗带宽的至少一个测量结果、所述适配器处理附加数据的能力的指示、以及针对所述适配器接口的对所述适配器接口的每个所述老化规划带宽使用值所求得的和。In some embodiments, the adapter agent is coupled and configured to generate an indication of available bandwidth for each of the adapter interfaces, wherein the indication of available bandwidth for each of the adapter interfaces indicates the available bandwidth of the adapter interface, including by aging each planned additional bandwidth usage value received from at least one of the server agents for one of the adapter interfaces, thereby generating an aged planned bandwidth usage value for the adapter interface, and maintaining, for each of the adapter interfaces, a sum of each of the aged planned bandwidth usage values for the adapter interface. In some such embodiments, the adapter agent is coupled and configured to generate the indication of available bandwidth for each of the adapter interfaces based on: the total available bandwidth of the adapter interface, at least one measurement of consumed bandwidth of the adapter interface, an indication of the adapter's ability to process additional data, and a sum of each of the aged planned bandwidth usage values for the adapter interface for the adapter interface.

在一些实施例中，每个服务器都编程有实现所述每个服务器的所述服务器代理的软件，并且每个适配器都编程有实现所述每个适配器的所述适配器代理的软件。在一些实施例中，至少一个服务器代理或至少一个适配器代理在硬件中实施(例如，至少一个所述服务器包括实现其所述服务器代理的硬件子系统)。In some embodiments, each server is programmed with software that implements the server agent for each server, and each adapter is programmed with software that implements the adapter agent for each adapter. In some embodiments, at least one server agent or at least one adapter agent is implemented in hardware (e.g., at least one of the servers includes a hardware subsystem that implements its server agent).

本发明的其他方面是一种适配器(被编程或以其他方式被配置用于实现本发明适配器代理的实施例)、一种集成有这种适配器的磁盘驱动器(或其他存储设备)、一种集成有这种适配器的JBOD(或其他存储设备系统)、一种服务器(被编程或以其他方式被配置用于实现本发明服务器代理的实施例)、一种本发明服务器代理的实施例的硬件实施方式以及一种本发明适配器代理的实施例的软件实施方式。Other aspects of the present invention are an adapter (programmed or otherwise configured to implement an embodiment of the adapter agent of the present invention), a disk drive (or other storage device) integrated with such an adapter, a JBOD (or other storage device system) integrated with such an adapter, a server (programmed or otherwise configured to implement an embodiment of the server agent of the present invention), a hardware implementation of an embodiment of the server agent of the present invention, and a software implementation of an embodiment of the adapter agent of the present invention.

本发明的其他方面是在本发明系统、适配器、存储设备、JBOD、服务器或其他设备的任何实施例的操作中执行的方法。Other aspects of the invention are methods performed during the operation of any embodiment of the system, adapter, storage device, JBOD, server, or other device of the invention.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是本发明系统的实施例的框图。FIG1 is a block diagram of an embodiment of the system of the present invention.

图2是本发明系统的另一个实施例的框图。FIG2 is a block diagram of another embodiment of the system of the present invention.

具体实施方式DETAILED DESCRIPTION

在一类实施例中，本发明是一种系统，所述系统包括至少一个服务器，所述至少一个服务器通过至少一个服务器接口耦合至融合网络；以及至少一个存储设备，所述至少一个存储设备通过至少两个适配器耦合至所述网络。In one class of embodiments, the present invention is a system comprising at least one server coupled to a converged network via at least one server interface and at least one storage device coupled to the network via at least two adapters.

将参照图1对这种系统的示例进行描述。在图1的系统中，服务器1和3(以及可选地还有其他服务器)以及适配器5、7、9和11(以及可选地还有其他适配器)中的每一个都耦合至融合网络20。存储子系统13通过适配器5和7中的每一个耦合至网络20。存储子系统15通过适配器9和11中的每一个耦合至网络20。存储子系统13和15中的每一个都可以是磁盘驱动器或其他存储设备，或者包括多个存储设备的存储子系统(例如，JBOD)。An example of such a system will be described with reference to FIG1. In the system of FIG1, servers 1 and 3 (and optionally other servers) and adapters 5, 7, 9, and 11 (and optionally other adapters) are each coupled to a converged network 20. Storage subsystem 13 is coupled to network 20 through each of adapters 5 and 7. Storage subsystem 15 is coupled to network 20 through each of adapters 9 and 11. Each of storage subsystems 13 and 15 can be a disk drive or other storage device, or a storage subsystem including multiple storage devices (e.g., JBOD).

服务器1包括接口2(所述接口被配置用于将服务器1连接至网络20)，并且服务器1被配置成包括应用子系统4(例如，被编程有实现所述应用子系统的软件)。服务器1还被配置成包括服务器代理子系统6(例如，被编程有实现所述服务器代理子系统的软件)。服务器3包括接口8(所述接口被配置用于将服务器3连接至网络20)，并且被配置成包括应用子系统10(例如，被编程有实现所述应用子系统的软件)。服务器3还被配置成包括服务器代理子系统12(例如，被编程有实现所述服务器代理子系统的软件)。Server 1 includes an interface 2 (the interface is configured to connect server 1 to network 20), and server 1 is configured to include an application subsystem 4 (e.g., programmed with software that implements the application subsystem). Server 1 is also configured to include a server agent subsystem 6 (e.g., programmed with software that implements the server agent subsystem). Server 3 includes an interface 8 (the interface is configured to connect server 3 to network 20), and is configured to include an application subsystem 10 (e.g., programmed with software that implements the application subsystem). Server 3 is also configured to include a server agent subsystem 12 (e.g., programmed with software that implements the server agent subsystem).

在一些实施方式中，接口2和8中的每一个都被实现为物理设备(即，网络接口控制器(“NIC”)。在其他实施方式中，接口2和8中的每一个都被实现为多个NIC的软件定义封装器。在本发明的典型实施例中，接口2和8中的每一个都是具有其自己的互联网协议(IP)地址的硬件或软件元件。In some embodiments, each of interfaces 2 and 8 is implemented as a physical device (i.e., a network interface controller ("NIC"). In other embodiments, each of interfaces 2 and 8 is implemented as a software-defined wrapper of multiple NICs. In an exemplary embodiment of the invention, each of interfaces 2 and 8 is a hardware or software element with its own Internet Protocol (IP) address.

适配器5被配置成包括适配器代理子系统14(例如，被编程有实现所述适配器代理子系统的软件)。适配器7被配置成包括适配器代理子系统16(例如，被编程有实现所述适配器代理子系统的软件)。适配器9被配置成包括适配器代理子系统18(例如，被编程有实现所述适配器代理子系统的软件)。适配器11被配置成包括适配器代理子系统22(例如，被编程有实现所述适配器代理子系统的软件)。Adapter 5 is configured to include an adapter proxy subsystem 14 (e.g., programmed with software that implements the adapter proxy subsystem). Adapter 7 is configured to include an adapter proxy subsystem 16 (e.g., programmed with software that implements the adapter proxy subsystem). Adapter 9 is configured to include an adapter proxy subsystem 18 (e.g., programmed with software that implements the adapter proxy subsystem). Adapter 11 is configured to include an adapter proxy subsystem 22 (e.g., programmed with software that implements the adapter proxy subsystem).

在示例性实施例中，网络20是以太网网络，并且元件1、3、5、7、9和11被配置用于根据iSCSI(互联网小型计算机系统接口)联网协议通过网络20进行通信。iSCSI协议是常规的基于互联网协议的联网标准，所述联网标准允许通过LAN、WAN或互联网传输数据。在此示例性实施例中，元件1、3、5、7、9和11(以及代理6、12、14、16、18和22)以简单的方式(比许多常规应用中简单得多)使用iSCSI网络协议，其中，虽然允许服务器1(或3)与适配器5、7、9或11中的任何适配器之间的通信，但是在每个服务器(1或3)与每个适配器(5、7、9或11)之间一次只存在一条连接路径。In an exemplary embodiment, network 20 is an Ethernet network, and elements 1, 3, 5, 7, 9, and 11 are configured to communicate over network 20 according to the iSCSI (Internet Small Computer System Interface) networking protocol. The iSCSI protocol is a conventional Internet Protocol-based networking standard that allows data to be transmitted over a LAN, WAN, or the Internet. In this exemplary embodiment, elements 1, 3, 5, 7, 9, and 11 (as well as agents 6, 12, 14, 16, 18, and 22) use the iSCSI networking protocol in a simple manner (much simpler than in many conventional applications) in which, while communication between server 1 (or 3) and any of adapters 5, 7, 9, or 11 is allowed, only one connection path exists at a time between each server (1 or 3) and each adapter (5, 7, 9, or 11).

在示例性实施例中：In an exemplary embodiment:

适配器5包括用于经由网络20与服务器1或3通信的iSCSI接口。根据本发明的在适配器代理14与服务器代理6和12之间的通信通过此iSCSI接口来实现。适配器5还被配置用于根据公知的串行附接SCSI(“SAS”)协议与存储子系统13通信以便实现服务器1(或3)与子系统13之间的存储数据流量；Adapter 5 includes an iSCSI interface for communicating with server 1 or 3 via network 20. Communication between adapter agent 14 and server agents 6 and 12 according to the present invention is implemented through this iSCSI interface. Adapter 5 is also configured to communicate with storage subsystem 13 according to the well-known Serial Attached SCSI ("SAS") protocol to enable storage data traffic between server 1 (or 3) and subsystem 13;

适配器7包括用于经由网络20与服务器1或3通信的iSCSI接口。根据本发明的在适配器代理16与服务器代理6和12之间的通信通过此iSCSI接口来实现。适配器7还被配置用于根据SAS协议与存储子系统13通信以便实现服务器1(或3)与子系统13之间的存储数据流量；Adapter 7 includes an iSCSI interface for communicating with server 1 or 3 via network 20. Communication between adapter agent 16 and server agents 6 and 12 according to the present invention is implemented through this iSCSI interface. Adapter 7 is also configured to communicate with storage subsystem 13 according to the SAS protocol to implement storage data traffic between server 1 (or 3) and subsystem 13;

适配器9包括用于经由网络20与服务器1或3通信的iSCSI接口。根据本发明的在适配器代理18与服务器代理6和12之间的通信通过此iSCSI接口来实现。适配器9还被配置用于根据SAS协议与存储子系统15通信以便实现服务器1(或3)与子系统15之间的存储数据流量；并且Adapter 9 includes an iSCSI interface for communicating with server 1 or 3 via network 20. Communication between adapter agent 18 and server agents 6 and 12 according to the present invention is implemented through this iSCSI interface. Adapter 9 is also configured to communicate with storage subsystem 15 according to the SAS protocol to implement storage data traffic between server 1 (or 3) and subsystem 15; and

适配器11包括用于经由网络20与服务器1或3通信的iSCSI接口。根据本发明的在适配器代理22与服务器代理6和12之间的通信通过此iSCSI接口来实现。适配器11还被配置用于根据SAS协议与存储子系统15通信以便实现服务器1(或3)与子系统15之间的存储数据流量。Adapter 11 includes an iSCSI interface for communicating with server 1 or 3 via network 20. Communication between adapter agent 22 and server agents 6 and 12 according to the present invention is implemented through this iSCSI interface. Adapter 11 is also configured to communicate with storage subsystem 15 according to the SAS protocol to facilitate storage data traffic between server 1 (or 3) and subsystem 15.

服务器1的应用子系统4被配置用于发起对耦合至网络20的存储设备(例如，子系统13或15中的存储设备)的访问。服务器3的应用子系统10被配置用于发起对耦合至网络20的存储设备(例如，子系统13或15中的存储设备)的访问。在典型操作中，实体(例如，管理或分配过程)已经将可以用于服务器1与每个存储设备之间的所有数据路径通知给应用子系统4和代理6，所述服务器可以访问所述存储设备以便来往于所述存储设备传送数据，并且应用子系统4和代理6已经获知服务器1与存储设备之间的优选数据路径(针对可由服务器1访问的每个存储设备)(例如，基于对网络的动态分析，或以确定性方式确定(例如，到具有最低IP地址的适配器接口的路径))。类似地，在典型操作中，实体(例如，管理或分配过程)已经将可以用于服务器3与每个存储设备之间的所有数据路径通知给应用子系统10和代理12，所述服务器可以访问所述存储设备以便来往于所述存储设备传送数据，并且应用子系统10和代理12已经获知服务器3与存储设备之间的优选数据路径(针对可由服务器3访问的每个存储设备)。Application subsystem 4 of server 1 is configured to initiate access to storage devices coupled to network 20 (e.g., storage devices in subsystem 13 or 15). Application subsystem 10 of server 3 is configured to initiate access to storage devices coupled to network 20 (e.g., storage devices in subsystem 13 or 15). In typical operation, an entity (e.g., a management or allocation process) has informed application subsystem 4 and agent 6 of all data paths available between server 1 and each storage device that the server can access for transferring data to or from the storage device, and application subsystem 4 and agent 6 have learned the preferred data path between server 1 and the storage devices (for each storage device accessible by server 1) (e.g., based on dynamic analysis of the network, or determined in a deterministic manner (e.g., the path to the adapter interface with the lowest IP address)). Similarly, in typical operation, an entity (e.g., a management or allocation process) has informed the application subsystem 10 and agent 12 of all data paths that can be used between the server 3 and each storage device that the server can access to transfer data to and from the storage device, and the application subsystem 10 and agent 12 have learned the preferred data path between the server 3 and the storage device (for each storage device accessible by the server 3).

在典型的实施方式中，适配器代理子系统14、16、18和22(在本文中也被称为适配器代理或代理)中的每一个以及服务器代理子系统6和12(在本文中也被称为服务器代理或代理)中的每一个根据本发明(例如，以以下将描述的方式)被配置用于检测融合网络20上的存储数据流量的不均衡并对其进行响应，并且用于重新引导存储数据流量以便减少不均衡并且由此提高整体网络性能(针对数据通信和存储流量两者)。例如，在典型的实施方式中，服务器代理子系统6根据本发明(例如，以以下将描述的方式)被配置用于检测网络20上的存储数据流量的不均衡并且(在适当情况下)通过以下方式来对其进行响应：将存储数据流量从服务器1与(子系统13或15中的)特定存储设备之间的一条数据路径重新引导到服务器1与同一存储设备之间的另一条数据路径。In an exemplary embodiment, each of the adapter agent subsystems 14, 16, 18, and 22 (also referred to herein as adapter agents or agents) and each of the server agent subsystems 6 and 12 (also referred to herein as server agents or agents) are configured in accordance with the present invention (e.g., in a manner to be described below) to detect and respond to an imbalance in storage data traffic on the converged network 20, and to redirect the storage data traffic to reduce the imbalance and thereby improve overall network performance (for both data communications and storage traffic). For example, in an exemplary embodiment, the server agent subsystem 6 is configured in accordance with the present invention (e.g., in a manner to be described below) to detect an imbalance in storage data traffic on the network 20 and (where appropriate) respond to it by redirecting the storage data traffic from one data path between the server 1 and a particular storage device (in subsystem 13 or 15) to another data path between the server 1 and the same storage device.

图2中示出了本发明系统的另一个实施例。在图2的系统中，服务器21(以及可选地还有其他服务器)以及适配器25、27、29和31(以及可选地还有其他适配器)耦合至融合网络20(所述网络可以与图1的网络20完全相同)。存储子系统23通过适配器25和27中的每一个耦合至网络20。存储子系统33通过适配器29和31中的每一个耦合至网络20。存储子系统23和33中的每一个都是包括多个存储设备的存储子系统(例如，每一个都是包括多个磁盘驱动器的JBOD)。Another embodiment of the system of the present invention is shown in FIG2 . In the system of FIG2 , server 21 (and optionally other servers) and adapters 25, 27, 29, and 31 (and optionally other adapters) are coupled to converged network 20 (which may be identical to network 20 in FIG1 ). Storage subsystem 23 is coupled to network 20 via each of adapters 25 and 27. Storage subsystem 33 is coupled to network 20 via each of adapters 29 and 31. Each of storage subsystems 23 and 33 is a storage subsystem that includes multiple storage devices (e.g., each is a JBOD that includes multiple disk drives).

服务器21包括接口22和24，所述接口中的每一个都是具有其自己的互联网协议(IP)地址并且被配置用于将服务器21连接至网络20的网络接口控制器(NIC)。服务器21被配置成包括应用子系统26(例如，被编程有实现所述应用子系统的软件)并且还被配置成包括服务器代理子系统28(例如，被编程有实现所述服务器代理子系统的软件)。Server 21 includes interfaces 22 and 24, each of which is a network interface controller (NIC) having its own Internet Protocol (IP) address and configured to connect server 21 to network 20. Server 21 is configured to include an application subsystem 26 (e.g., programmed with software implementing the application subsystem) and is also configured to include a server agent subsystem 28 (e.g., programmed with software implementing the server agent subsystem).

适配器25包括接口30和32，所述接口中的每一个都是具有其自己的互联网协议(IP)地址并且被配置用于将适配器25连接至网络20的网络接口控制器(NIC)，并且适配器25被配置成包括适配器代理子系统38(例如，被编程有实现所述适配器代理子系统的软件)。适配器25还包括各自耦合至存储子系统23的端口34和36，并且被配置用于经由端口34或36中的任一个以及接口30或32中的任一个将(子系统23内的)存储设备耦合至网络20。Adapter 25 includes interfaces 30 and 32, each of which is a network interface controller (NIC) having its own Internet Protocol (IP) address and configured to connect adapter 25 to network 20, and adapter 25 is configured to include an adapter proxy subsystem 38 (e.g., programmed with software implementing the adapter proxy subsystem). Adapter 25 also includes ports 34 and 36, each coupled to storage subsystem 23, and is configured to couple a storage device (within subsystem 23) to network 20 via either port 34 or 36 and either interface 30 or 32.

适配器27包括接口40和42，所述接口中的每一个都是具有其自己的互联网协议(IP)地址并且被配置用于将适配器27连接至网络20的网络接口控制器(NIC)，并且适配器27被配置成包括适配器代理子系统48(例如，被编程有实现所述适配器代理子系统的软件)。适配器27还包括各自耦合至存储子系统23的端口44和46，并且被配置用于经由端口44或46中的任一个以及接口40或42中的任一个将(子系统23内的)存储设备耦合至网络20。Adapter 27 includes interfaces 40 and 42, each of which is a network interface controller (NIC) having its own Internet Protocol (IP) address and configured to connect adapter 27 to network 20, and adapter 27 is configured to include an adapter proxy subsystem 48 (e.g., programmed with software implementing the adapter proxy subsystem). Adapter 27 also includes ports 44 and 46, each coupled to storage subsystem 23, and is configured to couple a storage device (within subsystem 23) to network 20 via either port 44 or 46 and either interface 40 or 42.

适配器29包括多个接口(未示出)，所述接口中的每一个都是具有其自己的互联网协议(IP)地址并且被配置用于将适配器29连接至网络20的网络接口控制器(NIC)，并且适配器29被配置成包括适配器代理子系统50(例如，被编程有实现所述适配器代理子系统的软件)。适配器29还包括各自耦合至存储子系统33的多个端口(未明确示出)，并且被配置用于经由所述端口中的任何一个以及适配器29的NIC中的任何一个将(子系统33内的)存储设备耦合至网络20。Adapter 29 includes multiple interfaces (not shown), each of which is a network interface controller (NIC) having its own Internet Protocol (IP) address and configured to connect adapter 29 to network 20, and adapter 29 is configured to include an adapter proxy subsystem 50 (e.g., programmed with software implementing the adapter proxy subsystem). Adapter 29 also includes multiple ports (not explicitly shown) each coupled to storage subsystem 33, and is configured to couple storage devices (within subsystem 33) to network 20 via any of the ports and any of the NICs of adapter 29.

适配器31包括多个接口(未示出)，所述接口中的每一个都是具有其自己的互联网协议(IP)地址并且被配置用于将适配器31连接至网络20的网络接口控制器(NIC)，并且适配器31被配置成包括适配器代理子系统52(例如，被编程有实现所述适配器代理子系统的软件)。适配器31还包括各自耦合至存储子系统33的多个端口(未明确示出)，并且被配置用于经由所述端口中的任何一个以及适配器31的NIC中的任何一个将(子系统33内的)存储设备耦合至网络20。Adapter 31 includes multiple interfaces (not shown), each of which is a network interface controller (NIC) having its own Internet Protocol (IP) address and configured to connect adapter 31 to network 20, and adapter 31 is configured to include an adapter proxy subsystem 52 (e.g., programmed with software implementing the adapter proxy subsystem). Adapter 31 also includes multiple ports (not explicitly shown) each coupled to storage subsystem 33, and is configured to couple storage devices (within subsystem 33) to network 20 via any of the ports and any of the NICs of adapter 31.

在示例性实施例中，网络20是以太网网络，并且元件21、25、27、29和31被配置用于根据iSCSI(互联网小型计算机系统接口)联网协议通过网络20进行通信。在此示例性实施例中，元件21、25、27、29和31(以及代理28、38、48、50和52)以简单的方式(比许多常规应用中简单得多)使用iSCSI网络协议，其中，虽然允许服务器21与适配器25、27、29或31中的任何适配器之间的通信，但是在服务器与每个适配器(25、27、29或31)之间一次只存在一条连接路径。In an exemplary embodiment, network 20 is an Ethernet network, and elements 21, 25, 27, 29, and 31 are configured to communicate according to the iSCSI (Internet Small Computer System Interface) networking protocol over network 20. In this exemplary embodiment, elements 21, 25, 27, 29, and 31 (and agents 28, 38, 48, 50, and 52) use the iSCSI networking protocol in a simple manner (much simpler than in many conventional applications) in which, while communication between server 21 and any of adapters 25, 27, 29, or 31 is permitted, only one connection path exists between the server and each adapter (25, 27, 29, or 31) at a time.

在示例性实施例中：In an exemplary embodiment:

适配器25的接口30和32中的每一个都是用于经由网络20与服务器21通信的iSCSI接口。根据本发明的在适配器代理38与服务器代理28之间的通信通过此iSCSI接口来实现。适配器25还被配置用于根据串行附接SCSI(“SAS”)协议经由端口34或36中的任一个与存储子系统23通信以便实现服务器21与子系统23之间的存储数据流量；Each of interfaces 30 and 32 of adapter 25 is an iSCSI interface for communicating with server 21 via network 20. Communication between adapter agent 38 and server agent 28 according to the present invention is implemented through this iSCSI interface. Adapter 25 is also configured to communicate with storage subsystem 23 via either port 34 or 36 in accordance with the Serial Attached SCSI ("SAS") protocol to enable storage data traffic between server 21 and subsystem 23;

适配器27的接口40和42中的每一个都是用于经由网络20与服务器21通信的iSCSI接口。根据本发明的在适配器代理48与服务器代理28之间的通信通过此iSCSI接口来实现。适配器27还被配置用于根据串行附接SCSI(“SAS”)协议经由端口44或46中的任一个与存储子系统23通信以便实现服务器21与子系统23之间的存储数据流量；Each of interfaces 40 and 42 of adapter 27 is an iSCSI interface for communicating with server 21 via network 20. Communication between adapter agent 48 and server agent 28 according to the present invention is implemented through this iSCSI interface. Adapter 27 is also configured to communicate with storage subsystem 23 via either port 44 or 46 in accordance with the Serial Attached SCSI ("SAS") protocol to enable storage data traffic between server 21 and subsystem 23;

适配器29包括用于经由网络20与服务器21通信的iSCSI接口。根据本发明的在适配器代理50与服务器代理28之间的通信通过此iSCSI接口来实现。适配器29还被配置用于根据SAS协议与存储子系统33通信以便实现服务器21与子系统33之间的存储数据流量；并且Adapter 29 includes an iSCSI interface for communicating with server 21 via network 20. Communication between adapter agent 50 and server agent 28 according to the present invention is implemented through this iSCSI interface. Adapter 29 is also configured to communicate with storage subsystem 33 according to the SAS protocol to implement storage data traffic between server 21 and subsystem 33; and

适配器31包括用于经由网络20与服务器21通信的iSCSI接口。根据本发明的在适配器代理52与服务器代理28之间的通信通过此iSCSI接口来实现。适配器31还被配置用于根据SAS协议与存储子系统33通信以便实现服务器21与子系统33之间的存储数据流量。Adapter 31 includes an iSCSI interface for communicating with server 21 via network 20. Communication between adapter agent 52 and server agent 28 according to the present invention is implemented through this iSCSI interface. Adapter 31 is also configured to communicate with storage subsystem 33 according to the SAS protocol to facilitate storage data traffic between server 21 and subsystem 33.

服务器21的应用子系统26被配置用于发起对耦合至网络20的存储设备(例如，子系统23或33中的存储设备)的访问。在典型操作中，实体(例如，管理或分配过程)已经将可以用于服务器21与每个存储设备之间的所有数据路径通知给应用子系统26和代理28，所述服务器可以访问所述存储设备以便来往于所述存储设备传送数据，并且应用子系统26和代理28已经获知服务器21与存储设备之间的优选数据路径(针对可由服务器21访问的每个存储设备)(例如，基于对网络的动态分析，或以确定性方式确定(例如，到具有最低IP地址的适配器接口的路径))。Application subsystem 26 of server 21 is configured to initiate access to storage devices coupled to network 20 (e.g., storage devices in subsystem 23 or 33). In typical operation, an entity (e.g., a management or allocation process) has informed application subsystem 26 and agent 28 of all data paths that can be used between server 21 and each storage device that the server can access to transfer data to or from the storage device, and application subsystem 26 and agent 28 have learned the preferred data path between server 21 and the storage devices (for each storage device accessible by server 21) (e.g., based on dynamic analysis of the network, or determined in a deterministic manner (e.g., the path to the adapter interface with the lowest IP address)).

在典型的实施方式中，适配器代理子系统38、48、50和52(在本文中也被称为适配器代理或代理)中的每一个以及服务器代理子系统26(在本文中也被称为服务器代理或代理)根据本发明(例如，以以下将描述的方式)被配置用于检测融合网络20上的存储数据流量的不均衡并对其进行响应，并且用于重新引导存储数据流量以便减少不均衡并且由此提高整体网络性能(针对数据通信和存储流量两者)。例如，在典型的实施方式中，服务器代理26根据本发明(例如，以以下将描述的方式)被配置用于检测网络20上的存储数据流量的不均衡并且(在适当情况下)通过以下方式来对其进行响应：将存储数据流量从服务器21与(子系统23或33中的)特定存储设备之间的一条数据路径重新引导到服务器21与同一存储设备之间的另一条数据路径。In an exemplary embodiment, each of the adapter agent subsystems 38, 48, 50, and 52 (also referred to herein as adapter agents or agents), and the server agent subsystem 26 (also referred to herein as server agent or agent), are configured in accordance with the present invention (e.g., in a manner to be described below) to detect and respond to an imbalance in storage data traffic on the converged network 20, and to redirect the storage data traffic to reduce the imbalance and thereby improve overall network performance (for both data communications and storage traffic). For example, in an exemplary embodiment, the server agent 26 is configured in accordance with the present invention (e.g., in a manner to be described below) to detect an imbalance in storage data traffic on the network 20 and (where appropriate) respond to it by redirecting the storage data traffic from one data path between the server 21 and a particular storage device (in subsystem 23 or 33) to another data path between the server 21 and the same storage device.

在服务器21的每个以太网端口(NIC 22或24)到每个可访问存储设备(通常，每个都是磁盘驱动器)之间存在至少四条数据路径(例如，一条路径通过接口30、32、40和42中的每一个)，并且因此在服务器21与每个可访问存储设备之间存在至少八条数据路径。因此，图2的系统为存储设备访问提供了许多冗余。There are at least four data paths (e.g., one path through each of interfaces 30, 32, 40, and 42) between each Ethernet port (NIC 22 or 24) of server 21 and each accessible storage device (typically, each a disk drive), and thus at least eight data paths between server 21 and each accessible storage device. Thus, the system of FIG. 2 provides a lot of redundancy for storage device access.

在典型的数据中心(例如，实现图1的系统或图2的系统的数据中心)中，管理服务器(在图1或图2中未示出)将耦合至网络以供用于对数据中心进行配置和重新配置(例如，包括通过将可以用于服务器21与每个存储设备之间的所有数据路径通知给图2的应用子系统26和代理28，所述服务器可以经由网络20访问所述存储设备以便来往于所述存储设备传送数据)。In a typical data center (e.g., a data center implementing the system of Figure 1 or the system of Figure 2), a management server (not shown in Figures 1 or 2) would be coupled to a network for use in configuring and reconfiguring the data center (e.g., including by notifying the application subsystem 26 and agent 28 of Figure 2 of all data paths that can be used between a server 21 and each storage device that the server can access via network 20 to transfer data to and from the storage device).

设想了本发明服务器的一些实施例被编程(例如，服务器的应用子系统被编程)用于运行允许大量服务器一起工作以便解决问题(通常涉及海量数据)的数据包(例如，Hadoop开源软件包)。还设想了多个这种服务器(以及各自被配置用于实现本发明适配器代理的实施例的多个适配器)可以耦合至数据中心(例如，Hadoop数据中心)中的融合网络，所述数据中心可能位于单个建筑物中。每个适配器都将通常耦合至JBOD，使得JBOD的单独磁盘驱动器可由服务器经由适配器通过网络访问。被考虑为“本地”于每个服务器的磁盘驱动器将通常位于一个JBOD(或多于一个JBOD)中，并且所述一个或多个JBOD将通常安装在一个机架中(例如，一个服务器可以获得其处理的数据的三个副本，将一个副本存储到一个机架上的两个磁盘驱动器上，并且将第三副本存储在另一个机架上的磁盘驱动器上)。在这种实施例中，服务器将经由网络耦合以允许并行地对一组(例如，一大组)数据的分布式处理(处理中的一些是响应于服务器中的每一个断言的命令而执行的)。It is contemplated that some embodiments of the present invention's servers are programmed (e.g., the server's application subsystem is programmed) to run a data package (e.g., the Hadoop open source software package) that allows a large number of servers to work together to solve a problem (typically involving large amounts of data). It is also contemplated that multiple such servers (and multiple adapters, each configured to implement an embodiment of the present invention's adapter agent) can be coupled to a converged network in a data center (e.g., a Hadoop data center), which may be located in a single building. Each adapter will typically be coupled to a JBOD, such that the JBOD's individual disk drives are accessible to the server via the adapter over the network. The disk drives considered "local" to each server will typically be located in one JBOD (or more than one JBOD), and the one or more JBODs will typically be installed in one rack (e.g., a server can obtain three copies of the data it processes, storing one copy on two disk drives in one rack and the third copy on disk drives in another rack). In such embodiments, the servers will be coupled via the network to enable distributed processing of a set (e.g., a large set) of data in parallel (with some of the processing being performed in response to commands asserted by each server).

更一般地，在本发明系统、服务器或适配器的典型实施方式中，每个适配器代理(例如，图1的代理14、16、18或22，或图2的代理38、48、50或52)和每个服务器代理(例如，图1的代理6或12，或图2的代理28)正处理配置有软件(例如，源代码以语言Python和/或语言C来编写的软件)的硬件以根据本发明的实施例而操作。例如，服务器的服务器代理和应用子系统两者(例如，图1的服务器1的代理6和子系统4两者，或图1的服务器3的代理12和子系统10两者)都可以在处理配置有软件的硬件(例如，计算机)时实现。通常，不需要改变任何应用(例如，由图1的服务器1的子系统4或服务器3的子系统10或图2的服务器21的子系统26实现的应用)以获得本发明的典型实施例的优点。通常，每个服务器代理和适配器代理以对于应用来说不可见的方式操作，并且使用所涉及的服务器或适配器接口(包括仅执行数据通信操作的接口)中的任何接口的任何应用都将得益于根据本发明执行的存储数据负载均衡。More generally, in an exemplary embodiment of the system, server, or adapter of the present invention, each adapter agent (e.g., agents 14, 16, 18, or 22 of FIG. 1 , or agents 38, 48, 50, or 52 of FIG. 2 ) and each server agent (e.g., agents 6 or 12 of FIG. 1 , or agent 28 of FIG. 2 ) is processing hardware configured with software (e.g., software whose source code is written in Python and/or C) to operate in accordance with an embodiment of the present invention. For example, both the server agent and the application subsystem of a server (e.g., both agent 6 and subsystem 4 of server 1 of FIG. 1 , or both agent 12 and subsystem 10 of server 3 of FIG. 1 ) can be implemented while processing hardware configured with software (e.g., a computer). Generally, no application (e.g., an application implemented by subsystem 4 of server 1 of FIG. 1 , or subsystem 10 of server 3 of FIG. 2 , or subsystem 26 of server 21 of FIG. 2 ) needs to be modified to obtain the advantages of exemplary embodiments of the present invention. Typically, each server agent and adapter agent operates in a manner that is invisible to the application, and any application that uses any of the involved server or adapter interfaces (including interfaces that perform only data communication operations) will benefit from the storage data load balancing performed in accordance with the present invention.

接下来，我们对每个适配器代理和每个服务器代理在根据本发明的一类实施例的本发明的系统的实施例的操作期间的操作进行描述。在描述中，“接收流量”(或“接收数据”)表示从网络向适配器(或服务器)断言(即，提供)的数据，并且“发射流量”(或“发射数据”)表示从适配器(或服务器)向网络断言(即，提供)的数据。通常，单个适配器(或单个服务器)具有两个到网络的接口，并且其可以具有多于两个到网络的接口。Next, we describe the operation of each adapter agent and each server agent during operation of an embodiment of the system of the present invention according to a class of embodiments of the present invention. In this description, "receive traffic" (or "receive data") refers to data asserted (i.e., provided) from the network to the adapter (or server), and "transmit traffic" (or "transmit data") refers to data asserted (i.e., provided) from the adapter (or server) to the network. Typically, a single adapter (or a single server) has two interfaces to the network, and it may have more than two interfaces to the network.

在一些实施例中，每个适配器代理(例如，图1的代理14、16、18和22中的每一个，或者图2的代理38、48、50和52中的每一个)都被配置用于执行以下操作中的全部或一些操作：In some embodiments, each adapter agent (e.g., each of agents 14, 16, 18, and 22 of FIG. 1 , or each of agents 38, 48, 50, and 52 of FIG. 2 ) is configured to perform all or some of the following operations:

1.适配器代理监测在适配器的每个接口上发生的接收流量和发射流量(例如，以比特每秒为单位)并且生成每个所述接口的消耗带宽的至少一个测量结果。通常，每个监测样本通常是在相对较短的时间段(例如，几秒)内取得的，并且适配器代理确定接收数据样本流和发射数据样本流的统计表征，以提供对每个接口的消耗带宽(在每个接口上使用的带宽)的短期和长期测量结果。由于现代NIC是全双工的(通常，适配器的NIC可以同时进行发送和接收)，所以通常对每个接口上的接收数据和发射数据保持独立的统计数据。在优选实施例中，确定值的指数移动平均数(即，在这种情况下，接口上的接收流量在固定持续时间的移动时间窗口内的指数移动平均数，或者接口上的发射流量在固定持续时间的移动时间窗口内的指数移动平均数)的公知方法用于确定对每个接口上的接收流量的统计表征以及对每个接口上的发射流量的统计表征，这是因为这种指数移动平均数计算起来很便宜。在(于2002年8月20日公布的)美国专利6,438,141中参照其图8描述了用于确定这种指数(加权)移动平均数的方法的示例。在优选实施例中，每个短期移动平均数都近似于在20秒的间隔(窗口)(或基本上等于20秒的间隔)内的算术移动平均数，并且每个长期平均数都近似于在60秒的间隔(窗口)(或基本上等于60秒的间隔)内的算术移动平均数。其他窗口持续时间和计算方法可以用于实现本发明的其他实施例；1. The adapter agent monitors the receive and transmit traffic (e.g., in bits per second) occurring on each interface of the adapter and generates at least one measurement of the consumed bandwidth for each of the interfaces. Typically, each monitoring sample is taken over a relatively short period of time (e.g., a few seconds), and the adapter agent determines statistical representations of the streams of receive and transmit data samples to provide short-term and long-term measurements of the consumed bandwidth (the bandwidth used on each interface) for each interface. Because modern NICs are full-duplex (typically, the adapter's NIC can transmit and receive simultaneously), separate statistics are typically maintained for the receive and transmit data on each interface. In a preferred embodiment, known methods for determining exponential moving averages of values (i.e., in this case, the exponential moving average of the receive traffic on the interface over a moving time window of fixed duration, or the exponential moving average of the transmit traffic on the interface over a moving time window of fixed duration) are used to determine the statistical representations of the receive traffic on each interface and the statistical representations of the transmit traffic on each interface, because such exponential moving averages are inexpensive to compute. An example of a method for determining such an exponential (weighted) moving average is described in U.S. Patent No. 6,438,141 (issued on August 20, 2002) with reference to FIG. 8 thereof. In a preferred embodiment, each short-term moving average approximates an arithmetic moving average over a 20-second interval (window) (or intervals substantially equal to 20 seconds), and each long-term average approximates an arithmetic moving average over a 60-second interval (window) (or intervals substantially equal to 60 seconds). Other window durations and calculation methods may be used to implement other embodiments of the present invention;

2.适配器代理计算(例如，估计)适配器处理附加数据的能力。在优选实施例中，这是适配器的计算负载能力。由于处理附加数据将涉及更多计算工作，所以如果适配器正以其计算能力运行，则其可能不能处理附加存储数据流量，即使其接口未被充分利用。在一些实施例中，适配器代理将可能由适配器在处理存储数据流量时消耗的任何其他资源的剩余能力合并到其对适配器处理附加数据的能力的计算结果(例如，估计值)中。可选地，适配器代理还确定适配器的降额因子，适配器代理将所述降额因子乘以(在一些实施例中)每适配器接口的附加可用带宽的原始估计量以确定每适配器接口的附加可用带宽的有限估计量(例如，以将适配器代理将报告的带宽限制为可用于所述接口，如以下描述的)；2. The adapter agent calculates (e.g., estimates) the adapter's ability to handle the additional data. In a preferred embodiment, this is the adapter's computational load capacity. Because processing the additional data will involve more computational work, if the adapter is running at its computational capacity, it may not be able to handle the additional storage data traffic, even if its interfaces are not fully utilized. In some embodiments, the adapter agent incorporates into its calculation (e.g., estimate) of the adapter's ability to handle the additional data the remaining capacity of any other resources that may be consumed by the adapter in processing the storage data traffic. Optionally, the adapter agent also determines a derating factor for the adapter, which the adapter agent multiplies (in some embodiments) by a raw estimate of the additional available bandwidth per adapter interface to determine a limited estimate of the additional available bandwidth per adapter interface (e.g., to limit the bandwidth that the adapter agent will report to be available for the interface, as described below);

3.如果服务器代理已经向适配器代理指示服务器(在所述服务器中实现服务器代理)规划在不久的将来使用未来的附加带宽(在包括适配器的接口的路径上)，则适配器代理维持每个这种服务器代理已经针对包括适配器接口的一条或多条路径而向适配器代理指示的(多个)规划附加将来带宽使用的总和。在优选实施例中，(适配器的)适配器代理将只有在服务器正访问包括适配器的路径上的存储设备(例如，磁盘驱动器)时才仅接受来自服务器的规划带宽使用通知。服务器代理对规划将来带宽使用的指示不是预留或带宽分配，而是提供由适配器代理确定的实际消耗带宽统计数据将可能在不久的将来发生变化的通知。由服务器代理进行的这种指示以及由适配器代理维持的总和的目的是消除或限制许多存储设备的数据流量将被立即引导至一个适配器的一个接口的可能性。适配器代理通常随着时间的推移而减少(即，“老化”)每个规划附加带宽使用通知，并且维持适配器的每个接口的老化规划附加使用值的经更新(老化)总和。由于新流量实际上被路由通过接口，所以这种新的实际流量包括在由适配器代理进行的每接口流量测量中。在优选实施例中，适配器代理通过以20秒的半衰期(或基本上等于20秒的半衰期)指数地(即，实现所指示的规划附加带宽使用值的指数式衰减)降低由所述通知指示的带宽值来使每个规划附加带宽使用通知老化。可替代地，其他机制和值(例如，指数式衰减半衰期值)可以用于实现每个所指示的经规划附加带宽使用值的期望老化；3. If the server agent has indicated to the adapter agent that the server (in which the server agent is implemented) plans to use future additional bandwidth (on paths that include the adapter's interface) in the near future, the adapter agent maintains a total of the planned additional future bandwidth usage (s) that each such server agent has indicated to the adapter agent for one or more paths that include the adapter's interface. In a preferred embodiment, the adapter agent (of the adapter) will only accept notifications of planned bandwidth usage from the server when the server is accessing a storage device (e.g., a disk drive) on the path that includes the adapter. The server agent's indication of planned future bandwidth usage is not a reservation or bandwidth allocation, but rather provides notification that actual bandwidth consumption statistics determined by the adapter agent will likely change in the near future. The purpose of such indication by the server agent and the total maintained by the adapter agent is to eliminate or limit the possibility that data traffic from many storage devices will be directed to one interface of one adapter at once. The adapter agent typically decrements (i.e., "ages") each planned additional bandwidth usage notification over time and maintains an updated (aged) total of the aged planned additional usage values for each interface of the adapter. Since new traffic is actually being routed through the interface, this new actual traffic is included in the per-interface traffic measurements made by the adapter agent. In a preferred embodiment, the adapter agent ages each planned additional bandwidth usage notification by exponentially (i.e., achieving exponential decay of the indicated planned additional bandwidth usage value) decreasing the bandwidth value indicated by said notification with a half-life of 20 seconds (or a half-life substantially equal to 20 seconds). Alternatively, other mechanisms and values (e.g., an exponentially decaying half-life value) may be used to achieve the desired aging of each indicated planned additional bandwidth usage value;

4.适配器代理判定(计算)适配器(在所述适配器中实现所述代理)的每个接口是否都过载，并且(响应于来自服务器代理的请求而)向服务器代理报告对这种过载(在确定过载存在的情况下)的指示。这种过载指示可由服务器用于判定是否尝试停止使用接口(如果可能的话)。服务器将通常被配置用于使用所述指示来判定链路是否已经被几乎完全利用一会儿并且仍被完全利用，并且若是，则将链路考虑为过载并且判定将一些存储数据流量路由到其他地方是否将更好。适配器代理可以对(指示确定过载的)原始过载指示值进行筛选以生成指示确定过载是否持续的经筛选的过载值，并且然后(响应于来自服务器代理的请求而)报告经筛选的过载值而不是原始过载值。在典型的实施例中，适配器代理被配置用于将接口被视为被完全利用的所选带宽用作过载带宽水平。在一个优选实施例中，过载带宽水平被选择为接口的全可用带宽的92.5％，并且如果过载计算至少连续两次产生真，则接口(通过经筛选的过载值)被报告为过载。在典型的实施例中，如果以下情况中的任一情况为真，则过载计算被考虑为真：4. The adapter agent determines (calculates) whether each interface of the adapter (in which the agent is implemented) is overloaded, and (in response to a request from the server agent) reports an indication of such overload (if an overload is determined to exist) to the server agent. This overload indication can be used by the server to determine whether to attempt to stop using the interface (if possible). The server will typically be configured to use the indication to determine whether a link has been almost fully utilized for a while and is still fully utilized, and if so, consider the link to be overloaded and determine whether it would be better to route some storage data traffic elsewhere. The adapter agent can filter the raw overload indication value (indicating the determined overload) to generate a filtered overload value indicating whether the determined overload persists, and then (in response to a request from the server agent) report the filtered overload value instead of the raw overload value. In a typical embodiment, the adapter agent is configured to use a selected bandwidth at which the interface is considered to be fully utilized as the overload bandwidth level. In a preferred embodiment, the overload bandwidth level is selected to be 92.5% of the full available bandwidth of the interface, and the interface (by the filtered overload value) is reported as overloaded if the overload calculation yields true at least twice in a row. In a typical embodiment, an overload calculation is considered true if any of the following conditions are true:

消耗发射打款的短期和长期测量结果(例如，长期和短期发射带宽平均值)两者都超过过载带宽水平，或者消耗接收带宽的短期和长期测量结果(例如，长期和短期接收宽带平均值)两者都高于过载带宽水平；或者Both the short-term and long-term measurements of consumed transmit bandwidth (e.g., long-term and short-term transmit bandwidth averages) exceed the overload bandwidth level, or both the short-term and long-term measurements of consumed receive bandwidth (e.g., long-term and short-term receive bandwidth averages) are above the overload bandwidth level; or

已经(或者几乎已经)达到适配器处理数据的能力；The adapter's data processing capacity has been (or almost has been) reached;

5.适配器代理计算每适配器接口可用带宽的估计量(即，可用于容纳新存储设备的可能被服务器重新引导到接口的数据流量的附加带宽)。这种计算不需要关于新存储设备的状态或能力的任何知识，而相反是由适配器代理进行的对来往于存储设备的额外存储数据流量的估计量的确定，如果这种额外流量被引导至接口，则所述额外流量可以由接口处理。对可用带宽的这种估计值通常是根据以下各项来计算的：接口的全可用带宽(例如，以比特每秒为单位的原始能力)、接口的流量统计数据(即，适配器接口的消耗带宽的至少一个测量结果)、适配器处理附加数据的能力(即，计算负载)以及针对接口的总指示将来带宽通知。由于存储数据流量在各种时候都包括读和写流量，所以所估计的附加可用流量计算结果通常假设附加流量将是发射或接收流量，以已经最繁忙的为准。这防止了任何附加流量使接口上的已经重负载的数据行进方向过载。在优选实施例中，所估计的可用带宽基于接口的平均接收和发射数据加上最近流量的正常变化的估计值(即，标准偏差)以便避免减缓对现有工作的处理。在优选实施例中，流量的平均和预期变化的估计值是经由如例如在上述美国专利6,438,141中描述的“加速、减速(fast up,slow down)”指数移动平均数(其中，如果最近生成的统计数据大于之前生成的统计数据，则相对较大权重应用于下一个平均值；如果最近生成的统计数据小于或等于之前生成的统计数据，则相对较小权重应用于下一个平均值)来计算的。使用简单计算，这种“加速、减速”指数移动平均数可以近似于最近平均数加上系列的一个标准偏差。而且，所估计的总原始可用带宽可以按安全因子减小，以便使通过接口的流量中的短暂冲击存续，而不会降低性能。在一个实施例中，适配器接口的可用带宽的估计量(在以下等式中被表示为值“可用”)被计算如下(但是应当理解的是，尽管可以将附加项和因子考虑在计算内以便调整行为)：5. The adapter agent calculates an estimate of the available bandwidth per adapter interface (i.e., the additional bandwidth available to accommodate data traffic from the new storage device that may be redirected to the interface by the server). This calculation does not require any knowledge of the state or capabilities of the new storage device, but rather is a determination by the adapter agent of an estimated amount of additional storage data traffic to and from the storage device that could be handled by the interface if such additional traffic were directed to the interface. This estimate of available bandwidth is typically calculated based on the interface's total available bandwidth (e.g., raw capacity in bits per second), traffic statistics for the interface (i.e., at least one measurement of the adapter interface's consumed bandwidth), the adapter's ability to handle the additional data (i.e., computational load), and the total indicated future bandwidth notification for the interface. Since storage data traffic often includes both read and write traffic, the estimated additional available traffic calculation typically assumes that the additional traffic will be either transmit or receive traffic, whichever is already busiest. This prevents any additional traffic from overloading the already heavily loaded data travel direction on the interface. In a preferred embodiment, the estimated available bandwidth is based on the average receive and transmit data for the interface plus an estimate of the normal variation in recent traffic (i.e., the standard deviation) to avoid slowing down the processing of existing work. In a preferred embodiment, the estimate of the average and expected variation in traffic is calculated via a "fast up, slow down" exponential moving average (wherein if the most recently generated statistic is greater than the previously generated statistic, a relatively greater weight is applied to the next average; if the most recently generated statistic is less than or equal to the previously generated statistic, a relatively lesser weight is applied to the next average) as described, for example, in the aforementioned U.S. Patent 6,438,141. Using simple calculations, this "fast up, slow down" exponential moving average can be approximated to the most recent average plus one standard deviation of the series. Furthermore, the estimated total raw available bandwidth can be reduced by a safety factor to account for brief bursts in traffic passing through the interface without degrading performance. In one embodiment, the estimate of the available bandwidth of the adapter interface (represented as the value "available" in the following equation) is calculated as follows (although it should be understood that additional terms and factors can be taken into account in the calculation to adjust the behavior):

可用＝(安全_因子*(原始_带宽-最差情况))*处理_能力_降额_因子，Available = (Safety_Factor*(Original_Bandwidth-Worst Case))*Processing_Capacity_Derating_Factor,

其中，值“最差情况”等于最大(发射_平均值_和_变化，接收_平均值_和_变化)+总和(老化-未来-带宽-通知)，其中，Where the value "worst case" is equal to maximum(transmit_average_sum_variation, receive_average_sum_variation) + sum(aging - future - bandwidth - notification), where

“最大(a，b)”表示值“a”或值“b”，以较大者为准，"max(a,b)" means the value "a" or the value "b", whichever is greater,

发射_平均值_和_变化是接口的消耗发射带宽的测量结果(例如，接口的平均发射数据的估计值加上最近发射流量的正常变化(标准偏差)的估计值)，transmit_mean_sum_variance is a measure of the consumed transmit bandwidth of the interface (e.g., an estimate of the average transmit data for the interface plus an estimate of the normal variation (standard deviation) of recent transmit traffic).

接收_平均值_和_变化是接口的消耗接收带宽的测量结果(例如，接口的平均接收数据的估计值加上最近接收流量的正常变化(标准偏差)的估计值)，receive_mean_and_variance is a measure of the consumed receive bandwidth of an interface (e.g., an estimate of the average received data for the interface plus an estimate of the normal variation (standard deviation) of recent receive traffic).

“总和(老化-未来-带宽-通知)”是适配器接口的老化规划附加带宽使用值的总和，"Sum (Aging-Future-Bandwidth-Notification)" is the sum of the aging planned additional bandwidth usage values of the adapter interface.

安全_因子是上述安全因子，Safety_factor is the safety factor mentioned above,

原始_带宽表示接口的全可用带宽，并且Raw_Bandwidth represents the full available bandwidth of the interface, and

处理_能力_降额_因子是适配器的(上述类型的)降额因子；和/或processing_capability_derating_factor is the derating factor of the adapter (of the above type); and/or

6.适配器代理对来自服务器代理的状态请求(即，来自与适配器位于同一存储数据路径上的服务器的服务器代理的状态请求)进行响应。通常，返回至服务器代理的状态报告针对每个适配器接口包含接口的当前过载状态和可用带宽，如上所述。6. The adapter agent responds to the status request from the server agent (i.e., the status request from the server agent of the server on the same storage data path as the adapter). Typically, the status report returned to the server agent contains the current overload status and available bandwidth of the interface for each adapter interface, as described above.

在一些实施例中，每个服务器代理(例如，图1的代理6和12中的每一个，或者图2的代理28)被配置用于执行以下操作中的全部或部分操作：In some embodiments, each server agent (e.g., each of agents 6 and 12 of FIG. 1 , or agent 28 of FIG. 2 ) is configured to perform all or some of the following operations:

1.如适配器代理的典型实施例一样，服务器代理监测发生在服务器的每个接口上的接收和发射流量(例如，以比特每秒为单位)，并且生成每个所述接口的消耗带宽的至少一个测量结果。通常，每个监测样本通常是在相对较短的时间段(例如，几秒)内取得的，并且服务器代理确定接收数据样本流和发射数据样本流的统计表征，以提供对每个接口的消耗带宽(在每个接口上使用的带宽)的短期和长期测量结果。由于现代NIC是全双工的(通常，服务器的NIC可以同时进行发送和接收)，所以通常对每个接口上的接收数据和发射数据保持独立的统计数据。在优选实施例中，确定值的指数移动平均数(即，在这种情况下，接口上的接收流量在固定持续时间的移动时间窗口内的指数移动平均数，或者接口上的发射流量在固定持续时间的移动时间窗口内的指数移动平均数)的公知方法用于确定对每个接口上的接收流量的统计表征以及对每个接口上的发射流量的统计表征(例如，以上述与本发明适配器代理的典型实施例一样的方式)；1. As in the exemplary embodiment of the adapter agent, the server agent monitors the receive and transmit traffic (e.g., in bits per second) occurring on each interface of the server and generates at least one measurement of the consumed bandwidth of each of said interfaces. Typically, each monitoring sample is typically taken over a relatively short period of time (e.g., a few seconds), and the server agent determines statistical representations of the streams of receive and transmit data samples to provide short-term and long-term measurements of the consumed bandwidth of each interface (the bandwidth used on each interface). Because modern NICs are full-duplex (typically, a server's NIC can transmit and receive simultaneously), separate statistical data are typically maintained for the receive and transmit data on each interface. In a preferred embodiment, a known method for determining an exponential moving average of values (i.e., in this case, an exponential moving average of the receive traffic on the interface over a moving time window of fixed duration, or an exponential moving average of the transmit traffic on the interface over a moving time window of fixed duration) is used to determine the statistical representation of the receive traffic on each interface and the statistical representation of the transmit traffic on each interface (e.g., in the same manner as described above for the exemplary embodiment of the adapter agent of the present invention);

2.针对访问已被分配给服务器的路径的每个存储设备(例如，磁盘驱动器)(在存储设备可用于由服务器经由融合网络通过所述路径而访问的意义上)，服务器代理可以使服务器向作为所述路径的另一端点的适配器声明一请求，并且服务器代理接收适配器的带宽(消耗带宽和/或可用带宽)和/或过载信息(即，由适配器的适配器代理响应于所述请求而生成的过载和/或带宽报告)。在许多情况下，相同的适配器用于若干存储设备和路径，因此响应于一个请求接收到的适配器数据经常可以用于多条路径；2. For each storage device (e.g., disk drive) that accesses a path assigned to the server (in the sense that the storage device is available for access by the server via the converged network through the path), the server agent may cause the server to issue a request to the adapter that is the other end point of the path, and the server agent receives bandwidth (consumed bandwidth and/or available bandwidth) and/or overload information of the adapter (i.e., overload and/or bandwidth reports generated by the adapter agent of the adapter in response to the request). In many cases, the same adapter is used for several storage devices and paths, so the adapter data received in response to one request may often be for multiple paths;

3.针对服务器(在所述服务器中实现服务器代理)通过其经由适配器而访问存储设备的每条路径，服务器代理计算所述路径是否过载并且是否需要甩负荷，以及所述路径的可用(未使用)带宽有多少。在典型实施例中，服务器代理将路径上的可用带宽确定为以下各项中的最小值：服务器接口(其沿所述路径耦合)上的可用带宽或者适配器接口(其沿所述路径耦合)上的可用带宽。在典型实施例中，如果服务器接口或者适配器接口过载，则服务器代理确定路径是过载的(典型地，包括通过使用由适配器代理响应于来自服务器的请求而向服务器断言的报告中的接口过载指示)；3. For each path through which a server (in which a server agent is implemented) accesses a storage device via an adapter, the server agent calculates whether the path is overloaded and requires load shedding, and how much available (unused) bandwidth the path has. In a typical embodiment, the server agent determines the available bandwidth on the path as the minimum of: the available bandwidth on the server interface coupled along the path or the available bandwidth on the adapter interface coupled along the path. In a typical embodiment, the server agent determines that the path is overloaded if either the server interface or the adapter interface is overloaded (typically, including by using an interface overload indication in a report asserted by the adapter agent to the server in response to a request from the server);

4.如果至少一条过载路径正在使用(由服务器用来访问任意(多个)存储设备)，则服务器代理通常实施选择过程来访问每个过载。在优选实施例中，如果存在至少两条过载路径正被服务器使用，则服务器代理按随机顺序对其进行考虑，并且每个周期仅选择一条：4. If at least one overload path is in use (used by the server to access any storage device(s)), the server agent typically performs a selection process to access each overload. In a preferred embodiment, if there are at least two overload paths in use by the server, the server agent considers them in a random order and selects only one per cycle:

如果存在另一条路径可用(在沿过载路径耦合的服务器与适配器之间)，所述路径没有过载并且具有足够的可用带宽用于另一存储设备，则服务器代理选择这另一条路径以供随后使用。如果两条或更多条这种替换路径可用，则服务器代理选择具有最高可用带宽的路径；If there is another path available (between the server and the adapter coupled along the overloaded path) that is not overloaded and has sufficient available bandwidth for another storage device, the server agent selects the alternative path for subsequent use. If two or more such alternative paths are available, the server agent selects the path with the highest available bandwidth;

否则，如果服务器(及其服务器代理)已经获知沿过载路径耦合的服务器与存储设备之间的优选数据路径，并且如果当前(过载)路径不是最初由服务器分配用于访问存储设备的路径，则所述优选数据路径被选择以供随后使用(不管所述优选数据路径是否过载)。通常，如果当前(过载)路径的分配未改变(即，如果没有其他路径被选择用来替换所述当前路径)，则以与当前路径相同的方式来考虑下一过载路径；Otherwise, if the server (and its server agent) has learned a preferred data path between the server and the storage device coupled along the overloaded path, and if the current (overloaded) path is not the path originally assigned by the server for accessing the storage device, then the preferred data path is selected for subsequent use (regardless of whether the preferred data path is overloaded). Generally, if the assignment of the current (overloaded) path has not changed (i.e., if no other path has been selected to replace the current path), then the next overloaded path is considered in the same manner as the current path;

5.如果进行新路径分配(即，如果服务器代理选择另一路径来替换当前路径)，则服务器代理通常执行以下动作：5. If a new path assignment is made (i.e., if the server agent selects another path to replace the current path), the server agent typically performs the following actions:

通知与新选择的路径相关联的适配器代理所述服务器接口(其沿所述新选择的路径耦合)计划向适配器的特定接口断言具有特定带宽(例如，一个磁盘的带宽价值的将来负载)的存储数据流量。这立即影响了由适配器的适配器代理生成的统计数据和报告，并且通常防止(间接地)两个服务器试图使用适配器接口上相同过剩的带宽；并且Notifying the adapter agent associated with the newly selected path that the server interface (which is coupled along the newly selected path) plans to assert storage data traffic having a specific bandwidth (e.g., one disk's worth of bandwidth for future load) to a specific interface of the adapter. This immediately affects statistics and reports generated by the adapter agent for the adapter and generally prevents (indirectly) two servers from attempting to use the same excess bandwidth on an adapter interface; and

服务器代理使服务器将服务器与相关存储设备之间的存储数据流量路由改变至新选择的路径；和/或The server agent causes the server to reroute storage data traffic between the server and the associated storage device to the newly selected path; and/or

6.在使服务器将服务器与存储设备之间的存储数据流量路由改变至新选择的路径之后，服务器代理等待具有足够持续时间的一时间间隔(例如，预先确定的或随机选择的时间间隔)，使得服务器代理最近动作的影响可以反映在由每个适配器代理对适配器代理的每个适配器接口上的流量的持续监测结果(例如，监测统计数据)中。在所述等待之后，服务器代理开始评估(例如，重新评估)到存储设备的路径，包括除了新路径之外的至少一条路径。在优选实施例中，所述等待的所述时间间隔由被选择为所选间隔(例如，10秒)正常变量的随机数确定，受制于预定的最短等待和最长等待。6. After causing the server to reroute storage data traffic between the server and the storage device to the newly selected path, the server agent waits for a time interval (e.g., a predetermined or randomly selected time interval) of sufficient duration so that the impact of the server agent's recent action can be reflected in the ongoing monitoring results (e.g., monitoring statistics) of traffic on each adapter interface of the adapter agent by each adapter agent. After the wait, the server agent begins evaluating (e.g., re-evaluating) paths to the storage device, including at least one path other than the new path. In a preferred embodiment, the time interval for the wait is determined by a random number selected as a normal variable of the selected interval (e.g., 10 seconds), subject to a predetermined minimum wait and maximum wait.

由本发明系统实施例执行的示例性方法如下。服务器代理(在这里示例中，图2中服务器21的代理28)响应于确定服务器21应经由通过适配器25的特定接口(在这个示例中，即接口30)的路径访问存储设备(通过适配器25耦合至网络20)将规划的附加带宽使用通知发送至适配器代理(在这个示例中，图2的适配器25的适配器代理38)。作为响应，适配器代理38随时间推移减小(即，“老化”)由所述通知指示的规划附加带宽使用值，并且维护针对接口30接收到的所有老化规划附加带宽使用值的已更新(老化)总和(并且使用所述老化总和来生成当前过载状态和可用带宽指示)。由于新流量被实际路由通过接口30，因此这种新的实际流量被包括在由适配器代理38进行的每接口流量测量中(并且用于生成每个适配器接口的当前过载状态和可用带宽指示)。同时，耦合至网络的其他适配器的其他适配器代理独立地执行其自己的每适配器接口流量测量(并且生成其自己的每接口当前过载状态和可用带宽指示)。服务器代理28(从每个适配器代理)请求指示适配器的每个接口的当前过载状态和可用带宽的报告，在所述适配器中实现每个这种适配器代理并且所述适配器是到由服务器使用的存储设备的路径的一部分，并且作为响应，每个被查询适配器代理独立地将所请求的报告发送至服务器代理28。服务器代理28使用代理28针对其自己的服务器接口自身生成的流量的报告和统计表征，以判定是否允许服务器21经由当前路径来访问存储设备(由最新断言的规划附加带宽使用通知假设)，或者是否选择另一路径(来替换当前路径)以供服务器21访问存储设备。如果服务器代理28选择新路径来供服务器21访问存储设备，则服务器代理28通知与新选路径相关联的适配器代理所述服务器接口(其将被耦合至新选路径)计划向适配器的特定接口断言具有特定带宽的存储数据流量，并且服务器代理28使服务器21将服务器21与相关存储设备之间的存储数据流量路由改变至新选择的路径。因此，系统以分散化方式进行操作(从独立地适配器代理向服务器的服务器代理独立断言独立生成的报告)，从而选择由服务器访问存储设备的最佳路径。An exemplary method performed by an embodiment of the system of the present invention is as follows. A server agent (in this example, agent 28 of server 21 in FIG. 2 ) sends a notification of planned additional bandwidth usage to an adapter agent (in this example, adapter agent 38 of adapter 25 in FIG. 2 ) in response to determining that server 21 should access a storage device (coupled to network 20 via adapter 25) via a path through a particular interface of adapter 25 (in this example, interface 30). In response, adapter agent 38 decrements (i.e., “ages”) the planned additional bandwidth usage value indicated by the notification over time and maintains an updated (aged) sum of all aged planned additional bandwidth usage values received for interface 30 (and uses the aged sum to generate a current overload status and available bandwidth indication). Because new traffic is actually routed through interface 30, this new actual traffic is included in the per-interface traffic measurement performed by adapter agent 38 (and used to generate a current overload status and available bandwidth indication for each adapter interface). Concurrently, other adapter agents for other adapters coupled to the network independently perform their own per-adapter interface traffic measurements (and generate their own per-interface current overload status and available bandwidth indications). The server agent 28 requests (from each adapter agent) a report indicating the current overload status and available bandwidth for each interface of the adapter in which each such adapter agent is implemented and which is part of a path to the storage device used by the server, and in response, each queried adapter agent independently sends the requested report to the server agent 28. The server agent 28 uses the report and statistical representation of the traffic generated by the agent 28 itself for its own server interface to determine whether to allow the server 21 to access the storage device via the current path (as assumed by the most recently asserted notification of planned additional bandwidth usage) or whether to select another path (to replace the current path) for server 21 to access the storage device. If the server agent 28 selects a new path for server 21 to access the storage device, the server agent 28 notifies the adapter agent associated with the newly selected path that the server interface (which will be coupled to the newly selected path) plans to assert storage data traffic with a specific bandwidth to the specific interface of the adapter, and the server agent 28 causes the server 21 to reroute the storage data traffic between the server 21 and the associated storage device to the newly selected path. Thus, the system operates in a decentralized manner (independently asserting independently generated reports from independently configured adapter agents to server agents at the servers) to select the best path for access to storage devices by the servers.

在本发明的一些实施例中，耦合至融合网络的服务器的服务器代理(例如，图1的代理6和12中的每一个，或者图2的代理28)被配置用于检测并重新路由网络瓶颈周围的存储流量，除了由适配器接口流量或适配器能力引起的那些之外。这种瓶颈示例是由服务器与其他(多个)可能未参与再均衡机制的服务器之间的传统数据通信流量引起的网络瓶颈。In some embodiments of the present invention, a server agent (e.g., each of agents 6 and 12 of FIG. 1 , or agent 28 of FIG. 2 ) coupled to a server of a converged network is configured to detect and reroute storage traffic around network bottlenecks, in addition to those caused by adapter interface traffic or adapter capabilities. An example of such a bottleneck is a network bottleneck caused by traditional data communication traffic between the server and other server(s) that may not participate in the rebalancing mechanism.

在一类优选实施例中，服务器和适配器(例如，图1的元件1、3、5、7、9和11，每个元件实现有多个网络接口)耦合至融合网络(例如，图1的网络20)，所述融合网络为以太网，并且服务器和适配器被配置用于根据iSCSI(互联网小型计算机系统接口)联网协议通过网络进行通信。在这类实施例中，服务器代理和适配器代理(例如，图1的代理6、12、14、16、18和22)以简单的方式(比许多常规应用中简单得多)使用iSCSI网络协议，其中，虽然允许服务器(例如，服务器1或3)与适配器(例如，适配器5、7、9或11)中的任何适配器之间的通信，但是在每个服务器与每个适配器(至存储设备)之间一次只存在一条连接路径。在这类实施例中，服务器代理使用常规多路径I/O(“MPIO”)技术(或常规MPIO技术的新的简化版本)来完成根据本发明的存储数据流量均衡。表述“MPIO型子系统”在此用于指示实现常规MPIO的(例如，服务器的)处理子系统或实现常规MPIO的简化版本的处理子系统中的任一者。In one preferred embodiment, a server and an adapter (e.g., elements 1, 3, 5, 7, 9, and 11 of FIG. 1 , each of which implements multiple network interfaces) are coupled to a converged network (e.g., network 20 of FIG. 1 ), which is an Ethernet network, and the server and adapter are configured to communicate over the network according to the iSCSI (Internet Small Computer System Interface) networking protocol. In such embodiments, the server agent and the adapter agent (e.g., agents 6, 12, 14, 16, 18, and 22 of FIG. 1 ) utilize the iSCSI networking protocol in a simplified manner (much simpler than in many conventional applications), wherein, while communication between a server (e.g., server 1 or 3) and any of the adapters (e.g., adapters 5, 7, 9, or 11) is permitted, only one connection path exists between each server and each adapter (to the storage device) at a time. In such embodiments, the server agent utilizes conventional multipath I/O ("MPIO") technology (or a new, simplified version of conventional MPIO technology) to achieve storage data traffic balancing according to the present invention. The expression "MPIO-type subsystem" is used herein to refer to either a processing subsystem (eg, of a server) that implements conventional MPIO or a processing subsystem that implements a simplified version of conventional MPIO.

在所描述的这类实施例中，每个服务器包括MPIO型子系统(例如，内核中的MPIO驱动器)，所述MPIO型子系统根据iSCSI经由服务器接口中的所选择接口来管理数据输入/输出。服务器的服务器代理与MPIO型子系统交互，包括通过设置存储设备访问“策略”，所述策略允许服务器(经由网络和适配器之一)仅通过已被服务器代理选择的服务器接口之一来访问存储设备。这种策略类似常规的MPIO“仅故障转移”，所述策略部执行负载均衡并且代替地使用单个有效路径用于网络访问(并且其他潜在可用路径仅是备用路径，所述备用路径仅在单个有效路径故障时使用)。然而，存储设备访问策略由本发明服务器代理根据本发明进行使用从而以新的方式实现存储数据流量均衡。当服务器的服务器代理选择新路径(根据本发明方法的任何实施例，通常包括从适配器代理接收所请求报告的步骤)以由服务器经由服务器的新选接口访问存储设备，所述服务器代理通过使MPIO型子系统指定新的存储设备访问“策略”而使服务器将存储数据流量(去向或来自存储设备)的路由改变至新选路径，所述策略允许服务器仅经由已被服务器代理选择的服务器接口中的新接口来访问存储设备。服务器代理还使新的存储设备访问路径延伸至由服务器代理选择的合适适配器接口。In the described embodiment of the type, each server includes an MPIO-type subsystem (e.g., an MPIO driver in the kernel) that manages data input/output according to iSCSI via selected ones of the server interfaces. The server's server agent interacts with the MPIO-type subsystem, including by setting a storage device access "policy" that allows the server (via one of the network and the adapter) to access the storage device only through one of the server interfaces that has been selected by the server agent. This policy is similar to a conventional MPIO "failover only" policy that performs load balancing and instead uses a single valid path for network access (and the other potentially available paths are merely backup paths that are only used when the single valid path fails). However, the storage device access policy is used by the server agent of the present invention in accordance with the present invention to achieve storage data traffic balancing in a new way. When the server's server agent selects a new path (typically including the step of receiving a requested report from the adapter agent, according to any embodiment of the method of the present invention) for the server to access the storage device via the server's newly selected interface, the server agent causes the server to reroute storage data traffic (to or from the storage device) to the newly selected path by causing the MPIO-type subsystem to specify a new storage device access "policy," which allows the server to access the storage device only via the new interface among the server interfaces selected by the server agent. The server agent also causes the new storage device access path to extend to the appropriate adapter interface selected by the server agent.

因此，在所描述的这类实施例中，MPIO型子系统(由本发明服务器代理)用来根据本发明均衡融合网络上的存储数据流量。Thus, in embodiments of the type described, an MPIO-type subsystem (proxyed by the server of the present invention) is used to balance storage data traffic over a converged network in accordance with the present invention.

MPIO最初是在隔离开的存储网络上发展的，并且常规MPIO负载均衡在融合网络上将不会运行良好。例如，假设尝试在融合网络(被实现为以太网)中使用MPIO来均衡耦合至网络的服务器的多个以太网端口与耦合至网络的适配器的多个以太网端口之间的存储数据流量，其中，所述适配器还具有耦合至待由服务器访问的磁盘驱动子系统(即，JBOD)的多个“后端”SAS端口。在示例中，MPIO的所有常规负载均衡“策略”(跨所有可用以太网接口以循环复用方式发送存储命令，或者确定每条链路上有多少工作未解决的某个度量以及向‘最不繁忙的’以太网接口发送命令)通常将增加执行一系列磁盘访问命令所需的搜寻次数，因为其将频繁地使序列中的连续命令采取不同的路径从服务器到磁盘驱动器(经常使命令无序地到达磁盘驱动器)，并且因此将使得上述过度搜寻问题成为过快改变通过所述网络的存储数据路径的结果而不管改变是否是令人期望的。相比而言，本发明的典型实施例(包括本发明服务器代理的实施例使用服务器的MPIO型子系统来均衡如上所述融合网络上的存储数据流量的那些实施例)将通常不会引起过度搜寻问题，因为其通常将仅在需要时仅改变用于访问任一磁盘驱动器的存储数据路径，并且通常非常不频繁(例如，每小时一次、两次或几次)。本发明的典型实施例的重要优点是维持将命令经由融合网络有序传递至磁盘，同时调整交叉流量(以执行存储数据流量均衡)。MPIO was originally developed on isolated storage networks, and conventional MPIO load balancing will not work well on converged networks. For example, suppose an attempt is made to use MPIO in a converged network (implemented as Ethernet) to balance storage data traffic between multiple Ethernet ports of a server coupled to the network and multiple Ethernet ports of an adapter coupled to the network, where the adapter also has multiple "back-end" SAS ports coupled to a disk drive subsystem (i.e., JBOD) to be accessed by the server. In this example, all conventional load balancing "strategies" of MPIO (sending storage commands in a round-robin manner across all available Ethernet interfaces, or determining some metric of how much work is outstanding on each link and sending commands to the 'least busy' Ethernet interface) will typically increase the number of seeks required to execute a series of disk access commands because it will frequently cause consecutive commands in the sequence to take different paths from the server to the disk drives (often causing the commands to arrive at the disk drives out of order), and will therefore cause the excessive seek problem described above to be the result of changing the storage data path through the network too quickly, regardless of whether the change is desirable or not. In contrast, typical embodiments of the present invention (including those embodiments in which the server agent of the present invention uses the MPIO-type subsystem of the server to balance storage data traffic on the converged network as described above) will generally not cause excessive seek problems because they will generally only change the storage data path used to access any disk drive when necessary, and generally very infrequently (e.g., once, twice, or a few times per hour). An important advantage of typical embodiments of the present invention is that it maintains the orderly delivery of commands to disks via the converged network while adjusting cross-traffic (to perform storage data traffic balancing).

在另一类实施例中，实现本发明服务器代理实施例的服务器还实现用户接口。在服务器的典型操作过程中，在具有耦合至服务器的显示设备的这种实施例中，用户接口将使显示设备显示服务器代理的操作或状态的指示和/或由服务器代理所接收的报告或做作出的判定的指示。例如，可以显示以下类型的指示：服务器代理监测服务器接口流量和/或带宽的状态、从适配器代理接收的报告(例如，关于适配器接口状态和可用带宽)、以及当前存储设备访问路径应该或不应该被改变的判定。In another class of embodiments, a server implementing a server agent embodiment of the present invention also implements a user interface. During typical operation of the server, in such embodiments having a display device coupled to the server, the user interface causes the display device to display indications of the operation or status of the server agent and/or indications of reports received or determinations made by the server agent. For example, the following types of indications may be displayed: status of the server agent monitoring server interface traffic and/or bandwidth, reports received from the adapter agent (e.g., regarding adapter interface status and available bandwidth), and determinations that the current storage device access path should or should not be changed.

本发明的典型实施例的优点和特征包括以下各项：Advantages and features of exemplary embodiments of the present invention include the following:

1.根据典型的实施例，以完全分散化的方式均衡融合网络上的存储数据流量，通信被执行以便实现仅在适配器与服务器(不是在服务器之间或在适配器之间或从适配器到两个或更多个服务器)之间的每条数据路径的端点(例如，图1的服务器1与服务器5，或图2的服务器21与适配器25)之间发生的均衡。任何参与者(例如，服务器接口、服务器代理、适配器接口或适配器代理)的故障仅影响所述参与者作为成员的路径。一般而言，在任何服务器代理与适配器代理(例如，服务器代理不与多于一个适配器代理共享这种通信)之间仅存在一对一通信。相比而言，用于均衡多个存储设备和多个服务器当中的存储数据流量的常规方法还未以此方式分散。1. According to an exemplary embodiment, storage data traffic is balanced on a converged network in a fully decentralized manner, with communication performed so that balancing occurs only between the endpoints of each data path (e.g., server 1 and server 5 in FIG. 1 , or server 21 and adapter 25 in FIG. 2 ) between an adapter and a server (not between servers, between adapters, or from an adapter to two or more servers). A failure of any participant (e.g., a server interface, a server agent, an adapter interface, or an adapter agent) affects only the paths of which that participant is a member. Generally, there is only one-to-one communication between any server agent and an adapter agent (e.g., a server agent does not share such communication with more than one adapter agent). In contrast, conventional methods for balancing storage data traffic across multiple storage devices and multiple servers have not been decentralized in this manner.

2.实现存储流量均衡所需的通信仅在适配器与服务器之间的每条数据路径的端点之间(例如，图1的服务器1与适配器5，或者图2的服务器21与适配器25)。因此，服务器与适配器之间的连接数量由同服务器与适配器之间的路径相关联的存储设备(例如，磁盘数量)数量界定。因此，即使在具有上千个服务器和适配器的非常大的数据中心，实现本发明的典型实施例需要的每个服务器和适配器上的计算负载以及网络负载非常小。2. The communication required to implement storage traffic balancing occurs only between the endpoints of each data path between the adapter and the server (e.g., server 1 and adapter 5 in FIG1 , or server 21 and adapter 25 in FIG2 ). Therefore, the number of connections between a server and an adapter is limited by the number of storage devices (e.g., disks) associated with the path between the server and the adapter. Consequently, even in very large data centers with thousands of servers and adapters, the computational and network load on each server and adapter required to implement exemplary embodiments of the present invention is minimal.

3.不存在针对存储数据流量的带宽预保留或锁定。因此，任何参与者(即，服务器接口、服务器代理、适配器接口或适配器代理)的故障将立即反映在整体统计数据中，并且参与者(在故障之前)所使用的资源将自动可用于由剩余设备使用。如果一个或多个故障设备随后返回，则若流量导致过载则本发明方法的典型实施例的性能将导致其他服务器重新引导流量离开由(多个)已恢复设备使用的(多条)路径。3. There is no pre-reservation or locking of bandwidth for storage data traffic. Therefore, a failure of any participant (i.e., server interface, server agent, adapter interface, or adapter agent) will be immediately reflected in the overall statistics, and the resources used by the participant (before the failure) will automatically be available for use by the remaining devices. If one or more failed devices subsequently return, the performance of an exemplary embodiment of the present method will cause other servers to redirect traffic away from the path(s) used by the restored device(s) if the traffic causes an overload.

4.即使当服务器将规划附加带宽使用通知发送给适配器时，(在适配器中实现的)适配器代理通常随时间推移减小(即，“老化”)由每个通知指示的规划附加带宽使用值。老化通常将适配器的每个接口的老化规划附加带宽使用值相对快速地减小(至零)。因此，没有很快导致附加观察存储数据流量的规划附加带宽使用通知被迅速忽略。4. Even when the server sends planned additional bandwidth usage notifications to the adapter, the adapter agent (implemented in the adapter) typically decreases (i.e., "ages") the planned additional bandwidth usage value indicated by each notification over time. Aging typically decreases (to zero) the aged planned additional bandwidth usage value for each interface of the adapter relatively quickly. Consequently, planned additional bandwidth usage notifications that do not quickly result in additional observed storage data traffic are quickly ignored.

5.由服务器选择的导致暂时过载的数据路径通常在非常短的时间内被更改(即，被到达同一存储设备的新数据路径取代)。5. The data path selected by the server that causes a temporary overload is usually changed (ie, replaced by a new data path to the same storage device) in a very short time.

6.宣布每个服务器开始使用新路径的意图的过程(即，由每个服务器代理将规划附加带宽使用通知发送至每个适配器的适配器代理，这将受所指示的规划附加带宽使用的真实存在影响)防止许多服务器在几乎相同时刻作出相同的决策。即，其将单独基于历史数据实质地防止可根据几乎同步的路径决策而发生的任何冲突。否则，所有服务器可查看指示轻度过载接口的统计数据，并且其中所有服务器可重新引导至所述接口的路径，从而导致严重的过载状况。6. The process of announcing each server's intention to begin using a new path (i.e., each server agent sends a notification of planned additional bandwidth usage to each adapter's adapter agent, which will be affected by the actual existence of the indicated planned additional bandwidth usage) prevents many servers from making the same decision at approximately the same time. This essentially prevents any conflicts that could arise from nearly simultaneous path decisions based solely on historical data. Otherwise, all servers might see statistics indicating a slightly overloaded interface, and all servers might redirect paths to that interface, leading to a severe overload condition.

7.使用随机周期(例如，在服务器代理在使服务器将服务器与存储设备之间的存储数据流量路由改变至新选路径之后等待一随机确定的时间间隔从而使得在服务器代理开始重新评估到存储设备的路径之前服务器代理的最新动作的影响可以反映在监测统计数据中的实施例中)防止服务器工作在锁定步骤中，进一步避免做出同步冲突决策。7. Using random periods (e.g., in an embodiment where the server agent waits for a randomly determined time interval after causing the server to reroute storage data traffic between the server and the storage device to a newly selected path so that the impact of the server agent's latest action can be reflected in the monitoring statistics before the server agent begins to reevaluate the path to the storage device) prevents the server from working in lock step, further avoiding making synchronization conflict decisions.

8.如果网络变为被充分利用(即，所有接口都过载)，从而使得没有机会来重新引导存储流量，在典型地实施例中，所有服务器和适配器将恢复到服务器与适配器之间的预定“优选”数据路径。这意味着将不会进行无用的重新引导尝试。此外，如果优选数据路径以静态地均衡所有数据流量的方式被选择，则它们应在全负载网络中构成最优配置。8. If the network becomes fully utilized (i.e., all interfaces are overloaded), such that there is no opportunity to redirect storage traffic, in a typical embodiment, all servers and adapters will revert to the predetermined "preferred" data paths between the servers and adapters. This means that no useless reboot attempts will be made. Furthermore, if the preferred data paths are selected in a way that statically balances all data traffic, they should constitute an optimal configuration in a fully loaded network.

9.不需要改变任何应用(例如，由图1的服务器1的子系统4或服务器3的子系统10或者图2的服务器21的子系统26实现的应用)以获得本发明的典型实施例的优点。通常，每个服务器代理和适配器代理以对应用不可见的方式运行，并且使用所涉及接口中的任何接口的程序和设备将从存储数据负载均衡中受益(包括仅执行数据通信操作的那些程序和设备)。9. No application (e.g., an application implemented by subsystem 4 of server 1 or subsystem 10 of server 3 in FIG. 1 , or subsystem 26 of server 21 in FIG. 2 ) needs to be changed to obtain the advantages of the exemplary embodiments of the present invention. Typically, each server agent and adapter agent operates in a manner invisible to the application, and programs and devices that use any of the involved interfaces will benefit from storage data load balancing (including those that perform only data communication operations).

本发明的其他方面是被编程或以其他方式被配置用于实现本发明适配器代理实施例的适配器(例如，图1的适配器5、7、9和11中任何一个，或者图2的适配器25、27、29和31中任何一个)、集成有这种适配器的磁盘驱动器(或其他存储设备)(例如，将存储子系统15实现为磁盘驱动器，与适配器9(以及适配器11)集成为单个设备100，如图1中所示)、集成有这种适配器的JBOD(或其他存储设备系统)(例如，将存储子系统33实现为JBOD，与适配器29(以及适配器31)集成为单个设备101，如图2中所示)、被编程或以其他方式被配置用于实现本发明服务器代理实施例的服务器(例如，图1的服务器1和3中任何一个，或者图2的服务器21)、本发明服务器代理的实施例的硬件实施方式(例如，图1的代理6，在硬件中实现)、以及本发明适配器代理的实施例的硬件实施方式(例如，图1的代理14，在硬件中实现)。Other aspects of the present invention are adapters programmed or otherwise configured to implement adapter agent embodiments of the present invention (e.g., any one of adapters 5, 7, 9, and 11 of FIG. 1 , or any one of adapters 25, 27, 29, and 31 of FIG. 2 ), disk drives (or other storage devices) integrated with such adapters (e.g., storage subsystem 15 implemented as a disk drive, integrated with adapter 9 (and adapter 11) as a single device 100, as shown in FIG. 1 ), JBODs (or other storage device systems) integrated with such adapters (e.g., storage subsystem 33 implemented as a JBOD, integrated with adapter 29 (and adapter 31) as a single device 101, as shown in FIG. 2 ), servers programmed or otherwise configured to implement server agent embodiments of the present invention (e.g., any one of servers 1 and 3 of FIG. 1 , or server 21 of FIG. 2 ), hardware implementations of server agent embodiments of the present invention (e.g., agent 6 of FIG. 1 , implemented in hardware), and hardware implementations of adapter agent embodiments of the present invention (e.g., agent 14 of FIG. 1 , implemented in hardware).

本发明的其他方面是在本发明系统、适配器、存储设备、JBOD、服务器或其他设备的操作中执行的方法。一种这样的方法包括以下步骤：Other aspects of the present invention are methods performed in the operation of the system, adapter, storage device, JBOD, server, or other device of the present invention. One such method comprises the following steps:

通过融合网络从服务器到适配器断言一请求，其中，所述服务器被配置成包括服务器代理并且所述适配器被配置成包括适配器代理；asserting a request from a server to an adapter over a converged network, wherein the server is configured to include a server agent and the adapter is configured to include an adapter agent;

采用所述服务器代理来对由所述适配器代理响应于所述请求而断言(即，提供)至所述服务器的服务器接口的至少一个适配器接口过载指示进行标识，其中，所述适配器接口过载指示指示所述适配器的适配器接口是否过载；并且employing the server agent to identify at least one adapter interface overload indication asserted (i.e., provided) by the adapter agent to a server interface of the server in response to the request, wherein the adapter interface overload indication indicates whether an adapter interface of the adapter is overloaded; and

针对包括所述服务器接口并且所述服务器通过其而经由所述适配器访问至少一个存储设备的路径，采用所述服务器代理来以使用所述适配器接口过载指示的方式来判定所述路径是否过载。For a path including the server interface and through which the server accesses at least one storage device via the adapter, the server agent is employed to determine whether the path is overloaded by using the adapter interface overload indication.

应理解的是，虽然本发明的某些形式已经在此被展示且被描述，但是本发明不限于所描述或所示出的特定实施例或者所描述的特定方法。除非在权利要求语言中明确描述，否则描述方法的权利要求不暗示任何特定的步骤顺序。It should be understood that although certain forms of the invention have been shown and described herein, the invention is not limited to the specific embodiments described or illustrated or the specific methods described. Claims describing methods do not imply any specific order of steps unless explicitly stated in the claim language.

Claims

1. An adapter configured for use in a system comprising at least one server coupled to a converged network via at least one server interface, and at least one storage device, wherein the server includes a server agent, and the adapter comprises:

At least one port, the at least one port being configured to couple the storage device to the adapter;

At least one adapter interface, the at least one adapter interface being configured to couple the adapter to the network, and thereby coupling the storage device to the network via the adapter when the storage device is coupled to the at least one port; and

An adapter proxy, wherein the adapter proxy is coupled and configured to:

Determine whether each adapter interface is overloaded, and generate an adapter interface overload indication for each adapter interface, wherein the adapter interface overload indication for each adapter interface indicates whether the adapter interface is overloaded; and

In response to a request from the server agent, the adapter asserts data indicating at least one adapter interface overload to at least one of the adapter interfaces.

2. The adapter of claim 1, wherein the adapter agent is further coupled and configured to:

Monitor the data traffic occurring on each of the adapter interfaces and generate a bandwidth consumption indication for each of the adapter interfaces, wherein the bandwidth consumption indication for each of the adapter interfaces indicates the bandwidth consumed by the adapter interface;

Generate an available bandwidth indication for each of the adapter interfaces, wherein the available bandwidth indication for each of the adapter interfaces indicates the available bandwidth of the adapter interface; and

In response to the request from the server proxy, the adapter asserts data indicating the following to at least one of the adapter interfaces:

At least one of the adapter interface overload indicators, and at least one of the bandwidth consumption indicators and/or at least one of the available bandwidth indicators.

3. The adapter of claim 2, wherein the adapter agent is further coupled and configured to filter raw overload indication values to generate filtered overload values, wherein the raw overload indication values indicate identified overloads, and the filtered overload values indicate whether the identified overloads are persistent; and in response to the request from the server agent, the adapter asserts data indicating the filtered overload values to at least one of the adapter interfaces.

4. The adapter of claim 1, wherein the adapter agent is further coupled and configured to estimate the adapter's ability to process additional data.

5. The adapter of claim 1, wherein the adapter agent is coupled and configured to generate an available bandwidth indication for each of the adapter interfaces, wherein the available bandwidth indication for each of the adapter interfaces indicates the available bandwidth of the adapter interface, including by:

Each planned additional bandwidth usage value received from at least one of the server proxies for one of the adapter interfaces is aged out, thereby generating an aged planned bandwidth usage value for the adapter interface, and the sum of each of the aged planned bandwidth usage values for the adapter interface is maintained for each of the adapter interfaces.

6. The adapter of claim 5, wherein the adapter agent is coupled and configured to generate the available bandwidth indication for each of the adapter interfaces based on: the full available bandwidth of the adapter interface, at least one measurement of the bandwidth consumed by the adapter interface, an indication of the adapter's ability to process additional data, and a sum of the aging-planned bandwidth usage values for each of the adapter interfaces.

7. A system for a converged network, comprising the adapter as claimed in claim 1, and further comprising:

At least one server, the at least one server having at least one server interface, wherein the server is configured to include a server proxy and is coupled to a converged network via the server interface; and

At least one storage device is configured to be coupled to the adapter, such that the adapter couples the storage device to the network via the adapter interface.

The server proxy is coupled and configured to:

The server asserts a request to the adapter agent and identifies at least one adapter interface overload indication provided to the server by the adapter agent in response to the request; and

For a path that includes the server interface and through which the server accesses the storage device via the adapter, the overload indication of the adapter interface is used to determine whether the path is overloaded.

8. The system of claim 7, wherein the server proxy is configured to respond to a determination of path overload, including by:

Determine whether to select a new path to the storage device for subsequent use, and

After determining that the new path should be selected, the server changes the routing of storage data traffic between the server and the storage device to the new path.

9. The system of claim 8, wherein the server agent is coupled and configured to: wait for a time interval of sufficient duration after causing the server to change the routing of storage data traffic between the server and the storage device to the new path, such that the effect of the change to the new path is reflected in the results of continuous monitoring of traffic on each adapter interface of the adapter agent by each adapter agent; and after the wait, begin evaluating the path to the storage device, including at least one path other than the new path.

10. The system of claim 9, wherein the time interval of the waiting is determined by a random number selected as a normal variable of the selected interval, subject to a predetermined minimum wait and a maximum wait.

11. The system of claim 7, wherein the adapter is a first adapter, the first adapter including at least one first adapter interface and a first adapter agent, and wherein the system comprises:

A second adapter, configured to couple the storage device to the network, wherein the second adapter includes at least one second adapter interface, and the second adapter includes a second adapter agent, and the server agent is coupled and configured to:

Monitoring data traffic occurring on each of the server interfaces to determine the bandwidth consumed by each of the server interfaces, and determining the available bandwidth for each of the server interfaces based on the bandwidth consumed; and

At least one available bandwidth indication provided to the server by the first adapter proxy in response to a request asserted from the server to the first adapter is identified, wherein each available bandwidth indication indicates the available bandwidth of one first adapter interface; and at least one additional available bandwidth indication provided to the server by the second adapter proxy in response to a request asserted from the server to the second adapter is identified, wherein each additional available bandwidth indication indicates the available bandwidth of one second adapter interface; and

The available bandwidth on the path including the server interface and one of the second adapter interfaces is determined as the minimum of the available bandwidth on the server interface and the available bandwidth of the one of the second adapter interfaces.

12. An apparatus for a converged network, configured in a system including at least one server coupled to the converged network via at least one server interface, wherein the server includes a server agent, wherein the apparatus is a storage device with an integrated adapter, and includes:

Data storage subsystem; and

An adapter subsystem, coupled to the data storage subsystem, wherein the adapter subsystem implements the adapter, and the adapter subsystem includes:

At least one adapter interface configured to couple the adapter subsystem to the network, and thereby couple the data storage subsystem to the network via the adapter subsystem; and

An adapter proxy, wherein the adapter proxy is coupled and configured to:

Determine whether each of the adapter interfaces is overloaded, and generate an adapter interface overload indication for each of the adapter interfaces, wherein the adapter interface overload indication for each of the adapter interfaces indicates whether the adapter interface is overloaded; and

In response to a request from the server agent, the adapter subsystem asserts data indicating at least one adapter interface overload to at least one of the adapter interfaces.

13. The device of claim 12, wherein the device is a storage device with an integrated adapter, the adapter subsystem implementing the adapter, and the data storage subsystem implementing the storage device.

14. The device of claim 13, wherein the storage device is a disk drive.

15. The device of claim 12, wherein the device is a JBOD with an integrated adapter, the adapter subsystem implementing the adapter, the data storage subsystem implementing the JBOD, and the JBOD includes a set of disk drives.

16. A server configured for use in a system including at least one storage device and at least one adapter coupled to said storage device, wherein the adapter has at least one adapter interface coupled to a converged network, the adapter coupling the storage device to the network via said adapter interface, and the adapter is configured to include an adapter agent, the server comprising:

A processing subsystem, configured to include a server agent; and

At least one server interface, the at least one server interface being configured to be coupled to the network, wherein the processing subsystem is coupled to the server interface and configured to access the network via the server interface when the server interface is coupled to the network, and wherein the server proxy is coupled and configured to:

The processing subsystem asserts a request to the adapter agent and identifies at least one adapter interface overload indication provided by the adapter agent to the server interface in response to the request; and

17. The server of claim 16, wherein the server proxy is configured to respond to a determination of path overload, including by:

Determine whether to select a new path to the storage device for subsequent use by the server, and

After determining that the new path should be selected, the processing subsystem changes the routing of storage data traffic between the server and the storage device to the new path.

18. The server of claim 17, wherein the server agent is coupled and configured to: wait for a time interval of sufficient duration after the processing subsystem changes the routing of storage data traffic between the server and the storage device to the new path, such that the effect of the change to the new path is reflected in the results of continuous monitoring of traffic on each adapter interface of the adapter agent by each adapter agent; and after the wait, begin evaluating paths to the storage device, including at least one path other than the new path.

19. The server of claim 18, wherein the time interval of the waiting is determined by a random number selected as a normal variable of the selected interval, subject to a predetermined minimum wait and a maximum wait.

20. The server of claim 16, wherein the system includes a first adapter configured to couple the storage device to the network and a second adapter configured to couple the storage device to the network, wherein the first adapter includes at least one first adapter interface and the second adapter includes at least one second adapter interface, the first adapter includes a first adapter proxy and the second adapter includes a second adapter proxy, wherein the server includes at least a first server interface and a second server interface, and wherein the server proxy is coupled and configured to:

Monitoring data traffic occurring on each of the server interfaces determines the bandwidth consumed by each of the server interfaces, and based on the bandwidth consumed by each of the server interfaces, determines the available bandwidth for each of the server interfaces; and

At least one available bandwidth indication provided by the first adapter agent to the first server interface in response to a request from the processing subsystem to the first adapter assertion is identified, wherein each available bandwidth indication indicates the available bandwidth of one first adapter interface; and at least one additional available bandwidth indication provided by the second adapter agent to the second server interface in response to a request from the processing subsystem to the second adapter assertion is identified, wherein each additional available bandwidth indication indicates the available bandwidth of one second adapter interface; and

The available bandwidth on the path including the second server interface and the second adapter interface is determined as the minimum of the available bandwidth on the second server interface and the available bandwidth of the second adapter interface.

21. The server of claim 16, wherein the adapter agent is coupled and configured to generate an available bandwidth indication for each of the adapter interfaces of the adapter.

Furthermore, the server proxy is coupled and configured to:

Identify at least one of the available bandwidth indications provided to the server interface by the adapter agent in response to the request; and

Access the path including the server interface and at least one of the adapter interfaces in a manner that uses the available bandwidth indication.

22. A method for a converged network, comprising the following steps:

A request is asserted from a server to an adapter via a converged network, wherein the server is configured to include a server proxy and the adapter is configured to include an adapter proxy;

The server proxy is used to identify at least one adapter interface overload indication of the server interface provided to the server by the adapter proxy in response to the request, wherein the adapter interface overload indication indicates whether the adapter's adapter interface is overloaded; and

For a path that includes the server interface and through which the server accesses at least one storage device via the adapter, the server agent is used to determine whether the path is overloaded by using the adapter interface overload indication.

23. The method of claim 22, further comprising the step of: responding to the determination of path overload using the server proxy, including by:

24. The method of claim 23, comprising the following steps:

(a) After rerouting storage data traffic between the server and the storage device to the new path, wait for a sufficient time interval such that the impact of the rerouting to the new path is reflected in the results of continuous monitoring of traffic on each adapter interface of each of the adapter agents coupled to the network; and

(b) After step (a), the server agent is used to evaluate the path to the storage device, including at least one path other than the new path.

25. The method of claim 22, further comprising the step of:

The server proxy is used to monitor data traffic occurring on each server interface of the server to determine the bandwidth consumed by each server interface, and based on the bandwidth consumed by each server interface, the available bandwidth for each server interface is determined; and

The server proxy is used to identify at least one available bandwidth indication provided to the server by a first adapter proxy of a first adapter coupled to the network in response to a request asserted by the server to the first adapter, wherein each available bandwidth indication indicates the available bandwidth of the adapter interface of the first adapter; and to identify at least one additional available bandwidth indication provided to the server by a second adapter proxy of a second adapter coupled to the network in response to a request asserted by the server to the second adapter, wherein each additional available bandwidth indication indicates the available bandwidth of the adapter interface of the second adapter; and

The server proxy is used to determine the available bandwidth on the path including one of the server interfaces and the adapter interface of the second adapter as the minimum of the available bandwidth on the server interface and the available bandwidth on the adapter interface of the second adapter.

26. The method of claim 22, further comprising the step of:

The adapter agent is used to monitor data traffic occurring on each adapter interface of the adapter and to generate a bandwidth consumption indication for each adapter interface, wherein the bandwidth consumption indication for each adapter interface indicates the bandwidth consumed by the adapter interface;

The adapter agent is used to generate an available bandwidth indication for each of the adapter interfaces, wherein the available bandwidth indication for each adapter interface indicates the available bandwidth of the adapter interface; and

The adapter agent is used to cause the adapter to report at least one adapter interface overload indication, at least one bandwidth consumption indication, and/or at least one available bandwidth indication to the server agent.

27. An adapter configured for use in a system comprising at least one server coupled to a converged network via at least one server interface, and at least one storage device, wherein the server includes a server agent, and the adapter includes:

An adapter proxy, wherein the adapter proxy is coupled and configured to:

Monitor data traffic occurring on each of the adapter interfaces of the adapter, and generate a bandwidth consumption indication for each adapter interface, wherein the bandwidth consumption indication for each adapter interface indicates the bandwidth consumed by the adapter interface;

Generate an available bandwidth indication for each of the adapter interfaces, wherein the available bandwidth indication for each adapter interface indicates the available bandwidth of the adapter interface; and

In response to a request from the server agent, the adapter asserts at least one bandwidth indication data to at least one of the adapter interfaces, wherein the at least one bandwidth indication is at least one of the consumed bandwidth indications, or at least one of the available bandwidth indications, or at least one of the consumed bandwidth indications and at least one of the available bandwidth indications.

28. The adapter of claim 27, wherein the adapter agent is further coupled and configured to estimate the adapter's ability to process additional data.

29. The adapter of claim 27, wherein the adapter agent is coupled and configured to generate the available bandwidth indication for each of the adapter interfaces, including by:

Each planned additional bandwidth usage value received from at least one of the server proxies for one of the adapter interfaces is aged out, thereby generating an aged planned bandwidth usage value for the adapter interface, and the sum of each of the aged planned bandwidth usage values for each of the adapter interfaces is maintained.

30. The adapter of claim 29, wherein the adapter agent is coupled and configured to generate the available bandwidth indication for each of the adapter interfaces based on: the full available bandwidth of the adapter interface, at least one measurement of the bandwidth consumed by the adapter interface, an indication of the adapter's ability to process additional data, and a sum of the aging-planned bandwidth usage values for each of the adapter interfaces.

31. A system for a converged network, comprising the adapter as claimed in claim 27, and further comprising:

The server proxy is coupled and configured to:

The server asserts a request to the adapter agent and identifies at least one bandwidth indication provided to the server by the adapter agent in response to the request; and

Traffic imbalances on the network can be detected by using at least one bandwidth indication provided to the server by the adapter agent in response to the request.

32. The system of claim 31, wherein the server agent is coupled and configured to: in response to detecting the imbalance, cause the server to redirect storage data traffic on the network from a data path between the server and the storage device to a new data path between the server and the storage device.

33. The system of claim 32, wherein the server agent is coupled and configured to: wait for a time interval of sufficient duration after causing the server to redirect storage data traffic between the server and the storage device from the one data path to the new data path, such that the effect of the redirection to the new data path is reflected in the results of continuous monitoring of traffic on each adapter interface of the adapter agent by each adapter agent; and after the wait, begin evaluating paths to the storage device, including at least one path other than the new data path.

34. The system of claim 33, wherein the time interval of the waiting is determined by a random number selected as a normal variable of the selected interval, subject to a predetermined minimum wait and a maximum wait.

35. The system of claim 31, wherein the adapter is a first adapter, the first adapter including at least one first adapter interface and a first adapter agent, and wherein the system comprises:

36. An apparatus for a converged network, configured for use in a system including at least one server coupled to the converged network via at least one server interface, wherein the server includes a server agent, wherein the apparatus is a storage device with an integrated adapter, and includes:

Data storage subsystem; and

An adapter proxy, wherein the adapter proxy is coupled and configured to:

37. The device of claim 36, wherein the device is a storage device with an integrated adapter, the adapter subsystem implementing the adapter, and the data storage subsystem implementing the storage device.

38. The device of claim 37, wherein the storage device is a disk drive.

39. The device of claim 36, wherein the device is a JBOD with an integrated adapter, the adapter subsystem implementing the adapter, the data storage subsystem implementing the JBOD, and the JBOD includes a set of disk drives.

40. A server configured for use in a system including at least one storage device and at least one adapter coupled to said storage device, wherein the adapter has at least one adapter interface coupled to a converged network, the adapter coupling the storage device to the network via said adapter interface, and the adapter is configured to include an adapter agent, the server comprising:

A processing subsystem, configured to include a server agent; and

The processing subsystem asserts a request to the adapter agent and identifies at least one bandwidth indication provided to the server by the adapter agent in response to the request, wherein the at least one bandwidth indication is at least one consumed bandwidth indication, or at least one available bandwidth indication, or at least one consumed bandwidth indication and at least one available bandwidth indication; and

41. The server of claim 40, wherein the server agent is coupled and configured to: in response to detecting the imbalance, cause the server to redirect storage data traffic on the network from a data path between the server and the storage device to a new data path between the server and the storage device.

42. The server of claim 41, wherein the server agent is coupled and configured to: wait for a time interval of sufficient duration after causing the server to redirect storage data traffic between the server and the storage device from the one data path to the new data path, such that the effect of the redirection to the new data path is reflected in the results of continuous monitoring of traffic on each adapter interface of the adapter agent by each adapter agent; and after the wait, begin evaluating paths to the storage device, including at least one path other than the new data path.

43. The server of claim 42, wherein the time interval of the waiting is determined by a random number selected as a normal variable of the selected interval, subject to a predetermined minimum wait and a maximum wait.

44. The server of claim 40, wherein the system includes a first adapter configured to couple the storage device to the network and a second adapter configured to couple the storage device to the network, wherein the first adapter includes at least one first adapter interface and the second adapter includes at least one second adapter interface, the first adapter includes a first adapter proxy and the second adapter includes a second adapter proxy, wherein the server includes at least a first server interface and a second server interface, and wherein the server proxy is coupled and configured to:

45. The server of claim 40, wherein the adapter agent is coupled and configured to generate an available bandwidth indication for each of the adapter interfaces of the adapter.

Furthermore, the server proxy is coupled and configured to:

46. A method for a converged network, comprising the following steps:

The server proxy is used to identify at least one bandwidth indication provided to the server by the adapter proxy in response to the request, wherein the at least one bandwidth indication is at least one consumed bandwidth indication, or at least one available bandwidth indication, or at least one consumed bandwidth indication and at least one available bandwidth indication; and

47. The method of claim 46, further comprising the step of: using the server proxy to redirect storage data traffic on the network from a data path between the server and the storage device to a new data path between the server and the storage device in response to detecting the imbalance.

48. The method of claim 47, further comprising the steps of: employing the server agent to wait for a sufficiently long time interval after the server redirects storage data traffic between the server and the storage device from the one data path to the new data path, such that the impact of the redirection to the new data path is reflected in the results of continuous monitoring of traffic on each adapter interface of the adapter agent by each adapter agent; and after the wait, beginning to evaluate paths to the storage device, including at least one path other than the new data path.

49. The method of claim 46, further comprising the step of: