EP1751668A2 - Methods and systems for history analysis and predictive change management for access paths in networks - Google Patents
Methods and systems for history analysis and predictive change management for access paths in networksInfo
- Publication number
- EP1751668A2 EP1751668A2 EP05742291A EP05742291A EP1751668A2 EP 1751668 A2 EP1751668 A2 EP 1751668A2 EP 05742291 A EP05742291 A EP 05742291A EP 05742291 A EP05742291 A EP 05742291A EP 1751668 A2 EP1751668 A2 EP 1751668A2
- Authority
- EP
- European Patent Office
- Prior art keywords
- access path
- network
- event
- change
- access
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
- 238000000034 method Methods 0.000 title claims abstract description 73
- 238000013070 change management Methods 0.000 title claims abstract description 10
- 230000008859 change Effects 0.000 claims abstract description 119
- 230000000694 effects Effects 0.000 claims abstract description 6
- 230000008569 process Effects 0.000 claims description 59
- 230000009471 action Effects 0.000 claims description 34
- 238000007726 management method Methods 0.000 claims description 26
- 230000002123 temporal effect Effects 0.000 claims description 18
- 230000001364 causal effect Effects 0.000 claims description 10
- 230000005540 biological transmission Effects 0.000 claims description 8
- 238000010606 normalization Methods 0.000 claims description 5
- 230000003116 impacting effect Effects 0.000 claims description 4
- 238000013507 mapping Methods 0.000 claims description 4
- 238000001914 filtration Methods 0.000 claims description 3
- 230000003993 interaction Effects 0.000 claims description 3
- 238000012937 correction Methods 0.000 claims description 2
- 230000000875 corresponding effect Effects 0.000 description 15
- 239000004744 fabric Substances 0.000 description 15
- 238000010200 validation analysis Methods 0.000 description 12
- 239000003999 initiator Substances 0.000 description 11
- 238000012800 visualization Methods 0.000 description 11
- 230000000873 masking effect Effects 0.000 description 9
- 238000013316 zoning Methods 0.000 description 6
- 235000008694 Humulus lupulus Nutrition 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 238000012546 transfer Methods 0.000 description 4
- 238000004891 communication Methods 0.000 description 3
- 238000013500 data storage Methods 0.000 description 3
- 230000007246 mechanism Effects 0.000 description 3
- 230000037361 pathway Effects 0.000 description 3
- 238000013439 planning Methods 0.000 description 3
- 230000004044 response Effects 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 230000002596 correlated effect Effects 0.000 description 2
- 230000001934 delay Effects 0.000 description 2
- 230000009977 dual effect Effects 0.000 description 2
- 230000005012 migration Effects 0.000 description 2
- 238000013508 migration Methods 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- 241000238876 Acari Species 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000012550 audit Methods 0.000 description 1
- 238000013474 audit trail Methods 0.000 description 1
- 230000000295 complement effect Effects 0.000 description 1
- 239000000470 constituent Substances 0.000 description 1
- 230000008094 contradictory effect Effects 0.000 description 1
- 238000003066 decision tree Methods 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 230000008030 elimination Effects 0.000 description 1
- 238000003379 elimination reaction Methods 0.000 description 1
- 230000008713 feedback mechanism Effects 0.000 description 1
- 239000000835 fiber Substances 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000009434 installation Methods 0.000 description 1
- 238000012804 iterative process Methods 0.000 description 1
- 230000007257 malfunction Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000008439 repair process Effects 0.000 description 1
- 230000010076 replication Effects 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
- 230000001960 triggered effect Effects 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0631—Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/08—Configuration management of networks or network elements
- H04L41/0894—Policy-based network configuration management
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L43/00—Arrangements for monitoring or testing data switching networks
- H04L43/02—Capturing of monitoring data
- H04L43/026—Capturing of monitoring data using flow identification
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/10—Protocols in which an application is distributed across nodes in the network
- H04L67/1097—Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L69/00—Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
- H04L69/40—Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass for recovering from a failure of a protocol instance or entity, e.g. service redundancy protocols, protocol state redundancy or protocol service redirection
Definitions
- the invention is directed to methods and systems for constructing and analyzing logical end-to-end access path histories and predictive change management of these access paths in data networks.
- the methods and systems can be advantageous for managing access paths in storage area networks (SAN).
- SAN storage area networks
- Data networks are employed to transmit messages and data from a network appliance which initiates a network event, such as a query or a data transfer, subsequently also referred to as an initiator, application, server or host, to another network appliance which can respond to the event, for example, a data storage device.
- a network event such as a query or a data transfer
- another network appliance which can respond to the event, for example, a data storage device.
- defined access paths between the network appliances may have to conform to an access path policy.
- the defined access paths are physical and logical pathways, which can include the initiators, their particular components and ports, such as Host Bus Adapters (HBA), a switch fabric with switches, routers and the like, and end devices, such as physical storage devices, containing Logical Unit Numbers (LUN).
- HBA Host Bus Adapters
- LUN Logical Unit Numbers
- the state of each of the components has to be properly logically configured to enable appropriate flow along the pathway.
- the physical pathways typically have to comply with a policy, also referred to as access path policy, which includes policy attributes, such as path redundancy, path connectivity characteristics, and the like.
- One example of a data network with defined logical access path is a storage area network which enables multiple applications on servers access to data stored in consolidated, shared storage infrastructures.
- Enterprises increasingly deploy large- scale, complex networks to gain economies-of-scale business benefits, and are performing and planning extensive business-critical migration processes to these new environments.
- Data networks are constantly undergoing changes, upgrades and expansion, which increases their complexity.
- the number of components and links which may be associated with the data transfer between a given initiator and one or more of its data network appliances may increase exponentially with the size of the network.
- the physical and logical set-up can include multiple actions (sometime tens per a single logical change), which need to be set up at different locations and with device types, with perfect mutual consistency.
- time-based sequences of consistent global snapshots of the network are constantly constructed and maintained for future reference.
- planned changes in devices and device configurations of the devices connected to the network fabric and to the network fabric itself are analyzed and mapped to a consistent global snapshot of the network.
- the predictive change management process includes both pre-validation consistency checks before actions are taken, and post-validation consistency checks after the action are taken. The consistency checks consider the current state of the access paths in the network, the current state of the access path policy and the set of new events , planned or executed, to determine conformance, or establish violation and its root cause.
- a consistent global snapshot of a network is a representation which correctly reflects the actual status of all of the end-to-end access paths in the network at a particular point in time that are consistent with or conform to a defined network access path policy.
- the network access path policy specifies which access paths in the network between an initiator, such as an application in a SAN, and a data appliance, such as a LUN in a SAN storage appliance, should not exist, which should exist, and what should be the end-to-end attributes of each access path.
- the status of each access path includes high level path attributes derived from the physical and logical characteristics of the components which provide the access relationship between a given initiator and a data appliance.
- a management server automatically collects information about events and from devices distributed across the network, before and/or after they are implemented, using a variety of non-intrusive mechanisms. It identifies violations of actual access paths relative to the required access paths as determined by the policy. It provides notifications about violations, with all their relevant context information, to an appropriate target recipient.
- the management server using network customized graph-based algorithms, analyzes the information and its impact on network access paths and compliance with the access path policy.
- the invention provides a process for constructing and analyzing an access path event history in a network, which includes detecting events of one or more components in the network, determining a logical order or a temporal order, or both, of the events; and generating an event sequence based on the logical or temporal order.
- the invention provides a process for constructing and analyzing an access path event history in a network, which includes detecting at least one event causing nonconformance with a network access path policy, determining a root cause for the nonconformance, reconfiguring network resources based on the root cause to bring the reconfigured network in conformance with the network access path policy, and validating the reconfigured network.
- the invention provides a process for updating logical access paths in a network, which includes detecting at least one component event in the network, checking conformance of a configuration change caused by the at least one component event with a network access path policy, and if the configuration change is in conformance, validating the logical access path.
- the invention provides a process for predictive change management of access paths in a network, which includes specifying one or more planned change tasks, pre-validating the one or more planned change tasks according to an access path policy of the network, implementing the one or more planned change tasks, tracking implementation of the one or more changes, and post- validating the implemented changes for conformance of with the access path policy.
- the invention provides a process for managing an access path change in a network, which includes specifying a change in an access path policy in the network, associating with the specified change at least one component event, determining an effect of the least one component event by evaluating conformance of the changed access path policy, and in the event of nonconformance, determining a root cause for the nonconformance.
- Detecting an event may involve obtaining physical and logical state parameters from the components in the network.
- An access path violation may be associated with a first event in the event sequence, with the first event representing a root cause for the access path violation.
- the root cause can be analyzed and a corrective action for the component causing the access path violation can be determined. For example, the root cause can be determined based on event timing and reconciling discrepancies in the event timing.
- Event timing can include receiving from network components timestamps associated with the events.
- Reconciling discrepancies in the event timing may include determining a relative time order of the timestamps based on semantics of a timestamp, a route for transmission of a timestamp, a multiplicity of messages with different timestamps for an identical event, and a causal relationship between events.
- reconciling discrepancies may include correcting, or proposing corrective action for, logical/physical path connectivity of the network components to bring the reconfigured network in conformance with the network access path policy.
- a logical access path associated with the event can be revised, if the event is caused by a logical access path that is not conforming to the access path policy.
- Events and associated root causes related to nonconformance can be successively recorded and optionally collapsed into an updated access path representation.
- the recorded events and/or the collapsed access path representation can be visualized, for example, on a display.
- One or more subsets of the events or of the network can be selected to provide a summary statistics of the events or a network status.
- the events can also be filtered so as to record state changes within a certain time interval, state changes affecting a specific component, or state changes impacting the access paths, or a combination thereof.
- a planned change task may include adding, removing and/or changing a physical network component, a physical link between network components, a port configuration and/or a LUN mapping.
- the physical component can be a storage device, a switch or a server.
- the access path policy may be updated by adding an access path to the policy, deleting an access path from the policy, or changing an attribute of an access path.
- a planned change task may involve modifying an access path attribute.
- Pre-validating may include detecting a nonconformance of access paths with the access path policy and modifying the planned change task to conform the access paths to the access path policy.
- Post-validating may include detecting a nonconformance of the implemented change with the access path policy and notifying a user or determining a root cause of the nonconformance.
- the root cause may be determined from a logical order and/or a temporal order of the component event, and by generating an event sequence based on the logical and/or temporal order.
- the root cause can be associated with a first event in the event sequence and may be eliminated by proposing a correction of one or more access paths.
- Generating the event sequence may include determining a relative order of the component events based on semantics of a timestamp, a route for transmission of a timestamp, a multiplicity of messages with different timestamps for an identical event, and or a causal relationship between events.
- a planned change task, a pre-validated change task, an implemented change, and/or a post-validated configuration change may be visualized graphically and/or textually.
- FIG. 1 shows a schematic diagram of an exemplary network with physical links and a management server according to the invention
- FIG. 2 shows details of the management server of FIG. 1 ;
- FIG. 3 shows two exemplary logical access path between a server and a storage device that comply with a network access path policy
- FIG. 4 is a schematic high-level flow diagram for constructing and analyzing an access path event history
- FIG. 5 shows a planned change in the logical access paths of the network after adding another server
- FIG. 6 is a schematic high-level flow diagram for managing predictive changes to access paths in a network
- FIG. 7 shows an exemplary visualization of a access path change plan and pre- validation
- FIG. 8 shows an exemplary textual visualization of a change execution tracking
- FIG. 9 shows an exemplary graphical visualization of a change execution tracking
- FIG. 10 shows an exemplary textual and graphical visualization of a violations analysis
- FIG. 11 shows an exemplary textual and graphical visualization of a change recording and history management.
- the methods and systems described herein enable multiple efficient, effective, and risk-free changes to access path environments in which initiators on multiple network appliances, such as servers or applications, to access (read and write) data which is stored on multiple other shared network appliances, such as storage devices.
- Exemplary embodiments of the methods and systems will be described with reference to a SAN having specialized SAN devices (such as different types of switches) which are interlinked, and may employ various transfer protocols (such as Fibre Channel and iSCSI).
- Each server is connected to the network through one or more specialized network cards (such as a Host Bus Adapter, or HBA).
- a SAN is only one particular example of a network with defined physical and logical access paths and an associated access path policy, and it will be understood that the invention can be used with other types of networks, such as local area networks, wide area networks, extranets, remote-site connections, inter-domain networks, intranets, and possible the Internet, as long as access paths can be defined, set up, and monitored in such a network.
- Access Path a physical and logical connection or link between components in a network, which enables an initiator on one node of the network to access a data set on another node, while disabling other initiators from accessing the same data set, unless provisions are made for controlled sharing of the same data between several initiators.
- Logical Access Path refers to a logical channel between a given application and a given LUN along which data can flow. The logical state
- Access Path Policy specifies all the access paths that should exist, and required end-to-end properties for each of these access-path at any given point in time.
- the access path policy can specify redundancy and replication of components and network appliances, number of hops, latency, constraints on component types inter-connections along the path, etc.
- Access Path Event a physical or logical change of a state of one of the components which are part of an access path.
- An access path event can be caused, for example, by a node failure, a faulty cable connection, or a change in a cable connection, or a zone configuration change.
- Access Path Violation - a discrepancy at some point in time between the access path policy and the access paths in the network environment, which can be caused by an access path event and can be reflected, for example, in the existence/absence of an access path or in a change in the properties of one or more of the existing access paths.
- FIG. 1 shows a topological view of an exemplary network 10, such as a storage area network (SAN), with several network appliances, for example peripherals such as servers 102, 104, 106, 108, 110, 112, switches 122, 124, 126, and application data storage devices 132, 134.
- the storage devices can be, for example, disk drives, such as RAID devices, tape drives, or other types of mass-storage devices.
- the physical connection paths between different network appliances are indicated by solid lines. Not all physical access paths are also logical access paths, because some physical access paths alone or in combination with other access paths may have characteristics that cause nonconformance with the access path policy.
- the storage devices 132, 134 can be further partitioned into data storage regions, such as unique LUNs 131, 133, 135.
- the network 10 of FIG. 1 includes a management server 12 which for the exemplary network 10 is configured to communicate with various network components, including the network appliances, i.e., the servers 102, 104, 106, ... , 112, the storage devices 132, 134 and respective LUNs 131, 133, 135, and switches 122, 124, 126, to monitor the network and network resources and assure conformance between the logical access paths and the access path policy.
- This communication can take place via the communication channels used by the network for data transfer or via separate communication channels.
- FIG. 2 shows in more detail an exemplary configuration of the management server 12.
- the management server 12 can include, inter alia, a Components Interaction Engine 202 which obtains information from the various network components in a manner described above.
- An Information Normalization Engine 204 converts the obtained information to a standard, device-independent representation, with the Information Reconciliation Engine 206 reconciling conflicts, removing redundant information and identifying incomplete information.
- Event Correlation Engine 208 establishes relationships between events and establishes an temporal and logical event sequence.
- the Validation Analysis Engine 210 compares the actual access paths with the access path policy and identifies access path violations.
- History Analysis Engine 212 selects events, for example, based on their causal relationships, filters events according to defined filter criteria, and performs statistical and trending analysis.
- the Root Cause Analysis Engine 214 analyzes the root cause(s) of detected violations and can optionally generate a root cause decision tree 216.
- An Event Repository 218 stores access path violations, events leading to these violations, root causes, etc., whereas a Policy History Repository 220 stores access path policies and changes in the access path policies.
- FIG. 3 depicts the exemplary network 10 with logical access paths (shown as bold lines) set up between a network appliance 106, such as a server or an application, and a LUN, such as LUN 135, on storage device 134.
- the intermediate components along the access path include, among others, intermediate nodes, such as switch 122 in one of the logical access paths, and switches 124, 126, in the other access path. It can be inferred from the network diagram of FIG. 3 that the access path policy requires redundant access paths.
- the illustrated network configuration serves as an example only, and that the configuration and routing of the access paths will depend on the type of device and number of devices employed at the network nodes (e.g., switches 122, 124, 126).
- Each network device may be set to logically constrain traffic flows through that device to specific respective end-points only (using different methods depending on the type of device and on the vendor).
- each switch typically supports the definition of different type of zones which are sets of device ports between which data may flow via that switch.
- Storage devices typically support LUN masking which imposes constraints on the servers that may access a LUN.
- a server's HBA host bus adapter
- types of LUN masking that constrain which LUNs can be accessed from that particular server.
- both physical constraints at least one physical path must exist between the corresponding server and the corresponding storage LUN
- logical constraints the zoning in each switch and the LUN masking at the HBA and storage device should be set so as not to disable data traffic between these end points
- the logical setup on each of the two HBAs on server 106, the zone set up in each of the switches 122, 124, 126, as well as the LUN masking 135 at the storage device must be set to enable flows along each of these logical channels between these two end points 106 and 135.
- the zoning on switch 122 needs to be defined such that the port corresponding to server 106 and the other port corresponding to the storage device of LUN 135 are in the same zone.
- Logical access paths can include associated path attributes which can be considered as particular properties that characterize each end-to-end Logical Access Path, describing, for example, aspects of availability, performance and security of each given end-to-end logical channel.
- a particular value can be computed for each one of the defined attributes (that value represents a particular property value for that logical access path instantiation).
- the computation of a particular attribute value for a given logical access path can be based on information related to the sequence of linked components, as well as on information about the types and internal configuration states of any number of components contained in that logical access path.
- the path attributes represent the characteristics of the end-to-end data flow between an initiator and its data.
- the path attributes relate to, for example, levels of end-to-end availability, performance, and security, all of which characterize data flows along a logical access path.
- Monitoring the performance and compliance of a network with defined logical paths requires monitoring network appliances to detect access path events, analyzing the access path events to detect access path violations, and constructing and maintaining an access path history of access path violations through a sequence of end-to-end path snapshots of the access paths. These snapshots may have divergent and/or contradicting information and further information about the temporal and/or logical sequence of events may hence be required. Discrepancies requiring reconciliation may be due to:
- status information received from a variety of distributed sources is processed at a central server, such as the network management server 12.
- a central server such as the network management server 12.
- Each such status information received from a component is parsed and translated from the source-specific and protocol-specific context representation to a normalized representation designed to capture device- and protocol-independent access path status information. Normalization is desired because, depending on the component type and the actual protocol used to communicate with the component, the information received from each component has a different syntactic and semantic representation, as well as variations in the information contents. Normalization takes into account, among others:
- Status information can be received from each component source, either as an update response to a request or as a pre-defined, optionally periodically transmitted, component status update message.
- Each status update can contain a summary of the current state of the component, or information about a new component event that occurred at that component or at other parts of the network.
- the switch 124 in network 10 may have failed, disabling one of the redundant access paths between server 106 and LUN 135, so that the logical access path between these two network appliances is no longer in compliance with the access path policy of the network 10 which requires redundancy.
- access path policy may specify level storage redundancy and data redundancy (not shown).
- the physical connectivity information may include the identity of other components which are connected to this component's ports.
- the logical connectivity information may include various types of information flow constraints through the component, such as zoning, port binding, LUN masking, and the like.
- the normalization process involves mapping the component-specific status information into an access path context paradigm. That is, deducing from the status of a component (physical and logical constraints) which access paths (potential data flows, from certain sources to certain destinations) are enabled or disabled by this specific component.
- a "device down” event (for example, inferred when no status response is received within a certain time after a request) may indicate that no data can currently be transmitted through the device.
- a “link down” event from a component may indicate that no data flow can currently be transmitted through a particular port of that component.
- a “soft zone update” event from a component (e.g. switch) may imply that data can only flow between network appliances in a new zone configuration.
- a “LUN masking update” event from a component e.g. a storage device) may imply that data can only flow between a specified source in the new LUN masking configuration to the corresponding destination storage LUN.
- the normalized representation of component status information can then be aggregated consistently, as will be described below, to determine the status of all the access paths in the network at any point in time.
- the information received from the various distributed components about local component events can be processed by the management server 12 to determine an event sequence of these events.
- Determining the sequential ordering of the underlying distributed events may not be straightforward for various reasons. For example, some event information may not contain timestamps, or timestamps from different sources may have different semantics, or clocks of different distributed sources may not be fully synchronized. Alternatively or in addition, timestamps from different sources may represent time of message generation (or transmission) rather than time of a specific underlying event, or messages from different distributed sources may incur different levels of transmission delays, or event information from various sources may be duplicated or partially duplicated, or, due to transmission failure or misrouting, messages may become reordered or lost.
- the management server 12 analyzes the messages received from each component and determines the "correct" event sequence, i.e., the relative time order or timeline of the underlying events based on: The semantics of the timestamp in the message as determined by the type of the source device and the nature of the interaction protocol between the central server and the source device. For example, in a switch that transmits, in response to a polling request, the updated zone configuration state, any specific new zone change event must be associated with a time point between the current snapshot time and the previous snapshot time for that switch. - The characteristics of the route between the source device and the central server.
- a component event message from a component connected to the central server via a direct local-area channel is likely generated at a more predictable recent point in time than an event message transmitted via a number of network hops.
- - Elimination of multiple messages representing the same underlying event For example, two different components may generate separate messages related to the same failure event.
- a port failure at switch 124 may cause messages from both switch 124 and switch 126.
- - Assessment of causal relationships between different events For example, a zone configuration change in one switch in a fabric can trigger a propagation of corresponding update events at other switches in that fabric. In other words, the update events at the other switches are causally dependent on the configuration change in the first switch and therefore occur after the original event, i.e., the configuration change.
- disconnection of a cable from one port must have occurred before reconnection of the same cable to another port.
- the aforedescribed exemplary steps are only illustrative and by no means exhaustive.
- the temporal event order taken together with the logical event order make it possible to create a causal and consistent representation of an event sequence of multiple component events in the order in which the events actually occurred in the network.
- the management server 12 analyzes the consistent event sequence representation and maps the event sequence to a higher level access path representation by performing the following operations:
- a change event received from more than one source may indicate a new connection between two switches which represents a high level state change.
- Another example of a high level state is a zoning change in a switch fabric which may involve several low-level state changes in several switches in the fabric.
- a device failure represents a high level state change which may result in one or more redundant messages from devices in addition to the time- out message from the failed device itself.
- - Correlating multiple low-level events to obtain additional complementary information about a high-level event For example, correlating between a host WWN identifier and its IP identifier may lead to additional information regarding a state change associated with that host. - Correlating multiple low-level events to determine conflicting information relating to the same state change, and attempting to resolve the identified conflict based on consistency of information from different sources and known reliability of the components. For example, when a new switch or a new host is added to a switch fabric, some neighboring devices may detect and report this event, while others may be updated later and hence still provide outdated state information. - Correlating multiple low-level events with information about a previously planned change in the network fabric.
- one or more low-level component events may initially indicate a component failure, such as a disabled connection on one of the switch ports may indicate.
- the failure may be diagnosed as being most likely due to a change in cabling as part of a planned migration task.
- the network-specific access path policy may require dual fabric redundancy, so that a new low-level component event may indicate a new cable connection between a server and a storage device as part of a redundant path being set up.
- a zoning change or a LUN-masking change may imply that one or more new end-to-end (application/hosts to data/LUNs) access paths are established, or that one or more access paths cease to exist.
- the above process defines a consistent event sequence of component state changes and their impact on the end-to-end access paths. This can be represented and visualized as a consistent access path event history in the network.
- the consistent access path event history is used as a basis for various analysis tasks and to provide control, diagnostics, management, and audit functions.
- the management server may record every time-stamped low level component event and store it in a dedicated repository. The server can determine from the event sequence of the low level component events corresponding high level access paths and compare the access path with the access path policy. The management server may also record the derived higher level access paths state changes in the repository. In addition, the management server may store the access path policy and changes to the access path policy of the network or of at least the part of the network managed by the management server. Access path violations may occur for a variety of reasons, including planning mistakes, component failures, and human errors. One challenge addressed by the process of the invention is to identify the root cause of an access path violation.
- a root cause is defined as an event in the network fabric that causes a violation of access path policy, without being itself caused by an earlier, correctly time-stamped event that violated the access path policy. Whenever an access path violation is detected, effective corrective actions can be performed once the root cause is established.
- the management server Whenever the management server detects an access path violation, for example, caused by a component event, the server determines whether the access path was ever set-up correctly, i.e., did not have a preceding violation. If the access path had been set-up correctly, then at least the subset of access paths associated with the access path event history of that path is examined, and the earliest state change in the event sequence from a state without a violation to a state with a violation is identified as the root cause. The state change is presented, for example displayed on display 14, with its context information, including the corresponding low-level component events. Determining the appropriate corrective action is described below. In most situations, i.e.
- the root cause of the violation may be due to one or more "missing state changes.” Identifying these missing state changes is performed as part of the corrective action analysis. The process of establishing the appropriate corrective actions for a given violation is performed by a combination of several approaches. For certain types of the root cause events and violations, a corresponding counter-event mapping is predefined.
- a graph-analysis is performed to determine the optimal route (or change state sequences) which will lead from the current access path state (which includes the violation) to the target state (as specified by the corresponding policy).
- This analysis can also be supported by a knowledge-base representing pre-defined network procedures that can provide helpful hints and suggestions to users in cases where no single final determination about the best corrective action can be derived.
- the aforedescribed process forms a basis for constructing and analyzing the access path history in a network. As shown in FIG. 4, the process 40 starts with a network state where all network appliances are communicating via valid physical and logical access paths that conform to the defined access path policy for that network or section of network, step 402.
- the management server 12 collects information from the network appliances, either continuously or periodically, as described above, enabling the server to identify one or more component events, step 404.
- the server attempts to determine a temporal sequence of the component events based, for example, on timestamps of these events, step 406.
- a single component event such as a component fault or a configuration change, can cause multiple events to be indicated to the server 12, which are a consequence of the first event without representing in themselves a component malfunction
- the temporal sequence is analyzed and transformed into a consistent event sequence based on likely or necessary causal relationships between the individual events, step 408, and mapped onto a logical access path representation, step 410, which can be visualized on monitor 14.
- the access paths are then compared 412 with the defined access path policy 414 which can be stored 416 in a policy repository. If the logical access paths comply with the access path policy, no action is required, and the event is logged in an access path history file, step 430, the access path representation is updated, if necessary, step 432, and the server 12 continues to monitor the network.
- the root cause for the access path violation(s) is determined, step 418, based, for example, on the consistent event sequence determined in step 408. It is first determined in step 420, if the access path was set up correctly, because the root cause event can also be triggered, for example, by a component change which complies with a presumably correct access path that was, however, set up incorrectly. In this case, the missing or incorrect state change is identified, step 424, and changes in the logical/physical connectivity are proposed, step 426.
- step 420 If, on the other hand, the access path was set up correctly, as determined in step 420, then the earliest component event (root cause) identified in step 418 is assumed to have been the trigger event, step 422, and the component is repaired or the physical connection redefined and/or rerouted, step 426, and the resulting logical path is validated and mapped onto the access path representation. The changes are then logged, step 430, and the access path representation updated, step 432, as described before.
- the mechanisms for analyzing and correcting root cause violations are applicable to "actual violations” as well as "pending violations.”
- Actual violations reflect events that have already occurred in the network environment.
- Pending violations reflect planned state change events which have not yet been performed in the network environment. As both these events are similarly represented in the state change history structure (with past timestamps and future timestamps, respectively), the analysis mechanisms for both these cases can be constructed in an analogous manner.
- Predictive Change Management is designed to improve the reliability and efficiency of access path change processes in IT infrastructures, such as managed networks having an access path policy, for example, storage area networks (SAN).
- the management server 12 interacts with the network appliances and network resources in the network fabric and implements a process with the following main aspects:
- the management server Before contemplating a change in the current network configuration that could potentially affect the access paths, the management server receives state information from the various network appliances and fabric components, correlates the information, reconciles inconsistencies, and constructs a representation of the current state of the infrastructure environment and of all access paths that exist in the environment at that point in time, as well as their access path attributes.
- the representation of the existing access paths and path attributes is compared with the corresponding representation in the access path policy repository 214 of the management server 12 (see FIG. 2), violations are identified, and appropriate notifications are generated.
- the types of possible violations include, without being limited thereto, "access path does not exist", “access path should not exist", and "access path attribute value discrepancy".
- the server obtains the updated state information, and the violations are deleted from the list.
- Access path policy may change by, for example, adding one or more new access paths between two network appliances (with particular attributes), deleting one or more access path, or changing access path attributes.
- the network configuration can also change due to one or more actions related to component changes, such as: connecting or removing a link between two device ports; connecting devices to or disconnecting devices from links; and/or changing a logical access state configuration on a device to enable data flows between devices and/or ports.
- the action "logical access configuration change at device R" is mapped to a detailed zoning configuration or LUN masking.
- a new server 103 is slated to be added to the network.
- the network policy stipulates that the server 103 is connected to storage LUN 135 with dual fabric redundancy and no more than one hop to maintain a low latency.
- One access path meeting this access path attribute can be established between an HBA of server 103 and LUN 135 via switch 126.
- a second possible access path via switches 124 and 122 has two hops and therefore violates the access path attributes. No second access path exists that only includes a single hop. Accordingly, a new access path 142 is established between switch 124 and LUN 135.
- Adding or changing one or more access path may involve addition or reconfiguration of a number of components, which may in turn affect other access paths that previously conformed to the access path policy.
- FIG. 6 shows an exemplary process 50 for predictive change management according to the invention.
- the sequential order of the depicted steps may be changed and additional process steps may be performed as long as the illustrative process enables predictive change management of access paths in a network.
- the network is presumed to have a defined access path policy, step 502, so that a valid state of existing access paths can be established, step 504.
- a proposed network change plan is specified, which may add and/or change physical network components, links, port settings, LUN masking, and the like, step 506.
- the details of each proposed change in the plan are pre-validated after specification and before their implementation, step 508.
- Pre-validation is performed by simulating the effect of constituent proposed actions, i.e., the addition of server 103 and the two redundant links via switch 124 and 126.
- the effect of these actions on the representation of the infrastructure is determined, and any deviations in the resulting state representation from the specified required policy rules are identified.
- the effect of each action on the environment is simulated and a list of access paths is derived.
- a specific logical configuration update of a single component can open new access paths, close existing access paths, as well as change some attributes of existing access paths.
- the addition of path 142 between switch 124 and LUN 135 also opens connections between servers 102, 106, 108 and LUN 135 having a lesser number of hops. Any identified deviations from policy are presented, analyzed, and can be corrected, simulated, and pre- validated again in an iterative process, step 514.
- the result of a successful pre-validation phase is a detailed execution plan for
- a failure in an access path can be correlated with the performance of a component in the access path, which would allow the generation of a forward-looking temporal sequence or timeline of future component events in the access path. For example, by collecting and analyzing time- stamped information from the various components in a simulated implementation, a root cause for an access path failure can be determined before the changes are implemented, as discussed below. Accordingly, necessary repairs and an access path reconfiguration can be easily and predictably pinpointed and cost-effectively performed.
- the proposed change plan may be implemented in the infrastructure environment based on a pre-established action plan, step 516.
- the change implementation can be performed in a variety of ways, including physical changes in the environment (re-cabling, connecting components), logical re- configuration using component-specific management interfaces or other provisioning solutions. Different parts of the change plan can be implemented in parallel by diverse IT personnel.
- the actual implementation of the change plan is continuously tracked and analyzed by the server based on update messages received from the components in the network and mapped to the change execution plan and the access path policy, step 520.
- the server records the individual state change actions (what was performed, where, when, by whom), and the evolving network state until the planned changes are completed.
- Validation of each implemented change includes establishing its consistency with respect to the pre-validated change plan as well as its consistency with respect to the specified policy, step 522. Any deviation from the change plan or from the specified policy triggers appropriate notifications, step 524.
- Each such notification can include context information suggesting a root cause, step 526, andor specifying proposed corrective actions, step 528.
- the process 50 then returns to step 516, so that each corrective action can be iteratively processed through the predictive change phases cycle, or parts of it, until successful completion.
- an access path history file such as the Event History Repository 218 (see FIG. 2), which can be maintained as long as necessary for future reference, for change cycle statistics, and as a guideline for managing future access path changes.
- FIG. 7 represents an example of a visualization of access paths change plan to be pre- validated. Outlined are the planned high level tasks, the detailed change low-level individual actions to be pre-validated and performed, and the consequent future changes implied by these low-level actions. The pre-validation is performed based on the current state of the access paths in the network, the access path policy, and the set of planned low-level actions. Any future violation implied by these is detected , highlighted, and notifications are generated.
- FIG. 8 represents an example of a visualization of change execution tracking. Outlined are these low-level actions which the system detects as being completed, by interacting with the components in the network environment. In the example, the system detected that actions 1, 2 , and 5 of the plan were the only ones completed at that point in time, as denoted by the ticks inserted to the boxes next to these.
- FIG. 9 represents an example of a graphical visualization of change execution tracking.
- the current state of the network and access path is depicted, with the impact of the latest detected change action highlighted.
- the state of the network is depicted after the execution of action 5 in the plan, in which a cable was connected between component UP2001 and component USCG_D1_SW4.
- FIG. 10 represents an example of a textual and graphical visualization of violations and their analysis. Depicted are a list of violations detected in the environment at a particular point in time, for each the graphical representation of the affected access paths, as well as the details of the changes that are associated with these violations. For example the 1 st violation shown represent a redundancy violation in an access path between host 3 and storage 2, created by a change that occurred 9/23/03 at 5:32pm.
- FIG. 11 represents an example of a textual and graphical visualization of change history. It contains all the change events that occurred in the system at any point in the past up to now, ordered correctly on a timeline. For each change event, context information is provided, and the state of the access paths in the network at the point in time in which the change event occurred, is depicted graphically.
- the low level event sequences may be collapsed into a new global state representation after violations of the access path policy have been rectified and/or changes to the access path policy have been implemented.
- Each such global state is a representation of the network at a specific point in time and can be viewed graphically.
- Each collapsed global state representation is computed by starting from the last collapsed state and applying sequentially each new state change. Zoom-in and zoom-out capabilities, aided by graphic visualization, enable to view details of low-level events, corresponding higher level state changes, affected access paths, and/or the corresponding network state representation.
- a planned change task, a pre-validated change task, an implemented change and/or a post-validated configuration change can be displayed graphically or in a text window, with tasks that have been performed or that still have to be performed, marked on the graphs or in the corresponding text fields.
- a variety of filtering and query capabilities enable selection and presentation of subsets of the history and subsets of the network, according to any selection criteria.
- Indexing structures enable selection, for example, of all the state changes within a certain time interval, affecting a specific component, or impacting certain access paths, and others.
- comprehensive summary statistics on either network states or on the change processes themselves can be prepared.
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
Description
Claims
Applications Claiming Priority (5)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US56506404P | 2004-04-23 | 2004-04-23 | |
| US56483704P | 2004-04-23 | 2004-04-23 | |
| US11/112,942 US7961594B2 (en) | 2002-10-23 | 2005-04-22 | Methods and systems for history analysis for access paths in networks |
| US11/112,624 US7546333B2 (en) | 2002-10-23 | 2005-04-22 | Methods and systems for predictive change management for access paths in networks |
| PCT/US2005/013999 WO2005106694A2 (en) | 2004-04-23 | 2005-04-25 | Methods and systems for history analysis and predictive change management for access paths in networks |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| EP1751668A2 true EP1751668A2 (en) | 2007-02-14 |
| EP1751668A4 EP1751668A4 (en) | 2016-06-15 |
Family
ID=35242312
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| EP05742291.7A Withdrawn EP1751668A4 (en) | 2004-04-23 | 2005-04-25 | METHODS AND SYSTEMS FOR HISTORICAL ANALYSIS AND MANAGEMENT OF PREDICTIVE CHANGES OF ACCESS PATHWAYS IN NETWORKS |
Country Status (2)
| Country | Link |
|---|---|
| EP (1) | EP1751668A4 (en) |
| WO (1) | WO2005106694A2 (en) |
Families Citing this family (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US8560671B1 (en) | 2003-10-23 | 2013-10-15 | Netapp, Inc. | Systems and methods for path-based management of virtual servers in storage network environments |
| CN112398815A (en) * | 2020-10-28 | 2021-02-23 | 武汉思普崚技术有限公司 | Access control baseline detection method and device based on simulation path analysis |
Family Cites Families (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US6108700A (en) * | 1997-08-01 | 2000-08-22 | International Business Machines Corporation | Application end-to-end response time measurement and decomposition |
| US6327598B1 (en) * | 1997-11-24 | 2001-12-04 | International Business Machines Corporation | Removing a filled-out form from a non-interactive web browser cache to an interactive web browser cache |
| US6353902B1 (en) * | 1999-06-08 | 2002-03-05 | Nortel Networks Limited | Network fault prediction and proactive maintenance system |
| US6636981B1 (en) * | 2000-01-06 | 2003-10-21 | International Business Machines Corporation | Method and system for end-to-end problem determination and fault isolation for storage area networks |
-
2005
- 2005-04-25 WO PCT/US2005/013999 patent/WO2005106694A2/en not_active Ceased
- 2005-04-25 EP EP05742291.7A patent/EP1751668A4/en not_active Withdrawn
Non-Patent Citations (1)
| Title |
|---|
| See references of WO2005106694A2 * |
Also Published As
| Publication number | Publication date |
|---|---|
| EP1751668A4 (en) | 2016-06-15 |
| WO2005106694A3 (en) | 2006-05-11 |
| WO2005106694A2 (en) | 2005-11-10 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US7961594B2 (en) | Methods and systems for history analysis for access paths in networks | |
| US7546333B2 (en) | Methods and systems for predictive change management for access paths in networks | |
| US7702667B2 (en) | Methods and systems for validating accessibility and currency of replicated data | |
| US7246163B2 (en) | System and method for configuring a network device | |
| US7249170B2 (en) | System and method for configuration, management and monitoring of network resources | |
| US7774444B1 (en) | SAN simulator | |
| US7685261B1 (en) | Extensible architecture for the centralized discovery and management of heterogeneous SAN components | |
| US7328260B1 (en) | Mapping discovered devices to SAN-manageable objects using configurable rules | |
| US7523184B2 (en) | System and method for synchronizing the configuration of distributed network management applications | |
| US7685269B1 (en) | Service-level monitoring for storage applications | |
| US8347143B2 (en) | Facilitating event management and analysis within a communications environment | |
| US7451175B2 (en) | System and method for managing computer networks | |
| US20070244997A1 (en) | System and method for configuring a network device | |
| EP1356630A2 (en) | Method for generating a network management database record | |
| JP2005276177A (en) | Method, system and program for network configuration checking and repair | |
| US7587483B1 (en) | System and method for managing computer networks | |
| WO2002025870A1 (en) | Method, system, and computer program product for managing storage resources | |
| CN101681362B (en) | Storage optimization method | |
| EP1751668A2 (en) | Methods and systems for history analysis and predictive change management for access paths in networks | |
| Mendiratta et al. | How reliable is my software-defined network? Models and failure impacts | |
| Sherwood et al. | Netcastle: network infrastructure testing at scale |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
| 17P | Request for examination filed |
Effective date: 20061123 |
|
| AK | Designated contracting states |
Kind code of ref document: A2 Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LI LT LU MC NL PL PT RO SE SI SK TR |
|
| DAX | Request for extension of the european patent (deleted) | ||
| A4 | Supplementary search report drawn up and despatched |
Effective date: 20160517 |
|
| RIC1 | Information provided on ipc code assigned before grant |
Ipc: G06F 15/173 20060101AFI20160510BHEP Ipc: H04L 12/24 20060101ALI20160510BHEP Ipc: H04L 12/26 20060101ALI20160510BHEP Ipc: H04L 29/14 20060101ALI20160510BHEP Ipc: H04L 29/08 20060101ALI20160510BHEP |
|
| STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: EXAMINATION IS IN PROGRESS |
|
| 17Q | First examination report despatched |
Effective date: 20170210 |
|
| STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN |
|
| 18D | Application deemed to be withdrawn |
Effective date: 20170621 |