Microsoft KB Archive/242600

= Network failure detection and recovery in a two-node Windows Server 2000 cluster =

Article ID: 242600

Article Last Modified on 3/1/2007

-

APPLIES TO


 * Microsoft Windows 2000 Advanced Server
 * Microsoft Windows 2000 Datacenter Server

-



This article was previously published under Q242600



SUMMARY
The Windows 2000 Cluster service runs a sophisticated algorithm to detect the availability of network interfaces. Also, the Plug and Play functionality of Windows 2000 detects disconnected network cables and connectivity problems between the network adapter and the device it is connected to, such as a hub or a switch. This article describes the network failure detection and recovery process on a two-node Windows 2000 Server Cluster.



MORE INFORMATION
The Cluster service detects the health of the network interfaces on your server cluster by sending a heartbeat from one node in the cluster to another node, and by monitoring node operational status information. Heartbeats are single User Datagram Protocol (UDP) packets exchanged between server cluster Node Managers every 1.2 seconds to confirm that each network interface is still up.

If the heartbeat packet is not received within two heartbeat periods, and the Local Area Network (LAN) to which the server cluster is connected to is configured for client to cluster communication, and then the Cluster service tests the ability of each node to communicate with external hosts. Note that external hosts, by this definition, correspond to IP addresses that are obtained by using the method in the following example. Note that a frequently used external host would be the local router (default gateway).

Example

 * The cluster has two nodes, Node1 and Node2.
 * HEARTBEAT CONNECTION is configured as a private network for heartbeat communication.
 * PUBLIC CONNECTION is configured as a mixed network for client access.
 * NIC1 is attached to Node1. NIC2 is attached to Node2. NIC1 and NIC2 are members of PUBLIC CONNECTION.


 * 1) Obtain all IP addresses that are bound to NIC1 to form IPLIST1.
 * 2) Obtain all IP addresses that are bound to NIC2 to form IPLIST2.
 * 3) Combine IPLIST1 and IPLIST2 to form IPLIST.
 * 4) Check the IP Route Table of Node1 to obtain the IP addresses (PINGLIST11) that are listed as Gateways and masked with the network mask of Interface NIC1 to match the subnet of NIC1 (the default gateway of NIC1 is included in this list). Check the current TCP Connection Table that is established with NIC1 to obtain the TCP Remote addresses (PINGLIST12). Combine PINGLIST11 and PINGLIST12 to form PINGLIST1.
 * 5) Check the IP Route Table of Node2 to obtain the IP addresses (PINGLIST21) that are listed as Gateways and masked with the network mask of Interface NIC2 to match the subnet of NIC2 (the default gateway of NIC2 is included in this list). Check the current TCP Connection Table that is established with NIC2 to obtain the TCP Remote addresses (PINGLIST22). Combine PINGLIST21 and PINGLIST22 to form PINGLIST2.
 * 6) Combine PINGLIST1 and PINGLIST2 to form PINGLIST.
 * 7) Combine IPLISTS and PINGLIST to form UNIONLIST. Remove the duplicate items, remove the IP addresses that are bound to local NICs, and remove the IP addresses that are not in the LAN of PUBLIC CONNECTION. UNIONLIST lists all the IP addresses that can be "external hosts."

The Cluster service tests LAN connectivity by using Internet Control Message Protocol (ICMP) echo requests to determine the scope of the network interface failure. For example, if the nodes on your server cluster are unable to communicate with each other, but one of the nodes is able to communicate with an external host, then the network interface remains up, and that node, if designated a possible owner, takes ownership of the cluster resources that are dependent on client LAN connectivity. Since the use of ICMP echo requests consumes LAN resources, they are used only as a secondary method of determining a failure. Server cluster network interfaces that are configured only for private communication between nodes behave differently when a LAN failure is detected. Because of this, the private LAN should be isolated, such that all cluster nodes are the only computers connected to the segment, and that only one LAN resides on the segment. Other private LANs for the same cluster must be isolated on a different segment. To create the isolated segment, you may use a hub, or in the case of a two-node server cluster, you may use a crossover cable.

Based on these requirements, there are no external hosts for use in determining the extent of the failure. If there is no alternate LAN for private cluster communication, the Cluster service must use the quorum device to arbitrate which node should remain up and running. Otherwise, an alternate available LAN is used for private cluster communications. Note that this process does not take into account the status of LANs designated for client use only.

Unavailable
The owning node is down.

Failed
Reports that other interfaces on the LAN can communicate with each other or with external hosts, while the local interface cannot. The possible causes for this state are:
 * Network adapter failure.
 * Network adapter driver failure.
 * Local cable failure.
 * Port failure on the device that the network adapter is connected to.

Unreachable
Cannot communicate with at least one other interface whose state is not Failed, and/or not Unavailable.

Up
Can communicate with all other interfaces on the LAN whose states are not Failed, and/or not Unavailable. This is the normal operational state.

Unavailable
All interfaces defined on this cluster network are Unavailable.

Down
All network interfaces defined on this cluster network have lost communication with each other and with all known external hosts. All connected network interfaces on up nodes are in either the Failed or the Unreachable state. Therefore, all Transport Control Protocol/Internet Protocol (TCP/IP) address resources that are defined on the same subnet, and all resources that depend on these resources, do not work and are unavailable on the LAN.

Partitioned
One or more network interfaces are in the Unreachable state, but at least two interfaces can still communicate with each other or with an external host.

NOTE: This only applies to server clusters that have two or more nodes.

Up
All network interfaces defined on this cluster network that are not Failed and are not Unavailable can communicate. This is the normal operational state. In the following examples, there is only one LAN in the server cluster which is configured for client to public communication, and this LAN is lost.

NOTE: Disabling media sense on each node in the cluster affects its behavior, and this behavior is noted in the examples listed below. For more information about disabling media sense, click the following article number to view the article in the Microsoft Knowledge Base:

239924 How to disable Media Sense for TCP/IP in Windows

Scenario

 * Node A and node B lose communication.
 * Node B can communicate with an external host.
 * Node A cannot communicate with any external hosts.

Results

 * The node A network interface state is Unreachable, Failed and then this network interface disappears from Cluster Administrator.
 * The node B network interface state is Unreachable, and then Up.
 * The Network state is Up.
 * Any resource groups with TCP/IP address resources dependent on the network interface that has failed, fail over to node B.

Scenario

 * Node A and node B lose communication.
 * Node A and node B cannot communicate with any external hosts.

Results

 * The state of both node A and node B network interfaces is Unreachable, and they disappear from Cluster Administrator.
 * The Network state is Down, and the network disappears from Cluster Administrator. When the LAN connection is restored, this LAN inherits the default network role which is to be used for both client and private communication. If something different is needed, it must be modified manually.
 * No resource groups fail over. TCP/IP address resources dependent on that network fail, and all resources that are dependent on that TCP/IP address are taken offline.

Results with Media Sense Disabled

 * Both network interfaces are Unreachable until network connectivity can be re-established.
 * Network state remains Down until the LAN connection is restored. This retains the network role configuration.
 * The resources remain online.

NOTE: In the process of doing a "rolling" upgrade from a Microsoft Windows NT Server 4.0, Enterprise Edition Cluster Server to a Windows 2000 Server Cluster, there will be a point when you will have a Windows 2000 node and a Windows NT 4.0 node. In this case, the Windows 2000 node uses the Windows NT 4.0 interface state algorithm. When all nodes are running Windows 2000, they will use the Windows 2000 interface state algorithm. For more information about the Windows NT 4.0 interface state algorithm, click the following article number to view the article in the Microsoft Knowledge Base:

176320 Impact of network adapter failure in a cluster

Additional query words: mscs

Keywords: kbinfo kbnetwork KB242600

-

[mailto:TECHNET@MICROSOFT.COM Send feedback to Microsoft]

© Microsoft Corporation. All rights reserved.