Microsoft KB Archive/815267

= How to enable User Mode Hang Detection on a server cluster in Windows Server 2003 and in Windows 2000 Server SP4 =

Article ID: 815267

Article Last Modified on 2/28/2007

-

APPLIES TO


 * Microsoft Windows Server 2003, Datacenter Edition (32-bit x86)
 * Microsoft Windows Server 2003, Enterprise Edition (32-bit x86)
 * Microsoft Windows 2000 Service Pack 4

-





SUMMARY
This article describes how to use and configure the detection feature for User Mode Hang Detection on Windows Server 2000 Service Pack 4 (SP4) and Windows Server 2003 server clusters. For more information about the latest service pack for Microsoft Windows 2000, click the following article number to view the article in the Microsoft Knowledge Base:

260910 How to obtain the latest Windows 2000 service pack



MORE INFORMATION
Sometimes, a cluster node may stop responding (&quot;hang&quot;). Certain conditions, such as thread deadlocks or memory leaks, may deprive user-mode processes of resources that they must have to function correctly. These conditions may also prevent user mode processes from running. This may cause the programs or services on the cluster node to stop servicing client requests. Because cluster node health monitoring is performed at the kernel level, and because kernel components may continue to function in these cases, a cluster node whose user-mode processes have stopped responding may still appear to be a fully functioning cluster node. The unresponsive cluster node becomes unavailable to the end user, but it does not fail over because the other cluster nodes cannot detect a failure in the user mode space.

The following symptoms typically indicate that the cluster node has stopped responding:
 * You can confirm IP connectivity to the server that is hanging by pinging it.
 * You cannot successfully establish a connection to the server by using the net use command.
 * You cannot successfully connect to the server by using a Terminal Services client.
 * You can move the mouse pointer when you log on locally to the server.
 * You cannot start programs or utilities when you are logged on locally to the server.

Note: Although other issues may cause some of the previous symptoms, this combination of issues generally indicates that the server has stopped responding.

&quot;Hang&quot; detection in Cluster service
The Windows Cluster service incorporates a limited detection mechanism that may detect unresponsiveness in user-mode components. ClusNet monitors the health of ClusSvc based on periodic communication between the user-mode ClusSvc.exe program and the kernel-mode ClusNet driver. Periodic communication between the user-mode ClusSvc.exe program and the kernel-mode ClusNet driver is the heartbeat. The Cluster service in Windows Server 2003 and Windows 2000 SP4 has two new properties that control the behavior of the heartbeat:
 * ClusSvcHeartbeatTimeout

This property controls how long the ClusNet driver waits between ClusSvc heartbeats before it determines that ClusSvc has stopped responding. By default, the value for this property is 60 seconds.
 * HangRecoveryAction

This property controls the action to take if the user-mode processes have stopped responding. By default, the Cluster service stops. This causes cluster resources to fail over to other cluster nodes.

How to turn on Cluster service &quot;hang&quot; detection
The Cluster service processes the changes to these cluster properties only during the initialization of the Cluster service. Therefore, you must stop and then restart the Cluster service on each node to make sure that the new policies take effect. To minimize resource downtime, restart the Cluster service on the cluster nodes one node at a time.

ClusSvcHeartbeatTimeout
To configure how much time elapses after ClusNet determines that ClusSvc is unresponsive, set the value of the ClusSvcHeartbeatTimeout property. The heartbeat is set according to the following formula:

in seconds/4

For example, if you set the ClusSvcHeartbeatTimeout property to 60 seconds, the heartbeat is sent every 15 seconds (60 seconds divided by 4).

The ClusNet driver maintains a countdown timer that initiates the HangRecoveryAction property when it reaches 0 (zero). Whenever the ClusNet driver receives a ClusSvc heartbeat, the countdown time is reset to the ClusSvcHeartbeatTimeout property. Additionally, when the Cluster service stops for any reason, the ClusNet driver automatically turns off the countdown timer.

To set the value of the ClusSvcHeartbeatTimeout property, run the following command from a command prompt:

cluster.exe /cluster: /prop clussvcheartbeattimeout=

where  is the name of the cluster and   is the number of seconds that you want to use in the calculation of the heartbeat.

HangRecoveryAction
When the ClusNet driver countdown timer reaches 0 (zero), the HangRecoveryAction property is initiated. You can set the HangRecoveryAction property to one of the following numeric values:
 * 0 (zero): Disables the heartbeat and monitoring mechanism.
 * 1: Logs an event in the system log of the Event Viewer.
 * 2: Terminates the Cluster Service. This is the default setting.
 * 3: Causes a Stop error (Bugcheck) on the cluster node.

To set the value of the HangRecoveryAction property, run the following command at a command prompt:

cluster.exe /cluster: /prop hangrecoveryaction=

where  is the name of the cluster and   is the number that corresponds to the action that you want to occur if the ClusNet driver countdown timer reaches 0 (zero).

Note In some extreme cases, system services may also stop responding, and actions 1 and 2 in the earlier list may not succeed. In such cases, action 3 (bugcheck) is the only effective recovery measure.

If the action is set to cause a bugcheck on the cluster node, Windows stops responding and you receive the Stop error Bugcheck code of 0x9E. The Stop error causes a failover to another cluster node. Additionally, if the node where the Stop error occurs is configured to capture a memory dump file, you may be able to use the information that is contained in the memory dump file to diagnose the cause of the unresponsive cluster node. The following code is an example of a stack trace from a Kernel dump that the ClusNet driver initiated:

ChildEBP RetAddr f9c33ea8   f6e2e11f    nt!KeBugCheckEx+0x19 f9c33ecc   f6e2e836   clusnet!CnpCheckClussvcHang+0xef f9c33ef0   805070d7  clusnet!CnpHeartBeatDpc+0x47e f9c33fa4   8050735d  nt!KiTimerExpiration+0x371 f9c33ff4   80543ccf   nt!KiRetireDpcList+0x63

The Bugcheck error code is similar to the following error code:

BugCheck 9E, {812d5b08, 3c, 0, 0}

Important You must manually configure the server to generate a memory dump file in response to a Bugcheck.

Windows 2000 service pack information
To resolve this problem, obtain the latest service pack for Windows 2000. For more information, click the following article number to view the article in the Microsoft Knowledge Base:

260910 How to obtain the latest Windows 2000 service pack

Windows 2000 hotfix information
A supported hotfix is now available from Microsoft, but it is only intended to correct the problem that is described in this article. Only apply it to systems that are experiencing this specific problem. This hotfix may receive additional testing. Therefore, if you are not severely affected by this problem, we recommend that you wait for the next Windows 2000 service pack that contains this hotfix.

To resolve this problem immediately, contact Microsoft Product Support Services to obtain the hotfix. For a complete list of Microsoft Product Support Services telephone numbers and information about support costs, visit the following Microsoft Web site:

http://support.microsoft.com/contactus/?ws=support

Note In special cases, charges that are ordinarily incurred for support calls may be canceled if a Microsoft Support Professional determines that a specific update will resolve your problem. The usual support costs will apply to additional support questions and issues that do not qualify for the specific update in question.

The English version of this hotfix has the file attributes (or later file attributes) that are listed in the following table. The dates and times for these files are listed in Coordinated Universal Time (UTC). When you view the file information, it is converted to local time. To find the difference between UTC and local time, use the Time Zone tab in the Date and Time tool in Control Panel.   Date           Time   Version              Size  File name --  12-Mar-2003  14:22  5.0.2195.6683     55,568  Clusapi.dll 12-Mar-2003 14:02  5.0.2195.6683     67,760  Clusnet.sys 12-Mar-2003 14:02  5.0.2195.6683    682,768  Clussvc.exe 12-Mar-2003 14:22  5.0.2195.6660     99,600  Netman.dll 12-Mar-2003 14:22  5.0.2195.6604    477,456  Netshell.dll 12-Mar-2003 14:02  5.0.2195.6683     54,544  Resrcmon.exe 07-Mar-2003 18:41  5.0.2195.6680  3,988,992  Sp3res.dll

Additional query words: MSCS hang detection failover fail-over move group

Keywords: kbhotfixserver kbqfe kbwin2ksp4fix kbinfo KB815267

-

[mailto:TECHNET@MICROSOFT.COM Send feedback to Microsoft]

© Microsoft Corporation. All rights reserved.