Microsoft KB Archive/815125

From BetaArchive Wiki
Knowledge Base


FIX: Possible Cluster Failover Issues with Pre-Service Pack 4 Version of SQL Server 7.0

Article ID: 815125

Article Last Modified on 1/4/2006



APPLIES TO

  • Microsoft SQL Server 7.0 Standard Edition
  • Microsoft SQL Server 7.0 Service Pack 1
  • Microsoft SQL Server 7.0 Service Pack 2
  • Microsoft SQL Server 7.0 Service Pack 3



BUG #: 49573 (sqlbug_70)

BUG #: 101175 (sqlbug_70)

SYMPTOMS

Clustered computers that are running a SQL Server 7.0 pre-Service Pack 4 (SP4) version might experience unexpected failovers. When the failover occurs, SQL Server might write these error messages to the SQL Server error log:

2003-01-21 16:13:57.99 logon    Login failed for user 'webuser'.
2003-01-21 16:13:59.09 spid1    Closing file F:\MSSQL7\data\MyDb_Data.MDF.
...
2003-01-21 16:13:59.60 spid1    Closing file F:\MSSQL7\DATA\modellog.ldf.
2003-01-21 16:13:59.60 kernel   SQL Server is terminating due to 'stop' request from Service Control Manager.
2003-01-21 16:14:00.35 logon    Login failed for user 'webuser'.
2003-01-21 16:14:00.35 logon    Login failed for user 'webuser'.
2003-01-21 16:14:00.35 logon    Login failed for user 'webuser'.
2003-01-21 16:14:00.37 logon    Login failed for user 'webuser'.
2003-01-21 16:14:00.37 logon    Login failed for user 'webuser'.
2003-01-21 16:14:00.37 logon    Login failed for user 'webuser'.
2003-01-21 16:14:00.37 logon    Login failed for user 'webuser'.
2003-01-21 16:14:00.37 logon    Login failed for user 'webuser'.
2003-01-21 16:14:00.37 logon    Login failed for user 'webuser'.
2003-01-21 16:14:00.37 logon    Login failed for user 'webuser'.
2003-01-21 16:14:00.37 logon    Login failed for user 'webuser'. 


A DBCC stackdump or userdump taken for SQL Server before the failover occurs might show the following stack:

ntdll!ZwWaitForSingleObject+0xb
KERNEL32!WaitForSingleObjectEx+0x71
ums!UmsThreadScheduler::Switch+0x59
ums!UmsScheduler::Suspend+0xb2
ums!UmsEvent::Wait+0x89
opends60!release_srvproc+0x7c
opends60!exit_srv_thread+0x73
opends60!process_commands+0x2aa
ums!ProcessWorkRequests+0xed
ums!ThreadStartRoutine+0x13a
MSVCRT!_threadstart+0x57
KERNEL32!BaseThreadStart+0x52

CAUSE

Before SQL Server 7.0 SP4, login requests that are not valid might delay the release of internal structures (srv_proc) with SQL Server for a period of 60 seconds. During this delay, SQL Server does not free the worker thread that is associated with the invalid login request. This reduces the number of worker threads available to the worker pool. If a significant number of requests are not coming into SQL Server, all the worker threads may end up in the 60 second delay. Therefore, this may lead to no worker threads being available to process valid database requests. A shortage of worker threads also affects SQL Server cluster resources. When the cluster resource does not receive a response in 60 seconds, it may initiate a failover of the resources.

The following can contribute to the cluster failover problem:

  1. A long running run away query (without proper indexes) not yielding that leads to possible login timeouts (may be 30 seconds) from Web servers. Because of the 60 second delay, SQL Server waits 60 seconds before it gives up the worker thread that was servicing the request that timed-out. When the Web servers experience timeout issues, they keep trying to make new requests to keep their pool of connections constant. This exhausts all the worker threads. Because all the worker threads end up in the release_srvproc function, the cluster resource times out and initiates the failover of the SQL Server resources.
  2. When a shutdown of the SQL Server is underway, any non-sa logins that are trying to run a query are rightfully denied access to the server, and they are set up to fail with a "Login failed" error message. They are set up to fail with an error message because the server does not honor a non-sa login while a shutdown is in progress. Therefore, if the cluster initiates a shutdown request and the Web servers keep submitting non-sa login requests at the same time, they all end up with "Login failed" messages in the error log, as shown earlier in this article. With clusters, as soon as a shutdown is issued by the cluster service (because it was not able to get a worker thread to run a simple query to validate if the server is responding), all new non-sa connection requests coming from Web servers end up holding the worker threads for 60 seconds each. This leads to a longer period for SQL Server to shutdown.
  3. If you are using Microsoft Data Access Components (MDAC) version 2.6 from a client application to connect to a SQL Server 7.0 database, read the following articles in the Microsoft Knowledge Base for more information:

    300519 FIX: Problems Occur When Connecting to SQL Server from ADO Application

    301202 HOW TO: Check for MDAC Version

    231943 INFO: Microsoft Data Access Components (MDAC) Release History

  4. Actual login requests that are not valid that are being submitted by the Web servers (a malicious user). The invalid requests delay the worker threads and lead to the whole chain of events. Going through the Web server logs and SQL Profiler (monitoring only for "Login failed" error messages), or use of the Audit Login failures will help you to see if any invalid login attempts come into SQL Server. Only pursue this after you rule out the previous options as a possible cause for the behavior.


RESOLUTION

To resolve this problem, obtain the latest service pack for Microsoft SQL Server 7.0. For additional information, click the following article number to view the article in the Microsoft Knowledge Base:

301511 INF: How to Obtain the Latest SQL Server 7.0 Service Pack



After you apply SQL Server 7.0 SP4, you must also apply hotfix 745 to fix a known memory leak problem on clusters:

253738 FIX: SQL Server Components that Access the Registry in a Cluster Environment May Cause a Memory Leak


Note: Hotfix 7.00.0745 is not included with SQL Server 7.0 SP4. That is why you must apply it after you install SQL Server 7.0 SP4.

STATUS

Microsoft has confirmed that this is a problem in the Microsoft products that are listed at the beginning of this article.
This problem was first corrected in Microsoft SQL Server 7.0 Service Pack 4.


Additional query words: MSCS

Keywords: kbfix KB815125