Microsoft KB Archive/897284

= SQL Server 2000 SP4 diagnostics help detect stalled and stuck I/O operations =

Article ID: 897284

Article Last Modified on 11/2/2007

-

APPLIES TO


 * Microsoft SQL Server 2000 Enterprise Edition
 * Microsoft SQL Server 2000 Personal Edition
 * Microsoft SQL Server 2000 Standard Edition
 * Microsoft SQL Server 2000 Workgroup Edition
 * Microsoft SQL Server 2000 Developer Edition
 * Microsoft SQL Server 2000 Enterprise Edition 64-bit
 * Microsoft SQL Server 2000 Service Pack 4

-



SUMMARY
A database management system (DBMS), such as SQL Server, relies on the timeliness of file input and output (I/O) operations. Any one of the following items may create stuck or stalled I/O problems. These problems can quickly have an adverse affect on SQL Server responsiveness and performance:
 * Faulty hardware
 * Hardware that is not configured correctly
 * Firmware settings
 * Filter drivers
 * Compression
 * Bugs
 * Other conditions in the I/O path

These I/O problems may cause the following behavior to occur:
 * Blocking
 * Latch contention and time-outs
 * Slow response time
 * Stretching of resource boundaries

Microsoft SQL Server 2000 Service Pack 4 (SP4) includes logic that helps detect stalled and stuck conditions for database I/O reads and writes and log file I/O reads and writes. When an I/O operation has been pending for 15 seconds or longer, SQL Server performs the following steps:  Detects that the operation has been pending.  Writes an informational message to the SQL Server error log. The text of the message is similar to the following: 2004-11-11 00:21:25.26 spid1 SQL Server has encountered 192 occurrence(s) of IO requests taking longer than 15 seconds to complete on file [E:\SEDATA\stressdb5.ndf] in database [stressdb] (7). The OS file handle is 0x00000000000074D4. The offset of the latest long IO is: 0x00000000022000”. 

Informational message explanation
This informational message indicates that the current load may be experiencing one of the following conditions:
 * The workload is exceeding the I/O path capabilities.
 * The workload is exceeding the current system capabilities.
 * The I/O path has malfunctioning software; perhaps a firmware or a driver issue.
 * The I/O path has malfunctioning hardware components.

For more information about SQL Server 2000 I/O patterns, visit the following Microsoft Web site:

http://www.microsoft.com/technet/prodtechnol/sql/2000/maintain/sqlIObasics.mspx



Stuck I/O
Stuck I/O is defined as an I/O request that does not complete. Frequently, stuck I/O is a stuck IRP. To resolve a stuck I/O condition, you typically must restart the computer or perform a similar action. A stuck I/O condition typically indicates one of the following:
 * Faulty hardware.
 * A bug in an I/O path component.

Stalled I/O
Stalled I/O is defined as an I/O request that does complete. However, the I/O operation takes excessive time to complete. Stalled I/O behavior typically occurs because of one of the following reasons:
 * The hardware configuration.
 * The firmware settings.
 * A filter driver issue that requires assistance from the hardware or the software vendor to trace and to resolve.

SQL Server stalled I/O and stuck I/O recording and reporting
Microsoft SQL Server Support handles many cases each year that involve stuck or stalled I/O problems. As outlined later, these I/O problems appear in different ways. I/O problems are some of the most difficult problems to diagnose and to debug. I/O problems require significant time and resources for debugging from Microsoft and from the customer. The reporting features that have been added to SQL Server 2000 SP4 significantly reduce the time that is required to identify an I/O problem.

The reporting and the recording of I/O requests are designed on a per file basis. The detection and the reporting of stalled and stuck I/O requests are two separate actions.

Recording
There are two locations in SQL Server where record action take place. The first location in which recording occurs, is where the I/O operation actually completes. If an I/O request takes more than 15 seconds to complete, a record operation occurs. The second location in which recording occurs, is when the lazy writer runs. When the lazy writer runs, the lazy writer checks all the pending data and all the pending log file I/O requests. If the 15-second threshold has been exceeded, a record operation occurs.

Reporting
Reporting occurs in intervals that are 5 minutes, or more, apart. Reporting occurs when the next I/O request is made on the file. If a record action has occurred and 5 minutes or more have passed since the last report occurred, the informational message that is mentioned in the &quot;Summary&quot; section is written to the SQL Server error log.

The 15-second threshold is not adjustable. However, you can disable stalled and stuck I/O detection by using trace flag 830 but we do not recommend that you do this.

To disable detection when SQL Server starts, use the -T830 startup parameter. Using the -T830 startup parameter will disable detection every time SQL Server is started. To disable detection for an instance of SQL Server that is currently running, use the following statement:

dbcc traceoff(830, -1)

This setting is only effective for the life of the SQL Server process.

Note An I/O request that becomes stalled or stuck is only reported one time. If the message reports that 10 I/O requests are stalled, those 10 reports will not occur again. If the next message reports that 15 I/O requests are stalled, that means that 15 new I/O requests have become stalled.



Tracking the I/O request packet (IRP)
SQL Server uses the standard Microsoft Windows API calls to read and to write data. For example, SQL Server uses the following functions:
 * WriteFile
 * ReadFile
 * WriteFileScatter
 * ReadFileGather

The read or the write request is handled by Windows as an I/O request packet (IRP). To determine the state of the IRP, use both of the following:
 * Microsoft Platform Support assistance.
 * The kernel debugger.

For more information about IRP and IRP tracing, visit the following Microsoft Web site and search for the keyword &quot;IRP&quot;:

http://support.microsoft.com

Note Kernel debugging can be an invasive process because kernel debugging can require you to stop the system to complete the debugging actions. We recommend that you see if there are any updates available for the following items:
 * The BIOS
 * The firmware
 * Any other I/O path components

Contact your hardware vendors before you perform additional debugging actions. The debug session will likely involve a third-party driver, firmware, or a filter driver component.

System performance and query plan actions
Overall system performance can play a key role in I/O processing. You should take the general health of the system into consideration when you are investigating reports of stalled or stuck I/O operations. Excessive loads can cause the overall system to be slow. This includes I/O processing. The behavior of the system at the time the problem occurs can be a key factor in determining the root cause of the problem. For example, if CPU usage becomes high or if CPU usage remains high when the problem occurs, this behavior may indicate that a process on the system is using so much CPU that other processes are being adversely affected.

Performance counters
To monitor I/O performance, examine the following performance counters for specific I/O path information:
 * Average Disk Sec/Transfer
 * Average Disk Queue Length
 * Current Disk Queue Length

For example, the Average Disk Sec/Transfer time on a computer that is running SQL Server is typically less than 15 milliseconds. If the Average Disk Sec/Transfer value climbs, this indicates that the I/O subsystem is not optimally keeping up with the I/O demand.

Be careful with the performance counters because SQL Server takes full advantage of asynchronous I/O capabilities that push the disk queue lengths heavily. Therefore, longer disk queue lengths alone do not indicate a problem.



Indexes and parallelism
Frequently, bursts of I/O occur because an index is missing. This behavior can push the I/O path heavily. A pass with the Index Turning Wizard (ITW) may help resolve I/O pressure on the system. If a query benefits from an index instead of from a table scan or perhaps even uses a sort or hash, the system can gain the following advantages:
 * A reduction of physical I/O to complete the action that directly equates to performance benefits for the query.
 * Fewer pages in the data cache have to be turned over. Therefore, those pages that are in the data cache remain relevant to active queries.
 * Sorts and hashes are used because an index may be missing or because statistics are out of date. You may reduce tempdb use and contention by adding one or more indexes.
 * A reduction of resources, a reduction of parallel operations, or both. Because SQL Server does not guarantee parallel query execution, and because the load on the system is considered, it is best to optimize all queries for serial execution. To optimize a query, open Query Analyzer and set the sp_configure value of the max degree of parallelism option to 1. If all the queries are tuned to run promptly as a serial operation, parallel execution is often just a better result. However, many times parallel execution is selected because the amount of data is just large. For a missing index, a large sort may have to occur. Multiple workers that are performing the sort operation will create a quicker response. However, this action can dramatically increase the pressure on the system. Large read requests from many workers can cause an I/O burst together with increased CPU usage from multiple workers. Many times a query can be tuned to run faster and to use fewer resources if an index is added or if another tuning action occurs.

Practical examples from Microsoft SQL Server Support
The following examples have been handled by Microsoft SQL Server Support and Platforms Escalation Support. These examples are intended to provide a frame of reference and help set your expectations about stalled and stuck I/O situations and about how a system may be affected or may respond. There is no specific hardware or set of drivers that pose any specific risk or increased risk over another. All systems are the same in this respect.

Example 1: A log write that is stuck for 45 seconds
A SQL Server log file write attempt periodically became stuck for approximately 45 seconds. The log write was not finishing in a timely manner. This behavior created a blocking condition that led to 30-second client time-outs.

The application submitted a commit to SQL Server and the commit became stuck as a log write pending. This behavior caused the query to continue holding locks and to block incoming requests from other clients. Then, other clients started to time out. This compounded the problem because the application was not designed to roll back open transactions when a query time out occurred. This created hundreds of open transactions that were holding locks. Therefore, a severe blocking situation occurred.

For more information about transaction handling and blocking, click the following article number to view the article in the Microsoft Knowledge Base:

224453 Understanding and resolving SQL Server 7.0 or 2000 blocking problems

The application serviced a Web site using connection pooling. As more connections became blocked, the Web site created more connections. These connections became blocked and the cycle continued.

After approximately 45 seconds, the log write completed. However, by this time, hundreds of connections were backed up. The blocking problems resulted in several minutes of recovery time for SQL Server and for the application. When combined with the application problems, the stalled I/O condition had a very negative affect on the system.

Resolution
The problem was tracked to a stuck I/O request in a Host Bus Adapter (HBA) driver. The computer had multiple HBA cards with failover support. When one HBA was behind or was not communicating with the Storage Area Network (SAN), the &quot;retry before failover&quot; time-out value was configured to 45 seconds. When the time-out was exceeded, the I/O request was routed to the second HBA. The second HBA handled the request and quickly finished. To help prevent such stall conditions, the hardware manufacturer recommended a &quot;retry before failover&quot; setting of 5 seconds.

Example 2: Filter driver intervention
Many antivirus software programs and backup products use I/O filter drivers. These I/O filter drivers become part of the I/O request stack, and they have access to the IRP request. Microsoft Product Support Services has seen various issues from bugs that create stuck I/O conditions or stalled I/O conditions in a filter driver implementation.

One such condition was a filter driver for backup processing that allowed a backup of the files that were open when the backup occurred. The system administrator had included the SQL Server data file directory in the file backup selections. When the backup occurred, the backup tried to gather the correct image of the file at the time the backup started. Doing this delayed I/O requests. The I/O requests were allowed to only complete one at a time as they were handled by the software.

When the backup started, SQL Server performance dropped dramatically because the I/Os of SQL Server were forced to complete one at a time. To compound the issue, the &quot;one at a time&quot; logic was such that the I/O operation could not be performed asynchronously. Therefore, when SQL Server expected to post an I/O request and to continue, the worker was stuck in the read or the write call until the I/O request finished. Processing tasks such as a SQL Server read ahead were effectively disabled by the actions of the filter driver. Additionally, another bug in the filter driver left the &quot;one at a time&quot; actions in process, even when the backup was finished. The only way to restore SQL Server performance was to close and then reopen the database, or to restart SQL Server so that the file handle was released and reacquired without the filter driver interaction.

Resolution
To resolve this problem, the SQL Server data files were removed from the file backup process. The software manufacture also corrected the problem that left the file in &quot;one at a time&quot; mode.

Example 3: Hidden errors
Many higher end systems have multichannel I/O paths to handle load balancing or similar activities. Microsoft Product Support has found problems with the load balancing software where an I/O request fails but the software does not handle the error condition correctly. The software can attempt infinite retries. The I/O operation becomes stuck and SQL Server cannot finish the specified action. Much like the log write condition that was described earlier, many poor system behaviors can occur after such a condition wedges the system.

Resolution
To resolve this problem, restarting SQL Server is often required. However, sometimes you must restart the operating system to restore processing. We also recommend that you obtain a software update from the I/O vendor.

Example 4: Remote storage, Mirroring, and Raid drives
Many systems use mirroring or take similar steps to prevent data loss. Some of the systems that use mirroring are software based and some are hardware based. The situation typically uncovered by Product Support Services for these systems is increased latency.

An increase in the overall I/O time occurs when the I/O must complete to the mirror before the I/O is considered complete. For remote mirror installations, network retries can become involved. When drive failures occur and the raid system is rebuilding, the I/O pattern can also be interrupted.

Resolution
Strict configuration settings are required to reduce latency to mirrors or to raid rebuild operations.

Example 5: Compression
Microsoft does not support Microsoft SQL Server 7.0 or Microsoft SQL Server 2000 data files and log files on compressed drives. NTFS compression is not safe for SQL Server because NTFS compression breaks Write Ahead Logging (WAL) protocol. NTFS compression also requires increased processing for each I/O operation. Compression creates &quot;one at a time&quot; like behavior that causes severe performance issues to occur.

Resolution
To resolve this problem, uncompress the data and the log files.

Additional data points
The wait type information that is provided in the sysprocesses system table can be helpful in diagnosing I/O bottlenecks. Buffer I/O latch wait types and writelog waits are key indicators to investigate I/O path performance.

For more information about wait types, click the following article number to view the article in the Microsoft Knowledge Base:

822101 Description of the waittype and lastwaittype columns in the master.dbo.sysprocesses table in SQL Server 2000 and SQL Server 2005

