Microsoft KB Archive/224453

From BetaArchive Wiki

Article ID: 224453

Article Last Modified on 3/22/2004



APPLIES TO

  • Microsoft SQL Server 7.0 Standard Edition
  • Microsoft SQL Server 2000 Standard Edition



This article was previously published under Q224453

SUMMARY

This article is an update for SQL Server 7.0 of the following article, which applies to SQL Server 6.x, in the Microsoft Knowledge Base:

162361 INF: Understanding and Resolving SQL Server Blocking Problems


Much of the information contained in the above article has already been updated and included in SQL Server 7.0 Books Online in the "Understanding and Avoiding Blocking" topic. Carefully review this information before proceeding with this article; it will not be repeated here. This article focuses on how to monitor SQL Server to capture pertinent system information and how to analyze that information to successfully resolve blocking issues.

This article uses the same standard terminology as defined in the above information. In this discussion, the term "connection" refers to a single logged-on session of the database. Each connection appears as a system process ID (SPID). Each of these SPIDs is often referred to as a process, although it is not a separate process context in the usual sense. Rather, each SPID consists of the server resources and data structures necessary to service the requests of a single connection from a given client. A single client application may have one or more connections. From the perspective of SQL Server, there is no difference between multiple connections from a single client application on a single client computer and multiple connections from multiple client applications or multiple client computers. One connection can block another connection, regardless of whether they emanate from the same application or separate applications on two different client computers.

MORE INFORMATION

Blocking is an unavoidable characteristic of any relational database management system (RDBMS) with lock-based concurrency. On SQL Server, blocking occurs when one SPID holds a lock on a specific resource and a second SPID attempts to acquire a conflicting lock type on the same resource. Typically, the time frame for which the first SPID locks the resource is very small. When it releases the lock, the second connection is free to acquire its own lock on the resource and continue processing. This is normal behavior and may happen many times throughout the course of a day with no noticeable effect on system performance.

The duration and transaction context of a query determine how long its locks are held and, thereby, their impact on other queries. If the query is not executed within a transaction (and no lock hints are used), the locks for SELECT statements will only be held on a resource at the time it is actually being read, not for the duration of the query. For INSERT, UPDATE, and DELETE statements, the locks are held for the duration of the query, both for data consistency and to allow the query to be rolled back if necessary.

For queries executed within a transaction, the duration for which the locks are held are determined by the type of query, the transaction isolation level, and whether or not lock hints are used in the query. For a description of locking, lock hints, and transaction isolation levels, see the following topics in SQL Server 7.0 Books Online:

  • "Understanding Locking in SQL Server"
  • "Locking Architecture"
  • "Lock Compatibility"
  • "Locking Hints"
  • "Changing Default Locking Behavior in Oracle and SQL Server"

When locking and blocking increase to the point where there is a detrimental effect on system performance, it is usually due to one of the following reasons:

  • A SPID holds locks on a set of resources for an extended period of time before releasing them. This type of blocking resolves itself over time, but can cause performance degradation.
  • A SPID holds locks on a set of resources and never releases them. This type of blocking does not resolve itself and prevents access to the affected resources indefinitely.

In the first scenario above, the blocking problem resolves itself over time as the SPID releases the locks. However, the situation can be very fluid as different SPIDs cause blocking on different resources over time, creating a moving target. For this reason, these situations can be difficult to troubleshoot using SQL Server Enterprise Manager or individual SQL queries. The second situation results in a consistent state that can be easier to diagnose.

Gathering Blocking Information

To counteract the difficulty of troubleshooting blocking problems, a database administrator can use SQL scripts that constantly monitor the state of locking and blocking on SQL Server. These scripts can provide snapshots of specific instances over time, leading to an overall picture of the problem. For a description of how to monitor blocking with SQL scripts, see the following articles in the Microsoft Knowledge Base:

251004 INF: How to Monitor SQL Server 7.0 Blocking


271509 INF: How to Monitor SQL Server 2000 Blocking


The scripts in this article will perform the tasks below. Where possible, the method for obtaining this information from Enterprise Manager or a specific SQL query is given.

  1. Identify the SPID at the head of the blocking chain.
    In addition to using the scripts in the above article, you can also identify the head of the blocking chain by using SQL Enterprise Manager as follows:

    1. Expand the server group; then expand the server.
    2. Expand Management; then expand Current Activity.
    3. Expand Locks / Process ID. The SPIDs, along with their blocking information, are displayed in the details pane. SPIDs that are blocking others will appear as "(Blocking)."
    Note however, that it is sometimes necessary to use queries instead of Enterprise Manager, because some types of tempdb blocking problems may prevent you from running queries that use temporary table operations. Using direct queries gives you the control necessary to avoid this problem.
  2. Find the query that the blocking SPID is running.
    The script method uses the following query to determine the command issued by a particular SPID:

    DBCC INPUTBUFFER (<spid>)
                            

    Alternately, you can use SQL Enterprise Manager as follows:

    1. Expand the server group; then expand the server.
    2. Expand Management; then expand Current Activity.
    3. Click Process Info. The SPIDs are displayed in the details pane.
    4. Double-click the blocking SPID to see the last Transact-SQL command batch the SPID executed.
  3. Find the type of locks the blocking SPID is holding.
    You can determine this information by executing the sp_lock system stored procedure. Alternatively, you can use Enterprise Manager as follows:

    1. Expand the server group; then expand the server.
    2. Expand Management; then expand Current Activity.
    3. Expand Locks / Process ID. The SPIDs, along with the information on the locks they are holding, are displayed in the details pane.
  4. Find the transaction nesting level and process status of the blocking SPID.
    The transaction nesting level of a SPID is available in the @@TRANCOUNT global variable. However, it can be determined from outside the SPID by querying the sysprocesses table as follows:

    SELECT open_tran FROM SYSPROCESSES WHERE SPID=<blocking SPID number>
    go
                            

    The value returned is the @@TRANCOUNT value for the SPID. This shows the transaction nesting level for the blocking SPID, which in turn can explain why it is holding locks. For example, if the value is greater than zero, the SPID is in the midst of a transaction (in which case it is expected that it retains certain locks it has acquired, depending on the transaction isolation level).

    You can also check to see if any long-term open transaction exists in the database by using DBCC OPENTRAN database_name.

Gathering SQL Server Profiler Trace Information

In addition to the above information, it is often necessary to capture a Profiler trace of the activities on the server to thoroughly investigate a blocking problem on SQL Server. If a SPID executes multiple statements within a transaction, only the last statement will appear in the DBCC INPUTBUFFER output. However, one of the earlier commands may be the reason locks are still being held. A Profiler trace will enable you to see all of the commands executed by a SPID within the current transaction. The following steps help you to set up SQL Server Profiler to capture a trace.

  1. Open SQL Server Profiler.
  2. On the Tools menu, click Options.
  3. Ensure that the All Event Classes and All Data Columns options are selected.
  4. Click OK.
  5. On the File menu, point to New then click Trace.
  6. On the General tab, specify a trace name and a file to capture the data to.
  7. On the Events tab, add the following event types to your trace:

    Heading Event to add Description
    Error and Warning Exception This event indicates that an exception has occurred. Exceptions with severity less than 25 indicate that an error was returned to the client from SQL Server. Exceptions of severity 25 are internal SQL Server exceptions and should be filtered out as described below.
    Misc. Attention This event indicates that an attention signal has been raised. Normal causes of an attention signal are a client cancel or query timeout.
    Sessions Connect This event indicates that a new connection has been made.
    Sessions Disconnect This event indicates that a client has disconnected.
    Sessions Existing Connection This event indicates that a connection existed at the time when the SQL Profiler trace was started.
    TSQL RPC:Starting This event indicates that execution of a remote procedure call (RPC) has started.
    TSQL SQL:BatchStarting This event indicates that execution of a Transact-SQL batch has started.
    Stored Procedures SP: StmtStarting This event indicates when a statement within a stored procedure is starting execution. The stored procedure name is indicated at the start of the Text for the event.


    Additionally, you may include the following events for further information. If you are running in a high-volume production environment, you may decide to use only the above events, as they are sufficient to troubleshoot blocking problems. Including the additional events below may make it easier to quickly determine the source of a problem, but will also add to the load on the system and increase the trace output size.

    Heading Event to add Description
    Misc. Execution Plan This event shows the plan tree of the Transact-SQL statement that was executed.
    Transactions DTCTransaction This event tracks Microsoft Distributed Transaction Coordinator (MS DTC) transactions between two or more databases or servers.
    Transactions SQLTransaction This event tracks SQL BEGIN, SAVE, COMMIT, and ROLLBACK TRANSACTION statements.
    TSQL RPC:Completed This event indicates that execution of a remote procedure call (RPC) has completed.
    TSQL SQL:BatchCompleted This event indicates that execution of a Transact-SQL batch has completed.
    Stored Procedures SP: StmtCompleted This event indicates that a statement within a stored procedure has completed execution.
  8. On the Data Columns tab, ensure that the following columns are included: Start Time, End Time, Connection ID, SPID, Event Class, Text, Integer Data, Binary Data, Application Name, NT User Name, and SQL User Name. If you included the additional events from the second table above, also include the following data columns as well: Duration, CPU, Reads, and Writes.
  9. On the Filters tab, exclude SQL Server internal exceptions. In the Trace Event Criteria box, select Severity and type 24 in the Maximum box. Then click OK.

    For more information on monitoring errors sent to clients from SQL Server, see the following article in the Microsoft Knowledge Base:

    199037 INF: Trapping Error Messages Sent to Clients from a SQL Server

For information about using the Profiler, please see SQL Server Books Online.

Identifying and Resolving Common Blocking Scenarios

By examining the above information, you can determine the cause of most blocking problems. The rest of this article is a discussion of how to use this information to identify and resolve some common blocking scenarios. This discussion assumes you have used the blocking scripts in article Q251004 (referenced earlier) to capture information on the blocking SPIDs and have made a Profiler trace with the events described above.

Viewing the Blocking Script Output

  • Examine the sysprocesses output to determine the heads of the blocking chains.
    If you did not specify fast mode for the blocking scripts, there will be a section titled "SPIDs at the head of blocking chains" that lists the SPIDs blocking others in the script output:

       SPIDs at the head of blocking chains
       spid   
       ------ 
       9
       10
                            

    If you specified the fast option, you can still determine the blocking heads by looking at the sysprocesses output. The following is an abbreviated sysprocesses output:

       spid   status                         blocked
        9     sleeping                       0
       10     sleeping                       0
       11     sleeping                       13
       12     sleeping                       10
       13     sleeping                       9
       14     sleeping                       12
                            

    In this case, you can see that SPIDs 9 and 10 both have 0 in the blocked column, meaning that they are not being blocked, yet they both appear in the blocked column for other SPIDs. This indicates that SPIDs 9 and 10 are each at the head of separate blocking chains.

  • Examine the sysprocesses output for information on the SPIDs at the head of the blocking chain.
    It is important to evaluate the following sysprocesses fields:

    • Status
      This column gives a quick look at the status of a particular SPID. Typically, a sleeping status indicates that the SPID has completed execution and is waiting for the application to submit another query or batch. A runnable status indicates the SPID is currently processing a query. The following table gives brief explanations for the various status values.

      Status Meaning
      Background SPID is performing a background task.
      Sleeping SPID is not currently executing. This usually indicates that the SPID is awaiting a command from the application.
      Runnable SPID is currently executing.
      Dormant Same as Sleeping, except Dormant also indicates that the SPID has been reset after completing an RPC event. The reset cleans up resources used during the RPC event. This is a normal state and the SPID is available and waiting to execute further commands.
      Rollback The SPID is in rollback of a transaction.
      Defwakeup Indicates that a SPID is waiting on a resource that is in the process of being freed. The waitresource field should indicate the resource in question.
      Spinloop Process is waiting while attempting to acquire a spinlock used for concurrency control on SMP systems.
    • Open_tran
      This field tells you the transaction nesting level of the SPID. If this value is greater than 0, the SPID is within an open transaction and may be holding locks acquired by any statement within the transaction.
    • Lastwaittype, waittype, and waittime
      The lastwaittype field tells you the last or current waittype of the SPID. This field is new in SQL Server 7.0 and is a string representation of the waittype field (which is a reserved internal binary column). If the waittype is 0x0000, then the SPID is not currently waiting on anything and the lastwaittype value indicates the last waittype the SPID had. If waittype is non-zero, the lastwaittype value indicates the current waittype of the SPID.

      For a brief description of the different lastwaittype and waittype values, please see the following article in the Microsoft Knowledge base:

      244455 INF: Definition of Sysprocesses Waittype and Lastwaittype Fields

      The waittime value can be useful to determine if the SPID is making progress. When a query against the sysprocesses table returns a value in the waittime column that is less than the waittime value from a previous query of sysprocesses, this indicates that the prior lock was acquired and released and is now waiting on a new lock (assuming non-zero waittime). This can be verified by comparing the waitresource between sysprocesses output.
    • Waitresource
      This field indicates the resource that a SPID is waiting on. The following table lists common waitresource formats and their meaning:

      Resource Format Example
      Table DatabaseID:ObjectID TAB: 5:261575970

      In this case, database ID 5 is the pubs sample database and object ID 261575970 is the titles table.

      Page DatabaseID:FileID:PageID PAG: 5:1:104

      In this case, database ID 5 is pubs, file ID 1 is the primary data file, and page 104 is a page belonging to the titles table.

      Key DatabaseID:ObjectID:IndexID (Hash value for index key) KEY: 5:261575970:1 (5d0164fb1eac)

      In this case, database ID 5 is pubs, object ID 261575970 is the titles table, index ID 1 is the clustered index, and the hash value indicates the index key value for the particular row.

    • Other columns
      The remaining sysprocesses columns can provide insight into the root of a problem as well. Their usefulness varies depending on the circumstances of the problem. For example, you can determine if the problem happens only from certain clients (hostname), on certain network libraries (net_library), when the last batch submitted by a SPID was (last_batch), and so on. For a brief description of all of the sysprocesses columns, please see the "sysprocesses (T-SQL)" topic in SQL Server 7.0 Books Online.

      NOTE: The SUID column is not included in the blocking script output because it is a derived column that was only included for backwards compatibility. It is not used internally by SQL Server, and you can cause a performance degradation by querying it (because it is derived), so it was not included.
  • Examine the DBCC INPUTBUFFER output.

    For any SPID at the head of a blocking chain or with a non-zero waittype, the blocking script will execute DBCC INPUTBUFFER to determine the current query for that SPID:

       DBCC INPUTBUFFER FOR SPID 9
       EventType      Parameters EventInfo                                    
       -------------- ---------- -------------------------------------------- 
       Language Event 0          update titles set title = title
                            

    In many cases, this is the query that is causing the locks that are blocking other users to be held. However, if the SPID is within a transaction, the locks may have been acquired by a previously executed query, not the current one. Therefore, you should also view the Profiler output for the SPID, not just the inputbuffer.

    NOTE: Because the blocking script consists of multiple steps, it is possible that a SPID may appear in the first section as the head of a blocking chain, but by the time the DBCC INPUTBUFFER query is executed, it is no longer blocking and the INPUTBUFFER is not captured. This indicates that the blocking is resolving itself for that SPID and it may or may not be a problem. At this point, you can either use the fast version of the blocking script to try to ensure you capture the inputbuffer before it clears (although there is still no guarantee), or view the Profiler data from that time frame to determine what queries the SPID was executing.


Row DatabaseID:FileID:PageID:Slot(row) RID: 5:1:104:3

In this case, database ID 5 is pubs, file ID 1 is the primary data file, page 104 is a page belonging to the titles table, and slot 3 indicates the row's position on the page.

Compile DatabaseID:ObjectID TAB: 5:834102012 COMPILE


In this case, database ID 5 is pubs, but the object ID 834102012 is a stored procedure. This indicates that the SPID is waiting to compile a plan for the stored procedure.

Viewing the Profiler Data

Viewing Profiler data efficiently is extremely valuable in resolving blocking issues. The most important thing to realize is that you do not have to look at everything you captured; be selective. Profiler provides capabilities to help you effectively view the captured data. In the Properties dialog box (on the File menu, click Properties), Profiler allows you to limit the data displayed by removing data columns or events, grouping (sorting) by data columns and applying filters. You can search the whole trace or only a specific column for specific values (on the Edit menu, click Find). You can also save the Profiler data to a SQL Server table (on the File menu, point to Save As and then click Table) and run SQL queries against it.

Be careful that you perform filtering only on a previously saved trace file. If you perform these steps on an active trace, you risk losing data that has been captured since the trace was started. Save an active trace to a file or table first (on the File menu, click Save As) and then reopen it (on the File menu, click Open) before proceeding. When working on a saved trace file, the filtering does not permanently remove the data being filtered out, it just does not display all the data. You can add and remove events and data columns as needed to help focus your searches.

What to look for:

  • What commands has the SPID at the head of a blocking chain executed within the current transaction?

Filter the trace data for a particular SPID that is at the head of a blocking chain (on the File menu, click Properties; then on the Filters tab specify the SPID value). You can then examine the commands it has executed prior to the time it was blocking other SPIDs. If you include the Transaction events, they can easily identify when a transaction was started. Otherwise, you can search the Text column for BEGIN, SAVE, COMMIT, or ROLLBACK TRANSACTION operations. Use the open_tran value from the sysprocesses table to ensure that you catch all of the transaction events. Knowing the commands executed and the transaction context will allow you to determine why a SPID is holding locks.

Remember, you can remove events and data columns. Instead of looking at both starting and completed events, choose one. If the blocking SPIDs are not stored procedures, remove the SP:Starting or SP:Completed events; the SQLBatch and RPC events will show the procedure call. Only view the SP events when you need to see that level of detail.

  • What is the duration of the queries for SPIDs at the head of blocking chains?

If you include the completed events above, the Duration column will show the query execution time. This can help you identify long-running queries that are causing blocking. To determine why the query is performing slowly, view the CPU, Read, and Writes columns, as well as the Execution Plan event.

Categorizing Common Blocking Scenarios

The table below maps common symptoms to their probable causes. The number indicated in the Scenario column corresponds to the number in the "Common Blocking Scenarios and Resolutions" section of this article below. The Waittype, Open_Tran, and Status columns refer to sysprocesses information. The Resolves? column indicates whether or not the blocking will resolve on its own.

Scenario Waittype Open_Tran Status Resolves? Other Symptoms
1 Non-zero >= 0 runnable Yes, when query finishes. Physical_IO, CPU and/or Memusage columns will increase over time. Duration for the query will be high when completed.
2 0x0000 >0 sleeping No, but SPID can be killed. An attention signal may be seen in the Profiler trace for this SPID, indicating a query timeout or cancel has occurred.
3 0x0000 >= 0 runnable No. Will not resolve until client fetches all rows or closes connection. SPID can be killed, but it may take up to 30 seconds. If open_tran = 0, and the SPID holds locks while the transaction isolation level is default (READ COMMMITTED), this is a likely cause.
4 Varies >= 0 runnable No. Will not resolve until client cancels queries or closes connections. SPIDs can be killed, but may take up to 30 seconds. The hostname column in sysprocesses for the SPID at the head of a blocking chain will be the same as one of the SPID it is blocking.
5 0x0000 >0 rollback Yes. An attention signal may be seen in the Profiler trace for this SPID, indicating a query timeout or cancel has occurred, or simply a rollback statement has been issued.
6 0x0000 >0 sleeping Eventually. When Windows NT determines the session is no longer active, the SQL Server connection will be broken. The last_batch value in sysprocesses is much earlier than the current time.

Common Blocking Scenarios and Resolutions

The scenarios listed below will have the characteristics listed in the table above. This section provides additional details when applicable, as well as paths to resolution.

  1. Blocking Caused by a Normally Running Query with a Long Execution Time

    Resolution:
    The solution to this type of blocking problem is to look for ways to optimize the query. Actually, this class of blocking problem may just be a performance problem, and require you to pursue it as such. For information on troubleshooting a specific slow-running query, see the following article in the Microsoft Knowledge Base:

    243589 INF: Troubleshooting Slow-Running Queries on SQL Server 7.0

    For overall application performance troubleshooting, see the following article in the Microsoft Knowledge Base:

    224587 HOW TO: Troubleshoot Application Performance with SQL Server

    If you have a long-running query that is blocking other users and cannot be optimized, consider moving it from an OLTP environment to a decision support system.
  2. Blocking Caused by a Sleeping SPID That Has Lost Track of the Transaction Nesting Level

    This type of blocking can often be identified by a SPID that is sleeping or awaiting a command, yet whose transaction nesting level (@@TRANCOUNT, open_tran from sysprocesses) is greater than zero. This can occur if the application experiences a query timeout, or issues a cancel without also issuing the required number of ROLLBACK and/or COMMIT statements. When a SPID receives a query timeout or cancel, it will the terminate the current query and batch, but does not automatically roll back or commit the transaction. The application is responsible for this, as SQL Server cannot assume that an entire transaction must be rolled back simply due to a single query being canceled. The query timeout or cancel will appear as an ATTENTION signal event for the SPID in the Profiler trace.

    To demonstrate this, issue the following simple query from Query Analyzer:

    BEGIN TRAN 
    SELECT * FROM SYSOBJECTS S1, SYSOBJECTS S2
    
    -- Issue this after canceling query
    SELECT @@TRANCOUNT
    ROLLBACK TRAN
                            

    While the query is executing, click the red Cancel button. After the query is canceled, SELECT @@TRANCOUNT indicates that the transaction nesting level is one. Had this been a DELETE or an UPDATE query, or had HOLDLOCK been used on the SELECT, all the locks acquired would still be held. Even with the query above, if another query had acquired and held locks earlier in the transaction, they would still be held when the above SELECT was canceled.

    Resolutions:

    • Applications must properly manage transaction nesting levels, or they may cause a blocking problem following the cancellation of the query in this manner. This can be accomplished in one of several ways:
      1. In the error handler of the client application, submit an IF @@TRANCOUNT > 0 ROLLBACK TRAN following any error, even if the client application does not believe a transaction is open. This is required, because a stored procedure called during the batch could have started a transaction without the client application's knowledge. Note that certain conditions, such as canceling the query, prevent the procedure from executing past the current statement, so even if the procedure has logic to check IF @@ERROR <> 0 and abort the transaction, this rollback code will not be executed in such cases.
      2. Use SET XACT_ABORT ON for the connection, or in any stored procedures which begin transactions and are not cleaning up following an error. In the event of a run-time error, this setting will abort any open transactions and return control to the client. Note that T-SQL statements following the statement which caused the error will not be executed.
      3. If connection pooling is being used in an application that opens the connection and runs a small number of queries before releasing the connection back to the pool, such as a Web-based application, temporarily disabling connection pooling may help alleviate the problem until the client application is modified to handle the errors appropriately. By disabling connection pooling, releasing the connection will cause a physical logout of the SQL Server connection, resulting in the server rolling back any open transactions.
      4. If connection pooling is enabled and the destination server is SQL Server 2000, upgrading the client computer to MDAC 2.6 or later may be beneficial. This version of the MDAC components adds code to the ODBC driver and OLE DB provider so that the connection would be "reset" before it is reused. This call to sp_reset_connection aborts any server-initiated transactions (DTC transactions initiated by the client app are not affected), resets the default database, SET options, and so forth. Note that the connection is not reset until it is reused from the connection pool, so it is possible that a user could open a transaction and then release the connection to the connection pool, but it might not be reused for several seconds, during which time the transaction would remain open. If the connection is not reused, the transaction will be aborted when the connection times out and is removed from the connection pool. Thus, it is optimal for the client application to abort transactions in their error handler or use SET XACT_ABORT ON to avoid this potential delay.
    • Actually, this class of blocking problem may also be a performance problem, and require you to pursue it as such. If the query execution time can be diminished, the query timeout or cancel would not occur. It is important that the application be able to handle the timeout or cancel scenarios should they arise, but you may also benefit from examining the performance of the query.

      For information on troubleshooting a specific slow-running query, see the following article in the Microsoft Knowledge Base:

      243589 INF: Troubleshooting Slow-Running Queries on SQL Server 7.0

      For overall application performance troubleshooting, see the following article in the Microsoft Knowledge Base:

      224587 HOW TO: Troubleshoot Application Performance with SQL Server

      If you have a long-running query that is blocking other users and cannot be optimized, consider moving it from an OLTP environment to a decision support system.
  3. Blocking Caused by a SPID Whose Corresponding Client Application Did Not Fetch All Result Rows to Completion

    After sending a query to the server, all applications must immediately fetch all result rows to completion. If an application does not fetch all result rows, locks can be left on the tables, blocking other users. If you are using an application that transparently submits SQL statements to the server, the application must fetch all result rows. If it does not (and if it cannot be configured to do so), you may be unable to resolve the blocking problem. To avoid the problem, you can restrict poorly-behaved applications to a reporting or a decision-support database.

    Resolution:

    The application must be re-written to fetch all rows of the result to completion.
  4. Blocking Caused by a Distributed Client/Server Deadlock

    Unlike a conventional deadlock, a distributed deadlock is not detectable using the RDBMS lock manager. This is due to the fact that only one of the resources involved in the deadlock is a SQL Server lock. The other side of the deadlock is at the client application level, over which SQL Server has no control. The following are two examples of how this can happen, and possible ways the application can avoid it.

    1. Client/Server Distributed Deadlock with a Single Client Thread
      If the client has multiple open connections, and a single thread of execution, the following distributed deadlock may occur. For brevity, the term "dbproc" used here refers to the client connection structure.

       SPID1------blocked on lock------->SPID2
        /\                         (waiting to write results         
        |                           back to client)
        |                                 |
        |                                 |                      Server side
        | ================================|==================================
        |     <-- single thread -->       |                      Client side
        |                                 \/ 
       dbproc1   <-------------------   dbproc2
       (waiting to fetch             (effectively blocked on dbproc1, awaiting
        next row)                     single thread of execution to run)
                                      

      In the case shown above, a single client application thread has two open connections. It asynchronously submits a SQL operation on dbproc1. This means it does not wait on the call to return before proceeding. The application then submits another SQL operation on dbproc2, and awaits the results to start processing the returned data. When data starts coming back (whichever dbproc first responds -- assume this is dbproc1), it processes to completion all the data returned on that dbproc. It fetches results from dbproc1 until SPID1 gets blocked on a lock held by SPID2 (because the two queries are running asynchronously on the server). At this point, dbproc1 will wait indefinitely for more data. SPID2 is not blocked on a lock, but tries to send data to its client, dbproc2. However, dbproc2 is effectively blocked on dbproc1 at the application layer as the single thread of execution for the application is in use by dbproc1. This results in a deadlock that SQL Server cannot detect or resolve because only one of the resources involved is a SQL Server resource.

    2. Client/Server Distributed Deadlock with a Thread per Connection

      Even if a separate thread exists for each connection on the client, a variation of this distributed deadlock may still occur as shown by the following.

      SPID1------blocked on lock-------->SPID2
        /\                         (waiting on net write)        Server side
        |                                 |
        |                                 |
        | INSERT                          |SELECT
        | ================================|==================================
        |     <-- thread per dbproc -->   |                      Client side
        |                                 \/ 
       dbproc1   <-----data row-------   dbproc2
       (waiting on                     (blocked on dbproc1, waiting for it
        insert)                         to read the row from its buffer)
                                      

      This case is similar to Example A, except dbproc2 and SPID2 are running a SELECT statement with the intention of performing row-at-a-time processing and handing each row through a buffer to dbproc1 for an INSERT, UPDATE, or DELETE statement on the same table. Eventually, SPID1 (performing the INSERT, UPDATE, or DELETE) becomes blocked on a lock held by SPID2 (performing the SELECT). SPID2 writes a result row to the client dbproc2. Dbproc2 then tries to pass the row in a buffer to dbproc1, but finds dbproc1 is busy (it is blocked waiting on SPID1 to finish the current INSERT, which is blocked on SPID2). At this point, dbproc2 is blocked at the application layer by dbproc1 whose SPID (SPID1) is blocked at the database level by SPID2. Again, this results in a deadlock that SQL Server cannot detect or resolve because only one of the resources involved is a SQL Server resource.

    Both examples A and B are fundamental issues that application developers must be aware of. They must code applications to handle these cases appropriately.

    Resolutions:

    Two reliable solutions are to use either a query timeout or bound connections.

    • Query Timeout
      When a query timeout has been provided, if the distributed deadlock occurs, it will be broken when then timeout happens. See the DB-Library or ODBC documentation for more information on using a query timeout.
    • Bound Connections
      This feature allows a client having multiple connections to bind them into a single transaction space, so the connections do not block each other. For more information, see the "Using Bound Connections" topic in SQL Server 7.0 Books Online.
  5. Blocking Caused by a SPID That Is in a "Golden," or Rollback, State

    A data modification query that is KILLed, or canceled outside of a user-defined transaction, will be rolled back. This can also occur as a side effect of the client computer restarting and its network session disconnecting. Likewise, a query selected as the deadlock victim will be rolled back. A data modification query often cannot be rolled back any faster than the changes were initially applied. For example, if a DELETE, INSERT, or UPDATE statement had been running for an hour, it could take at least an hour to roll back. This is expected behavior, because the changes made must be completely rolled back, or transactional and physical integrity in the database would be compromised. Because this must happen, SQL Server marks the SPID in a "golden" or rollback state (which means it cannot be KILLed or selected as a deadlock victim). This can often be identified by observing the output of sp_who, which may indicate the ROLLBACK command. The Status column of sysprocesses will indicate a ROLLBACK status, which will also appear in sp_who output or the SQL Enterprise Manager Current Activity screens.
    Resolution:

    You must wait for the SPID to finish rolling back the changes that were made.

    If the server is shut down in the midst of this operation, the database will be in recovery mode upon restarting, and it will be inaccessible until all open transactions are processed. Startup recovery takes essentially the same amount of time per transaction as run-time recovery, and the database is inaccessible during this period. Thus, forcing the server down to fix a SPID in a rollback state will often be counterproductive.

    To avoid this situation, do not perform large batch INSERT, UPDATE, or DELETE operations during busy hours on OLTP systems. If possible, perform such operations during periods of low activity.
  6. Blocking Caused by an Orphaned Connection

    If the client application traps or the client workstation is restarted, the network session to the server may not be immediately canceled under some conditions. From the server's perspective, the client still appears to be present, and any locks acquired may still be retained. For more information, see the "Orphaned Connections" topic in SQL Server 7.0 Books Online.

    Resolution:

    If the client application has disconnected without appropriately cleaning up its resources, you can terminate the SPID by using the KILL command. The KILL command takes the SPID value as input. For example, to kill SPID 9, simply issue the following command:

    KILL 9
                            


    NOTE: The KILL command may take up to 30 seconds to complete, due to the interval between checks for the KILL command.

Application Involvement in Blocking Problems

There may be a tendency to focus on server-side tuning and platform issues when facing a blocking problem. However, this does not usually lead to a resolution, and can absorb time and energy better directed at examining the client application and the queries it submits. No matter what level of visibility the application exposes regarding the database calls being made, a blocking problem nonetheless frequently requires both the inspection of the exact SQL statements submitted by the application and the application's exact behavior regarding query cancellation, connection management, fetching all result rows, and so on. If the development tool does not allow explicit control over connection management, query cancellation, query timeout, result fetching, and so on, blocking problems may not be resolvable. This potential should be closely examined before selecting an application development tool for SQL Server, especially for business-critical OLTP environments.

It is vital that great care be exercised during the design and construction phase of the database and application. In particular, the resource consumption, isolation level, and transaction path length should be evaluated for each query. Each query and transaction should be as lightweight as possible. Good connection management discipline must be exercised. If this is not done, it is possible that the application may appear to have acceptable performance at low numbers of users, but the performance may degrade significantly as the number of users scales upward.

With proper application and query design, Microsoft SQL Server is capable of supporting many thousands of simultaneous users on a single server, with little blocking. Please see the "Application Design" and "Understanding and Avoiding Blocking" topics in SQL Server 7.0 Books Online for more information. The successful sites that reach these numbers of users typically use the techniques described in these topics.

REFERENCES

For more information, refer to the following book:

Microsoft Corporation Microsoft SQL Server 7.0 System Administration Training Kit Microsoft Press, 2001


Microsoft Corporation MCSE Training Kit: Microsoft SQL Server 2000 System Administration Microsoft Press, 2001


For more information, refer to the following Microsoft Training & Certification course:

Keywords: kbinfo KB224453