Microsoft KB Archive/230785

= SQL Server 7.0, SQL Server 2000, and SQL Server 2005 logging and data storage algorithms extend data reliability =

Article ID: 230785

Article Last Modified on 2/9/2006

-

APPLIES TO


 * Microsoft SQL Server 2005 Standard Edition
 * Microsoft SQL Server 2005 Enterprise Edition
 * Microsoft SQL Server 2005 Developer Edition
 * Microsoft SQL Server 2005 Workgroup Edition
 * Microsoft SQL Server 2005 Express Edition
 * Microsoft SQL Server 2000 Standard Edition
 * Microsoft SQL Server 7.0 Standard Edition

-



This article was previously published under Q230785



SUMMARY
SQL Server 7.0, SQL Server 2000, and SQL Server 2005 restructure and redesign the logging and data algorithms from earlier Microsoft SQL Server releases to improve data reliability and integrity.

To learn more about the underlying concepts of the SQL Server 7.0 and SQL Server 2000 engines, see "ARIES (Algorithm for Recovery and Isolation Exploiting Semantics): A Transaction Recovery method Supporting Fine-Granularity Locking and Partial Rollbacks Using Write-Ahead Logging", of ACM Transactions on Database Systems. This document was written by Chunder Mohan.

This document addresses the SQL Server 7.0, SQL Server 2000, and SQL Server 2005 techniques to extend data reliability and integrity as related to failures.

It is recommended that you read the following articles in the Microsoft Knowledge Base for further clarification on caching and alternate failure mode discussions:

86903 Description of caching disk controllers in SQL Server

46091 Using hard disk controller caching with SQL Server

234656 Using disk drive caching with SQL Server



MORE INFORMATION
Before beginning the in-depth discussion, some of the terms as used throughout this article are defined in the following section.

SQL Server 2005, SQL Server 2000, SQL Server 7.0, earlier versions of SQL Server, and many mainstream database products on the market today use the Write-Ahead Logging (WAL) protocol.

Write-Ahead Logging (WAL) protocol
The term protocol is an excellent way to describe WAL. It is a specific and defined set of implementation steps necessary to ensure data is stored and exchanged properly and can be recovered to a known state in the event of a failure. Just as a network contains a defined protocol to exchange data in a consistent and protected manner, so too does the WAL describe the protocol to protect data.

The ARIES document defines the WAL as:

The WAL protocol asserts that the log records representing changes to some data must already be in stable storage before the changed data is allowed to replace the previous version of the data in nonvolatile storage. That is, the system is not allowed to write an updated page to the nonvolatile storage version of the page until at least the undo portions of the log records which describe the updates to the page have been written to stable storage.

For more information about Write-Ahead Logging, see the SQL Server Books Online documentation.

SQL Server and the WAL
SQL Server 2005, SQL Server 2000, SQL Server 7.0, and earlier SQL Server releases use the WAL protocol. To ensure proper committal of a transaction, all log records associated with the transaction must be secured in stable storage.

To clarify this, consider the following specific example (for this example assume that there is no index and the page affected is page 150). BEGIN TRANSACTION INSERT INTO tblTest VALUES (1) COMMIT TRANSACTION Now break the activity down into simplistic logging steps:



Do not be confused with locking and logging. Although important, locking and logging are separate issues when dealing with the WAL. In the example above, SQL Server 7.0, SQL Server 2000, and SQL Server 2005 generally hold the latch on page 150 for the time necessary to perform the physical insert changes on the page, not the entire time of the transaction. The appropriate lock type is established to protect the row, range, page, or table as necessary. Refer to the SQL Server Books Online locking sections for more details on lock types.

Looking at the example in more detail, you might ask what happens when the LazyWriter or CheckPoint processes execute. SQL Server 7.0, SQL Server 2000, and SQL Server 2005 issue all appropriate flushes to stable storage for transactional log records associated with the dirty and pinned page. This ensures the WAL protocol a data page can never be written to stable storage until the associated transactional log records have been flushed.

SQL Server and stable storage
SQL Server 7.0, SQL Server 2000, and SQL Server 2005 enhance log and data page operations by including the knowledge of disk sector sizes (commonly 512 bytes).

To maintain the ACID properties of a transaction, the SQL Server must account for failure points. During a failure many disk drive specifications only guarantee a limited amount of sector write operations. Most specifications guarantee completion of a single sector write when a failure occurs.

SQL Server 7.0, SQL Server 2000, and SQL Server 2005 use 8-KB data pages and the log (if flushed) on multiples of the sector size. (Most disk drives use 512 bytes as the default sector size.) In the case of a failure, SQL Server can account for write operations larger than a sector by employing log parity and torn write techniques.



Torn page detection
The following section comes from the SQL Server 7.0 Books Online describing torn page detection:

This option allows SQL Server to detect incomplete I/O operations caused by power failures or other system outages. When true, it causes a bit to be flipped for each 512-byte sector in an 8-kilobyte (KB) database page whenever the page is written to disk. If a bit is in the wrong state when the page is later read by SQL Server, then the page was written incorrectly; a torn page is detected. Torn pages are usually detected during recovery because any page that was written incorrectly is likely to be read by recovery.

Although SQL Server database pages are 8 KB, disks perform I/O operations using a 512-byte sector. Therefore, 16 sectors are written per database page. A torn page can occur if the system fails (for example, due to power failure) between the time the operating system writes the first 512-byte sector to disk and the completion of the 8 KB I/O operation. If the first sector of a database page is successfully written before the failure, the database page on disk will appear as updated, although it may not have succeeded.

Using battery-backed disk controller caches can ensure that data is successfully written to disk or not written at all. In this case, do not set torn page detection to true, for it is not needed.

Note Torn page detection is not enabled by default in SQL Server 7.0. See sp_dboption for how to enable the detection on your system.

Log parity
Log parity checking is very similar to torn page detection. Each 512-byte sector contains parity bits. These parity bits are always written with the log record and evaluated when the log record is retrieved. By forcing log writes on a 512-byte boundary, SQL Server 7.0, SQL Server 2000, and SQL Server 2005 can ensure that committal operations are completely written to the physical disk sectors.

The changes provide increased data consistency, even over prior versions of SQL Server.

Versions of SQL Server earlier than 7.0
Versions of SQL Server earlier than 7.0 did not provide log parity or torn bit detection facilities. In fact, those versions can write the same log page multiple times until the log records fill the 2-KB log page. This can expose transactions that have successfully committed. If the log page is being rewritten during a failure, a sector with the committed transaction may not get rewritten properly.

Performance impacts
All versions of SQL Server open the log and data files using the Win32 CreateFile function. The dwFlagsAndAttributes member includes the FILE_FLAG_WRITE_THROUGH option when opened by SQL Server.

FILE_FLAG_WRITE_THROUGH

Instructs the system to write through any intermediate cache and go directly to disk. The system can still cache write operations, but cannot lazily flush them.

The FILE_FLAG_WRITE_THROUGH option ensures that when a write operation returns successful completion the data is correctly stored in stable storage. This aligns with the WAL protocol ensuring the data.

Many disk drives (SCSI and IDE) contain onboard caches of 512 KB, 1 MB, or larger. However, the drive caches usually rely on a capacitor and not a battery-backed solution. These caching mechanisms cannot guarantee writes across a power cycle or similar failure point. They only guarantee the completion of the sector write operations. This is specifically why the torn write and log parity detection were built into SQL Server 7.0, SQL Server 2000, and SQL Server 2005. As the drives continue to grow in size, the caches become larger, and they can expose larger amounts of data during a failure.

Many hardware vendors provide battery-backed disk controller solutions. These controller caches can maintain the data in the cache for several days and even allow the caching hardware to be placed in a second computer. When power is properly restored, the unwritten data is completely flushed before the further data access is allowed. Many of them allow a percentage of read versus write cache to be established for optimal performance. Some contain large memory storage areas. In fact, for a very specific segment of the market, some hardware vendors provide high-end battery-backed disk caching controller systems with 6 GB of cache. These can significantly improve database performance.

Advanced caching implementations will handle the FILE_FLAG_WRITE_THROUGH request by not disabling the controller cache because they can provide true rewrite capabilities in the event of a system reset, power failure, or other failure point.

I/O transfers without the use of a cache can be significantly longer due to the mechanical time needed to move the drive heads, spin rates, and other limiting factors.

Sector ordering
A common technique used to increase I/O performance is sector ordering. To avoid mechanical head movement the read/write requests are sorted, allowing a more consistent motion of the head to retrieve or store data.

The cache can hold multiple log and data write requests at the same time. The WAL protocol and the SQL Server implementation of the WAL protocol require flushing of the log writes to stable storage before the page write can be issued. However, use of the cache might return success from a log write request without the data being written to the actual drive (that is, written to stable storage). This may lead to SQL Server issuing the data page write request.

With the write cache involvement, the data is still considered to be in volatile storage. However, from the Win32 API WriteFile call, exactly how SQL Server sees the activity, a successful return code was obtained. SQL Server or any process using the WriteFile API call can only deduce the data has correctly obtained stable storage.

For discussion purposes, assume that all sectors of the data page are sorted to write before the sectors of the matching log records. This immediately violates the WAL protocol. The cache is writing a data page before the log records. Unless the cache is fully battery-backed, a failure can cause catastrophic results.

When evaluating the optimal performance factors for a database server, there are many factors to consider. The foremost of these considerations is "Does my system allow valid FILE_FLAG_WRITE_THROUGH capabilities?"

Note Any cache you are using must fully support a battery-backed solution. All other caching mechanisms are suspect to data corruption and data loss. SQL Server makes every effort to ensure the WAL by enabling FILE_FLAG_WRITE_THROUGH.

Testing has shown that many disk drive configurations may contain write caching without proper battery backup. SCSI, IDE, and EIDE drives take full advantage of write caches.

In many configurations, the only way to properly disable the write caching of an IDE or EIDE drive is with a specific manufacturer utility or by using jumpers located on the drive itself. To ensure that the write cache is disabled for the drive itself, contact the drive manufacturer.

SCSI drives also have write caches but these caches can commonly be disabled by the operating system. If there is any question, contact the drive manufacturer for appropriate utilities.



Write Cache Stacking
Write Cache Stacking is similar to Sector Ordering. The following definition was taken directly from a leading IDE drive manufacturer Web site:

Normally, this mode is active. Write cache mode accepts the host write data into the buffer until the buffer is full or the host transfer is complete.

A disk write task begins to store the host data to disk. Host write commands continue to be accepted and data transferred to the buffer until either the write command stack is full or the data buffer is full. The drive may reorder write commands to optimize drive throughput.



Automatic Write Reallocation (AWR)
Other common technique used to protect data is to detect bad sectors during data manipulation. The following explanation comes from the same leading IDE drive manufacturer Web site:

This feature is part of the write cache and reduces the risk of data loss during deferred write operations. If a disk error occurs during the disk write process, the disk task stops and the suspect sector is reallocated to a pool of alternate sectors located at the end of the drive. Following the reallocation, the disk write task continues until it is complete.

This can be a very powerful feature if battery backup is provided for the cache, allowing proper modification upon restart. It is preferable to detect the disk errors, but the data security of the WAL protocol would again require this to be done real time and not in a deferred manner. Within the WAL parameters, the AWR technique cannot account for a situation where a log write fails due to a sector error but the drive is full. The database engine must immediately know about the failure so the transaction can be properly aborted, the administrator can be alerted, and proper steps taken to secure the data and correct the media failure situation.

Data safety
There are several precautions a database administrator should take to ensure the safety of the data.
 * It is always a good idea to ensure that your backup strategy is sufficient to recover from a catastrophic failure. Offsite storage and other precautions are appropriate.
 * Test the database restore operation in a secondary or test database on a frequent basis.
 * Ensure that any caching devices can handle all failure situations (power outage, bad sectors, bad drives, system outage, lockups, power spike, and so forth).
 * Ensure that your caching device:
 * Has integrated battery backup.
 * Can reissue writes on power up.
 * Can be fully disabled if necessary.
 * Handles bad sector re-mapping realtime.
 * Enable torn page detection; it has little performance impact.
 * Configure RAID drives allowing for a hot swap of a bad disk drive, if possible.
 * Use newer caching controllers that allow addition of more disk space without restarting the OS. This can be an ideal solution.

Testing drives
To fully secure your data, you should ensure that all data caching is properly handled. In many situations, this means you must disable the write caching of the disk drive.

Note Ensure that an alternate caching mechanism can properly handle multiple types of failure.

Microsoft has performed testing on several SCSI and IDE drives using the SQLIOStress utility. This utility simulates heavy asynchronous read/write activity to a simulated data device and log device. Test performance statistics show the average write operations per second between 50 and 70 for a drive with disabled write caching and an RPM range between 5,200 and 7,200.

For more information and details about the SQLIOStress utility, see the following article in the Microsoft Knowledge Base:

231619 How to use the SQLIOStress utility to stress a disk subsystem such as SQL Server

Many PC manufacturers (for example, Compaq, Dell, Gateway, HP, and so on) order the drives with the write cache disabled. However, testing shows that this may not always be the case so always test completely.



Data devices
In all but non-logged situations, SQL Server will require only the log records to be flushed. When doing non-logged operations, the data pages must also be flushed to stable storage; there are no individual log records to regenerate the actions in the case of a failure.

The data pages can remain in cache until the LazyWriter or CheckPoint process flushes them to stable storage. Using the WAL protocol to ensure that the log records are properly stored ensures that recovery can recover a data page to a known state.

This does not mean that it is advisable to place data files on a cached drive. When the SQL Server flushes the data page(s) to stable storage, the log records can be truncated from the transaction log. If the data pages are stored on volatile cache, it is possible to truncate log records that would be used to recover a page in the event of a failure. Ensure that both your data and log devices accommodate stable storage properly.



Increasing performance
The initial question that arises is: "I have an IDE drive that was caching but when I disabled it, my performance became significantly less than expected -- why?"

Many of the IDE drives tested by Microsoft run at an RPM rate of 5,200, and the SCSI drives at an RPM of 7,200. When you disable the write caching of the IDE drive the mechanical performance can become a factor.

There is a very clear area to address the performance difference: "Address the transaction rate."

There are many online transaction processing (OLTP) systems that require a high transaction rate. For these systems, consider using a caching controller that can properly support a write cache and provide the performance boost while ensuring data integrity.

To significantly encounter performance changes with SQL Server on a caching drive, the transaction rate was increased using small transactions.

Testing shows that high write activity of buffers smaller than 512 KB or above 2 MB can cause slow performance.



Consider the following example: CREATE TABLE tblTest ( iID int IDENTITY(1,1), strData char(10)) GO

SET NOCOUNT ON GO

INSERT INTO tblTest VALUES ('Test') WHILE @@IDENTITY < 10000 INSERT INTO tblTest VALUES ('Test') The following are sample test results for SQL Server:

SCSI(7200 RPM) 84 seconds

SCSI(7200 RPM) 15 seconds (Caching controller)

IDE(5200 RPM) 14 seconds (Drive cache enabled)

IDE(5200 RPM) 160 seconds

Wrapping the entire series of INSERT operations in a single transaction runs in approximately 4 seconds in all configurations.

The reason is the number of log flushes required. Without the transaction, each INSERT is a transaction in and of itself, and each time the log records for the transaction must be flushed. Each flush is 512 bytes in size, which requires significant mechanical drive intervention.

When a single transaction is used, the log records for the transaction can be bundled and a single, larger write can be used to flush the gathered log records. The mechanical intervention is significantly reduced.

Warning It is not recommended that you increase your transaction scope. Long-running transactions can lead to excessive and unwanted blocking as well as increased overhead. Use the SQL Server 7.0, SQL Server 2000, and SQL Server 2005 performance counters SQL Server:Databases to view the transaction log-based counters. Specifically, Log Bytes Flushed/sec can indicate many small transactions leading to high mechanical disk activity.

Look at the statements associated with the log flush and determine if the number of log flushes can be reduced. In the above example, a single transaction was implemented. However, in many scenarios this can lead to undesired locking behavior. Look at the design of the transaction. Perhaps something like the following to perform batches to reduce the frequent and small log flush activity: BEGIN TRAN GO

INSERT INTO tblTest VALUES ('Test') WHILE @@IDENTITY < 50 BEGIN INSERT INTO tblTest VALUES ('Test')

if(0 = cast(@@IDENTITY as int) % 10) BEGIN PRINT 'Commit tran batch' COMMIT TRAN BEGIN TRAN END END GO

COMMIT TRAN GO SQL Server 6.x may not see the same performance impact from frequent and small transaction log writes. SQL Server 6.x rewrites the same 2-KB log page as transactions are committed. This can reduce the size of the log significantly compared to the 512-byte sector boundary flushes in SQL Server 7.0, SQL Server 2000, and SQL Server 2005. Reducing the size of the log directly relates to the amount of mechanical drive activity. However, as explained above, the SQL Server 6.x algorithm may expose committed transactions.

Keywords: kbhowto kbinfo KB230785

-

[mailto:TECHNET@MICROSOFT.COM Send feedback to Microsoft]

© Microsoft Corporation. All rights reserved.