Microsoft KB Archive/102777

OS/2 LAN Manager Sleeping Server Stops Responding to Requests

PSS ID Number: Q102777 Article last modified on 09-20-1993

2.00 2.10 2.10a 2.20 OS/2

= SUMMARY =

LAN Manager servers may fail or “sleep” under extreme operating stress, refusing to initiate new sessions while allowing active sessions to continue for a while. Three problems cause stress-related symptoms: - SCSI bid time-out failures - Kernel failures - NETAPI.DLL or SCSI bid time-out problems This article discusses these three problems and several other topics in the following order: - Causes of Server Stress - Server-Stress-Related Problems - Non-Server-Stress-Related Problems - Problem 1: SCSI Bid Time-out Failures - Problem 2: Kernel Failures - Problem 3: NETAPI.DLL or SCSI Bid Time-out Problems - Configuration (of servers prone to these problems) - Tuning Recommendations - Utilities and Diagnostics Each problem section is organized in the standard style: Symptoms, Cause, Resolution/Workaround. You can obtain Microsoft OS/2 LAN Manager updates from Microsoft Corporate Network Support. Some fixes will be included in an update in LAN Manager 2.2a.

= MORE INFORMATION =

CAUSES OF SERVER STRESS ======================= Several system conditions can create server stress: a large number of workstations with active sessions accessing the file server, or problems with foreground processes such as ring 3 services (SQL Server, the Netlogon service, and backup operations).

SERVER-STRESS-RELATED PROBLEMS ============================== Stress failures can disrupt some server processes while other OS/2 processes continue unaffected. Even when critical ring 0 or kernel- level failures affect the server, trap error messages are not always displayed on the server monitor or through an OS/2 kernel debugger, and some OS/2 processes can continue to operate for a while (for instance, the HPFS386 cache and operations serviced by it) although the Program Manager cannot be accessed and keyboard interrupts are shut down.

NON-SERVER-STRESS-RELATED PROBLEMS ================================== Complete server hangs are often caused by something other than sleeping servers: - System CPU hardware failures that hang the server and require a cold boot. - A system device or ring 0 device driver that shuts down interrupts and does not reenable them, locking the server keyboard without an error message. - Network adapters that halt all server processing. - A tight I/O-bound application loop such as a process that continuously writes to the screen, starving other processes on the server. If a process remains at high priority in a PM screen group, it blocks other processes. It is a good idea to avoid running local processes involving continuous screen or other device I/O on an OS/2 LAN Manager file server. Run only specifically designed server applications, and avoid running applications such as tape backup software during peak hours.

PROBLEM 1: SCSI BID TIME-OUT FAILURES ===================================== Sleeping or comatose server; disk access blocked.

= SYMPTOMS =

There are many possible symptoms. When a SCSI bid time-out occurs for disk drive requests, drive access is blocked even though network I/O does not stop, and users can connect to the server by way of NET USE or NET ADMIN. If primary disk drive access is blocked and SWAPPER.DAT is located there, local OS/2 foreground processes may become inaccessible, the system may not respond to keyboard input, and Task Manager may fail. No operation can write to the blocked disk drive. Only operations serviced by HPFS386 cache continue to operate. A PSTAT listing will reveal all processes blocked other than those serviced by cache.

= CAUSE =

This often is caused by a lack of time-out handling code in the OS/2 bid, which in turn causes disk requests to time-out due to server stress and slow responses from I/O devices. Requests (called RCBs) are passed through the file system to the SCSI bid through IOS$, which provides I/O access to the bid from file systems. The requests include time-out values. For SCSI bids, this value is passed as an SSCI request block (SRB) parameter for time-out, and time-outs cannot be handled properly unless the bid monitors the SRB time-out value for all I/O queues. If a request expires, the SCSI bus and I/O queue time- outs should be reset and the operations allowed to retry. If operations are not allowed to retry, threads involved with the I/O process hang. Certain hardware items and quantities can cause or contribute to these problems: - Slow SCSI devices - Multiple devices - Multiple large hard-disk drives - Multiple tape drives - CD-ROM drives When server performance falls, access to CD-ROM drives fails, and attempting to access a logical drive letter from the server console associated with a CD-ROM hangs that OS/2 screen group. Also, if multiple CD-ROM devices containing large amounts of data are attached to the server, this failure can result in a sleeping server hang: Net start server The server is starting……………………………….

= RESOLUTION =

Update SCSI bids to address the lack of time-out code. Following are instructions for current SCSI bids divided into four classes. A. SCSI controller bids for which updates are available B. SCSI controllers without time-out handling code that can be replaced with monolithic drivers C. SCSI controllers without time-out handling code or currently available monolithic drivers D. SCSI bids with time-out code for OS/2 1.301 LAN Manager 2.2 The manufacturers are working on fixes for deficient controller bids. A. SCSI controller bids for which updates are available: - COMPAQ Cpq710 bid - UltraStor Ultra24 bid (installed as BOOTBID.BID, not preinstalled) - Adaptec 174x bid Updates are available from Microsoft PSS, on CompuServe in the MS Networks forum (see BIDS.ZIP), and on the PSS Internet server in CS-ED (see GOWINNT.MICROSOFT.COM). Use FTP to get the files. B. SCSI controllers without time-out handling code which can be replaced with monolithic drivers: - IBM PS/2 ABIOS.BID (From OS/2 1.3 csd5050 or later) - COMPAQ CPQARRAY BID (From OS/2 1.21) To work around the problem, replace these with monolithic drivers. Monolithic drivers do not support LADDR-specific features such as FT or CdRomIfs, but proper time-out code is available for hard disk drives. NOTE: Monolithic drivers have not been certified or exhaustively tested with OS/2 1.301 csd5015 LAN Manager 2.2. C. SCSI controllers without time-out handling code or currently available monolithic drivers: - Adaptec 154X and 164X - Future Domain WD7000EX and FD16-700 bids - Dell001 bid To work around the problem, install an adapter and driver that support time-outs. D. SCSI bids with time-out code for OS/2 1.301 LAN Manager 2.2: - ESDI-506 bid used for IDE, ESDI, and ST-506 compatible controllers - DPT201X bid - NCRC700, NCRC710 and NCRC90

PROBLEM 2: KERNEL FAILURES ========================== Server slows down (OS2KRNL).

= SYMPTOMS =

The server slows or halts. CPU utilization increases and workstations receive poor server response.

= CAUSE =

The kernel memory-management routines for CSD5050 and subsequent revisions have been updated from 286-specific code to 386-specific code. Memory compaction on a 386 does not take advantage of the 386 processor double word capability, resulting in poor performance, especially with memory-intensive operations.

= RESOLUTION =

Update the OS/2 kernel and redirector. Current versions are: OS2KRNL OS2 1.301 CSD01.001 NETWKSTA.SYS LM22 CSD00.013

PROBLEM 3: NETAPI.DLL OR SCSI BID TIME-OUT PROBLEMS =================================================== Server rejects new sessions. (NETAPI.DLL, SCSI bid, or resource problems).

= SYMPTOMS =

If the server is experiencing SCSI bid time-outs and disk access is blocked, the server service degrades gradually. As long as an HPFS386 server is operating in RAM, it maintains existing sessions and continues to report that no listens are available; when the server begins workstation file-copy sessions, however, NetBIOS session-alive traffic remains active but workstations cannot connect to the server. Before long, the only continuing traffic is LLC NetBEUI low-level transport operations. Workstations attempting new connections receive error 51. Error 53 is sometimes returned, but this actually is error 51 erroneously reported by LAN Manager. Likewise, error net3779, sometimes returned to users attempting to log on to a sleeping primary domain controller (PDC), is incorrect and should be error 51. PSTAT may show that due to an Announcer thread failure Netlogon was stuck in a critical section. No new listens are posted by the ring 3 server, Netservr Scavenger thread. Net session at the server reports existing sessions, but the ring 3 server revokes new requests. Possible error messages: Error 51: The remote computer is not available. Error 53: The network path was not found. Error 240: The network connection is disconnected. Net3779: Your logon attempt has failed due to an incorrect username or password. Servers with a NETAPI.DLL failure respond to console keyboard commands, and their workstation servers allow network access to other servers by means of NET USE for users logged on at the server. Since a NETAPI.DLL failure leads to a Netlogon service failure, users cannot log on if the server is a domain controller. Attempting to shut down the server may return: Net Stop Server Net2190: The service ended abnormally Net Stop Workstation Net2189: The service cannot be controlled in its present state OS/2 shutdown may cause the server to hang.

= CAUSE =

The symptoms indicate that the ring 3 server Scavenger thread has failed when it checked disk space for alert purposes. This thread also checks disk drives for free space, and hangs if a SCSI bid time-out failure occurs while the check is in progress. If the ring 3 server fails, active sessions continue operating but new connections are refused. Primary domain controller ring 3 server threads can also hang if a backup domain controller causes a semaphore deadlock by calling NetAccountSync. On an OS/2 LAN Manager HPFS386 server, the ring 3 server (Netservr) provides all file services for FAT partitions, but only new server connection requests for HPFS386 partitions. The ring 0 HPFS386 server is optimized for performance, and it–not the ring 3 server–handles file service. As a result, if the ring 3 server fails, HPFS386 partitions continue to service requests, new connections are refused, and the server appears to sleep.

= RESOLUTION =

Update NETAPI.DLL and NETLOGON.EXE CSD00.036. Following are procedures for installing the fixes on a LAN Manager 2.2 OS/2 1.301 server:

Procedure 1: Installing OS2KRNL
 From the OS/2 File Manager, do the following:  Select these options:  View Include All File Flags (Hidden) Set View Select OS2KRNL File</li> Change Flags</li></ul> </li> Deselect these options: <ul> System</li> Hidden</li> Read Only</li></ul> </li></ol> </li> Issue a NET STOP command on the workstation.</li> Shut down the server from the OS/2 desktop.</li> Use an HPFS386 recovery disk to boot the server.</li> Use OS/2 Disk 1 to perform the following command: chkdsk c: /f:386</li> Issue the following commands: rename c:2KRNL c:2KRNL.old copy a:2KRNL c:2KRNL (USE CAPITAL LETTERS ONLY) rename c:.sys *.old copy a:.sys c:</li> Restart the server.</li></ol>

Procedure 2: Installing monolithic drivers on a LAN Manager 2.2 OS/2 1.301 server ————————————– 1. In the root directory, issue the following commands: md laddr copy .sys laddr (EXCEPT CONFIG.SYS) copy .bid laddr copy .tsd laddr copy .vsd laddr 2. Copy the following files from OS/2 1.21 or 1.3 installation Disk 1 (ISA computers) or Disk 2 (PS/2–Micro Channel computers) to the root directory BASEDD0X.SYS DISK0X.SYS where X = 1 for ISA computers and 2 for PS/2–Micro Channel. 3. REM out the lines in the CONFIG.SYS file from DEVICE=DENON.VSD to IFS=CDROM.IFS. 4. Reboot.

Procedure 3: Installing updated NETAPI.DLL and NETLOGON.EXE
<ol style="list-style-type: decimal;">  Issue the following command: copy c:.sys c:.sav </li>  Make this TEMPORARY change to the CONFIG.SYS file: E Config.sys Libpath=c:;… (remove c:) Libpath=… </li>  Shut down the server. </li>  Restart the server. </li>  Issue the following commands: rename c:.dll c:.old copy a:.dll c: rename c:.exe c:.old copy a:.exe c: copy c:.sav c:.sys CONFIGURATION ============= Here is how servers exhibiting “sleeping” problems were configured: </li></ol>

<ul> <li> 486 (> 33 mhz) PC server </li> <li> SCSI controller or IDE controller (16-bit ISA or 32-bit EISA or MCA) </li> <li> LAN Manager 2.1, 2.1a, 2.2 </li> <li> Microsoft OS/2 1.301 </li> <li> HPFS386 partitions </li> <li> Primary domain controller operation Netlogon service </li> <li> Ifs …… /cache:4096 or larger cache size </li> <li> OS/2 ring 3 applications such as Netlogon or SQL Server TUNING RECOMMENDATIONS ====================== Check the server error log for this error Net3101: The system ran out of a resource controlled by the *** option where *** is the numbigbuf or numreqbuf parameter. Lanman.ini [Server] Numbigbuf = x (1-80) Numreqbuf = x (1-300) If you find this error, edit LANMAN.INI and increase the corresponding parameters to correct the problem and prevent future server failures. LAN Manager allocates request and big buffers statically at server startup. Under high-stress operating conditions, these resources can be depleted, causing the ring 3 server threads (including the Scavenger) to fail. UTILITIES AND DIAGNOSTICS ========================= PSTAT: PSTAT reports made before or after the failure will verify that one or more Netlogon threads became stuck in a critical section, or Netservr threads, including the Scavenger thread, have been terminated. Process and Thread Information on a sample PSTAT screen: Process Thread Name ID Priority Block ID State NETLOGON 04 06FF 00000000 CritSec Sniffer protocol analyzer traces will reveal that the server has no listen commands outstanding. As the workstation repeatedly fails to connect, it receives this packet and returns error 51 to the user. Sample detail of a Sniffer screen: - Frame 1 - SUMMARY Delta T Destination Source Summary M 1 Workstation Server NETB Name 2TFRUIT Recognized NETB: —– NETBIOS Name Recognized —– NETB: NETB: Header length = 44, Data length = 0 NETB: Delimiter = EFFF (NETBIOS) NETB: Command = 0E NETB: No LISTEN command outstanding for this name. NETB: Caller’s name type = 00 (Unique name) NETB: Transmit correlator = 000D NETB: Response correlator = 0000 NETB: Receiver’s name = Workstation<00> NETB: Sender’s name = Server NETB: If the server service is swapped to disk, however, sessions are dropped and only LLC and NETB traffic remains active. The NETB traffic may eventually end as well. Sample from a Sniffer summary report: 98 0.0369 SERVER WORKSTATION SMB C Open .cmd 99 0.0429 WORKSTATION SERVER NETB D=68 S=05 Data ACK 100 0.0011 SERVER WORKSTATION LLC R D=F0 S=F0 RRNR=117 101 15.5651 SERVER WORKSTATION NETB Session alive 102 0.2152 WORKSTATION SERVER LLC R D=F0 S=F0 RR NR=36 103 2.0314 WORKSTATION SERVER NETB Session alive 104 0.0008 SERVER WORKSTATION LLC R D=F0 S=F0 RRNR=118 </li></ul>

Additional reference words: Sleeping, 51, scsi, bid, hang, 2.00 2.10 2.10a 2.20 Copyright Microsoft Corporation 1993.