Nebraska SQL from @DBA_ANDY: July 2013

Wednesday, July 10, 2013

Hung SQL Server Restores from Tivoli Storage Manager/Tivoli Data Protection

I ran into this with a client last night when trying to restore their SQL Server data warehouse database from TSM.

We went through several iterations of the restore failing without any meaningful error other than a cryptic “TCP/IP Connection Refused” even though we could ping and telnet from the database server to the AIX server hosting TSM. The TSM technician at the client could see multiple tapes loading and then just sitting, with only a little data transferred to the database server at the beginning of the process but no large scale data transfer. From my end on SQL Server I could see several connections from the TSM client (Tivoli Data Protection/TDP) that appeared to be waiting as well for BACKUPIO with little CPU or I/O traffic, but the wait time was cycling up and down rather than continuously increasing, as if the thread was doing a little work rather than completely blocked. Perfmon showed no I/O on the disk to which we were going to restore.

The first several tries at the restore ran for about 20 minutes and then failed with the TCP/IP error. When this occurred, a database was created on the SQL Server in “restoring” mode immediately before the failure and empty files were created on the disk at this same time. For each subsequent attempt I deleted the restoring database and the empty files.

I found an article that talks about the length of time required to zero-initialize large database files and the potential need to increase the timeout on the TSM server (COMMTIMEOUT) to allow the SQL Server time to zero-initialize the files. We tried this, but realized after 25 minutes of no change that with no traffic on the disk (in PerfMon) that it couldn’t be just waiting for SQL to zero-init the files.

The TSM technician found a IBM article about Tivoli Data Protection that talked about hung restores being related to striped backups. In this client’s case, they stripe their TSM/TDP backups with four stripes to increase performance (fine) but they do it to media pools (not fine). As a result in this case, there were four backup stripes written to only three tapes. By default, the TSM/TDP client restores with the same number of stripes as the backup, but when it tried to queue up all four stripes at once it couldn’t with two of them on one tape.

The answer was to modify the restore request to run as a single striped restore, which succeeded. It still took about two hours for the restore of the 214GB database, but when we kicked off the single striped restore we immediately saw a difference on both the TSM server and the SQL Server. It took the TSM server a couple minutes to load the first tape and then progress was visible in the TSM/TDP client window, data was being transferred from the TSM server, and the connections in SQL Server showed I/O.

This is potentially an issue with any enterprise backup system that backs up using multiple stripes to tapes in media pools unless that system is internally intelligent enough to prevent the “two stripes on one tape” issue described.