I ran into this with a client last night when trying to restore
their SQL Server data warehouse database from TSM.
We went through several iterations of the restore failing
without any meaningful error other than a cryptic “TCP/IP Connection Refused”
even though we could ping and telnet from the database server to the AIX server
hosting TSM. The TSM technician at the client could see multiple tapes loading
and then just sitting, with only a little data transferred to the database
server at the beginning of the process but no large scale data transfer.
From my end on SQL Server I could see several connections from the TSM client
(Tivoli Data Protection/TDP) that appeared to be waiting as well for BACKUPIO
with little CPU or I/O traffic, but the wait time was cycling up and down
rather than continuously increasing, as if the thread was doing a little work
rather than completely blocked. Perfmon showed no I/O on the disk to
which we were going to restore.
The first several tries at the restore ran for about 20
minutes and then failed with the TCP/IP error. When this occurred, a
database was created on the SQL Server in “restoring” mode immediately before
the failure and empty files were created on the disk at this same time.
For each subsequent attempt I deleted the restoring database and the empty
files.
I found an article that talks about
the length of time required to zero-initialize large database files and the
potential need to increase the timeout on the TSM server (COMMTIMEOUT) to allow
the SQL Server time to zero-initialize the files. We tried this, but
realized after 25 minutes of no change that with no traffic on the disk (in
PerfMon) that it couldn’t be just waiting for SQL to zero-init the files.
The TSM technician found a IBM article about Tivoli Data
Protection that talked about hung restores
being related to striped backups. In this client’s case, they stripe their
TSM/TDP backups with four stripes to increase performance (fine) but they do it
to media pools (not fine). As a result in this case, there were four
backup stripes written to only three tapes. By default, the TSM/TDP
client restores with the same number of stripes as the backup, but when it
tried to queue up all four stripes at once it couldn’t with two of them on one
tape.
The answer was to modify the restore request to run as a
single striped restore, which succeeded. It still took about two hours
for the restore of the 214GB database, but when we kicked off the single
striped restore we immediately saw a difference on both the TSM server and the
SQL Server. It took the TSM server a couple minutes to load the first
tape and then progress was visible in the TSM/TDP client window, data was being
transferred from the TSM server, and the connections in SQL Server showed I/O.
This is potentially an issue with any enterprise backup
system that backs up using multiple stripes to tapes in media pools unless that
system is internally intelligent enough to prevent the “two stripes on one
tape” issue described.
Great article. I am fully impressed.
ReplyDeleteLG Networks, Inc. is a Microsoft Certified and a leading Microsoft Exchange Server Consulting firm.
Our Emergency Exchange Support consultants can help you attain the maximum benefit and stability from
Microsoft Exchange Server with administrative, technical and onsite or remote troubleshooting support.