Before my time at Ntirety I had done quite a bit of work with failover clustering and mirroring (trending into some Availability Groups) but I had never really done that much replication or log shipping. Even at my previous managed services remote DBA job I simply hadn't run across it...
I have said in this blog more than once that education (training and real-world) is one of the most important parts of being a DBA, and that we should always be learning; my time at Ntirety has definitely educated me about replication and log shipping!
This Log Shipping story (maybe its a cautionary tale - so many of these stories are) begins with a failed server migration. We were trying to replace a server with a new VM and due to some issues completely unrelated to log shipping, we ended up reverting to the original server (maybe once we figure that out I will write about it too).
The problem was that I had already dropped the log shipping from OldServer01.Database99 to LSSecondary.Database99 and created new log shipping from NewServer01.Database99 to LSSecondary.Database99. Part of the backout plan was dropping *that* log shipping and recreating the original relationship between OldServer01 and LSSecondary.
I had the creation process scripted out, so I ran the scripts, made sure they succeeded, checked that LSSecondary.Database99 reflected the current data (we had it in Standyby/Read-Only) and then walked away from log shipping to continue the rest of the revert process.
Monday morning rolled around and one of the business users reported that Database99 on LSSecondary was in Restoring rather than Standby, and wanted me to check it out. Sure enough, Database99 was in Restoring even though the LSRestore job wasn't running and there was no active RESTORE process running.
OK....
I check the properties of the log shipping secondary in the Management Studio GUI and they looked as expected:
SELECT secondary_database
, restore_mode
, disconnect_users
, last_restored_file
FROM msdb.dbo.log_shipping_secondary_databases
Since the Database shows restore_mode 1 (Standby) and disconnect_users 1 (true) there is no reason by the configuration that the database should be in RESTORING.
The next stop was the SQL Server Error Log. This probably should have been my first stop, but since I had just set this log shipping up I thought of the configuration first. The Error Log was a horror show:
--
Error: 3456, Severity: 16, State: 1.
Could not redo log record (31185729:63924:22),
for transaction ID (0:0), on page (1:511238), database 'Database99'
(database ID 5). Page: LSN = (31185729:63857:14), type = 16. Log: OpCode = 7,
context 24, PrevPageLSN: (31185729:63804:511). Restore from a backup of the
database, or repair the database
--
…followed
by 100+ lines of Error Stack Dump <cue scary music>:
--
Using 'dbghelp.dll' version '4.0.5'
**Dump thread - spid = 0, EC = 0x00000002E61A20F0
***Stack Dump being sent to F:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\LOG\SQLDump0001.txt
* *******************************************************************************
*
* BEGIN STACK DUMP:
* 10/22/17 03:00:34 spid 55
*
* HandleAndNoteToErrorlog: Exception raised, major=34, minor=56, severity=16
*
* Input Buffer 506 bytes -
* RESTORE LOG [Database99] FROM DISK = N'\\LSSecondary\
* Database99\Database99_20171022053000.trn' WITH FILE = 1, STANDBY =
* N'F:\Program Files\Microsoft SQL Server\MSSQL10_50.MSSQLSERVER\MSSQL\DA
* TA\Database99_20171022090018.tuf'
*
*
* MODULE BASE END SIZE
* sqlservr 0000000000A40000 00000000045EDFFF 03bae000
* ntdll 0000000077510000 00000000776B8FFF 001a9000
* kernel32 00000000772F0000 000000007740EFFF 0011f000
* KERNELBASE 000007FEFD560000 000007FEFD5CBFFF 0006c000
* ADVAPI32 000007FEFF730000 000007FEFF80AFFF 000db000
* msvcrt 000007FEFED70000 000007FEFEE0EFFF 0009f000
* sechost 000007FEFE5B0000 000007FEFE5CEFFF 0001f000
* RPCRT4 000007FEFE630000 000007FEFE75CFFF 0012d000
* MSVCR80 0000000074F50000 0000000075018FFF 000c9000
* MSVCP80 0000000074BD0000 0000000074CD8FFF 00109000
* sqlos 00000000745D0000 00000000745D6FFF 00007000
* Secur32 000007FEFCFE0000 000007FEFCFEAFFF 0000b000
* SSPICLI 000007FEFD260000 000007FEFD284FFF 00025000
* pdh 000007FEF6F00000 000007FEF6F4DFFF 0004e000
* SHLWAPI 000007FEFF680000 000007FEFF6F0FFF 00071000
* GDI32 000007FEFEC20000 000007FEFEC86FFF 00067000
* USER32 0000000077410000 0000000077509FFF 000fa000
* LPK 000007FEFF810000 000007FEFF81DFFF 0000e000
* USP10 000007FEFF190000 000007FEFF258FFF 000c9000
* USERENV 000007FEFC740000 000007FEFC75DFFF 0001e000
* profapi 000007FEFD3D0000 000007FEFD3DEFFF 0000f000
* WINMM 000007FEF6EC0000 000007FEF6EFAFFF 0003b000
* IPHLPAPI 000007FEFB670000 000007FEFB696FFF 00027000
* NSI 000007FEFE620000 000007FEFE627FFF 00008000
* WINNSI 000007FEFB660000 000007FEFB66AFFF 0000b000
* opends60 00000000745E0000 00000000745E7FFF 00008000
* NETAPI32 000007FEFAF00000 000007FEFAF15FFF 00016000
* netutils 000007FEFC970000 000007FEFC97BFFF 0000c000
* srvcli 000007FEFCF20000 000007FEFCF42FFF 00023000
* wkscli 000007FEFAEE0000 000007FEFAEF4FFF 00015000
* LOGONCLI 000007FEFCC80000 000007FEFCCAFFFF 00030000
* SAMCLI 000007FEFA310000 000007FEFA323FFF 00014000
* BatchParser 00000000743D0000 00000000743FCFFF 0002d000
* IMM32 000007FEFF700000 000007FEFF72DFFF 0002e000
* MSCTF 000007FEFEE10000 000007FEFEF18FFF 00109000
* psapi 00000000776D0000 00000000776D6FFF 00007000
* instapi10 0000000074AA0000 0000000074AACFFF 0000d000
* cscapi 000007FEF6EB0000 000007FEF6EBEFFF 0000f000
* sqlevn70 0000000071050000 0000000071250FFF 00201000
* ntmarta 000007FEFC530000 000007FEFC55CFFF 0002d000
* WLDAP32 000007FEFF130000 000007FEFF181FFF 00052000
* CRYPTSP 000007FEFCBE0000 000007FEFCBF7FFF 00018000
* rsaenh 000007FEFC9E0000 000007FEFCA26FFF 00047000
* CRYPTBASE 000007FEFD300000 000007FEFD30EFFF 0000f000
* BROWCLI 000007FEF6DE0000 000007FEF6DF1FFF 00012000
* AUTHZ 000007FEFCE70000 000007FEFCE9EFFF 0002f000
* MSCOREE 000007FEF9F90000 000007FEF9FFEFFF 0006f000
* mscoreei 000007FEF9EE0000 000007FEF9F77FFF 00098000
* ole32 000007FEFEF20000 000007FEFF122FFF 00203000
* credssp 000007FEFC900000 000007FEFC909FFF 0000a000
* msv1_0 000007FEFCB80000 000007FEFCBD1FFF 00052000
* cryptdll 000007FEFCF50000 000007FEFCF63FFF 00014000
* kerberos 000007FEFCCB0000 000007FEFCD67FFF 000b8000
* MSASN1 000007FEFD4B0000 000007FEFD4BEFFF 0000f000
* schannel 000007FEFC980000 000007FEFC9D6FFF 00057000
* CRYPT32 000007FEFD630000 000007FEFD79CFFF 0016d000
* security 0000000074210000 0000000074212FFF 00003000
* WS2_32 000007FEFE5D0000 000007FEFE61CFFF 0004d000
* SHELL32 000007FEFD820000 000007FEFE5A8FFF 00d89000
* OLEAUT32 000007FEFEC90000 000007FEFED66FFF 000d7000
* ftimport 0000000060000000 0000000060024FFF 00025000
* MSFTE 0000000049980000 0000000049D2DFFF 003ae000
* VERSION 000007FEFC520000 000007FEFC52BFFF 0000c000
* dbghelp 000000006F0C0000 000000006F21DFFF 0015e000
* WINTRUST 000007FEFD7A0000 000007FEFD7DAFFF 0003b000
* ncrypt 000007FEFCE20000 000007FEFCE6FFFF 00050000
* bcrypt 000007FEFCDF0000 000007FEFCE11FFF 00022000
* mswsock 000007FEFCC20000 000007FEFCC74FFF 00055000
* wship6 000007FEFCDE0000 000007FEFCDE6FFF 00007000
* wshtcpip 000007FEFC620000 000007FEFC626FFF 00007000
* ntdsapi 000007FEFB0A0000 000007FEFB0C6FFF 00027000
* DNSAPI 000007FEFCA30000 000007FEFCA8AFFF 0005b000
* rasadhlp 000007FEFAA70000 000007FEFAA77FFF 00008000
* fwpuclnt 000007FEFB560000 000007FEFB5B2FFF 00053000
* bcryptprimitives 000007FEFC7A0000 000007FEFC7EBFFF 0004c000
* CLBCatQ 000007FEFF560000 000007FEFF5F8FFF 00099000
* sqlncli10 0000000072690000 0000000072947FFF 002b8000
* COMCTL32 000007FEF6F50000 000007FEF6FEFFFF 000a0000
* COMDLG32 000007FEFF4C0000 000007FEFF556FFF 00097000
* SQLNCLIR10 0000000071260000 0000000071296FFF 00037000
* netbios 000007FEF4040000 000007FEF4049FFF 0000a000
* xpsqlbot 0000000074270000 0000000074277FFF 00008000
* xpstar 0000000065380000 0000000065407FFF 00088000
* SQLSCM 0000000074590000 000000007459DFFF 0000e000
* ODBC32 000007FEF2BA0000 000007FEF2C50FFF 000b1000
* ATL80 0000000074F00000 0000000074F1FFFF 00020000
* odbcint 00000000733D0000 0000000073407FFF 00038000
* clusapi 000007FEF6760000 000007FEF67AFFFF 00050000
* resutils 000007FEF6740000 000007FEF6758FFF 00019000
* xpstar 0000000074240000 0000000074264FFF 00025000
* xplog70 0000000074280000 000000007428FFFF 00010000
* xplog70 0000000074220000 0000000074221FFF 00002000
* dsrole 000007FEFBCD0000 000007FEFBCDBFFF 0000c000
* RpcRtRemote 000007FEFD3B0000 000007FEFD3C3FFF 00014000
* ssdebugps 000007FEEAE20000 000007FEEAE31FFF 00012000
* apphelp 000007FEFD2A0000 000007FEFD2F6FFF 00057000
* msxmlsql 00000000632C0000 0000000063410FFF 00151000
* dbghelp 000000006F8D0000 000000006FA2DFFF 0015e000
*
* P1Home: 0044004C00510053:
* P2Home: 00000000004A87B4: 0122002501210023 0124002900230027 0126002D0025002B 012800310127002F 012A003501290033 002C0039002B0037
* P3Home: 00000000004C8310: 000000002A010FF0 000000002A010FF0 0000000017407610 000000001AA86DD0 000000001736E3F0 000000002A027FD0
* P4Home: 0000000000000000:
* P5Home: 0052004500AE00AC:
* P6Home: 000000001742C658: 0050005C003A0046 00720067006F0072 00460020006D0061 00730065006C0069 00630069004D005C 006F0073006F0072
* ContextFlags: 000000000010000F: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000
* MxCsr: 0000000000001FA0:
* SegCs: 0000000000000033:
* SegDs: 000000000000002B:
* SegEs: 000000000000002B:
* SegFs: 0000000000000053:
* SegGs: 000000000000002B:
* SegSs: 000000000000002B:
* EFlags: 0000000000000202:
* Rax: 0000000000991AD3: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000 0000000000000000
* Rcx: 000000002C568260: 0044004C00510053 00000000004A87B4 00000000004C8310 0000000000000000 0052004500AE00AC 000000001742C658
* Rdx: 0000000000000000:
* Rbx: 0000000000000000:
* Rsp: 000000002C568870: 0000000000000000 00000002E61A20F0 000000002C568AA0 000007FFFFEDC000 00000000000042AC 0000000000000000
* Rbp: 0000000000000000:
* Rsi: 00000002E61A20F0: 00000002E61A20F0 000000063039CEA0 0000000000000000 0000000000000004 00000002E61A2110 00000002E61A2110
* Rdi: 000000002C568AA0: 000000001742C650 FFFFFFFFFFFFFFFE 000007FFFFEDB4A8 0000000000000000 00000006699981A0 0000000000A426C3
* R8: 0000000000000000:
* R9: 0000000000000000:
* R10: 0000000000000000:
* R11: 000000002C569080: 0044004C00510053 00000000004A87B4 00000000004C8310 0000000000000000 0052004500AE00AC 000000001742C658
* R12: 0000000000000001:
* R13: 000000063039CEA0: 000000063039C080 000000063039C3B0 000000063039CAD0 00000002E61A2080 000000063039CF10 000000008089CE80
* R14: 000000002C569F60: 0064006E00610048 006E00410065006C 0074006F004E0064 0045006F00540065 0072006F00720072 003A0067006F006C
* R15: 000000000000003F:
* Rip: 000007FEFD56B3DD: C3000000C8C48148 003B830874F68548 0001BD1689660376 02D6840FFD3B0000 F3840F02FF830000 870F07FF83000002
* *******************************************************************************
* -------------------------------------------------------------------------------
* Short Stack Dump
000007FEFD56B3DD Module(KERNELBASE+000000000000B3DD)
0000000002AE981D Module(sqlservr+00000000020A981D)
0000000002AEDD3A Module(sqlservr+00000000020ADD3A)
0000000002B210C3 Module(sqlservr+00000000020E10C3)
0000000002B20F5D Module(sqlservr+00000000020E0F5D)
0000000000E3F5D8 Module(sqlservr+00000000003FF5D8)
0000000000E3F2A6 Module(sqlservr+00000000003FF2A6)
0000000002A47DBB Module(sqlservr+0000000002007DBB)
0000000002B24941 Module(sqlservr+00000000020E4941)
0000000002D14BFB Module(sqlservr+00000000022D4BFB)
0000000002CC0A44 Module(sqlservr+0000000002280A44)
0000000002CC0719 Module(sqlservr+0000000002280719)
0000000002CBFDC5 Module(sqlservr+000000000227FDC5)
0000000002D4C1B4 Module(sqlservr+000000000230C1B4)
0000000002D4B845 Module(sqlservr+000000000230B845)
0000000002CA321E Module(sqlservr+000000000226321E)
0000000002D4DD33 Module(sqlservr+000000000230DD33)
0000000002D52D0F Module(sqlservr+0000000002312D0F)
0000000002CC8CE6 Module(sqlservr+0000000002288CE6)
0000000001F288A1 Module(sqlservr+00000000014E88A1)
0000000000A99D59 Module(sqlservr+0000000000059D59)
0000000000A9A9B8 Module(sqlservr+000000000005A9B8)
0000000000A9B30C Module(sqlservr+000000000005B30C)
0000000000A9C1A6 Module(sqlservr+000000000005C1A6)
0000000000AE5342 Module(sqlservr+00000000000A5342)
0000000000A4BBD8 Module(sqlservr+000000000000BBD8)
0000000000A4B8BA Module(sqlservr+000000000000B8BA)
0000000000A4B6FF Module(sqlservr+000000000000B6FF)
0000000000F68FB6 Module(sqlservr+0000000000528FB6)
0000000000F69175 Module(sqlservr+0000000000529175)
0000000000F69839 Module(sqlservr+0000000000529839)
0000000000F69502 Module(sqlservr+0000000000529502)
0000000074F537D7 Module(MSVCR80+00000000000037D7)
0000000074F53894 Module(MSVCR80+0000000000003894)
00000000773059CD Module(kernel32+00000000000159CD)
000000007753B981 Module(ntdll+000000000002B981)
Stack Signature for the dump is 0x000000012832F01D
External dump process return code 0x20000001. External dump process returned no errors.
--
... followed by:
--
Error: 9003, Severity: 20, State: 6.
The log scan number (31185729:63630:107) passed to log scan
in database 'Database99' is not valid. This error may indicate data
corruption or that the log file (.ldf) does not match the data file (.mdf). If
this error occurred during replication, re-create the publication. Otherwise,
restore from backup if the problem results in a failure during startup.
--
|
https://thecatholicgeeks.files.wordpress.com/2015/05/beafraid.jpg |
Pretty scary looking errors…I
checked the database on OldServer01, especially since the CheckDB job hadn't run in several days (back to before the migration attempt), and thankfully
it was clear:
--
Date and time: 2017-10-23 10:49:35
Server: OldServer01
Version: 10.50.4305.0
Edition: Enterprise Edition (64-bit)
Procedure:
[master].[dbo].[DatabaseIntegrityCheck]
Parameters: @Databases = 'Database99',
@CheckCommands = 'CHECKDB', @PhysicalOnly = 'N', @NoIndex = 'N',
@ExtendedLogicalChecks = 'N', @TabLock = 'N'
, @FileGroups = NULL, @Objects = NULL,
@LockTimeout = NULL, @LogToTable = 'N', @Execute = 'Y'
Date and time: 2017-10-23 10:49:35
Database: [Database99]
Status: ONLINE
Standby: No
Updateability: READ_WRITE
User access: MULTI_USER
Is accessible: Yes
Recovery model: FULL
Date and time: 2017-10-23 10:49:35
Command: DBCC CHECKDB ([Database99]) WITH
NO_INFOMSGS, ALL_ERRORMSGS, DATA_PURITY
Outcome:
Succeeded
Duration: 00:35:44
Date and time: 2017-10-23 11:25:19
--
After some poking around and several rebuilds of log shipping (deleting the secondary, taking a new FULL on OldServer01, restoring it to LSSecondary, and then recreating the process) I figured out the issue was related to the first part of the first error:
“Could not redo log record (31185747:40079:1), for transaction ID (0:1129634848), on page (1:10522412), database 'Database99'”
The redo reader process was looking in the log shipping folder \\LSSecondary\Database99 and trying to restore the oldest TRN log backup files, which were related to the log shipping configuration from before the migration/fail-back over the weekend. Those older TRN files had log sequence numbers (LSNs) that were from before the FULL backup used to initialize the re-created log shipping, which was making the overall process error out and blowing up the process.
A colleague suggested that the log shipping restore should be skipping TRN files that are "too old" via LSN (which matched my understanding of how it "should" work) and that maybe this reflected one of the old TRN files having an invalid file header. After some consideration we realized that probably was *not* the case since the error message doesn't reflect not being able to read the backup file header, and if it couldn't read the header how would it have a log record number (31185747:40079:1) to include in the "could not redo" portion of the error?
I deleted the old backups from \\LSSecondary\Database99 and tried to run the process again, and received the same errors. When I looked in \\LSSecondary\Database99 I saw all of the old TRN files were back!
I had failed to consider (sigh) that the LSCopy job would bring all of the "old" files from the primary folder at \\OldServer01\Database99 back to \\LSSecondary\Database99, putting me back in the same situation.
Take 2 - I deleted all of the log shipping backups from *both* \\OldServer01\Database99 and \\LSSecondary\Database99 and then re-created the log shipping one last time…
|
https://memegenerator.net/img/instances/500x/27017972/success.jpg |
This time the LS_Restore job functioned as expected, Database99 was in Standby/Read-Only on LSSecondary, and instead of my 100+ line stack dump I received some very respectable information messages in the SQL Error Log on LSSecondary:
--
Setting database option SINGLE_USER to ON for database Database99.
Starting up database 'Database99'.
The database 'Database99' is marked RESTORING and is in a state that does not allow recovery to be run.
Recovery is writing a checkpoint in database 'Database99' (5). This is an informational message only. No user action is required.
Recovery completed for database Database99 (database ID 5) in 2 second(s) (analysis 1080 ms, redo 0 ms, undo 466 ms.) This is an informational message only. No user action is required.
Starting up database 'Database99'.
CHECKDB for database 'Database99' finished without errors on 2017-10-23 10:49:44.400 (local time). This is an informational message only; no user action is required.
Log was restored. Database: Database99, creation date(time): 2017/10/22(02:20:41), first LSN: 31185760:34887:1, last LSN: 31185760:44042:1, number of dump devices: 1, device information: (FILE=1, TYPE=DISK: {'\\LSSecondary\Database99\Database99_20171023204500.trn'}). This is an informational message. No user action is required.
Setting database option MULTI_USER to ON for database Database99.
--
The SINGLE_USER and MULTI_USER messages are artifacts of telling the log shipping to disconnect users when restoring backups.
--
Hope this helps!