Nebraska SQL from @DBA_ANDY: It's Just Another 9002...another error like all the rest...

11:30pm on a Saturday night, and the pager went off…

Error: 5901, Severity: 16, State: 1.

One or more recovery units belonging to database 'i3_dialer' failed to generate a checkpoint. This is typically caused by lack of system resources such as disk or memory, or in some cases due to database corruption. Examine previous entries in the error log for more detailed information on this failure.

That phrase – “due to database corruption” – always makes my stomach turn just a little…

https://img.memesuper.com/9b7a67d76f5de79439a92cc9ad703dda_im-coming-elizabeth-its-the-im-coming-elizabeth-meme_500-384.jpeg

Thankfully, this turned out to *not* be corruption. <Whew!>

There was a related message:

Could not write a checkpoint record in database DB_1 because the log is out of space. Contact the database administrator to truncate the log or allocate more space to the database log files.

…and then the real problem…

Error: 9002, Severity: 17, State: 9.

The transaction log for database DB_1 is full due to 'AVAILABILITY_REPLICA'.

Upon cursory examination, DB_1 was suspended (not moving data) in Sales_AG, with the AG being primary on node ServerA. The other databases in the AG were all transferring data without apparent issue.

I went to check the drive hosting the LDF log file and almost broke out laughing at midnight – it was a mount point that was 367MB.

That’s right, megabytes.

http://static.ddmcdn.com/gif/tallest-fishing-tales0.jpg

The DB_1 database was just over 1GB, with a 350MB LDF file.

Being an Availability Group, there wasn’t shared storage, so I went to the B node to look and see if something was sitting on the drive on that side – some extra file that was taking up space and preventing the log file from auto-growing, but I didn’t find one.

The regular Database Index Maintenance instigated this situation. The largest object (table) in DB_1 is only 69MB, but the transaction LOG backup job only runs every fifteen minutes. In a database this small it is quite possible that the index rebuild will cycle through all of the tables in the database between LOG backups, which would drive the LDF fill out towards the used space of the MDF data file, in this case 1GB.

In this specific situation, the LDF log file grew by 10% repeatedly until it hit 350MB. At that point, there wasn’t room for another 10%/35MB so it couldn’t grow any more.

On a regular (non-availability group) database this would manifest as the 9002 errors for log full messages with a log_reuse_wait_desc of LOG_BACKUP – the most likely fix would be to take LOG backups, shrink the LDF log file if needed, and be done. That was not the case here.

(Note that the log_reuse_wait_desc here wasn’t LOG_BACKUP, it was AVAILABILITY_REPLICA. This shows that something was breaking the situation from the AG side before the lack of LOG_BACKUP could even be raised as a problem, but I didn’t notice that at the time.)

I added a second LDF log file (not a great answer but often a useful triage to get things moving) and that allowed DB_1 to function, but it did not allow the availability group data movement to resume. To try to repair the situation, I removed DB_1 from the Sales_AG availability group. (Note that this does not prevent applications from accessing the database, but rather it would not allow DB_1 to fail over to the B node if a failover occurred – but in the current state where data movement was suspended and non-resumable, a failover wouldn’t work anyway.)

I ran the LOG backup job manually multiple times and then tried to re-add the DB_1 database to the availability group – that action gave me a telling message that there wasn’t room on the B node to restore the LDF log file:

System.Data.SqlClient.SqlError: There is insufficient free space on disk volume 'X:\LOG' to create the database. The database requires 368050176 additional free bytes, while only 356515840 bytes are available.

http://s2.quickmeme.com/img/a7/a765c5be1bb9380025b17660b4782d2b5dc9fa3a9d50ede174a680733de1d383.jpg

What I found surprised me even more than finding a 367MB mount point in the first place on ServerA – the mount point on ServerB was only 343 MB!

This case with the availability group made the situation especially nasty – by growing to 350MB the LDF log file was now past the size of the 343MB mount point on the B node. This automatically suspended data movement for the DB_1 database (not the whole availability group – just the single database) while it was waiting for sufficient space to grow the LDF on the secondary node.

I have dealt with lots of clustering/mirroring/AG’s over the years, and I have never tripped over this particular issue before. With traditional failover clustering, you have shared storage, so this isn’t an issue; with a “no shared storage” design, this is a real risk for both mirroring and AG’s, and in this case it was exaggerated by the tiny size of the mount.

I was able to shrink the LDF log file of DB_1 to 310MB, which fit inside the 343MB mount point on the B node. At that point I was able to re-add DB_1 to Sales_AG, and after it was successfully back in the availability group the database showed in the monitor as successfully moving data from A>>B.

I removed the secondary LDF log file that I had added as it is not recommended to have multiple log files as a regular operating state. #TheMoreYouKnow

As a final step I shrunk the primary LDF log file a little further (shrinking is not great but is sometimes necessary) and then modified the FILEGROWTH for the file to grow in 10MB increments (rather than %). I then manually capped the MAXSIZE of the file at 335MB so that if the file did fill again, it wouldn’t break the availability group in this same way. The database would still throw 9002’s, but they would be for LOG_BACKUP, not AVAILABILITY_REPLICA.

I warned the client that they really needed to expand the mount point a little (storage is cheap, right?) to cover the needs of the index maintenance and to match the sizes. I also noted that if the mount point were expanded, the manual cap would need to be lifted before the DB_1 LDF log file will take advantage of any extra space.

The cautionary tale here is about the mismatched storage – as unlikely as it may seems to have 343MB and 365MB mount points, the exact same thing could happen on a system with a 343GB and 367GB mounts with a file trying to grow past the smaller of the two sizes.

Configurations on AG partners don’t have to be exactly the same, but they should be darned close…and for storage – just make them the same, and save yourself this particular headache.

http://img01.deviantart.net/668d/i/2004/260/e/c/headache_by_cat_lovers.jpg

Hope this helps!

Friday, May 5, 2017

It's Just Another 9002...another error like all the rest...

No comments:

Post a Comment