11:30pm on a
Saturday night, and the pager went off…
Error:
5901, Severity: 16, State: 1.
One
or more recovery units belonging to database 'i3_dialer' failed to generate a
checkpoint. This is typically caused by lack of system resources
such as disk or memory, or in some cases due to database corruption.
Examine previous entries in the error log for more detailed information
on this failure.
That phrase
– “due to database corruption” – always makes my stomach turn just a little…
https://img.memesuper.com/9b7a67d76f5de79439a92cc9ad703dda_im-coming-elizabeth-its-the-im-coming-elizabeth-meme_500-384.jpeg |
Thankfully,
this turned out to *not* be corruption. <Whew!>
There was a
related message:
Could
not write a checkpoint record in database DB_1 because the log is out of space. Contact the
database administrator to truncate the log or allocate more space to the
database log files.
…and then
the real problem…
Error:
9002, Severity: 17, State: 9.
The
transaction log for database DB_1 is full due to 'AVAILABILITY_REPLICA'.
--
Upon cursory
examination, DB_1 was suspended (not moving data) in Sales_AG, with the AG
being primary on node ServerA. The other
databases in the AG were all transferring data without apparent issue.
I went to
check the drive hosting the LDF log file and almost broke out laughing at midnight
– it was a mount point that was 367MB.
That’s
right, megabytes.
The DB_1
database was just over 1GB, with a 350MB LDF file.
Being an
Availability Group, there wasn’t shared storage, so I went to the B node to
look and see if something was sitting on the drive on that side – some extra
file that was taking up space and preventing the log file from auto-growing,
but I didn’t find one.
--
The regular
Database Index Maintenance instigated this situation. The largest object (table) in DB_1 is only
69MB, but the transaction LOG backup job only runs every fifteen minutes. In a database this small it is quite possible
that the index rebuild will cycle through all of the tables in the database
between LOG backups, which would drive the LDF fill out towards the used space
of the MDF data file, in this case 1GB.
In this
specific situation, the LDF log file grew by 10% repeatedly until it hit
350MB. At that point, there wasn’t room
for another 10%/35MB so it couldn’t grow any more.
On a regular
(non-availability group) database this would manifest as the 9002 errors for log
full messages with a log_reuse_wait_desc of LOG_BACKUP – the most likely fix would
be to take LOG backups, shrink the LDF log file if needed, and be done. That was not the case here.
(Note that
the log_reuse_wait_desc here wasn’t LOG_BACKUP, it was
AVAILABILITY_REPLICA. This shows that something
was breaking the situation from the AG side before the lack of LOG_BACKUP could
even be raised as a problem, but I didn’t notice that at the time.)
--
I added a
second LDF log file (not a great answer but often a useful triage to get things
moving) and that allowed DB_1 to function, but it did not allow the
availability group data movement to resume. To try to repair the situation,
I removed DB_1 from the Sales_AG availability group. (Note that this does
not prevent applications from accessing the database, but rather it would not
allow DB_1 to fail over to the B node if a failover occurred – but in the
current state where data movement was suspended and non-resumable, a failover
wouldn’t work anyway.)
I ran the LOG
backup job manually multiple times and then tried to re-add the DB_1 database
to the availability group – that action gave me a telling message that there
wasn’t room on the B node to restore the LDF log file:
System.Data.SqlClient.SqlError:
There is insufficient free space on disk volume 'X:\LOG' to create the
database. The database requires 368050176 additional free bytes, while only 356515840
bytes are available.
This case
with the availability group made the situation especially nasty – by growing to
350MB the LDF log file was now past the size of the 343MB mount point on the B
node. This automatically suspended data
movement for the DB_1 database (not the whole availability group – just the
single database) while it was waiting for sufficient space to grow the LDF on
the secondary node.
I have dealt
with lots of clustering/mirroring/AG’s over the years, and I have never tripped
over this particular issue before. With
traditional failover clustering, you have shared storage, so this isn’t an
issue; with a “no shared storage” design, this is a real risk for both
mirroring and AG’s, and in this case it was exaggerated by the tiny size of the
mount.
I was able
to shrink the LDF log file of DB_1 to 310MB, which fit inside the 343MB mount
point on the B node. At that point I was able to re-add DB_1 to Sales_AG,
and after it was successfully back in the availability group the database
showed in the monitor as successfully moving data from A>>B.
I removed
the secondary LDF log file that I had added as it is not recommended to have
multiple log files as a regular operating state. #TheMoreYouKnow
As a final
step I shrunk the primary LDF log file a little further (shrinking is not great
but is sometimes necessary) and then modified the FILEGROWTH for the file to
grow in 10MB increments (rather than %).
I then manually capped the MAXSIZE of the file at 335MB so that if the
file did fill again, it wouldn’t break the availability group in this same
way. The database would still throw 9002’s,
but they would be for LOG_BACKUP, not AVAILABILITY_REPLICA.
I warned the
client that they really needed to expand the mount point a little (storage is
cheap, right?) to cover the needs of the index maintenance and to match the
sizes. I also noted that if the mount
point were expanded, the manual cap would need to be lifted before the DB_1 LDF
log file will take advantage of any extra space.
--
The
cautionary tale here is about the mismatched storage – as unlikely as it may
seems to have 343MB and 365MB mount points, the exact same thing could happen on
a system with a 343GB and 367GB mounts with a file trying to grow past the
smaller of the two sizes.
Configurations
on AG partners don’t have to be exactly the same, but they should be darned
close…and for storage – just make them the same, and save yourself this
particular headache.
http://img01.deviantart.net/668d/i/2004/260/e/c/headache_by_cat_lovers.jpg |
Hope this
helps!
No comments:
Post a Comment