Wednesday, April 2, 2014

LSN Chain Break and VM Backup !!!

Today I want to Discuss a Very Serious threat to the Production Databases which can eventually cause serious data loss .

While I was working on an issue in production, I had to restore the database to a point in time as a part of recovery process and I started my activity with the available backups starting with FULL backup. Once that was completed, I started with my T-Log backups. After restoring a few, I encountered a LSN mismatch error for the restore. Since we use TDP backups, I was restoring based on the order of backups in the interface. So I was thinking how come we could miss one in between.

I started searching for the missing backup file. After investigation, I understood that there was intermittent LOG backups happening with  a Device named 'NUL'. I verified the same in MSDB as well as SQL Error log that we had a backup happened in between the TDP schedule.



We confirmed from (Wintel,VMWare) that VM disk snapshots are happening for the server(VM). The tricky part is that the timestamps of the VM Snapshots matched the timestamps we saw in the MSDB and the SQL Server error logs for the 'NUL' Backups.


Later on after discussion with VM team and several searching in Google, we came to know that there is an option in VM backup tool for performing Transaction Log clearing. In our case it was VEEAM backup tool. Other tools like Hyper V backup, V ranger etc are other backup tools.

During VM snapshot, there is an option to clear the LOG as shown below. By default the option to Truncate logs will be enabled.[First Option]




The way that these tools clear the transaction log during a VM snapshot backup is by issuing a subsequent command to the SQL Server,

BACKUP LOG <database_name> TO DISK =’NUL’

which dumps the transaction log records to nowhere, essentially throwing them away completely and thus breaking the log backup chain. In the sense, when you issue the above command, no backup files are getting generated and it simply truncates log, which inturn makes our day horrible. :(

Solution 1
=========
If your server is in VM environment and have SQL native backups enabled, then you have to make sure that this option in VM is disabled. If you want to do a quick check from SQL point of view please use the below query to verify from MSDB.

SELECT b.physical_device_name, a.backup_start_date, a.first_lsn, a.user_name, a.* FROM msdb..backupset a
INNER JOIN msdb..backupmediafamily b ON a.media_set_id = b.media_set_id
WHERE a.type = 'L'
ORDER BY a.backup_finish_date DESC


Solution 2
=========

Consider where you have a situation where you cannot change this option or not allowed to change. In that case verify the VM snapshot schedule from VM team and schedule/initiate a Full backup after the VM backup so that LSN chain is rebuild.


I hope this was a new information to few people as it was to me. Have a quick check. Better Late than Never  :)


Thanks
Jinu Varghese