Thursday, August 27, 2015

SQL Server 2008 R2 SP3 Patch Failure

I was working on a SQL Server SP3 patch upgrade on Windows Server 2008 SP2 and for a server it failed with the following error.


 
 
"Attributes do not match. Present attributes (Directory, Compressed) , included attributes (0), excluded attributes (Compressed, Encrypted)"
Exit code (Decimal): -2068578304
Exit facility code:       1204
 
 
Solution
Go to C:\Program files\SQLServer\ and change the properties of this folder by unchecking
compress option/Archiving option.
 
 
 
 
















Once you uncheck this, Re-ran the patch installation and it should complete successfully. It worked for me without any issues and hope does the same for all.
 
Njoy and Have a Gr8 Day!!


 

Thursday, August 20, 2015

DBCC SHRINKFILE

Since SQL Server 2005 there is an additional shrink logic to compress TEXT/IMAGE data, referred to as LOBs(Large Object's). The dbcc shrink code uses the same underlying methods that ALTER INDEX .... WITH (LOB_COMPACTION ON) uses to compact the LOB space in the database files.

There are various stages of dbcc shrink activity which are indicated in the sys.dm_exec_requests dmv in command column along with the progress indications

select percent_complete, estimated_completion_time, cpu_time, total_elapsed_time, * from sys.dm_exec_requests.

Now let’s see the phases that SHRINK operation

Step
Command
Description
1
DbccSpaceReclaim
Clean up deferred allocations and purge empty extents preparing for data moves.  
2
DbccFilesCompact
Moves pages beyond the target to before the target and truncate file as required.
3
DbccLOBCompact
Compacting the LOB data.


A target extent is calculated based on the information supplied to the shrink command.   If you have a file that is 1GB in size and you specify a target of 750MB, it starts from there

MDF File











Shrink  Steps
Cleans up space that is available in the entire file, such as deferred allocations. This helps expose space in the lower portion of the file (BOF-Beginning Of File). Sets up the necessary information to tell the allocation routines that, no pages should be allocated beyond 750MB (Target Size). Scans all pages from 750MB to 1GB that are marked allocated in the GAM.  Data on allocated pages is moved to the lower portion of the file. This can involve index, LOB chain and other updates to move the row. The more rows that are found beyond the target extent location the more work shrink has to do. Do not compare one file to another because the data layout is critical to understanding the work that shrink has to perform. The dbcc extentinfo command can be used for this purpose but need to be aware of the large volume of output it produces. Once all pages are  moved below the target point of truncation, LOB's will be compacted unless we have specified truncate only in our SHRINK command


WARNINGThis should only be used under the guidance of SQL Server support and may not be supported in the future

SQL Server 2005 has a trace flag –T2548  dbcc tracon(-1, 2548) that allows shrink and other LOB_COMPACTION actions to be skipped. Enabling the trace flag returns shrink behavior to that similar to SQL Server 2000. Using ALTER INDEX … LOB_COMPACTION is another way of limiting the work that shrink has to perform.

Progress Indication
The progress indication in sys.dm_exec_requests is based on a grouping of work for DBCC. When a dbcc command starts, it performs an estimate of the work it will be doing.   Going back to the 750MB example part of the estimate will be that the 250MB at the end of the file has to be processed.  As each page is interrogated and moved the progress indication is advanced. However, this is not the only progress indication calculation that makes up the progress and estimated time indicators. For the same shrink an estimate at LOB_COMPACTION is made. This is included in the progress indicators. Combining the progress indicators with the command value should help you follow the DBCC progress.

Shrink Transaction Behavior
Shrink works in transaction batches of ~32 pages. Once 32 pages are moved below the truncation point the transaction is committed and a new transaction starts. Previously during shrink  it keeps a transaction open for a long time and caused the log files to grow because you can’t truncate past an active transaction.  This new process of processing in batches of 32 pages also makes shrink restart capable. If you cancel/kill a shrink operation only the current active transaction is rolled back. So you could schedule a shrink operation during a maintenance window and limit its run time to that maintenance window. You can restart it during the next maintenance window and it will pick up from where it left off. The only disadvantage to this is that allocations could occur after the target once the shrink is terminated. This may require shrink to do additional work for newly allocated space but this should be reasonably limited.

The shrink  process uses named transactions. For instance, during the DbccSpaceReclaim command you will see transaction names such as ‘DeferredAllocUnitDrop::ReclaimSp‘ and by Using sys.dm_tran_active_transactions you can view the activity.



Move LOB Fragement





 Like all other data, when a LOB fragment is found inside the shrinking line it is also moved below the target. If the LOB chain is in an older SQL Server format the entire chain is upgraded to current SQL Server LOB format by 'UpgradeBlob' getting invoked. The fastest way to upgrade the LOB chains is by a series of BULK export and BULK insert operations.  Otherwise expect the shrink activity to require more resources and time to complete if it requires LOB compaction. There is an excellent post from PAUL on this Why LOB data makes shrink run slooooowly ?



Single Threaded
Multiple shrink operations against the same file/database are not allowed. You will encounter an error as stated below. The shrink operation is currently a single threaded operation. The shrink does not run in parallel unlike DBCC CHECKDB

Error: File ID 1 of database ID 10 cannot be shrunk as it is either being shrunk by another process or is empty

.
Effects of Shrinking
1.     I/O contention
2.     Transaction Log file space
3.     Fragmentation
4.     Lock Contention and many other things,,,  :)

Paul has mentioned this in an excellent way here.



Moral Of  Story
Bottom line is that there are lot of overheads involved in Data file Shrinking and hence it is not recommended as a good practice Unless you have no other option and need space. The information's mentioned in this blog has contents collected from blogs written by CSS Engineers






Cheers... Have a Gr8 day!!!





  


Wednesday, July 29, 2015

REQUEST_FOR_DEADLOCK_SEARCH


I don’t want to make the readers feel bored of reading some repetitive things. So I am not explaining what is a deadlock how it occurs instead I felt it would be good if I mention what and how  SQL Server engine does internally about Deadlock.

Let me take you through a DMV called sys.dm_os_wait_stats. If you notice the columns that it uses, you will see something like …no Exactly like REQUEST_FOR_DEADLOCK_SEARCH. This is the guy we will be centering today.

 
What is this Doing????

This is basically a sleepless thread/process that wakes up every five seconds to see if we have any deadlocks. Wow…what a nice Guy!!!  If it finds any, it terminates one of the sessions. That’s true, it KILLS one of the two (multiple) sessions to let the other session have all the resources it needs.  So we don’t have to worry, right?

Well, SQL Server does have to worry a bit, even if it’s not going to prison.  So how does it decide which transaction to be killed?

It decides this by looking at the LOG USED value. Now what is that?

 Say if you have high Transaction Log usage (more Transactions), you will have a high log used value for that session (SPID). Consider another session with low value on LOG USED. It’s obvious that the second one will be easier to rollback. Because if SQL Server kills a transactions, any work it has done must be rolled back to bring the database to consistent state.  So it chooses a low value SPID and kills.

So what happens when the value are the same? 

It uses sort of LIFO (Last In First Out) method and kill whoever came in second.

How it Works????

It wakes up every 5 seconds to check for deadlocks. If it finds any, it follows above process to decide how to resolve it.  However, second time it wakes up right away, to make sure it is not a nested deadlock.  If there is one it kills that and goes back to sleep.  Next time it wakes up @4.90 seconds (I believe it’s actually 10ms).  It will keep going down to as small as 100ms; that is, it will wake up even 10 times per second to deal with Deadlock Issues.

 

So hope this was not a boring(For someone) but instead shed some light on the internal thread.

Have a Gr8 Day!! Njoy

Tuesday, July 14, 2015

Operating system error 5(Access is denied.). [SQLSTATE 42000] (Error 3201) RESTORE DATABASE is terminating abnormally. [SQLSTATE 42000] (Error 3013)


I found this error while I was watching for alerts in my new environment. I know this is a common and a silly error for DBA's on a senior level. I have intended to share this because I think it may at least help Newbies...
Issue

The Restore Job part of  a Maintenance Plan was failing with the above error. It was trying to get the backup file from one of the Shared Network Location. The job was running under SQL Server Service account and the error was

Executed as user: xxxx Cannot open backup device '\\xxx\d$\project\filetransfer\xxx.bak'. Operating system error 5(Access is denied.). [SQLSTATE 42000] (Error 3201)  RESTORE DATABASE is terminating abnormally. [SQLSTATE 42000] (Error 3013).  The step failed
.

Solution
Even after giving FULL permissions to service account for the folder and even Root (D Drive) it was failing. But when we changed the way we were calling the backup location it worked.

RESTORE DATABASE XXX
FROM DISK = '\\XXX\project\filetransfer\xxx.bak'.

Since the folder 'project' was shared, we omitted D$ and gave the shared folder name directly, which infact was the issue here.

Moral Of the Story:- Backups and Restores with Network locations are always Scary!!! :)



Hope this helps someone...Cheers and Njoy your day!!!



 

Thursday, June 12, 2014

Syspolicy_Purge_History Job failure !!!

Today I would like to share a job failure issue which I faced in my environment. The job name is syspolicy_purge_history.



What is this Job ?
================

SQL Server 2008 introduced a new feature called Policy Based Management.
So after the installation of SQL Server 2008 or upgrade to 2008, it will create this  job by default, which runs daily @ 2:00AM.


What it Does ?
===============

 






When a policy is defined and when it runs it stores the results in MSDB in the below tables.
 
syspolicy_policy_execution_history_details_internal

syspolicy_policy_execution_history_internal

syspolicy_policy_category_subscriptions_internal
 
Over a period of time, this may cause data getting piled up and your MSDB will be filled. The job syspolicy_purge_history prunes the data older than the the days defined in HistoryRetentionInDays property of Policy Management.
 
 
 
In any case if the job is not created, then you can go ahead and run the below statement to create it.
 

           EXEC msdb.dbo.sp_syspolicy_create_purge_job

Now we will Come to Our Issue... When I checked the job history, The third step was failing and the reason was mentioned as below.
 
 
"Executed as user:  The process could not be created for step 3 of job 0x6C76502E1BAA0640B4EA991B90359043 (reason: The process cannot access the file because it is being used by another process).  The step failed."

I started troubleshooting the issue with Powershell like changing the Execution Policy roles, but that didn't helped. I even tried stopping the Antivirus, but still the job was failing with same error.

Later I came across a file in the root of C:\ drive called 'Program'. It was little suspicious and was mentioned in some articles regarding the confusion it creates in OS and the problems it created for starting several applications.

I tried renaming it and was unsuccessful. It was throwing an error that its being used by another process. I started my investigations behind this and ran Process Explorer to find the details of the Handle.




Once the Process Explorer was opened, I Clicked on Find Handle, then got a Window to search the file 'Program'. Once I found inside that search window, I selected that inside the above window and right click for the below options.



I choose to close Handle to remove the connection to the file 'Program'. Once that was done, Navigate to the root of C and tried renaming the file and this time I succeeded.

Then I rerun the job, and it completed successfully. HURRAY !!!

It was an excellent learning to me. I hope it was a new info to atleat few DBA's.


Please post your commnets or concerns...


Have a Great day...

Cheers!!!


Thanks

Jinu

Wednesday, May 7, 2014

Page Corruption !!!

Yeah Corruption. A term that we won't like to hear in a production environment. Unfortunately I not only heard, but faced it. I would rather state myself being lucky enough to work on a corruption issue.


It started with an email from the Customer stating that one of their reports got failed with an error.

"jdbodbc.C5583 [Microsoft][SQL Server Native Client 10.0][SQL Server]Could not continue scan with NOLOCK due to data movement. - SQLSTATE: 37000"

It normally occurs in 2 scenarios.

1) While running a query with NOLOCK hint and Isolation level for the transaction is READ-UNCOMMITTED

2) Database Corruption

I quickly verified that the transaction isoaltion level was READ-COMMITTED. Then I decieded to run the query (query that is being used by the report ) both with and without NOLOCK hint and I got the following errors.

WITH NOLOCK Hint
==============

Error 601: Could not continue scan with NOLOCK due to data movement







WITHOUT  NOLOCK Hint
===============

Msg 824, Level 24, State 2







It was a bit hard to digest that I was dealing with a corruption issue and understood the fact that the coming days are gonna be busy for me. But I was ready for that. We arranged the call with customer and explained the seriousness of issue and mentioned the urgency in running DBCC CHECKDB as early as possible without which it would be difficult to understand the depth and number of objects(Tables/Pages) affected.

Then arises the question from customer regarding the time frame required for running CHECKDB.

Any Ideas ???

There are a lot of factors that is depended on it. I would like to summarize a few in simple way I can.
  •  Size of Database
  •  IO Load on the server and its throughput ability
  •  Concurrent activities on CPU 
  •  Concurrent Updates to Database (Ideal to be in Single User Mode)
  •  Number of CPU's and the Tempdb
  •  Complexity of Database Schema
  •  Number of Corruptions and type 
 Hence I would simply say the best way to estimate the time required for CHECKDB is to Run it. But we can estimate based on some references. I would like to provide my environment specifications which may help you to give an estimate.

My environment configurations are as follows.

Size of Database     :- 1 TB
CPU's                        :- 2
Corruption Details  :- UNKNOWN
DB Schema               :- 1 Clustered Index



To estimate the TEMPDB space requirements, I ran the DBCC command as mentioned below.

DBCC CHECDB('DBname') WITH ESTIMATEONLY

But I would say that this is a very Wrong estimate and readers please beware of this.


It says maximum of only 2GB required for the operation but the reality was that it was close to 20/22 GB for my 1TB database. Later I found from Paul Randal's Blog that its a BUG.


We started running the CHECKDB on QA to estimate the time required and based on that we ran it on Production after planning all pre-checks like drive space where tempdb resides etc. It ran for 10 hours and returned with 12 errors with 2 system object errors.

To summarise the ETA for DBCC CHECKDB for a 1 TB with 2 CPU's will be approximately 10-11 hours. 


Results of CHECKDB
================

"Msg 8909, Level 16, State 1, Line 1
Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 0 (type Unknown), page ID (1:78692566) contains an incorrect page ID in its page header. The PageId in the page header = (0:0).
Msg 8909, Level 16, State 1, Line 1
Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 0 (type Unknown), page ID (1:78692567) contains an incorrect page ID in its page header. The PageId in the page header = (0:0).
CHECKDB found 0 allocation errors and 2 consistency errors not associated with any single object.


DBCC results for 'table_name'.
Msg 8978, Level 16, State 1, Line 1
Table error: Object ID 2098926649, index ID 1, partition ID 419030233579520, alloc unit ID 419030233579520 (type In-row data). Page (1:78264522) is missing a reference from previous page (1:78692565). Possible chain linkage problem.
Msg 8978, Level 16, State 1, Line 1
Table error: Object ID 2098926649, index ID 1, partition ID 419030233579520, alloc unit ID 419030233579520 (type In-row data). Page (1:78264524) is missing a reference from previous page (1:78692567). Possible chain linkage problem.
Msg 8928, Level 16, State 1, Line 1
Object ID 2098926649, index ID 1, partition ID 419030233579520, alloc unit ID 419030233579520 (type In-row data): Page (1:78692565) could not be processed.  See other errors for details.
Msg 8939, Level 16, State 98, Line 1
Table error: Object ID 2098926649, index ID 1, partition ID 419030233579520, alloc unit ID 419030233579520 (type In-row data), page (1:78692565). Test (IS_OFF (BUF_IOERR, pBUF->bstat)) failed. Values are 29493257 and -1.
Msg 8976, Level 16, State 1, Line 1
Table error: Object ID 2098926649, index ID 1, partition ID 419030233579520, alloc unit ID 419030233579520 (type In-row data). Page (1:78692565) was not seen in the scan although its parent (1:78137174) and previous (1:78264521) refer to it. Check any previous errors.
Msg 8928, Level 16, State 1, Line 1
Object ID 2098926649, index ID 1, partition ID 419030233579520, alloc unit ID 419030233579520 (type In-row data): Page (1:78692566) could not be processed.  See other errors for details.
Msg 8976, Level 16, State 1, Line 1
Table error: Object ID 2098926649, index ID 1, partition ID 419030233579520, alloc unit ID 419030233579520 (type In-row data). Page (1:78692566) was not seen in the scan although its parent (1:78137174) and previous (1:78264522) refer to it. Check any previous errors.
Msg 8928, Level 16, State 1, Line 1
Object ID 2098926649, index ID 1, partition ID 419030233579520, alloc unit ID 419030233579520 (type In-row data): Page (1:78692567) could not be processed.  See other errors for details.
Msg 8976, Level 16, State 1, Line 1
Table error: Object ID 2098926649, index ID 1, partition ID 419030233579520, alloc unit ID 419030233579520 (type In-row data). Page (1:78692567) was not seen in the scan although its parent (1:78137174) and previous (1:78264523) refer to it. Check any previous errors.
Msg 8978, Level 16, State 1, Line 1
Table error: Object ID 2098926649, index ID 1, partition ID 419030233579520, alloc unit ID 419030233579520 (type In-row data). Page (1:82809613) is missing a reference from previous page (1:78692566). Possible chain linkage problem.
There are 50427099 rows in 2393300 pages for object "Table_name".
CHECKDB found 0 allocation errors and 10 consistency errors in table 'Table_name' (object ID 2098926649).


CHECKDB found 0 allocation errors and 12 consistency errors in database 'DB_name'.
repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (DB_Name).


So we are in the middle of system object corruption(cannot be repaired) and started preparing for Recovery Plans and procedures.

 Plan 1
========

The basic plan set in place was restoring from the backups. But it was a very long process as we had to start with a Good Clean Full backup followed by application of T-logs. We started the restoration activity and realized the issue that I explained in one of my previous post(VM backup and LSN chain break). So literally Plan 1 got failed.

Plan 2
=========

Create an SSIS package for "TRANSFER SQL SERVER OBJECTS" task and transfer everything except the corrupted table. For corrupted table manually do traversal with help of TOP N in ascending and descending Order by  clustered Index key and thus retrieve maximum rows. We planned this activity on QA and started first part of this Plan2. It took around 66 hours, and was a NO GO from client side as it required that much downtime in production.

But since System Objects also had the corruption, Microsoft also suggested same when we raised the Case for this Issue. But whatever ways, loss of data was sure.

Since data loss was for sure, I started thinking about doing a Repair using CHECKDB. This thought helped me create a Plan 3

Plan3
========

Take a latest backup of Production database and restore it with "Continue_after_error" option as there are chances that the restore may fail because of corruption. Once restored successfully, take the row count of all tables. Make the DB into Single user mode. Run the command

DBCC CHECKDB('database_name',REPAIR_ALLOW_DATA_LOSS)

Once repair is completed verify the dbcc log and take a row count and compare. Please find few highlighted results of our CHECKDB Repair...

Msg 8909, Level 16, State 1, Line 1
Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 0 (type Unknown), page ID (1:78692566) contains an incorrect page ID in its page header. The PageId in the page header = (0:0).
        The error has been repaired.
Msg 8909, Level 16, State 1, Line 1
Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 0 (type Unknown), page ID (1:78692567) contains an incorrect page ID in its page header. The PageId in the page header = (0:0).
        The error has been repaired.
CHECKDB found 0 allocation errors and 2 consistency errors not associated with any single object.
CHECKDB fixed 0 allocation errors and 2 consistency errors not associated with any single object.

DBCC results for 'Table_Name'.
==========================

Repair: The Clustered index successfully rebuilt for the object "table_name" in database "Database_Name".
Repair: The page (1:78692565) has been deallocated from object ID 2098926649, index ID 1, partition ID 419030233579520, alloc unit ID 419030233579520 (type In-row data).
Repair: The page (1:78692566) has been deallocated from object ID 2098926649, index ID 1, partition ID 419030233579520, alloc unit ID 419030233579520 (type In-row data).
Repair: The page (1:78692567) has been deallocated from object ID 2098926649, index ID 1, partition ID 419030233579520, alloc unit ID 419030233579520 (type In-row data).
Msg 8945, Level 16, State 1, Line 1
Table error: Object ID 2098926649, index ID 1 will be rebuilt.
        The error has been repaired.
Msg 8978, Level 16, State 1, Line 1
Table error: Object ID 2098926649, index ID 1, partition ID 419030233579520, alloc unit ID 419030233579520 (type In-row data). Page (1:78264522) is missing a reference from previous page (1:78692565). Possible chain linkage problem.
        The error has been repaired.
Msg 8978, Level 16, State 1, Line 1
Table error: Object ID 2098926649, index ID 1, partition ID 419030233579520, alloc unit ID 419030233579520 (type In-row data). Page (1:78264524) is missing a reference from previous page (1:78692567). Possible chain linkage problem.
        The error has been repaired.
Msg 8928, Level 16, State 1, Line 1
Object ID 2098926649, index ID 1, partition ID 419030233579520, alloc unit ID 419030233579520 (type In-row data): Page (1:78692565) could not be processed.  See other errors for details.
        The error has been repaired.
Msg 8939, Level 16, State 98, Line 1
Table error: Object ID 2098926649, index ID 1, partition ID 419030233579520, alloc unit ID 419030233579520 (type In-row data), page (1:78692565). Test (IS_OFF (BUF_IOERR, pBUF->bstat)) failed. Values are 29362185 and -1.
        The error has been repaired.
Msg 8976, Level 16, State 1, Line 1
Table error: Object ID 2098926649, index ID 1, partition ID 419030233579520, alloc unit ID 419030233579520 (type In-row data). Page (1:78692565) was not seen in the scan although its parent (1:78137174) and previous (1:78264521) refer to it. Check any previous errors.
        The error has been repaired.
Msg 8928, Level 16, State 1, Line 1
Object ID 2098926649, index ID 1, partition ID 419030233579520, alloc unit ID 419030233579520 (type In-row data): Page (1:78692566) could not be processed.  See other errors for details.
        The error has been repaired.
Msg 8976, Level 16, State 1, Line 1
Table error: Object ID 2098926649, index ID 1, partition ID 419030233579520, alloc unit ID 419030233579520 (type In-row data). Page (1:78692566) was not seen in the scan although its parent (1:78137174) and previous (1:78264522) refer to it. Check any previous errors.
        The error has been repaired.
Msg 8928, Level 16, State 1, Line 1
Object ID 2098926649, index ID 1, partition ID 419030233579520, alloc unit ID 419030233579520 (type In-row data): Page (1:78692567) could not be processed.  See other errors for details.
        The error has been repaired.
Msg 8976, Level 16, State 1, Line 1
Table error: Object ID 2098926649, index ID 1, partition ID 419030233579520, alloc unit ID 419030233579520 (type In-row data). Page (1:78692567) was not seen in the scan although its parent (1:78137174) and previous (1:78264523) refer to it. Check any previous errors.
        The error has been repaired.
Msg 8978, Level 16, State 1, Line 1
Table error: Object ID 2098926649, index ID 1, partition ID 419030233579520, alloc unit ID 419030233579520 (type In-row data). Page (1:82809613) is missing a reference from previous page (1:78692566). Possible chain linkage problem.
        The error has been repaired.

There are 50932994 rows in 2417291 pages for object "table_name".
CHECKDB found 0 allocation errors and 10 consistency errors in table 'table_name' (object ID 2098926649).
CHECKDB fixed 0 allocation errors and 10 consistency errors in table 'table_name' (object ID 2098926649).

Final Comment
=============
CHECKDB found 0 allocation errors and 12 consistency errors in database 'DB Name'.
CHECKDB fixed 0 allocation errors and 12 consistency errors in database 'DB Name'.
DBCC execution completed. If DBCC printed error messages, contact your system administrator.


After Row comparison, it was just 22  rows that were missing and the Customer was too happy and was ok with that result.

Note
=====
We tested this on QA with a backup of Production before implementing it on Production.


Thus the Issue was resolved with a simple plan. I had spend 40 days for this issue...Hope this post would help at least some DBA not to spend huge amount of time as I did, since I have already mentioned 3 plans in this post.

Enjoy and have a gr8 Day ahead !!!


Cheers


Jinu

Wednesday, April 2, 2014

LSN Chain Break and VM Backup !!!

Today I want to Discuss a Very Serious threat to the Production Databases which can eventually cause serious data loss .

While I was working on an issue in production, I had to restore the database to a point in time as a part of recovery process and I started my activity with the available backups starting with FULL backup. Once that was completed, I started with my T-Log backups. After restoring a few, I encountered a LSN mismatch error for the restore. Since we use TDP backups, I was restoring based on the order of backups in the interface. So I was thinking how come we could miss one in between.

I started searching for the missing backup file. After investigation, I understood that there was intermittent LOG backups happening with  a Device named 'NUL'. I verified the same in MSDB as well as SQL Error log that we had a backup happened in between the TDP schedule.



We confirmed from (Wintel,VMWare) that VM disk snapshots are happening for the server(VM). The tricky part is that the timestamps of the VM Snapshots matched the timestamps we saw in the MSDB and the SQL Server error logs for the 'NUL' Backups.


Later on after discussion with VM team and several searching in Google, we came to know that there is an option in VM backup tool for performing Transaction Log clearing. In our case it was VEEAM backup tool. Other tools like Hyper V backup, V ranger etc are other backup tools.

During VM snapshot, there is an option to clear the LOG as shown below. By default the option to Truncate logs will be enabled.[First Option]




The way that these tools clear the transaction log during a VM snapshot backup is by issuing a subsequent command to the SQL Server,

BACKUP LOG <database_name> TO DISK =’NUL’

which dumps the transaction log records to nowhere, essentially throwing them away completely and thus breaking the log backup chain. In the sense, when you issue the above command, no backup files are getting generated and it simply truncates log, which inturn makes our day horrible. :(

Solution 1
=========
If your server is in VM environment and have SQL native backups enabled, then you have to make sure that this option in VM is disabled. If you want to do a quick check from SQL point of view please use the below query to verify from MSDB.

SELECT b.physical_device_name, a.backup_start_date, a.first_lsn, a.user_name, a.* FROM msdb..backupset a
INNER JOIN msdb..backupmediafamily b ON a.media_set_id = b.media_set_id
WHERE a.type = 'L'
ORDER BY a.backup_finish_date DESC


Solution 2
=========

Consider where you have a situation where you cannot change this option or not allowed to change. In that case verify the VM snapshot schedule from VM team and schedule/initiate a Full backup after the VM backup so that LSN chain is rebuild.


I hope this was a new information to few people as it was to me. Have a quick check. Better Late than Never  :)


Thanks
Jinu Varghese