Thursday, June 12, 2014

Syspolicy_Purge_History Job failure !!!

Today I would like to share a job failure issue which I faced in my environment. The job name is syspolicy_purge_history.



What is this Job ?
================

SQL Server 2008 introduced a new feature called Policy Based Management.
So after the installation of SQL Server 2008 or upgrade to 2008, it will create this  job by default, which runs daily @ 2:00AM.


What it Does ?
===============

 






When a policy is defined and when it runs it stores the results in MSDB in the below tables.
 
syspolicy_policy_execution_history_details_internal

syspolicy_policy_execution_history_internal

syspolicy_policy_category_subscriptions_internal
 
Over a period of time, this may cause data getting piled up and your MSDB will be filled. The job syspolicy_purge_history prunes the data older than the the days defined in HistoryRetentionInDays property of Policy Management.
 
 
 
In any case if the job is not created, then you can go ahead and run the below statement to create it.
 

           EXEC msdb.dbo.sp_syspolicy_create_purge_job

Now we will Come to Our Issue... When I checked the job history, The third step was failing and the reason was mentioned as below.
 
 
"Executed as user:  The process could not be created for step 3 of job 0x6C76502E1BAA0640B4EA991B90359043 (reason: The process cannot access the file because it is being used by another process).  The step failed."

I started troubleshooting the issue with Powershell like changing the Execution Policy roles, but that didn't helped. I even tried stopping the Antivirus, but still the job was failing with same error.

Later I came across a file in the root of C:\ drive called 'Program'. It was little suspicious and was mentioned in some articles regarding the confusion it creates in OS and the problems it created for starting several applications.

I tried renaming it and was unsuccessful. It was throwing an error that its being used by another process. I started my investigations behind this and ran Process Explorer to find the details of the Handle.




Once the Process Explorer was opened, I Clicked on Find Handle, then got a Window to search the file 'Program'. Once I found inside that search window, I selected that inside the above window and right click for the below options.



I choose to close Handle to remove the connection to the file 'Program'. Once that was done, Navigate to the root of C and tried renaming the file and this time I succeeded.

Then I rerun the job, and it completed successfully. HURRAY !!!

It was an excellent learning to me. I hope it was a new info to atleat few DBA's.


Please post your commnets or concerns...


Have a Great day...

Cheers!!!


Thanks

Jinu

Wednesday, May 7, 2014

Page Corruption !!!

Yeah Corruption. A term that we won't like to hear in a production environment. Unfortunately I not only heard, but faced it. I would rather state myself being lucky enough to work on a corruption issue.


It started with an email from the Customer stating that one of their reports got failed with an error.

"jdbodbc.C5583 [Microsoft][SQL Server Native Client 10.0][SQL Server]Could not continue scan with NOLOCK due to data movement. - SQLSTATE: 37000"

It normally occurs in 2 scenarios.

1) While running a query with NOLOCK hint and Isolation level for the transaction is READ-UNCOMMITTED

2) Database Corruption

I quickly verified that the transaction isoaltion level was READ-COMMITTED. Then I decieded to run the query (query that is being used by the report ) both with and without NOLOCK hint and I got the following errors.

WITH NOLOCK Hint
==============

Error 601: Could not continue scan with NOLOCK due to data movement







WITHOUT  NOLOCK Hint
===============

Msg 824, Level 24, State 2







It was a bit hard to digest that I was dealing with a corruption issue and understood the fact that the coming days are gonna be busy for me. But I was ready for that. We arranged the call with customer and explained the seriousness of issue and mentioned the urgency in running DBCC CHECKDB as early as possible without which it would be difficult to understand the depth and number of objects(Tables/Pages) affected.

Then arises the question from customer regarding the time frame required for running CHECKDB.

Any Ideas ???

There are a lot of factors that is depended on it. I would like to summarize a few in simple way I can.
  •  Size of Database
  •  IO Load on the server and its throughput ability
  •  Concurrent activities on CPU 
  •  Concurrent Updates to Database (Ideal to be in Single User Mode)
  •  Number of CPU's and the Tempdb
  •  Complexity of Database Schema
  •  Number of Corruptions and type 
 Hence I would simply say the best way to estimate the time required for CHECKDB is to Run it. But we can estimate based on some references. I would like to provide my environment specifications which may help you to give an estimate.

My environment configurations are as follows.

Size of Database     :- 1 TB
CPU's                        :- 2
Corruption Details  :- UNKNOWN
DB Schema               :- 1 Clustered Index



To estimate the TEMPDB space requirements, I ran the DBCC command as mentioned below.

DBCC CHECDB('DBname') WITH ESTIMATEONLY

But I would say that this is a very Wrong estimate and readers please beware of this.


It says maximum of only 2GB required for the operation but the reality was that it was close to 20/22 GB for my 1TB database. Later I found from Paul Randal's Blog that its a BUG.


We started running the CHECKDB on QA to estimate the time required and based on that we ran it on Production after planning all pre-checks like drive space where tempdb resides etc. It ran for 10 hours and returned with 12 errors with 2 system object errors.

To summarise the ETA for DBCC CHECKDB for a 1 TB with 2 CPU's will be approximately 10-11 hours. 


Results of CHECKDB
================

"Msg 8909, Level 16, State 1, Line 1
Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 0 (type Unknown), page ID (1:78692566) contains an incorrect page ID in its page header. The PageId in the page header = (0:0).
Msg 8909, Level 16, State 1, Line 1
Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 0 (type Unknown), page ID (1:78692567) contains an incorrect page ID in its page header. The PageId in the page header = (0:0).
CHECKDB found 0 allocation errors and 2 consistency errors not associated with any single object.


DBCC results for 'table_name'.
Msg 8978, Level 16, State 1, Line 1
Table error: Object ID 2098926649, index ID 1, partition ID 419030233579520, alloc unit ID 419030233579520 (type In-row data). Page (1:78264522) is missing a reference from previous page (1:78692565). Possible chain linkage problem.
Msg 8978, Level 16, State 1, Line 1
Table error: Object ID 2098926649, index ID 1, partition ID 419030233579520, alloc unit ID 419030233579520 (type In-row data). Page (1:78264524) is missing a reference from previous page (1:78692567). Possible chain linkage problem.
Msg 8928, Level 16, State 1, Line 1
Object ID 2098926649, index ID 1, partition ID 419030233579520, alloc unit ID 419030233579520 (type In-row data): Page (1:78692565) could not be processed.  See other errors for details.
Msg 8939, Level 16, State 98, Line 1
Table error: Object ID 2098926649, index ID 1, partition ID 419030233579520, alloc unit ID 419030233579520 (type In-row data), page (1:78692565). Test (IS_OFF (BUF_IOERR, pBUF->bstat)) failed. Values are 29493257 and -1.
Msg 8976, Level 16, State 1, Line 1
Table error: Object ID 2098926649, index ID 1, partition ID 419030233579520, alloc unit ID 419030233579520 (type In-row data). Page (1:78692565) was not seen in the scan although its parent (1:78137174) and previous (1:78264521) refer to it. Check any previous errors.
Msg 8928, Level 16, State 1, Line 1
Object ID 2098926649, index ID 1, partition ID 419030233579520, alloc unit ID 419030233579520 (type In-row data): Page (1:78692566) could not be processed.  See other errors for details.
Msg 8976, Level 16, State 1, Line 1
Table error: Object ID 2098926649, index ID 1, partition ID 419030233579520, alloc unit ID 419030233579520 (type In-row data). Page (1:78692566) was not seen in the scan although its parent (1:78137174) and previous (1:78264522) refer to it. Check any previous errors.
Msg 8928, Level 16, State 1, Line 1
Object ID 2098926649, index ID 1, partition ID 419030233579520, alloc unit ID 419030233579520 (type In-row data): Page (1:78692567) could not be processed.  See other errors for details.
Msg 8976, Level 16, State 1, Line 1
Table error: Object ID 2098926649, index ID 1, partition ID 419030233579520, alloc unit ID 419030233579520 (type In-row data). Page (1:78692567) was not seen in the scan although its parent (1:78137174) and previous (1:78264523) refer to it. Check any previous errors.
Msg 8978, Level 16, State 1, Line 1
Table error: Object ID 2098926649, index ID 1, partition ID 419030233579520, alloc unit ID 419030233579520 (type In-row data). Page (1:82809613) is missing a reference from previous page (1:78692566). Possible chain linkage problem.
There are 50427099 rows in 2393300 pages for object "Table_name".
CHECKDB found 0 allocation errors and 10 consistency errors in table 'Table_name' (object ID 2098926649).


CHECKDB found 0 allocation errors and 12 consistency errors in database 'DB_name'.
repair_allow_data_loss is the minimum repair level for the errors found by DBCC CHECKDB (DB_Name).


So we are in the middle of system object corruption(cannot be repaired) and started preparing for Recovery Plans and procedures.

 Plan 1
========

The basic plan set in place was restoring from the backups. But it was a very long process as we had to start with a Good Clean Full backup followed by application of T-logs. We started the restoration activity and realized the issue that I explained in one of my previous post(VM backup and LSN chain break). So literally Plan 1 got failed.

Plan 2
=========

Create an SSIS package for "TRANSFER SQL SERVER OBJECTS" task and transfer everything except the corrupted table. For corrupted table manually do traversal with help of TOP N in ascending and descending Order by  clustered Index key and thus retrieve maximum rows. We planned this activity on QA and started first part of this Plan2. It took around 66 hours, and was a NO GO from client side as it required that much downtime in production.

But since System Objects also had the corruption, Microsoft also suggested same when we raised the Case for this Issue. But whatever ways, loss of data was sure.

Since data loss was for sure, I started thinking about doing a Repair using CHECKDB. This thought helped me create a Plan 3

Plan3
========

Take a latest backup of Production database and restore it with "Continue_after_error" option as there are chances that the restore may fail because of corruption. Once restored successfully, take the row count of all tables. Make the DB into Single user mode. Run the command

DBCC CHECKDB('database_name',REPAIR_ALLOW_DATA_LOSS)

Once repair is completed verify the dbcc log and take a row count and compare. Please find few highlighted results of our CHECKDB Repair...

Msg 8909, Level 16, State 1, Line 1
Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 0 (type Unknown), page ID (1:78692566) contains an incorrect page ID in its page header. The PageId in the page header = (0:0).
        The error has been repaired.
Msg 8909, Level 16, State 1, Line 1
Table error: Object ID 0, index ID -1, partition ID 0, alloc unit ID 0 (type Unknown), page ID (1:78692567) contains an incorrect page ID in its page header. The PageId in the page header = (0:0).
        The error has been repaired.
CHECKDB found 0 allocation errors and 2 consistency errors not associated with any single object.
CHECKDB fixed 0 allocation errors and 2 consistency errors not associated with any single object.

DBCC results for 'Table_Name'.
==========================

Repair: The Clustered index successfully rebuilt for the object "table_name" in database "Database_Name".
Repair: The page (1:78692565) has been deallocated from object ID 2098926649, index ID 1, partition ID 419030233579520, alloc unit ID 419030233579520 (type In-row data).
Repair: The page (1:78692566) has been deallocated from object ID 2098926649, index ID 1, partition ID 419030233579520, alloc unit ID 419030233579520 (type In-row data).
Repair: The page (1:78692567) has been deallocated from object ID 2098926649, index ID 1, partition ID 419030233579520, alloc unit ID 419030233579520 (type In-row data).
Msg 8945, Level 16, State 1, Line 1
Table error: Object ID 2098926649, index ID 1 will be rebuilt.
        The error has been repaired.
Msg 8978, Level 16, State 1, Line 1
Table error: Object ID 2098926649, index ID 1, partition ID 419030233579520, alloc unit ID 419030233579520 (type In-row data). Page (1:78264522) is missing a reference from previous page (1:78692565). Possible chain linkage problem.
        The error has been repaired.
Msg 8978, Level 16, State 1, Line 1
Table error: Object ID 2098926649, index ID 1, partition ID 419030233579520, alloc unit ID 419030233579520 (type In-row data). Page (1:78264524) is missing a reference from previous page (1:78692567). Possible chain linkage problem.
        The error has been repaired.
Msg 8928, Level 16, State 1, Line 1
Object ID 2098926649, index ID 1, partition ID 419030233579520, alloc unit ID 419030233579520 (type In-row data): Page (1:78692565) could not be processed.  See other errors for details.
        The error has been repaired.
Msg 8939, Level 16, State 98, Line 1
Table error: Object ID 2098926649, index ID 1, partition ID 419030233579520, alloc unit ID 419030233579520 (type In-row data), page (1:78692565). Test (IS_OFF (BUF_IOERR, pBUF->bstat)) failed. Values are 29362185 and -1.
        The error has been repaired.
Msg 8976, Level 16, State 1, Line 1
Table error: Object ID 2098926649, index ID 1, partition ID 419030233579520, alloc unit ID 419030233579520 (type In-row data). Page (1:78692565) was not seen in the scan although its parent (1:78137174) and previous (1:78264521) refer to it. Check any previous errors.
        The error has been repaired.
Msg 8928, Level 16, State 1, Line 1
Object ID 2098926649, index ID 1, partition ID 419030233579520, alloc unit ID 419030233579520 (type In-row data): Page (1:78692566) could not be processed.  See other errors for details.
        The error has been repaired.
Msg 8976, Level 16, State 1, Line 1
Table error: Object ID 2098926649, index ID 1, partition ID 419030233579520, alloc unit ID 419030233579520 (type In-row data). Page (1:78692566) was not seen in the scan although its parent (1:78137174) and previous (1:78264522) refer to it. Check any previous errors.
        The error has been repaired.
Msg 8928, Level 16, State 1, Line 1
Object ID 2098926649, index ID 1, partition ID 419030233579520, alloc unit ID 419030233579520 (type In-row data): Page (1:78692567) could not be processed.  See other errors for details.
        The error has been repaired.
Msg 8976, Level 16, State 1, Line 1
Table error: Object ID 2098926649, index ID 1, partition ID 419030233579520, alloc unit ID 419030233579520 (type In-row data). Page (1:78692567) was not seen in the scan although its parent (1:78137174) and previous (1:78264523) refer to it. Check any previous errors.
        The error has been repaired.
Msg 8978, Level 16, State 1, Line 1
Table error: Object ID 2098926649, index ID 1, partition ID 419030233579520, alloc unit ID 419030233579520 (type In-row data). Page (1:82809613) is missing a reference from previous page (1:78692566). Possible chain linkage problem.
        The error has been repaired.

There are 50932994 rows in 2417291 pages for object "table_name".
CHECKDB found 0 allocation errors and 10 consistency errors in table 'table_name' (object ID 2098926649).
CHECKDB fixed 0 allocation errors and 10 consistency errors in table 'table_name' (object ID 2098926649).

Final Comment
=============
CHECKDB found 0 allocation errors and 12 consistency errors in database 'DB Name'.
CHECKDB fixed 0 allocation errors and 12 consistency errors in database 'DB Name'.
DBCC execution completed. If DBCC printed error messages, contact your system administrator.


After Row comparison, it was just 22  rows that were missing and the Customer was too happy and was ok with that result.

Note
=====
We tested this on QA with a backup of Production before implementing it on Production.


Thus the Issue was resolved with a simple plan. I had spend 40 days for this issue...Hope this post would help at least some DBA not to spend huge amount of time as I did, since I have already mentioned 3 plans in this post.

Enjoy and have a gr8 Day ahead !!!


Cheers


Jinu

Wednesday, April 2, 2014

LSN Chain Break and VM Backup !!!

Today I want to Discuss a Very Serious threat to the Production Databases which can eventually cause serious data loss .

While I was working on an issue in production, I had to restore the database to a point in time as a part of recovery process and I started my activity with the available backups starting with FULL backup. Once that was completed, I started with my T-Log backups. After restoring a few, I encountered a LSN mismatch error for the restore. Since we use TDP backups, I was restoring based on the order of backups in the interface. So I was thinking how come we could miss one in between.

I started searching for the missing backup file. After investigation, I understood that there was intermittent LOG backups happening with  a Device named 'NUL'. I verified the same in MSDB as well as SQL Error log that we had a backup happened in between the TDP schedule.



We confirmed from (Wintel,VMWare) that VM disk snapshots are happening for the server(VM). The tricky part is that the timestamps of the VM Snapshots matched the timestamps we saw in the MSDB and the SQL Server error logs for the 'NUL' Backups.


Later on after discussion with VM team and several searching in Google, we came to know that there is an option in VM backup tool for performing Transaction Log clearing. In our case it was VEEAM backup tool. Other tools like Hyper V backup, V ranger etc are other backup tools.

During VM snapshot, there is an option to clear the LOG as shown below. By default the option to Truncate logs will be enabled.[First Option]




The way that these tools clear the transaction log during a VM snapshot backup is by issuing a subsequent command to the SQL Server,

BACKUP LOG <database_name> TO DISK =’NUL’

which dumps the transaction log records to nowhere, essentially throwing them away completely and thus breaking the log backup chain. In the sense, when you issue the above command, no backup files are getting generated and it simply truncates log, which inturn makes our day horrible. :(

Solution 1
=========
If your server is in VM environment and have SQL native backups enabled, then you have to make sure that this option in VM is disabled. If you want to do a quick check from SQL point of view please use the below query to verify from MSDB.

SELECT b.physical_device_name, a.backup_start_date, a.first_lsn, a.user_name, a.* FROM msdb..backupset a
INNER JOIN msdb..backupmediafamily b ON a.media_set_id = b.media_set_id
WHERE a.type = 'L'
ORDER BY a.backup_finish_date DESC


Solution 2
=========

Consider where you have a situation where you cannot change this option or not allowed to change. In that case verify the VM snapshot schedule from VM team and schedule/initiate a Full backup after the VM backup so that LSN chain is rebuild.


I hope this was a new information to few people as it was to me. Have a quick check. Better Late than Never  :)


Thanks
Jinu Varghese

Monday, March 31, 2014

Select Query Eating the Disk Space !!!

Confusing ??? I had the same confusion till I understood exactly what was happening behind the scenes.

While I was working on a production server, as per the customer request, I was running a simple Select * from a Table having some 3-4 million rows The most interesting part is that after some time, I got a call from Windows(WinTEL/Intel) team mentioning that C:\ drive free space have come down drastically well below the standard threshold.
While checking for the same, I saw that it is true, C:\ drive free space is coming down. I simply started searching for the file in C:\ drive which is growing, but was unable to find any. I really got confused as I was thinking, my Select Query can cause problem only if my TEMPDB files reside in C:\, but it was on another drive.


I started working on this to find the culprit. I reprocated the scenario on my environment. I had ran a  select statement on a big table and ran the process monitor to filter the SSMS processes and find out the file name and location.


So this is the culprit which was eating the Drive space.

Solutions
============

1) Whenever you want to run some select statements to retrieve some HUGE results, make sure you direct your output to a File, that resides in a Drive that have enough space. You can also use Ctrl+Shift+F after selecting the Query to activate this option or you can simply click on the below highlighted option before running the query.


                        
                                                      OR





2) Change the default path of TEMP/TMP file to a Different drive.





So from a DBA point of view its always best to go with the first method, as the second option involves a lot of other risks and approvals when a production environment is considered.


Hope this helps out someone to avoid Unnecessary Disk Space Issues...



Thanks
Jinu Varghese