Category Archives: Uncategorized

How I Learned to Stop Worrying and Add The Nonclustered Index

In my post about transactional replication we configured the dbo.Badges table for replication. Recall that this is a table that lives in the StackOverflow database.   Let’s pretend that over time the production table we replicated has grown to billions of rows, and our reporting relies on hitting the dbo.Badges subscriber table in the SO_small database.  In our scenario, as is often the case IRL, our production table is clustered on the “Id” field, and we are not allowed to add any additional indexes in production to support our query. But our query results are returning slower than molasses running up hill in winter because we are querying on the Date column, and there is no index present for this.

What we ARE allowed to do is add indexes to our replicated tables in the data warehouse.  What are the impacts of adding a nonclustered index on the “Date” column to one of our subscriber tables? The plan is to add a nonclustered index on SO_small.dbo.Badges as a test. We are interested in:

  • What is the impact to transaction log growth?
  • Do we have to reinitialize our subscription?

My initial concern was how the transaction log on production (StackOverflow) might grow based on having to store transactions while replication is paused. Further research showed this concern was unfounded. As it turns out, the Log Reader Agent continues to monitor the transaction log of our StackOverflow database, and loads any relevant transactions into the distribution database. Recall we called the distribution database “distDB” in our initial setup from the last post.  In the distribution database, the commands are queued until the Distribution Agent can copy them to the subscribers.  In our case, the transactions can be up to 72 hours old before they would expire and cause the need to re-snapshot our data.

A demo will be instructive. We will use the StackOverflow and SO_small databases from the last blog post on Transactional Replication.  Let’s check our database file sizes before we do this:


USE SO_small;
SELECT [file_id],
   [type],
   substring([physical_name],1,1) as DriveLetter,
   [name] LogicalFileName,
   [physical_name] PhysicalFileName,
   CAST([size] as DECIMAL(38,0))/128. as SizeMB, 
   CAST(FILEPROPERTY([name],'SpaceUsed') AS DECIMAL(38,0))/128. AS SpaceUsedMB, 
   (CAST([size] as DECIMAL(38,0))/128) - (CAST(FILEPROPERTY([name],'SpaceUsed') AS DECIMAL(38,0))/128.) FreeSpaceMB,
   (CAST([max_size] as DECIMAL(38,0))/128) as MaxSizeMB,
   [is_percent_growth] as [Is%Grow],
   [growth],
   GETDATE()
FROM [sys].[database_files];

Filespace Before Adding Our Nonclustered Index

StackOverflow Database:

image

SO_small database:

image

To make this all a bit more realistic, I’ve started running a load against the StackOverflow database. The tool I use for this is SQLQueryStress. You can visit https://github.com/ErikEJ/SqlQueryStress for documentation and updates. The .sql script I have modified for generating queries/inserts/updates/deletes is available at BrentOzar.com.

Let’s go ahead and pause replication while we run the index build. Bring up the Replication Monitor, click the “W2K12-SQL” node, and go the the Subscription Watch window.  Right click the subscription:

image

Choose “Stop Synchronizing” and the status will change to “Not running; performance critical”.

Now let’s create our index. We are choosing to do it OFFLINE, in hopes that the index build will run faster. In my system at work, this made a huge difference in testing:


USE SO_small
GO
  CREATE NONCLUSTERED INDEX idx_NC_Badges_Date
  ON dbo.Badges ([Date])



WITH (PAD_INDEX = ON, 
      STATISTICS_NORECOMPUTE = OFF, 
      SORT_IN_TEMPDB = ON, 
      DROP_EXISTING = OFF, 
      ONLINE = OFF, 
      ALLOW_ROW_LOCKS = ON, 
      ALLOW_PAGE_LOCKS = ON, 
      FILLFACTOR = 90, 
      MAXDOP = 4)
  GO


Now let’s take a look at replication, particularly our Undistributed commands shown in the Subscription Watch window.

image

You can see that a number of commands stacked up in the distributor database while we had replication synchronization paused for the index build.  No worries though! We simply bring up the Replication Monitor, click the “W2K12-SQL” node, right-click and select “Start Synchronizing”.

image

This will enable the Distributor to synchronize the commands waiting in the distDB to the SO_small database.  Basically the Log Reader Agent continued to monitor the transaction log of our StackOverflow database even while replication was paused. When we start synchronizing the Distribution Agent can copy those stored commands to the subscriber. When the synchronization catches up, we’ll see the record counts in range between the dbo.Badges tables match once again. We have to add the appropriate filter to our query, as we created our dbo.Badges article with a filter (WHERE Date >= ‘20160101’)


USE SO_small;
GO



select count(*)
from dbo.Badges
WHERE Date >= '20160101';



USE StackOverflow;
GO



select count(*)
from dbo.Badges
WHERE Date >= '20160101';

image

A quick look at the difference in IO/time stats and plans for this query between StackOverflow and SO_small show the positive impact of our new index.  Notice the lower reads and overall execution time between StackOverflow and the SO_small queries. The query plans show that the StackOverflow query had to do a scan on the cluster (more expensive!) versus a much more selective seek on our new nonclustered index for the SO_small query:

StackOverflow

image

image

SO_small

image

image

As it turns out, we win! Our nonclustered index speeds up our query. We did NOT blow up transaction logs. And we did not have to re-snapshot our publication table and wait for that. Great job! I hope this helps you out. See you next time!

The Path to Enlightenment – SQLSkills style!

Too often in my career I have heard “SQL Server doesn’t scale as well as DBMS x” or “We don’t need a DBA. You just install SQL Server and it runs itself”. Both of these are misconceptions that are reinforced by a lack of people who truly understand the deep magic that is SQL Server internals. My goal is to be a better evangelist for SQL Server, shedding light on the true power of our favorite DBMS. I want to better grok both the “Hows” and the “Whys” of SQL Server performance optimization, apply this knowledge in my daily work, and share this knowledge via my blog and presentations to SQL Server users groups in California (and maybe beyond!)

To that end, I am entering the contest to win a seat in the “IEPTO1: Performance Tuning and Optimization – Part 1” class offered by Paul Randal and Kimberly Tripp at SQLskills.com. The folks at SQLSkills.com are giving away two seats… one each for their Level 1 and Level 2 performance classes. I see these classes as the gold standard in SQL Server training, and plan to attend someday by hook or by crook. The class would be a huge kickstart for query and performance tuning efforts in my current job.

In my work in Business Intelligence I am particularly interested in improving analytic queries against large (100’s of millions to billions of row) tables. I currently deal/struggle with queries against some pretty giant tables on a daily basis. I have found some strategies for choosing good indexes, tweaking storage/partitioning, and getting better stats in place. However, the depth and breadth of Paul and Kim’s class would unlock a treasure trove of knowledge and best practices around indexing and tuning strategies. I am excited by the prospect of learning to avoid performance killing problems in the first place, by implementing proper design and implementation of basic database structures. Training with two of the top minds in our SQL Server community would give me a huge boost in confidence to push forward on the speaking front, and provide acres of soil in which to grow ideas for future blog posts.

If you can attend these classes, you will be a giant leap ahead in the path that is SQL Server enlightenment. This is a great opportunity and anyone who is a SQL Server user/enthusiast/fan should head on over to SQLSkills.com and read how to win!

Virtual logfile (VLF) hell…

At one point in my career I inherited a database that had suffered from neglect (to be kind).  The logfiles had be set to autogrow 1MB at a time, and this had happened ALOT! Interestingly this rather small database was taking a surprisingly long time to restore, Knowing that a high number (>1000) of VLFs can cause insert/update and restore performance penalties, I decided to check the VLF counts across the instance, since I had not done so to that point:


DECLARE @SQLtext varchar(500),
 @dbname varchar(500),
 @count int

SET NOCOUNT ON

DECLARE crsr CURSOR FAST_FORWARD READ_ONLY
FOR
SELECT name
FROM sys.databases

CREATE TABLE ##tblinfo
(
 dbname varchar(100),
 row_count int)

OPEN crsr

FETCH NEXT FROM crsr INTO @dbname

WHILE (@@fetch_status <> -1)
BEGIN

CREATE TABLE #log_info
(
 RecoveryUnitId tinyint,
 fileid tinyint,
 file_size bigint,
 start_offset bigint,
 FSeqNo int,
[status] tinyint,
 parity tinyint,
 create_lsn numeric(25,0)
)

SET @SQLtext = 'DBCC loginfo (' + '''' + @dbname + ''') '

INSERT INTO #log_info
EXEC (@SQLText)

SET @count = @@rowcount

DROP TABLE #log_info

INSERT INTO ##tblinfo
VALUES(@dbname, @count)

FETCH NEXT FROM crsr INTO @dbname

END

CLOSE crsr
DEALLOCATE crsr

SELECT dbname,
 row_count
FROM ##tblinfo
ORDER BY dbname

DROP TABLE ##tblinfo

After running this, I was able to identify the problem database and resolve the issue by shrinking the current TLOG file and setting the starting size and autogrow to more appropriate sizes.

 

Just how big ARE those tables?

If you need to know the size of all your tables, here’s a simple query to get you where you need to go:

SELECT t.NAME AS TableName,
s.Name AS SchemaName,
p.rows AS RowCounts,
SUM(a.total_pages) * 8 AS TotalSpaceKB,
SUM(a.used_pages) * 8 AS UsedSpaceKB,
(SUM(a.total_pages) - SUM(a.used_pages)) * 8 AS UnusedSpaceKB
FROM sys.tables t
INNER JOIN sys.indexes i ON (t.OBJECT_ID = i.object_id)
INNER JOIN sys.partitions p ON (i.object_id = p.OBJECT_ID AND i.index_id = p.index_id)
INNER JOIN sys.allocation_units a ON (p.partition_id = a.container_id)
LEFT OUTER JOIN sys.schemas s ON (t.schema_id = s.schema_id)
WHERE t.NAME NOT LIKE 'dt%'
AND t.is_ms_shipped = 0
AND i.OBJECT_ID > 255
GROUP BY t.Name, s.Name, p.Rows
ORDER BY UsedSpaceKB desc