Category Archives: DBA

Setup a Simple Transactional Replication lab

At work we lost our super star DBA some months back, and though he is sorely missed, this has created opportunities for me to grow in my SQL Server knowledge. One of the key components of our Business Intelligence environment (my world!) is our replication between PROD and the Data Warehouse. Because this replication is critical to our reporting, not to mention my livelihood, I have been digging in to how transactional replication works.  So the first step on this journey is to get a playground setup on my personal laptop and let the learning begin. I’m glad you are here to follow along.

We are going to replicate the dbo.Badges table from the StackOverflow database to small database called SO_small, using transactional replication in SQL Server 2016.

image

Setup the Distributor and Publisher

For the lab, we will set up all replication components on the same instance. In particular we’ll run a local distributor, which is different from most production environments, as I found out. However the constraints of my home lab dictate this setup.

All our configuration can be done with the wizards.  First, select “Replication” folder, and choose “Configure Distribution”:

image

Now you’ll get the “distributor” screen of the wizard. As mentioned above, choose to have all replication components on the same instance for our walk through:

image

We’ll choose to have our SQL Server Agent start automatically.

image

We now need to create the Snapshot Folder for initial seeding of table:

Create folder on DB server

We need a folder to hold the snapshot that will be created.  I called mine “sql_replic_share”. Super creative!

image

For purposes of the demo, just grant folder permissions to the “Authenticated Users” group to get going. NOT RECOMMENDED IN PRODUCTION!

image

Setup folder sharing on the sql_replic_share. Again for demo, grant “Everyone” read control. NOT RECOMMENDED FOR PRODUCTION!

image

Now populate “Snapshot Folder” with the correct path to the folder that will hold our initial snapshot and continue:

image

Configure the Distribution DB:

A distribution database will be created as part of this process and will be visible in the System databases folder. Be sure to choose the location that has the storage you are wanting to use:

image

As noted, this will create a System Database called “distDB”

image

Configure publishers. The local instance is Auto filled since we are setting up as a Distributor locally:

image

Let SQL Server do the distribution configuration:

image

Verify configuration. If it all looks good, hit Finish:

image

The wizard should finish successfully:

image

 

Setup Publication

We are wanting to replicate the dbo.Badges table FROM StackOverflow to the SO_small database on the same SQL Server instance.

Right click the Local Publications folder under the Replication folder, and choose New Publication:

image

We want to publish dbo.Badges from the StackOverflow database, so that is our Publication database for this dialog:

image

We are setting up Transactional Replication, so we will use that type for our Publication:

image

Our goal is to publish StackOverflow.dbo.Badges, so that is the Article we care about to publish. We are taking all the columns, though we do not have to:

image

For kicks, we’ll add a filter to our Article, so that we only get the Badges since 1/1/2016. In our case this will speed our queries and conserve space:

image

image

Our filter is now populated on the dialog:

image

Snapshot:

We are now at the step where the initial snapshot of our data will be created. We’ll create ours right away. This is the initial “seeding” of our replicated table in our subscriber database:

image

For our demo, we are going to run the Snapshot Agent under the SQL Server Agent service account, though usually you want to have a separate account to run each agent:

image

We’ll create our publication straight away, though you could also have SQL Server create a script for later use:

image

After hitting next, you’ll be presented with a chance to name your Publication. I chose something descriptive, including the fact that my publication contains a filter:

image

Hit finish, and if all goes well you’ll be presented with a dialog that reflects success:

image

Setup the Subscription

Now we’ll setup our subscription. In most real-world environments this subscription would live in a different instance from the Publisher. But this is a demo… so let’s keep it easy:

Expand “Replication”, and then right-click Local Publication and choose “New Subscriptions…”

image

Since we have but one publication, it is the only one to show in our window. Hit Next to be taken to the Distribution Agent Location dialog:

image

We will be creating a “push subscription” which is the default. Hit next.

image

On the “Subscribers” page, we choose our server and the database that will be our subscription database (SO_small):

Now we come the Distribution Agent Security Window. The distribution agent will be responsible for moving data from the distribution to the subscriber databases. So the accounts specified must be able to connect to each

image

We’ll click the … ellipsis next to the Connection to Subscriber box, and then we will specify the account that will run the Distributor Agent process. Since this is non-production, we’ll go ahead with “Run Under the SQL Server Agent server account”

image

Takes us back to the screen

image

We will choose the schedule that the agent will run. We choose “Run Continuously”

image

Now let’s initialize our subscription with a snapshot. Keep in mind this can take many hours/days depending on the size of the object being replicated. Our table is small, so it takes no time at all.  But a table at work took over a day, because it was rather wide and contained several billion rows. So beware of this.

image

We’ll let the wizard do its work, though you could also generate a script if so desired:

image

image

And this is what success looks like:

image

Or so it seems… however

Gotchas!

I ran into a problem after initially setting up my lab. You might find this useful. After waiting awhile and looking at our SO_small database, no dbo.Badges table showed up in our Subscription database (SO_small)

image

WTH? Fortunately we have a tool for monitoring replication that is accessible through SSMS. Not surprisingly it is called Replication Monitor (who would have guessed?). Go to the Replication folder in SSMS, right click and choose “Launch Replication Monitor”

image

We are presented with an application designed to help us figure out what is happening in replication at a given moment:

image

As we see, there is no shortage of red x’s, an almost universal sign of trouble. Let’s click on that “Agents” tab with the big red X:

Upon clicking the first error in the list on that tab, we see our problem has to do with access to our share that we created to hold our Snapshot:

image

First thing I’ll try is granting “Full Control” to the Everyone group on the share (again non for production!)

After making the permissions change, right click the line that shows the Error and list Snapshot Agent in the job column.  Choose “Start Agent”

image

When all is well, our replication monitor view will be gleefully free of the red x’s:

image

And sure enough, our table is now replicated in the SO_small database!

image

Add another article to an existing publication

The beauty of this is that it does not require another full snapshot, as long as you don’t change any other article’s properties in these steps. A mini-snapshot will be taken just for the object you are adding to replication in the current publication.

Right click the local publication we created above (BadgesFilteredPublication) and choose Properties:

image

On the Publication properties page, go to the Articles tab, uncheck “Show only checked articles in the list” and check the box for “Posts (dbo)” under the Objects to publish part of the dialog. Uncheck the “Body” column, as we don’t wish to replicate this NVARCHAR(MAX) column in our demo. Also, note we are leaving “Badges (dbo)” checked, as it is the existing article that is part of this publication. Hit “OK”.

image

At this point the snapshot agent will run, and generate the needed snapshot into the snapshot shared folder:

image

When the snapshot is applied, your replicated table will be available in the subscriber database:

image

At this point you have all the components of replication in place in your environment, and can see data flowing between the StackOverflow.dbo.Badges table and your SO_small.dbo.Badges table.  Next post we’ll put this to use! Hope to see you then.

 

Can Table Partitioning Improve Query Performance?

At work we have tables containing billions of rows, and are hundreds of GB in size. The business intelligence queries we run against them in the reporting database are often focused around dates and not accountids / productids as they are in PROD.  For these large tables, it is sometimes useful to split the records into multiple storage locations. Especially when most of these reporting tables live on relatively slow spinning disk (spinny drives in LUNs on SANs), there are benefits to be gained from getting the most frequently used / most recent data onto SSD. Enter table partitioning!

The idea behind partitioning is that the rows in our tables live in storage units called partitions. By default every table has one partition. Since these partitions can be stored on separate Filegroups living on separate Files (.ndf), we gain I/O benefits when targeting the partitions with our most frequently used rows to SSD. In my testing the benefits exist for both querying and loading these large tables. Previously an enterprise only feature, table partitioning is now available in SQL Server 2016 SP1 standard edition, so hopefully you will be able to take advantage of this too!

image

To protect the names of the innocent, I decided to come up with a contrived example using the StackOverflow database (see useful links at the bottom for where to download) while I learned and documented the process. For the example in this post, we want to query a rather large table (6M+ rows) from our local StackOverflow database to get summary stats about the reputations of users who joined in 2015. Because the kind of analyses we frequently perform involve grouping our Users in cohorts by the dates they created their accounts, I decided to partition and cluster on CreationDate. The plan is to make these types of queries more performant and to help with data loading on the back-end ETL processing.

Example Query – BEFORE Table Partitioning

Here is our query of interest. We are seeking some descriptive stats about the cohort of people who joined in 2015.

/* -- Clean the buffers */
DBCC DROPCLEANBUFFERS 
GO 
SET STATISTICS IO, TIME ON;
GO
SELECT Top 1 AVG(Reputation) OVER () as AVG_Reputation
, MIN(Reputation) OVER () as MIN_Reputation
, MAX(Reputation) OVER () as MAX_Reputation
FROM StackOverflow.dbo.Users
WHERE CreationDate >= '2015-01-01' and CreationDate < '2016-01-01'

IO Stats: (parsed by http://statisticsparser.com/) – Before Table Partitioning

The output of the time and IO stats is not beautiful by default.  Fortunately Richie Rump at http://statisticsparser.com/ created a tool to make the presentation beautiful (or at least much more pleasant to look at!)

clip_image001[1]

Looks like a good long time to get back our results.

Partitioning the Table

Let’s verify the current partitioning our our Users table:

USE StackOverflow;
GO
SELECT o.name objectname,i.name indexname, partition_id, partition_number, [rows]
FROM sys.partitions p
INNER JOIN sys.objects o ON o.object_id=p.object_id
INNER JOIN sys.indexes i ON i.object_id=p.object_id and p.index_id=i.index_id
WHERE o.name = 'Users'

image

As expected, our Users table currently has but 1 partition.

Create a new filegroup and secondary data file (.ndf) that lives on flash storage:

This new filegroup is where our new partitions will live.

Create new filegroup

USE [master]
GO
ALTER DATABASE [StackOverflow] ADD FILEGROUP [SECUNDUS]
GO

Create new datafile

USE [master]
GO
ALTER DATABASE [StackOverflow] 
ADD FILE ( NAME = N'StackOverflow_Secundus', 
FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\DATA\StackOverflow_Secondus.ndf' , 
SIZE = 1024MB , FILEGROWTH = 500MB ) 
TO FILEGROUP [SECUNDUS]
GO

Step 1: Partition key choice

With our filegroups and new datafile created, let’s do this thing. First, which column to partition ON?  We need a single column (called the “partition key”) on which to partition our table. Let’s create a partition for soft archiving the “Users” table records older than 2015, using CreationDate as our partition key. For our example, this is a useful choice since the majority of our queries are interested in the data from the last couple years. Partitioning by CreationDate will allow for “partition elimination”, since our scenario queries against the Users table are usually filtered on CreationDate. Partition elimination will save time and I/O costs.

NOTE: Our database has no non-clustered indexes for this demo. The Users table is clustered on AccountId. In real life we would do some indexing on CreationDate for performance reasons, but for simplicity I have not done this so we focus strictly on the impact of table partitioning.

Step 2: Create Partition Function (the “HOW”):

We now specify how we want to use the partition key to implement the partition. We do this by defining the boundary of our partitions. We’ll have only 4 partitions in our example:

USE StackOverflow;
GO
CREATE PARTITION FUNCTION pf_UsersCreationDate (datetime) 
AS RANGE RIGHT FOR VALUES ('20150101', '20160101', '20170101'); 

image

Step 3: Create Partition Scheme (the “WHERE”):

USE StackOverflow;
GO
CREATE PARTITION SCHEME ps_UsersCreationDate 
AS PARTITION pf_UsersCreationDate TO ([PRIMARY],[SECUNDUS],[SECUNDUS],[SECUNDUS]) 
GO

So what did we actually do? We told SQL Server HOW to divide up our data and WHERE to store it. We are setting all of our partitions Let’s see where we’re at:

SELECT ps.name,pf.name,boundary_id,value
FROM sys.partition_schemes ps
INNER JOIN sys.partition_functions pf ON pf.function_id=ps.function_id
INNER JOIN sys.partition_range_values prf ON pf.function_id=prf.function_id

clip_image001

But is OUR table, the existing Users table, now partitioned?

SELECT o.name objectname,i.name indexname, partition_id, partition_number, [rows]
FROM sys.partitions p
INNER JOIN sys.objects o ON o.object_id=p.object_id
INNER JOIN sys.indexes i ON i.object_id=p.object_id and p.index_id=i.index_id
WHERE o.name = 'Users'

clip_image002

Not so much. Still shows only 1 partition. Since it is an existing table, we have to either recreate the Users table and reload the data, or we can drop and rebuild the clustered index. The Users table is currently clustered on “Id” column.

Let’s check out our current indexing on dbo.Users before going forward. Brent Ozar’s sp_BlitzIndex to the rescue!


EXEC sp_BlitzIndex
     @DatabaseName = 'StackOverflow',
     @SchemaName = 'dbo',
     @TableName = 'Users'

This shows we currently have a clustered primary key on the Id field.

image

No time like the present… let’s get our table in partitioned!


-- Drop the existing primary key constraint.
USE StackOverflow;
GO
ALTER TABLE dbo.Users DROP CONSTRAINT PK_Users_Id;
GO


-- Add new non-clustered primary key constraint on Id
ALTER TABLE dbo.Users ADD CONSTRAINT PK_USERS_ID PRIMARY KEY NONCLUSTERED (Id)
WITH (STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, 
ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO


-- Create new clustered index using our partition key as the cluster key
CREATE CLUSTERED INDEX idx_Users_CreationDate ON dbo.Users (CreationDate)
WITH (STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, 
ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) 
ON ps_UsersCreationDate(CreationDate)
GO

And now we can see that our table is using the partitions we created:

SELECT o.name objectname,i.name indexname, partition_id, partition_number, [rows]
FROM sys.partitions p
INNER JOIN sys.objects o ON o.object_id=p.object_id
INNER JOIN sys.indexes i ON i.object_id=p.object_id and p.index_id=i.index_id
WHERE o.name = 'Users'

clip_image003

Let’s take another look at our indexes on dbo.Users using the same sp_BlitzIndex statement as above:

image

Our clustered index is now on CreationDate, with a UNIQUIFIER column added behind the scenes because our clustered index is not unique.  Notice our Primary Key on “Id” also has an additional “hidden” column of CreationDate plus the UNIQUIFIER.

Query AFTER table partitioning

Let’s run our original query against the Users table again now that partitioning is in place:

-- Clear the buffers
DBCC DROPCLEANBUFFERS 
GO 
-- turn on our statistics
SET STATISTICS IO, TIME ON;
GO
-- rerun our query of interest
SELECT Top 1 AVG(Reputation) OVER () as AVG_Reputation
, MIN(Reputation) OVER () as MIN_Reputation
, MAX(Reputation) OVER () as MAX_Reputation
FROM StackOverflow.dbo.Users
WHERE CreationDate >= '2015-01-01' and CreationDate < '2016-01-01'

Results:

clip_image001[1]

IO Stats: (parsed by http://statisticsparser.com/) – After Table Partitioning

clip_image001[3]

As one can see in the table below, our elapsed time was drastically reduced after partitioning. We effectively moving our data of interest onto the SSD storage, which helps IO efficiency:

image

The number of all read types was reduced. But was this performance increase due more to the table partitioning or to a choice of a clustering key more in line with our query patterns? Let’s take a look.

Clustering on CreationDate without partitioning

IO Stats Before changing clustering

As a reminder, here is the result of our IO and time stats when we ran the query against our Users table before partitioning and changing any indexes:

clip_image001[6]

After clustering the Users table on CreationDate

clip_image001[8]

In short, if you have a large table and a mix of slow and fast storage, table partitioning can be one tool for improving your query performance by taking advantage of that speedy SSD for your most commonly queried data. As noted in this demo, partitioning had less effect on query performance than just choosing a clustering key most appropriate for our particular query patterns. Depending on your environment you may not always be able to choose the clustering key you would most like to have.

Fortunately query speed isn’t the only benefit of table partitioning. Stay tuned for my next blog post about using table partitioning to help with ETL performance.

Useful links for further learning

I found the following resources to be invaluable as I set about diving into table partitioning:

Partitioning Basics:

https://www.cathrinewilhelmsen.net/2015/04/12/table-partitioning-in-sql-server/

Partitioning existing tables:

https://www.mssqltips.com/sqlservertip/2888/how-to-partition-an-existing-sql-server-table/

StackOverflow DB Download:

https://www.brentozar.com/archive/2015/10/how-to-download-the-stack-overflow-database-via-bittorrent/

Statistics Parser by Richie Rump:

http://statisticsparser.com/

How much space will table compression save me?

At work we have a reasonably large table with a NVARCHAR(MAX) column. This column stores HTML and other user generated content. This table is on a SQL Server 2014 instance, so cannot take advantage of new compression technologies in SS 2016.

Looking at the estimated benefits via query:

DECLARE @MyTableVar TABLE
(
  object_name NVARCHAR(50),
  schema_name NVARCHAR(50),
  index_id INT,
  partition_number INT,
  size_w_CURRENT_compression_MB INT,
  size_w_REQUESTED_compression_MB INT,
  sample_size_w_current_compression_MB INT,
  sample_size_w_requested_compression_MB INT
  )
  INSERT INTO @MyTableVar
  EXEC sp_estimate_data_compression_savings
  @schema_name =  'dbo'
  , @object_name = 'description_archive'
  , @index_id = NULL
  , @partition_number = NULL
  , @data_compression = 'ROW'
select object_name
  , schema_name
  , index_id
  , partition_number
  , (size_w_CURRENT_compression_MB/1024) as size_w_CURRENT_compression_MB
  , (size_w_REQUESTED_compression_MB/1024) as size_w_REQUESTED_compression_MB
  from @MyTableVar

clip_image001

Well … bummer!  Looks like no savings with ROW compression, and running the same query above with PAGE compression did not look much better. Why? It seems like a NVARCHAR(MAX) column full of HTML would be a great candidate for compression.

The answer lies in the allocation units in use for this table.  Let’s see how the allocation works for our archive table:

SELECT OBJECT_NAME(sp.object_id) AS [ObjectName]
, si.name AS IndexName
, sps.in_row_data_page_count as In_Row
, sps.row_overflow_used_page_count AS Row_Over_Flow
,sps.lob_reserved_page_count AS LOB_Data
FROM sys.dm_db_partition_stats sps
JOIN sys.partitions sp ON (sps.partition_id=sp.partition_id)
JOIN sys.indexes si ON (sp.index_id=si.index_id AND sp.object_id = si.object_id)
WHERE OBJECTPROPERTY(sp.object_id,'IsUserTable') =1
AND OBJECT_NAME(sp.object_id) = 'description_archive'
order by sps.in_row_data_page_count desc

The results show that the majority of this table data is stored in LOB type pages instead of in-row, because the “description” field is composed of values mostly > 8,000 bytes.image

If we had data in our NVARCHAR(MAX) column whose size was <= 8000k we would benefit from PAGE level compression, as the data would be stored IN ROW (provided we didn’t specify something different

Create an index on a table in Transactional Replication

Not long ago I took over a database that uses Transactional Replication. Not long after I needed to add an index.  So here is the syntax and misc I found for doing this:

CREATE NONCLUSTERED INDEX ListingBigBay_NC6f_NULL

ON [Bigtiva].[listing_BigBay] ([adid],[ListingDate])
INCLUDE ([AccountID])
WHERE (adid IS NULL)
WITH (PAD_INDEX = ON,
STATISTICS_NORECOMPUTE = OFF,
SORT_IN_TEMPDB = ON,
DROP_EXISTING = OFF,
ONLINE = ON,
ALLOW_ROW_LOCKS = ON,
ALLOW_PAGE_LOCKS = ON,
FILLFACTOR = 90,
MAXDOP = 4)

GO

The important point is that you use ONLINE=ON, and since our TEMPDB is on SSD, go ahead and use SORT_IN_TEMPDB.