Tag Archives: performance

Can Table Partitioning Improve Query Performance?

At work we have tables containing billions of rows, and are hundreds of GB in size. The business intelligence queries we run against them in the reporting database are often focused around dates and not accountids / productids as they are in PROD.  For these large tables, it is sometimes useful to split the records into multiple storage locations. Especially when most of these reporting tables live on relatively slow spinning disk (spinny drives in LUNs on SANs), there are benefits to be gained from getting the most frequently used / most recent data onto SSD. Enter table partitioning!

The idea behind partitioning is that the rows in our tables live in storage units called partitions. By default every table has one partition. Since these partitions can be stored on separate Filegroups living on separate Files (.ndf), we gain I/O benefits when targeting the partitions with our most frequently used rows to SSD. In my testing the benefits exist for both querying and loading these large tables. Previously an enterprise only feature, table partitioning is now available in SQL Server 2016 SP1 standard edition, so hopefully you will be able to take advantage of this too!

image

To protect the names of the innocent, I decided to come up with a contrived example using the StackOverflow database (see useful links at the bottom for where to download) while I learned and documented the process. For the example in this post, we want to query a rather large table (6M+ rows) from our local StackOverflow database to get summary stats about the reputations of users who joined in 2015. Because the kind of analyses we frequently perform involve grouping our Users in cohorts by the dates they created their accounts, I decided to partition and cluster on CreationDate. The plan is to make these types of queries more performant and to help with data loading on the back-end ETL processing.

Example Query – BEFORE Table Partitioning

Here is our query of interest. We are seeking some descriptive stats about the cohort of people who joined in 2015.

/* -- Clean the buffers */
DBCC DROPCLEANBUFFERS 
GO 
SET STATISTICS IO, TIME ON;
GO
SELECT Top 1 AVG(Reputation) OVER () as AVG_Reputation
, MIN(Reputation) OVER () as MIN_Reputation
, MAX(Reputation) OVER () as MAX_Reputation
FROM StackOverflow.dbo.Users
WHERE CreationDate >= '2015-01-01' and CreationDate < '2016-01-01'

IO Stats: (parsed by http://statisticsparser.com/) – Before Table Partitioning

The output of the time and IO stats is not beautiful by default.  Fortunately Richie Rump at http://statisticsparser.com/ created a tool to make the presentation beautiful (or at least much more pleasant to look at!)

clip_image001[1]

Looks like a good long time to get back our results.

Partitioning the Table

Let’s verify the current partitioning our our Users table:

USE StackOverflow;
GO
SELECT o.name objectname,i.name indexname, partition_id, partition_number, [rows]
FROM sys.partitions p
INNER JOIN sys.objects o ON o.object_id=p.object_id
INNER JOIN sys.indexes i ON i.object_id=p.object_id and p.index_id=i.index_id
WHERE o.name = 'Users'

image

As expected, our Users table currently has but 1 partition.

Create a new filegroup and secondary data file (.ndf) that lives on flash storage:

This new filegroup is where our new partitions will live.

Create new filegroup

USE [master]
GO
ALTER DATABASE [StackOverflow] ADD FILEGROUP [SECUNDUS]
GO

Create new datafile

USE [master]
GO
ALTER DATABASE [StackOverflow] 
ADD FILE ( NAME = N'StackOverflow_Secundus', 
FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\DATA\StackOverflow_Secondus.ndf' , 
SIZE = 1024MB , FILEGROWTH = 500MB ) 
TO FILEGROUP [SECUNDUS]
GO

Step 1: Partition key choice

With our filegroups and new datafile created, let’s do this thing. First, which column to partition ON?  We need a single column (called the “partition key”) on which to partition our table. Let’s create a partition for soft archiving the “Users” table records older than 2015, using CreationDate as our partition key. For our example, this is a useful choice since the majority of our queries are interested in the data from the last couple years. Partitioning by CreationDate will allow for “partition elimination”, since our scenario queries against the Users table are usually filtered on CreationDate. Partition elimination will save time and I/O costs.

NOTE: Our database has no non-clustered indexes for this demo. The Users table is clustered on AccountId. In real life we would do some indexing on CreationDate for performance reasons, but for simplicity I have not done this so we focus strictly on the impact of table partitioning.

Step 2: Create Partition Function (the “HOW”):

We now specify how we want to use the partition key to implement the partition. We do this by defining the boundary of our partitions. We’ll have only 4 partitions in our example:

USE StackOverflow;
GO
CREATE PARTITION FUNCTION pf_UsersCreationDate (datetime) 
AS RANGE RIGHT FOR VALUES ('20150101', '20160101', '20170101'); 

image

Step 3: Create Partition Scheme (the “WHERE”):

USE StackOverflow;
GO
CREATE PARTITION SCHEME ps_UsersCreationDate 
AS PARTITION pf_UsersCreationDate TO ([PRIMARY],[SECUNDUS],[SECUNDUS],[SECUNDUS]) 
GO

So what did we actually do? We told SQL Server HOW to divide up our data and WHERE to store it. We are setting all of our partitions Let’s see where we’re at:

SELECT ps.name,pf.name,boundary_id,value
FROM sys.partition_schemes ps
INNER JOIN sys.partition_functions pf ON pf.function_id=ps.function_id
INNER JOIN sys.partition_range_values prf ON pf.function_id=prf.function_id

clip_image001

But is OUR table, the existing Users table, now partitioned?

SELECT o.name objectname,i.name indexname, partition_id, partition_number, [rows]
FROM sys.partitions p
INNER JOIN sys.objects o ON o.object_id=p.object_id
INNER JOIN sys.indexes i ON i.object_id=p.object_id and p.index_id=i.index_id
WHERE o.name = 'Users'

clip_image002

Not so much. Still shows only 1 partition. Since it is an existing table, we have to either recreate the Users table and reload the data, or we can drop and rebuild the clustered index. The Users table is currently clustered on “Id” column.

Let’s check out our current indexing on dbo.Users before going forward. Brent Ozar’s sp_BlitzIndex to the rescue!


EXEC sp_BlitzIndex
     @DatabaseName = 'StackOverflow',
     @SchemaName = 'dbo',
     @TableName = 'Users'

This shows we currently have a clustered primary key on the Id field.

image

No time like the present… let’s get our table in partitioned!


-- Drop the existing primary key constraint.
USE StackOverflow;
GO
ALTER TABLE dbo.Users DROP CONSTRAINT PK_Users_Id;
GO


-- Add new non-clustered primary key constraint on Id
ALTER TABLE dbo.Users ADD CONSTRAINT PK_USERS_ID PRIMARY KEY NONCLUSTERED (Id)
WITH (STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, 
ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO


-- Create new clustered index using our partition key as the cluster key
CREATE CLUSTERED INDEX idx_Users_CreationDate ON dbo.Users (CreationDate)
WITH (STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, 
ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) 
ON ps_UsersCreationDate(CreationDate)
GO

And now we can see that our table is using the partitions we created:

SELECT o.name objectname,i.name indexname, partition_id, partition_number, [rows]
FROM sys.partitions p
INNER JOIN sys.objects o ON o.object_id=p.object_id
INNER JOIN sys.indexes i ON i.object_id=p.object_id and p.index_id=i.index_id
WHERE o.name = 'Users'

clip_image003

Let’s take another look at our indexes on dbo.Users using the same sp_BlitzIndex statement as above:

image

Our clustered index is now on CreationDate, with a UNIQUIFIER column added behind the scenes because our clustered index is not unique.  Notice our Primary Key on “Id” also has an additional “hidden” column of CreationDate plus the UNIQUIFIER.

Query AFTER table partitioning

Let’s run our original query against the Users table again now that partitioning is in place:

-- Clear the buffers
DBCC DROPCLEANBUFFERS 
GO 
-- turn on our statistics
SET STATISTICS IO, TIME ON;
GO
-- rerun our query of interest
SELECT Top 1 AVG(Reputation) OVER () as AVG_Reputation
, MIN(Reputation) OVER () as MIN_Reputation
, MAX(Reputation) OVER () as MAX_Reputation
FROM StackOverflow.dbo.Users
WHERE CreationDate >= '2015-01-01' and CreationDate < '2016-01-01'

Results:

clip_image001[1]

IO Stats: (parsed by http://statisticsparser.com/) – After Table Partitioning

clip_image001[3]

As one can see in the table below, our elapsed time was drastically reduced after partitioning. We effectively moving our data of interest onto the SSD storage, which helps IO efficiency:

image

The number of all read types was reduced. But was this performance increase due more to the table partitioning or to a choice of a clustering key more in line with our query patterns? Let’s take a look.

Clustering on CreationDate without partitioning

IO Stats Before changing clustering

As a reminder, here is the result of our IO and time stats when we ran the query against our Users table before partitioning and changing any indexes:

clip_image001[6]

After clustering the Users table on CreationDate

clip_image001[8]

In short, if you have a large table and a mix of slow and fast storage, table partitioning can be one tool for improving your query performance by taking advantage of that speedy SSD for your most commonly queried data. As noted in this demo, partitioning had less effect on query performance than just choosing a clustering key most appropriate for our particular query patterns. Depending on your environment you may not always be able to choose the clustering key you would most like to have.

Fortunately query speed isn’t the only benefit of table partitioning. Stay tuned for my next blog post about using table partitioning to help with ETL performance.

Useful links for further learning

I found the following resources to be invaluable as I set about diving into table partitioning:

Partitioning Basics:

https://www.cathrinewilhelmsen.net/2015/04/12/table-partitioning-in-sql-server/

Partitioning existing tables:

https://www.mssqltips.com/sqlservertip/2888/how-to-partition-an-existing-sql-server-table/

StackOverflow DB Download:

https://www.brentozar.com/archive/2015/10/how-to-download-the-stack-overflow-database-via-bittorrent/

Statistics Parser by Richie Rump:

http://statisticsparser.com/