Columnstore Index Goodness

In case I needed a reminder of how cool columnstore indexes are for analytic queries, I made this example to jog my memory. I used the StackOverflow 180 GB sized database you can download from Brent Ozar’s site. I have it running in VMWare Workstation pro, but these days you can run your SQL Server learning lab any number of ways.

As a test, I created a “small” version of the dbo.Posts table, removing any of the string fields, since they are not relevant to the analytics I am doing in my made up world of magical data goodness. I then copied all the data from dbo.Posts to dbo.Posts_small table. I created the separate slimmed down version of Posts to simulate having a “reporting” version of the Production Posts table, as would happen IRL. Of course, as of SQL Server 2017, you could recreate the Posts table as a clustered columnstore and have the benefit of better analytics query performance on your OLTP table. But that idea makes me cringe, and want to hide under my covers. Reporting on PROD brings out the grouchy DBA in me!

Create columnstore version of Posts table

-- Create a thinner version of the Posts table for reporting, as a clustered columnstore
USE [StackOverflow]
GO

CREATE TABLE [dbo].[Posts_columnstore_small](
	[Id]			        [int]   NULL,
	[AcceptedAnswerId]		[int] NULL,
	[AnswerCount]			[int] NULL,
	[ClosedDate]			[datetime] NULL,
	[CommentCount]			[int] NULL,
	[CommunityOwnedDate]	[datetime] NULL,
	[CreationDate]			[datetime] NOT NULL,
	[FavoriteCount]			[int] NULL,
	[LastActivityDate]		[datetime] NOT NULL,
	[LastEditDate]			[datetime] NULL,
	[LastEditorUserId]		[int] NULL,
	[OwnerUserId]			[int] NULL,
	[ParentId]			    [int] NULL,
	[PostTypeId]			[int] NOT NULL,
	[Score]				    [int] NOT NULL,
	[ViewCount]			    [int] NOT NULL,
	INDEX ix_CS_cluster_postssmall CLUSTERED COLUMNSTORE)
 ON  [PRIMARY]
GO

-- Copy data from the relevant columns into my "small" columnstore table.

insert into [dbo].[Posts_columnstore_small]
select [Id]				
	   ,[AcceptedAnswerId]	
	   ,[AnswerCount]		
	   ,[ClosedDate]		
	   ,[CommentCount]		
	   ,[CommunityOwnedDate]
	   ,[CreationDate]		
	   ,[FavoriteCount]		
	   ,[LastActivityDate]	
	   ,[LastEditDate]		
	   ,[LastEditorUserId]	
	   ,[OwnerUserId]		
	   ,[ParentId]			
	   ,[PostTypeId]		
	   ,[Score]				
	   ,[ViewCount]		
from dbo.Posts

Now let’s create a slimmed down version of the dbo.Posts table, but this time keep it a rowstore table. I just want to be sure that the speed benefits we see in a bit are due to the columnstore index and not just to the absence of the text columns, so we will create two clones of dbo.Posts: one columnstore and one rowstore.

Create rowstore version of Posts table

--------------------------------------------------------------------------
-- Create rowstore version of the dbo.Posts table without the text columns
---------------------------------------------------------------------------
USE [StackOverflow]
GO


CREATE TABLE [dbo].[Posts_rowstore_small](
	[Id] [int] IDENTITY(1,1) NOT NULL,
	[AcceptedAnswerId] [int] NOT NULL,
	[AnswerCount] [int] NULL,
	[ClosedDate] [datetime] NULL,
	[CommentCount] [int] NULL,
	[CommunityOwnedDate] [datetime] NULL,
	[CreationDate] [datetime] NOT NULL,
	[FavoriteCount] [int] NULL,
	[LastActivityDate] [datetime] NOT NULL,
	[LastEditDate] [datetime] NULL,
	[LastEditorUserId] [int] NULL,
	[OwnerUserId] [int] NULL,
	[ParentId] [int] NULL,
	[PostTypeId] [int] NOT NULL,
	[Score] [int] NOT NULL,
	[ViewCount] [int] NOT NULL,
	CONSTRAINT [PK_PostsRSsmall__Id] PRIMARY KEY CLUSTERED 
(
	[Id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

) ON [PRIMARY] 
GO

-- now insert data from the selected columns from existing dbo.Posts table.
SET IDENTITY_INSERT [dbo].[Posts_rowstore_small] ON;  
GO 

insert into [dbo].[Posts_rowstore_small] (
[Id]				
	   ,[AcceptedAnswerId]	
	   ,[AnswerCount]		
	   ,[ClosedDate]		
	   ,[CommentCount]		
	   ,[CommunityOwnedDate]
	   ,[CreationDate]		
	   ,[FavoriteCount]		
	   ,[LastActivityDate]	
	   ,[LastEditDate]		
	   ,[LastEditorUserId]	
	   ,[OwnerUserId]		
	   ,[ParentId]			
	   ,[PostTypeId]		
	   ,[Score]				
	   ,[ViewCount]	
)
select [Id]				
	   ,[AcceptedAnswerId]	
	   ,[AnswerCount]		
	   ,[ClosedDate]		
	   ,[CommentCount]		
	   ,[CommunityOwnedDate]
	   ,[CreationDate]		
	   ,[FavoriteCount]		
	   ,[LastActivityDate]	
	   ,[LastEditDate]		
	   ,[LastEditorUserId]	
	   ,[OwnerUserId]		
	   ,[ParentId]			
	   ,[PostTypeId]		
	   ,[Score]				
	   ,[ViewCount]		
from dbo.Posts

-- Verify row counts

select count(*) from [dbo].[Posts_columnstore_small]
40700647

select count(*) from [dbo].[Posts_rowstore_small]
40700647

Now it’s time to run our queries. I made a super contrived analytic-type query, just doing some aggregations across the tables by OwnerUserId and the year that the question was created. Put the queries in two different windows, and be sure to set time and io statistics on. The runtimes aren’t that different, at least on my machine, but the IO is SIGNIFICANTLY less on the query run against the columnstore version of the Posts table:


SSMS Tab 1: query against the rowstore version of Posts table

set statistics io, time on

select OwnerUserId, datepart(YYYY, CreationDate) as QuestionDateYear
	, count(*) as QuestionCount
	, count(*) over () as totalQuestionCount
from [dbo].[Posts_rowstore_small] p
Group by p.OwnerUserId, datepart(YYYY, CreationDate)

SSMS Tab 2: query against the columnstore version of Posts table

set statistics io, time on

select [OwnerUserId], datepart(YYYY, CreationDate) as QuestionDateYear
	, count(*) as QuestionCount
	, count(*) over () as totalQuestionCount
from [dbo].[Posts_columnstore_small] p
GROUP BY p.OwnerUserId, datepart(YYYY, CreationDate)

Now if we look at the Messages tab of our results in SSMS, we can see the query stats (time and IO as we specified). To make our lives easier, copy the contents of the “messages” tab in SSMS for each query individually, and paste each over to the wonder Statistics Parser tool created by  Richie Rump (blog | twitter) .

The results for each query are shown below. First the rowstore version. Notice the huge number of Logical Reads!

Stats Results – rowstore query

Lot’s and lot’s of Logical Reads!!!!!!!

Stats Results – columnstore query

Far fewer Logical Reads

The query runtime is about half as long for the columnstore version as it was for the rowstore. And the I/Os are much lower with the columnstore variety. Next time we’ll talk about why!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.