Dedupe it or lose it!

The scenario is there should only be one AccountID/CreateDate combo in our table1. But some process ran and has duplicated some rows and we want to get rid of them. Dedupe time!

Here is an example of deduping:

-- Create our testing table

USE testdb

GO

CREATE TABLE table1 (ID INT IDENTITY(1,1) NOT NULL, AccountID INT NOT NULL, CreateDate DATETIME);

INSERT INTO table1 VALUES 
(12345, '2019-01-01 00:00'),
(12345, '2019-01-01 00:00'),
(12345, '2019-01-01 01:21:00'),
(12345,'2019-01-01 01:21:00'),
(12345,'2019-01-01 01:21:20'),
(12345,'2019-01-01 01:21:20'),
(12345,'2019-01-01 01:21:20'),
(23456, '2019-01-01 01:21:00'),
(23456,'2019-01-01 01:21:10'),
(23456,'2019-01-01 01:21:20'),
(23456,'2019-01-01 01:29:20'),
(23456,'2019-01-01 01:21:20'),
(29956,'2019-01-01 01:21:10'),
(29956,'2019-01-01 01:21:10'),
(29956,'2019-01-01 01:29:20'),
(29956,'2019-01-01 01:21:20'),
(29956,'2019-01-02 01:21:20'),
(29956,'2019-01-02 01:21:20'),
(29956,'2019-01-03 01:29:20');

Now let’s see what lives in our table1:

SELECT *
FROM table1;

ID          AccountID   CreateDate
----------- ----------- -----------------------
1           12345       2019-01-01 00:00:00.000
2           12345       2019-01-01 00:00:00.000
3           12345       2019-01-01 01:21:00.000
4           12345       2019-01-01 01:21:00.000
5           12345       2019-01-01 01:21:20.000
6           12345       2019-01-01 01:21:20.000
7           12345       2019-01-01 01:21:20.000
8           23456       2019-01-01 01:21:00.000
9           23456       2019-01-01 01:21:10.000
10          23456       2019-01-01 01:21:20.000
11          23456       2019-01-01 01:29:20.000
12          23456       2019-01-01 01:21:20.000
13          29956       2019-01-01 01:21:10.000
14          29956       2019-01-01 01:21:10.000
15          29956       2019-01-01 01:29:20.000
16          29956       2019-01-01 01:21:20.000
17          29956       2019-01-02 01:21:20.000
18          29956       2019-01-02 01:21:20.000
19          29956       2019-01-03 01:29:20.000

(19 rows affected)

Now let’s look at the Query to group the rows by AccountID and CreateDate. Remember our scenario is there should only be one AccountID/CreateDate combo. But some process duplicated some rows and we want to get rid of them. Notice we are not deleting, but only looking at the dupes:

WITH dedupe AS
(
	select accountid,
		CreateDate,
		ROW_NUMBER() OVER (partition by accountid, createdate order by createdate) as RN
	FROM table1
)

SELECT *
FROM dedupe

accountid   CreateDate              RN
----------- ----------------------- --------------------
12345       2019-01-01 00:00:00.000 1
12345       2019-01-01 00:00:00.000 2
12345       2019-01-01 01:21:00.000 1
12345       2019-01-01 01:21:00.000 2
12345       2019-01-01 01:21:20.000 1
12345       2019-01-01 01:21:20.000 2
12345       2019-01-01 01:21:20.000 3
23456       2019-01-01 01:21:00.000 1
23456       2019-01-01 01:21:10.000 1
23456       2019-01-01 01:21:20.000 1
23456       2019-01-01 01:21:20.000 2
23456       2019-01-01 01:29:20.000 1
29956       2019-01-01 01:21:10.000 1
29956       2019-01-01 01:21:10.000 2
29956       2019-01-01 01:21:20.000 1
29956       2019-01-01 01:29:20.000 1
29956       2019-01-02 01:21:20.000 1
29956       2019-01-02 01:21:20.000 2
29956       2019-01-03 01:29:20.000 1

(19 rows affected)

From the above we know any of the rows with an RN > 1 are duplicates that we can get rid of. So NOW we will run the delete. Note that the delete against our CTE (“dedupe”) will affect the underlying table (“table1”):

WITH dedupe AS
(
	SELECT accountid,
		CreateDate,
		ROW_NUMBER() OVER (PARTITION BY accountid, createdate ORDER BY createdate) as RN
	FROM table1
)

DELETE
FROM dedupe
WHERE RN > 1


Now let’s see what we are left with.

SELECT *
FROM table1;

ID          AccountID   CreateDate
----------- ----------- -----------------------
1           12345       2019-01-01 00:00:00.000
3           12345       2019-01-01 01:21:00.000
5           12345       2019-01-01 01:21:20.000
8           23456       2019-01-01 01:21:00.000
9           23456       2019-01-01 01:21:10.000
10          23456       2019-01-01 01:21:20.000
11          23456       2019-01-01 01:29:20.000
13          29956       2019-01-01 01:21:10.000
15          29956       2019-01-01 01:29:20.000
16          29956       2019-01-01 01:21:20.000
17          29956       2019-01-02 01:21:20.000
19          29956       2019-01-03 01:29:20.000

(12 rows affected)

Sure enough! We’ve eliminated our duplicate rows, and have unique AccountID/CreateDate combinations.

Thanks for reading!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.