Written by 13:32 Database development, Stored Procedures, Tables

Stored Procedure to Delete Duplicate Records in SQL Table

CodingSight - Delete Duplicate Records in SQL Table

Sometimes during our run as DBAs, we come up across at least one table that is loaded with duplicate records. Even if the table has a Primary Key (an auto-incremental one in most cases), the rest of the fields might have duplicate values.

However, SQL Server allows for many ways to get rid of those duplicate records (e.g. using CTEs, SQL Rank function, subqueries with Group By, etc.).

I remember once, during an interview, I was asked how to delete duplicate records in a table while leaving just 1 of each. At that time, I wasn’t able to answer, but I was very curious. After researching a bit, I found plenty of options to resolve this problem.

Now, years later, I’m here to present you a Stored Procedure that aims to answer the question “how to delete duplicate records in SQL table?”. Any DBA can simply use it to do some housekeeping without worrying too much.

Create Stored Procedure: Initial Considerations

The account you use must have enough privileges to create a Stored Procedure in the intended database.

The account executing this Stored Procedure must have enough privileges to perform the SELECT and DELETE operations against the target database table.

This Stored Procedure is intended for the database tables that don’t have a Primary Key (nor a UNIQUE constraint) defined. However, if your table has a Primary Key, the Stored Procedure will not take those fields into account. It will perform the lookup and deletion based on the rest of the fields (so use it very carefully in this case).

How to Use Stored Procedure in SQL

Copy & paste the SP T-SQL Code available within this article. The SP expects 3 parameters:

@schemaName – the name of the database table schema if applies. If not – use dbo.

@tableName – the name of the database table where the duplicate values are stored.

@displayOnly – if set to 1, the actual duplicate records will not be deleted, but only displayed instead (if any). By default, this value is set to 0 meaning that the actual deletion will happen if duplicates exist.

SQL Server Stored Procedure Execution Tests

To demonstrate the Stored Procedure, I have created two different tables – one without a Primary Key, and one with a Primary Key. I have inserted some dummy records into these tables. Let’s check what results I get before/after executing the Stored Procedure.

SQL Table with Primary Key

CREATE TABLE [dbo].[test](
	[column1] [varchar](16) NOT NULL,
	[column2] [varchar](16) NOT NULL,
	[column3] [varchar](16) NOT NULL,
CONSTRAINT [PK_Test] PRIMARY KEY CLUSTERED 
(
	[column1] ASC,
	[column2] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO

SQL Stored Procedure Example Records

INSERT INTO test VALUES('A','A',1),('A','B',1),('A','C',1),('B','A',2),('B','B',3),('B','C',4)
Table with Primary Key Example

Execute Stored Procedure with Display Only

EXEC DBA_DeleteDuplicates @schemaName = 'dbo',@tableName = 'test',@displayOnly = 1
Stored Procedure Execution with a Display Only

Since column1 and column2 form the Primary Key, the duplicates are evaluated against the non-Primary Key columns, in this case, column3. The result is correct.

Execute Stored Procedure without Display Only

EXEC DBA_DeleteDuplicates @schemaName = 'dbo',@tableName = 'test',@displayOnly = 0
Stored Procedure Execution without Display Only

The duplicate records are gone.

However, you must be careful with this approach because the first occurrence of the record is the one that will cut. So, if for any reason you need a very specific record to be deleted, then you must tackle your particular case separately.

SQL Table with no Primary Key

CREATE TABLE [dbo].[duplicates](
	[column1] [varchar](16) NOT NULL,
	[column2] [varchar](16) NOT NULL,
	[column3] [varchar](16) NOT NULL
) ON [PRIMARY]
GO

SQL Stored Procedure Example Records

INSERT INTO duplicates VALUES
('John','Smith','Y'),
('John','Smith','Y'),
('John','Smith','N'),
('Peter','Parker','N'),
('Bruce','Wayne','Y'),
('Steve','Rogers','Y'),
('Steve','Rogers','Y'),
('Tony','Stark','N')
Table with no primary key example

Execute Stored Procedure with Display Only

EXEC DBA_DeleteDuplicates @schemaName = 'dbo',@tableName = 'duplicates',@displayOnly = 1
Stored Procedure Execution with a Display Only

The output is correct, those are the duplicate records in the table.

Execute Stored Procedure without Display Only

EXEC DBA_DeleteDuplicates @schemaName = 'dbo',@tableName = 'duplicates',@displayOnly = 0
Stored Procedure Execution without Display Only

The Stored Procedure has worked as expected and the duplicates are successfully cleaned.

Special Cases for this Stored Procedure in SQL

If the schema or table you are specifying doesn’t exist within your database, the Stored Procedure will notify you, and the script will end its execution.

Table doesn't exist within this database

If you leave the schema name blank, the script will notify you and end its execution.

You must specify the schema of the table

If you leave the table name blank, the script will notify you and end its execution.

You must specify the name of the table

If you execute the Stored Procedure against a table that doesn’t have any duplicates and activate the @displayOnly bit, you will get an empty result set.

Table doesn't have any duplicates

SQL Server Stored Procedure: Complete Code

SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
-- =============================================
-- Author     :	Alejandro Cobar
-- Create date: 2021-06-01
-- Description:	SP to delete duplicate rows in a table
-- =============================================
CREATE PROCEDURE DBA_DeleteDuplicates 
	@schemaName  VARCHAR(128),
	@tableName   VARCHAR(128),
	@displayOnly BIT = 0
AS
BEGIN
	SET NOCOUNT ON;
	
	IF LEN(@schemaName) = 0
	BEGIN
		PRINT 'You must specify the schema of the table!'
		RETURN
	END

	IF LEN(@tableName) = 0
	BEGIN
		PRINT 'You must specify the name of the table!'
		RETURN
	END

	IF EXISTS (SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_SCHEMA = @schemaName AND  TABLE_NAME = @tableName)
	BEGIN
		DECLARE @pkColumnName  VARCHAR(128);
		DECLARE @columnName    VARCHAR(128);
		DECLARE @sqlCommand    VARCHAR(MAX);
		DECLARE @columnsList   VARCHAR(MAX);
		DECLARE @pkColumnsList VARCHAR(MAX);
		DECLARE @pkColumns     TABLE(pkColumn VARCHAR(128));
		DECLARE @limit         INT;
		
		INSERT INTO @pkColumns
		SELECT K.COLUMN_NAME
		FROM INFORMATION_SCHEMA.TABLE_CONSTRAINTS AS C
		JOIN INFORMATION_SCHEMA.KEY_COLUMN_USAGE AS K ON C.TABLE_NAME = K.TABLE_NAME AND C.CONSTRAINT_SCHEMA = K.CONSTRAINT_SCHEMA
		WHERE C.CONSTRAINT_TYPE = 'PRIMARY KEY'
		  AND C.CONSTRAINT_SCHEMA = @schemaName AND C.TABLE_NAME = @tableName

		IF((SELECT COUNT(*) FROM @pkColumns) > 0)
		BEGIN
			DECLARE pk_cursor CURSOR FOR 
			SELECT * FROM @pkColumns
	
			OPEN pk_cursor  
			FETCH NEXT FROM pk_cursor INTO @pkColumnName 
		
			WHILE @@FETCH_STATUS = 0  
			BEGIN  
				SET @pkColumnsList = CONCAT(@pkColumnsList,'',@pkColumnName,',')
				FETCH NEXT FROM pk_cursor INTO @pkColumnName 
			END 

			CLOSE pk_cursor  
			DEALLOCATE pk_cursor 

			SET @pkColumnsList = SUBSTRING(@pkColumnsList,1,LEN(@pkColumnsList)-1)
		END  
		
		DECLARE columns_cursor CURSOR FOR 
		SELECT COLUMN_NAME
		FROM INFORMATION_SCHEMA.COLUMNS
		WHERE TABLE_SCHEMA = @schemaName AND TABLE_NAME = @tableName AND COLUMN_NAME NOT IN (SELECT pkColumn FROM @pkColumns)
		ORDER BY ORDINAL_POSITION;

		OPEN columns_cursor  
		FETCH NEXT FROM columns_cursor INTO @columnName 
		
		WHILE @@FETCH_STATUS = 0  
		BEGIN  
			SET @columnsList = CONCAT(@columnsList,'',@columnName,',')
			FETCH NEXT FROM columns_cursor INTO @columnName 
		END 

		CLOSE columns_cursor  
		DEALLOCATE columns_cursor 
		
		SET @columnsList = SUBSTRING(@columnsList,1,LEN(@columnsList)-1)

		IF((SELECT COUNT(*) FROM @pkColumns) > 0)
		BEGIN		

		IF(CHARINDEX(',',@columnsList) = 0)
		SET @limit = LEN(@columnsList)+1
		ELSE
		SET @limit = CHARINDEX(',',@columnsList)

		
		SET @sqlCommand = CONCAT('WITH CTE (',@columnsList,',DuplicateCount',') 
									AS (SELECT ',@columnsList,',',
									             'ROW_NUMBER() OVER(PARTITION BY ',@columnsList,' ',
												 'ORDER BY ',SUBSTRING(@columnsList,1,@limit-1),') AS DuplicateCount
									    FROM [',@schemaName,'].[',@tableName,'])
										
								  ')
		IF @displayOnly = 0
		SET @sqlCommand = CONCAT(@sqlCommand,'DELETE FROM CTE WHERE DuplicateCount > 1;')
		IF @displayOnly = 1
		SET @sqlCommand = CONCAT(@sqlCommand,'SELECT ',@columnsList,',MAX(DuplicateCount) AS DuplicateCount FROM CTE WHERE DuplicateCount > 1 GROUP BY ',@columnsList)

		END
		ELSE
		BEGIN		
		SET @sqlCommand = CONCAT('WITH CTE (',@columnsList,',DuplicateCount',') 
									AS (SELECT ',@columnsList,',',
									             'ROW_NUMBER() OVER(PARTITION BY ',@columnsList,' ',
												 'ORDER BY ',SUBSTRING(@columnsList,1,CHARINDEX(',',@columnsList)-1),') AS DuplicateCount
									    FROM [',@schemaName,'].[',@tableName,'])
										
								 ')

		IF @displayOnly = 0
		SET @sqlCommand = CONCAT(@sqlCommand,'DELETE FROM CTE WHERE DuplicateCount > 1;')
		IF @displayOnly = 1
		SET @sqlCommand = CONCAT(@sqlCommand,'SELECT * FROM CTE WHERE DuplicateCount > 1;')

		END
		
		EXEC (@sqlCommand)
	END
	ELSE
		BEGIN
			PRINT 'Table doesn't exist within this database!'
			RETURN
		END
END
GO

Conclusion

If you don’t know how to delete duplicate records in SQL table, than tools like this will be helpful for you. Any DBA can check if there are database tables that don’t have Primary Keys (nor Unique constraints) for them, that might accumulate a pile of unnecessary records over time (potentially wasting storage). Just plug and play the Stored Procedure, and you’re good to go.

You can go a bit further and build an alerting mechanism to notify you if there are duplicates for a specific table (after implementing a bit of automation using this tool of course), which comes quite handy.

As with anything related to DBA tasks, make sure to always test everything in a sandbox environment before pulling the trigger in production. And when you do, make sure to have a backup of the table you focus on.

Tags: , Last modified: November 07, 2022
Close