Is SQL DISTINCT Good for Removing Duplicates in Results?

Is SQL DISTINCT good (or bad) when you need to remove duplicates in results?

Some say it’s good and add DISTINCT when duplicates appear. Some say it’s bad and suggest using GROUP BY without an aggregate function. Others say DISTINCT and GROUP BY are the same when you need to remove duplicates.

This post will dive into the details to get correct answers. So, eventually, you will use the best keyword based on the need. Let’s begin.

CodingSight - Is SQL DISTINCT Good for Removing Duplicates in Results

How SQL DISTINCT Works to Remove Duplicates

Getting answers is not so hard to find. SQL Server provided us with execution plans to see how a query will be processed to give us the needed results.

The following section focuses on the execution plan when using DISTINCT. You need to press Ctrl-M in SQL Server Management Studio before executing the queries below. Or, click Include Actual Execution Plan from the toolbar.

Query Plans in SQL DISTINCT

Let’s begin by comparing 2 queries. The first will not use DISTINCT, and the second query will.

USE AdventureWorks
GO

-- Without DISTINCT. Duplicates included
SELECT Lastname FROM Person.Person;
-- With DISTINCT. Duplicates removed
SELECT DISTINCT Lastname FROM Person.Person;

Here’s the execution plan:

Execution plan of 2 queries. One without DISTINCT and the other with it
Figure 1. Execution plan of 2 queries. One without DISTINCT and the other with it

What did Figure 1 show us?

  • Without the DISTINCT keyword, the query is simple.
  • An extra step appears after adding DISTINCT.
  • The query cost of using DISTINCT is higher than without it.
  • Both have Index Scan operators. This is understandable because there’s no specific WHERE clause in our queries.
  • The extra step, the Stream Aggregate operator, is used to remove the duplicates.

The number of logical reads is the same (107) if you check the STATISTICS IO. Yet, the number of records is vastly different. 19,972 rows are returned by the first query. Meanwhile, 1,206 rows are returned by the second query.

Hence, you cannot add DISTINCT anytime you want. But if you need unique values, this is a necessary overhead.

There are operators used to output unique values. Let’s examine some of them.

STREAM AGGREGATE

This is the operator you saw in Figure 1. It accepts a single input and outputs an aggregated result. In Figure 1, the input comes from the Index Scan operator. However, Stream Aggregate needs a sorted input.

As you can see in Figure 1, it uses the IX_Person_LastName_FirstName_MiddleName, a non-unique index on names. Since the index already sorts the records by name, the Stream Aggregate accepts the input. Without the index, the query optimizer may opt to use an extra Sort operator in the plan. And that will be more expensive. Or, it can use a Hash Match.

HASH MATCH (AGGREGATE)

Another operator used by DISTINCT is Hash Match. This operator is used for joins and aggregations.

When using DISTINCT, Hash Match aggregates the results to produce unique values. Here’s one example.

USE AdventureWorks
GO

-- Get unique first names
SELECT DISTINCT Firstname FROM Person.Person;

And here’s the execution plan:

Hash Match (Aggregate) in execution plan when using DISTINCT
Figure 2. Hash Match (Aggregate) in execution plan when using DISTINCT

But why not Stream Aggregate?

Notice that the same name index is used. That index sorts with Lastname first. So, a Firstname only query will become unsorted.

Hash Match (Aggregate) is the next logical choice to remove the duplicates.

HASH MATCH (FLOW DISTINCT)

The Hash Match (Aggregate) is a blocking operator. Thus, it won’t produce the output it has processed the entire input stream. If we restrict the number of rows (like using TOP with DISTINCT), it will produce a unique output as soon as those rows are available. That’s what Hash Match (Flow Distinct) is all about.

USE AdventureWorks
GO

SELECT DISTINCT TOP 100
 sod.ProductID
,soh.OrderDate
,soh.ShipDate
FROM Sales.SalesOrderDetail sod
INNER JOIN Sales.SalesOrderHeader soh ON sod.SalesOrderID = soh.SalesOrderID;

The query uses TOP 100 along with DISTINCT. Here’s the execution plan:

Hash Match (Flow Distinct) is used when there’s TOP + DISTINCT in SELECT
Figure 3. Hash Match (Flow Distinct) is used when there’s TOP + DISTINCT in SELECT

WHEN THERE’S NO OPERATOR TO REMOVE DUPLICATES

Yup. This can happen. Consider the example below.

USE AdventureWorks
GO

SELECT DISTINCT
 BusinessEntityID
FROM Person.Person;

Then, check the execution plan:

No operator is used to output distinct values
Figure 4. No operator is used to output distinct values

The BusinessEntityID column is the primary key. Since that column is already unique, there is no use in applying DISTINCT. Try removing DISTINCT from the SELECT statement – the execution plan is the same as in Figure 4.

The same is true when using DISTINCT on columns with a unique index.

SQL DISTINCT Works on ALL Columns in the SELECT List

So far, we have only used 1 column in our examples. However, DISTINCT works on ALL columns you specify in the SELECT list.

Here’s an example. This query will make sure that the values of all 3 columns will be unique.

USE AdventureWorks
GO

SELECT DISTINCT 
 Lastname 
,FirstName
,MiddleName
FROM Person.Person;

Notice the first few rows in the result set in Figure 5.

Duplicates in the first 2 columns using DISTINCT in a 3-column SELECT
Figure 5. Duplicates in the first 2 columns using DISTINCT in a 3-column SELECT

The first few rows are all unique. The DISTINCT keyword made sure that the Middlename column is also considered. Notice the 2 names boxed in red. Considering the Lastname and Firstname only will make them duplicates. But adding Middlename to the mix changed everything.

What if you want to get unique first and last names but include the middle name in the result?

You have 2 options:

  • Add a WHERE clause to remove NULL middle names. This will remove all names with a NULL middle name.
  • Or, add a GROUP BY clause on Lastname and Firstname columns. Then, use the MIN aggregate function on the Middlename column. This will get 1 middle name with the same last and first names.

SQL DISTINCT vs. GROUP BY

When using GROUP BY without an aggregate function, it acts like DISTINCT. How do we know? One way to find out is to use an example.

USE AdventureWorks
GO

-- using DISTINCT
SELECT DISTINCT
 soh.TerritoryID
,st.Name
FROM Sales.SalesOrderHeader soh
INNER JOIN Sales.SalesTerritory st ON soh.TerritoryID = st.TerritoryID;

-- using GROUP BY
SELECT
 soh.TerritoryID
,st.Name
FROM Sales.SalesOrderHeader soh
INNER JOIN Sales.SalesTerritory st ON soh.TerritoryID = st.TerritoryID
GROUP BY
 soh.TerritoryID
,st.Name;

Run them and check out the execution plan. Is it like the screenshot below?

Execution plan comparison of a query using DISTINCT vs. query with GROUP BY
Figure 6. Execution plan comparison of a query using DISTINCT vs. query with GROUP BY

How do they compare?

  • They have the same plan operators and sequence.
  • The operator cost of each and the query costs are the same.

If you check the QueryPlanHash properties of the 2 SELECT operators, they are the same. Therefore, the query optimizer used the same process to return the same results.

In the end, we can’t say that using GROUP BY is better than DISTINCT in returning unique values. You can prove this by using the above examples to replace DISTINCT with GROUP BY.

It is now a matter of preference which will you use. I prefer DISTINCT. It explicitly tells the intent in the query – to produce unique results. And for me, GROUP BY is for grouping results using an aggregate function. That intent is also clear and consistent with the keyword itself. I don’t know if someone else will maintain my queries one day. So, the code should be clear.

But that’s not the end of the story.

When SQL DISTINCT is Not the Same as GROUP BY

I just expressed my opinion, and then this?

It’s true. They won’t be the same all the time. Consider this example.

-- using DISTINCT
SELECT DISTINCT
 soh.TerritoryID
,(SELECT name FROM Sales.SalesTerritory st WHERE st.TerritoryID = soh.TerritoryID) AS TerritoryName
FROM Sales.SalesOrderHeader soh;

-- using GROUP BY
SELECT
 soh.TerritoryID
,(SELECT name FROM Sales.SalesTerritory st WHERE st.TerritoryID = soh.TerritoryID) AS TerritoryName
FROM Sales.SalesOrderHeader soh
GROUP BY
soh.TerritoryID;

Though the result set is unsorted, the rows are the same as in the previous example. The only difference is the use of a subquery:

Another execution plan comparison of a query using DISTINCT vs. query with GROUP BY
Figure 7. Another execution plan comparison of a query using DISTINCT vs. query with GROUP BY

The differences are obvious: operators, query cost, overall plan. This time, GROUP BY wins with only 28% query cost. But here’s the thing.

The aim is to show you that they can be different. That’s all. This is by no means a recommendation. Using a join has a better execution plan (see Figure 6 again).

The Bottomline

Here’s what we have learned so far:

  • DISTINCT adds a plan operator to remove duplicates.
  • DISTINCT and GROUP BY without an aggregate function result in the same plan. In short, they are the same most of the time.
  • Sometimes, DISTINCT and GROUP BY can have different plans when a subquery is involved in the SELECT list.

So, is SQL DISTINCT good or bad in removing duplicates in results?

The results say that it’s good. It’s not better or worse than GROUP BY because the plans are the same. But it’s a good habit to check the execution plan. Think of optimization from the start. That way, if you stumble with any differences in DISTINCT and GROUP BY, you will spot them.

Besides, the modern tools make this task much simpler. For instance, a popular product dbForge SQL Complete from Devart has a specific feature that calculates values in the aggregate functions in the ready result set of the SSMS result grid. The DISTINCT values are also present there.

Like the post? Then, please spread the word by sharing it on your favorite social media platforms.

Related Articles for More Information

Edwin Sanchez

Edwin Sanchez

Software developer and project manager with a total of 20+ years of software development. His most recent technology preferences include C#, SQL Server BI Stack, Power BI, and Sharepoint. Edwin combines his technical knowledge with his most recent content writing skills to help new breed of technology enthusiasts.

Leave a Reply

Your email address will not be published. Required fields are marked *