Grouping Data using the OVER and PARTITION BY Functions

Total: 62 Average: 4.4

The OVER and PARTITION BY functions are both functions used to portion some particular results set according to specified criteria.

This article explains the usage of these two functions in conjunction to retrieve partitioned data in very specific ways.

Preparing Some Sample Data

To execute our sample queries, let’s first create a database named “studentdb”.

Run the following command in your query window:

CREATE DATABASE schooldb;

Next, we create the “student” table within the “studentdb” database. This table will have five columns: id, name, age, gender, and total_score.

As always, make sure to back up the existing data before experimenting with a new code. See this article on backing up SQL Server databases if you’re not sure.

Execute the following query to create the student table:

USE schooldb
CREATE TABLE student
(
    id INT PRIMARY KEY IDENTITY,
    name VARCHAR(50) NOT NULL,
    gender VARCHAR(50) NOT NULL,
    age INT NOT NULL,
    total_score INT NOT NULL, 
 )

Finally, we need to insert some dummy data into the database:

USE schooldb
INSERT INTO student 

VALUES ('Jolly', 'Female', 20, 500), 
('Jon', 'Male', 22, 545), 
('Sara', 'Female', 25, 600), 
('Laura', 'Female', 18, 400), 
('Alan', 'Male', 20, 500), 
('Kate', 'Female', 22, 500), 
('Joseph', 'Male', 18, 643), 
('Mice', 'Male', 23, 543), 
('Wise', 'Male', 21, 499), 
('Elis', 'Female', 27, 400);

Now, we are ready to work on a problem. Let’s review how we can use Over and Partition By to solve it.

Problem

We have 10 records in the student table, and we want to display the name, id, and gender of all students. Besides, we want to display the total number of students of each gender, the average students’ age for each gender, and the sum of values in the total_score column for each gender.

The result set that we’re looking for is as below:

Students table

As you see, the first three columns (marked black) contain individual values for each record, while the last three columns (marked red) contain the aggregated values, grouped by the gender column. For example, in the Average_Age column, the first five rows display the average age and the total score of all records for “Female.”

Our result set contains aggregated results joined with non-aggregated columns.

To retrieve these aggregated results grouped by a particular column, we can use the standard GROUP BY clause:

USE schooldb
SELECT gender, count(gender) AS Total_Students, AVG(age) as Average_Age, SUM(total_score) as Total_Score
FROM student
GROUP BY gender

Let’s retrieve Total_Students, Average_Age, and Total_Score of the students grouped by gender.

You will see the following results:

Students totals

We extend this and add ‘id’ and ‘name’ (non-aggregated columns in the SELECT statement). Now, we check if we can get our desired result.

USE schooldb
SELECT id, name, gender, count(gender) AS total_students, AVG(age) as Average_Age, SUM(total_score) as Total_Score
FROM student
GROUP BY gender

However, when we run the above query, we get the following error:

Error produced for the query

This error claims that the “id” column of the student table is invalid within the SELECT statement since we are using the GROUP BY clause in the query.

Therefore, we have to apply an aggregate function on the “id” column, or we have to use it in the GROUP BY clause. In short, this scheme doesn’t solve our problem.

Solution Using JOIN Statement

One solution to this would be making use of the JOIN statement to join the columns with aggregated results to columns with non-aggregated results.

To do so, you need a sub-query that retrieves gender, Total_Students, Average_Age, and the Total_Score of students grouped by gender. Then we can join these results with the results obtained from sub-query with the outer SELECT statement.

We apply it to the “gender” column of the sub-query containing the aggregated result, and the “gender” column of the student table. The outer SELECT statement would include non-aggregated columns i.e. ‘id’ and ‘name’, as below.

USE schooldb
SELECT id, name, Aggregation.gender, Aggregation.Total_students, Aggregation.Average_Age, Aggregation.Total_Score
FROM student
INNER JOIN
(SELECT gender, count(gender) AS Total_students, AVG(age) AS Average_Age, SUM(total_score) AS Total_Score
FROM student
GROUP BY gender) AS Aggregation
on Aggregation.gender = student.gender

The above query will give you the desired result but it is not the optimal solution either. We had to use a JOIN statement and a sub-query, thus increasing the complexity of the script. It is not an elegant, efficient solution.

A better approach is to use the OVER and PARTITION BY clauses in conjunction.

Solution Using OVER and PARTITION BY

To use the OVER and PARTITION BY clauses, you simply specify the column you want to partition your aggregated results by. Let’s examine the practical case of using OVER and PARTITION BY to understand this method:

USE schooldb
SELECT id, name, gender,
COUNT(gender) OVER (PARTITION BY gender) AS Total_students,
AVG(age) OVER (PARTITION BY gender) AS Average_Age,
SUM(total_score) OVER (PARTITION BY gender) AS Total_Score
FROM student

It is a much more efficient result. In the first line of the script, the “id,” “name,” and “gender” columns are retrieved. These columns do not contain any aggregated results.

Next, for the columns containing aggregated results, we simply specify the aggregated function, followed by the OVER clause. Then, within the parenthesis, we specify the PARTITION BY clause. The third element that follows is the name of the column we want our results to be partitioned:

Partition results table

Conclusion

The necessity to portion some data for particular criteria arises quite often, and several options are applicable to do this operation. The most suitable way is to use the T-SQL OVER and PARTITION BY clauses in conjunction. This method ensures to avoid errors and makes the script clear and less complex.

Among all the methods possible, T-SQL PARTITION BY and OVER clauses in conjunction is the most efficient and practical solution.

References

Ben Richardson

Ben Richardson

Ben Richardson runs Acuity Training. Acuity is an IT training business offering classroom courses in London and Guildford. It is a leading provider of SQL training the UK and offers a full range of SQL training from introductory training to advanced administration courses.