The OVER and PARTITION BY functions are both functions used to portion some particular results set according to specified criteria.
This article explains the usage of these two functions in conjunction to retrieve partitioned data in very specific ways.
Preparing Some Sample Data
To execute our sample queries, let’s first create a database named “studentdb”.
Run the following command in your query window:
CREATE DATABASE schooldb;
Next, we create the “student” table within the “studentdb” database. This table will have five columns: id, name, age, gender, and total_score.
As always, make sure to back up the existing data before experimenting with a new code. See this article on backing up SQL Server databases if you’re not sure.
Execute the following query to create the student table:
USE schooldb CREATE TABLE student ( id INT PRIMARY KEY IDENTITY, name VARCHAR(50) NOT NULL, gender VARCHAR(50) NOT NULL, age INT NOT NULL, total_score INT NOT NULL, )
Finally, we need to insert some dummy data into the database:
USE schooldb INSERT INTO student VALUES ('Jolly', 'Female', 20, 500), ('Jon', 'Male', 22, 545), ('Sara', 'Female', 25, 600), ('Laura', 'Female', 18, 400), ('Alan', 'Male', 20, 500), ('Kate', 'Female', 22, 500), ('Joseph', 'Male', 18, 643), ('Mice', 'Male', 23, 543), ('Wise', 'Male', 21, 499), ('Elis', 'Female', 27, 400);
Now, we are ready to work on a problem. Let’s review how we can use Over and Partition By to solve it.
We have 10 records in the student table, and we want to display the name, id, and gender of all students. Besides, we want to display the total number of students of each gender, the average students’ age for each gender, and the sum of values in the total_score column for each gender.
The result set that we’re looking for is as below:
As you see, the first three columns (marked black) contain individual values for each record, while the last three columns (marked red) contain the aggregated values, grouped by the gender column. For example, in the Average_Age column, the first five rows display the average age and the total score of all records for “Female.”
Our result set contains aggregated results joined with non-aggregated columns.
To retrieve these aggregated results grouped by a particular column, we can use the standard GROUP BY clause:
USE schooldb SELECT gender, count(gender) AS Total_Students, AVG(age) as Average_Age, SUM(total_score) as Total_Score FROM student GROUP BY gender
Let’s retrieve Total_Students, Average_Age, and Total_Score of the students grouped by gender.
You will see the following results:
We extend this and add ‘id’ and ‘name’ (non-aggregated columns in the SELECT statement). Now, we check if we can get our desired result.
USE schooldb SELECT id, name, gender, count(gender) AS total_students, AVG(age) as Average_Age, SUM(total_score) as Total_Score FROM student GROUP BY gender
However, when we run the above query, we get the following error:
This error claims that the “id” column of the student table is invalid within the SELECT statement since we are using the GROUP BY clause in the query.
Therefore, we have to apply an aggregate function on the “id” column, or we have to use it in the GROUP BY clause. In short, this scheme doesn’t solve our problem.
Solution Using JOIN Statement
One solution to this would be making use of the JOIN statement to join the columns with aggregated results to columns with non-aggregated results.
To do so, you need a sub-query that retrieves gender, Total_Students, Average_Age, and the Total_Score of students grouped by gender. Then we can join these results with the results obtained from sub-query with the outer SELECT statement.
We apply it to the “gender” column of the sub-query containing the aggregated result, and the “gender” column of the student table. The outer SELECT statement would include non-aggregated columns i.e. ‘id’ and ‘name’, as below.
USE schooldb SELECT id, name, Aggregation.gender, Aggregation.Total_students, Aggregation.Average_Age, Aggregation.Total_Score FROM student INNER JOIN (SELECT gender, count(gender) AS Total_students, AVG(age) AS Average_Age, SUM(total_score) AS Total_Score FROM student GROUP BY gender) AS Aggregation on Aggregation.gender = student.gender
The above query will give you the desired result but it is not the optimal solution either. We had to use a JOIN statement and a sub-query, thus increasing the complexity of the script. It is not an elegant, efficient solution.
A better approach is to use the OVER and PARTITION BY clauses in conjunction.
Solution Using OVER and PARTITION BY
To use the OVER and PARTITION BY clauses, you simply specify the column you want to partition your aggregated results by. Let’s examine the practical case of using OVER and PARTITION BY to understand this method:
USE schooldb SELECT id, name, gender, COUNT(gender) OVER (PARTITION BY gender) AS Total_students, AVG(age) OVER (PARTITION BY gender) AS Average_Age, SUM(total_score) OVER (PARTITION BY gender) AS Total_Score FROM student
It is a much more efficient result. In the first line of the script, the “id,” “name,” and “gender” columns are retrieved. These columns do not contain any aggregated results.
Next, for the columns containing aggregated results, we simply specify the aggregated function, followed by the OVER clause. Then, within the parenthesis, we specify the PARTITION BY clause. The third element that follows is the name of the column we want our results to be partitioned:
The necessity to portion some data for particular criteria arises quite often, and several options are applicable to do this operation. The most suitable way is to use the T-SQL OVER and PARTITION BY clauses in conjunction. This method ensures to avoid errors and makes the script clear and less complex.
Among all the methods possible, T-SQL PARTITION BY and OVER clauses in conjunction is the most efficient and practical solution.
- Microsoft – Understanding the OVER clause
- Midnight DBA – Introduction To OVER and PARTITION BY
- StackOverflow – Difference Between PARTITION BY and GROUP BY
- Pivoting, Unpivoting, and Splitting Columns in Power BI Query Editor - July 7, 2020
- Formatting Data in Power BI Desktop Visualizations - July 3, 2020
- Implementing Hierarchies in Power BI Desktop Visualizations - July 2, 2020