An Overview of Analytic Functions in PostgreSQL

Analytic functions are special kinds of pre-built functions that come with PostgreSQL by default. They allow you to execute a variety of analytical workloads on your datasets and prepare results.

In the world of cloud computing, where we need to provide insights to customers in a swift and meaningful way, understanding these analytical functions solves a lot of challenges.

This article will talk about the most common analytical functions used in PostgreSQL.

CodingSight - An overview of Analytic Functions in PostgreSQL

Analytic or Window Functions in PostgreSQL

The analytical functions have been added to the database engine since PostgreSQL 9.1.

The basic purpose of an analytic or a window function is to perform calculations across several records related to the current row. It takes into account the context of the current row and calculates other rows based on that.

The analytic or window functions can be considered somewhat similar to aggregate functions. However, the major difference between these is that an aggregate function group the number of output records, whereas an analytic function does not group the output records to a single row.

Some of the most popular analytic functions in PostgreSQL are as follows.

  • RANK – assign a rank to the row based on a column.
  • CUME_DIST – find the cumulative distribution of an integer column.
  • FIRST_VALUE – find the first value that appears in a group of records within the dataset.
  • LAST_VALUE – find the last value that appears within a group of records.
  • LAG – display values after the occurrence of the current row.
  • LEAD – display values that appear before the occurrence of the current row.

Now, let us try the implementation of these practices. I have prepared a script in SQL for creating a sample database table in your instance of PostgreSQL. You can find this SQL script in my gist.

Thus, first, you need to create a sample database in your PostgreSQL instance, and then run the script on the database. It will create a table with some dummy data in it.

Using the RANK Function

The RANK function serves to assign a rank or assign a row number to every row within a partition of a result set. This rank is set to 1 for the first row in each partition. As the partition changes, the RANK automatically detects the change and assigns the value 1 to that row.

SELECT 
    laptop_name,
    laptop_brand,
    RANK() OVER(PARTITION BY laptop_brand ORDER BY price) AS rank_laptop_brand
FROM laptops;
Using RANK in PostgreSQL
Figure 1 – Using RANK in PostgreSQL

We have prepared the query to rank the records by prices. The partition is created on the laptop_brand column. This means the rank will reset to 1 and start again for every change in this column.

Understanding the CUME_DIST Function

A Data Analyst who has to analyze the cumulative distribution of a numerical column may often need to apply statistical calculations in a relational table.

In statistics, a cumulative distribution function serves to calculate the probability when a random variable may take a specific value less than or equal to the value itself. In SQL, it can be calculated by counting the total number of records present in the dataset and then dividing it by the rank of each row.

Let us understand this with the help of an example:

SELECT 
    laptop_name, 
    laptop_brand, 
    relesed_year, 
    device_type, 
    price,
    CUME_DIST() OVER(ORDER BY price) AS price_cume_dist
FROM laptops
Calculating Cumulative Distribution in PostgreSQL
Figure 2 – Calculating Cumulative Distribution in PostgreSQL

As you can see, the column price_cume_dist gives us the cumulative distribution of the price starting from the lowest to the highest. The last record has the distribution value of 1 which means all prices in the dataset are equal to or lower than that value.

Likewise, the cumulative distribution can also be calculated based on the partition of a group in the dataset instead of the entire dataset. For example, we can calculate the cumulative distribution of the same numerical column based on another column within the dataset that would create the partition. Let us try an example:

SELECT 
    laptop_name, 
    laptop_brand, 
    relesed_year, 
    device_type, 
    price,
    CUME_DIST() OVER(PARTITION BY device_type ORDER BY price) AS price_cume_dist_by_device_type
FROM laptops
Calculating the Cumulative Distribution of a column based on a partition
Figure 3 – Calculating the Cumulative Distribution of a column based on a partition

Thus, the price_cume_dist_by_device_type column is calculated based on the partition created by the device_type column. For every new device type, the distribution is reset to 0.

Understanding the FIRST_VALUE and the LAST_VALUE Functions

As it goes by the name, the FIRST_VALUE function serves to find the first occurrence of the value from a column and repeat it for all the other rows. Similarly, the LAST_VALUE is used to find the last occurrence of a column value and then repeat it for all the rows in the dataset.

In practice, these functions are used to eliminate complex SQL joins while obtaining the first or the last value in the dataset. For example, in a web forum database, you might want to know who has initiated the thread and who was the last person to comment on the thread. These can be done by leveraging the FIRST_VALUE and the LAST_VALUE functions.

Let us see a quick demonstration:

SELECT 
    laptop_name,
    laptop_brand,
    relesed_year,
    device_type,
    price,
    FIRST_VALUE(laptop_name) OVER(ORDER BY price) AS first_value_laptop_name
FROM laptops
Using the FIRST_VALUE function
Figure 4 – Using the FIRST_VALUE function

The first value of the laptop_name column is repeated in the last column of the dataset for all columns. You can also add a partition to this query to print the first values in each of the partitions.

SELECT 
    laptop_name,
    laptop_brand,
    relesed_year,
    device_type,
    price,
FIRST_VALUE(laptop_name) OVER(PARTITION BY laptop_brand ORDER BY Price) AS first_value_by_laptop_brand
FROM laptops
Using the FIRST_VALUE function in a partition
Figure 5 – Using the FIRST_VALUE function in a partition

Here, the partition is created based on the laptop_brand column. For each of the brands, the first values change accordingly.

For the implementation of the last value function, you can use the following query:

SELECT 
    laptop_name,
    laptop_brand,
    relesed_year,
    device_type,
    price,
LAST_VALUE(laptop_name) OVER(ORDER BY price) AS last_value_laptop_name
FROM laptops
Using the LAST_VALUE Function in PostgreSQL
Figure 6 – Using the LAST_VALUE Function in PostgreSQL

Unlike previous functions, each of the rows has a different value for the LAST_VALUE column. This is because of the way the function is designed. To print the last values for each row, you can use the following SQL command:

SELECT 
    laptop_name,
    laptop_brand,
    relesed_year,
    device_type,
    price,
LAST_VALUE(laptop_name) OVER(ORDER BY price
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) AS last_value_laptop_name
FROM laptops
Printing the last value of the dataset using the LAST_VALUE function
Figure 7 – Printing the last value of the dataset using the LAST_VALUE function

Here, the last value of the laptop_name column is repeated for each of the rows.

Understanding the Usage of LEAD and LAG Functions

LEAD and LAG are two important analytical functions in SQL. They are used to find values from the same column but with an offset to compare sales values from different periods. For example, comparing the sales growth from last quarter to this quarter. Have a look at the example:

SELECT 
    laptop_name,
    LEAD(laptop_name,1) OVER(ORDER BY price) AS first_offset,
    LEAD(laptop_name,2) OVER(ORDER BY price) AS second_offset,
    laptop_brand,
    price
FROM laptops
Using the LEAD function in PostgreSQL
Figure 8 – Using the LEAD function in PostgreSQL

The LEAD function prints the leading row offset values in the second and the third column. Similarly, the LAG function will print the lagging row offsets.

SELECT 
    laptop_name,
    LAG(laptop_name,1) OVER(ORDER BY price) AS first_offset,
    LAG(laptop_name,2) OVER(ORDER BY price) AS second_offset,
    laptop_brand,
    price
FROM laptops
Using the LAG function in PostgreSQL
Figure 9 – Using the LAG function in PostgreSQL

Conclusion

PostgreSQL is an open-source and widely used database management system in today’s world. Most organizations prefer building data warehouses or reporting databases on the PostgreSQL database engine. That’s why analytic functions are necessary for efficient work. We hope that the current article was helpful for you.

To learn more about analytic (window) functions, refer to the official documentation from PostgreSQL.

Aveek Das
Latest posts by Aveek Das (see all)

Aveek Das

Aveek is an experienced Data and Analytics Engineer, currently working in Dublin, Ireland. His main areas of technical interest include SQL Server, SSIS/ETL, SSAS, Python, Big Data tools like Apache Spark, Kafka, and cloud technologies such as AWS/Amazon and Azure. He is a prolific author, with over 100 articles published on various technical blogs, including his own blog, and a frequent contributor to different technical forums. In his leisure time, he enjoys amateur photography mostly street imagery and still life. Some glimpses of his work can be found on Instagram. You can also find him on LinkedIn.

Leave a Reply

Your email address will not be published. Required fields are marked *