Virtual Columns and Functional Indexes

Total: 1 Average: 5

Much too often, we see poorly written complex SQL queries running against the database tables. Such queries may take a very short or a very long time to execute, but they consume a huge amount of CPU and other resources. Nevertheless, in many cases, complex queries provide valuable information to the application/person. Therefore, it brings useful assets in all varieties of applications.

Complexity of queries

Let’s look closer at problematic queries. Many of them are complex. That may be due to several reasons:

  1. The datatype chosen for the data;
  2. The organization and storage of the data in the database;
  3. Transformation and joining of the data in a query to retrieve the desired result set.

You need to think these three key factors out properly and implemented them correctly to make queries perform optimally.

However, it may become a nearly impossible task for both Database Developers and DBAs. For example, it can be exceptionally difficult to add new functionality to existing Legacy systems. A particularly complicated case is when you need to extract and transform the data from a legacy system so that you could compare it to the data produced by the new system or functionality. You have to achieve it without affecting the legacy application functionality.

Such queries may involve complex joins, like the following:

  1. A combination of substring and/or concatenation of several data columns;
  2. Built-in scalar functions;
  3. Customized UDFs;
  4. Any combination of WHERE clause comparisons and search conditions.

Queries, as described earlier, usually have complex access paths. What is worse, they may have many table scans and/or full index scans with such combinations of JOINs or searches occurring.

Data transformation and manipulations in queries

We need to point out that all data stored persistently in a database table need transformation and/or manipulation at some point when we query that data from the table. The transformation can range from a simple transformation to a very complex one. Depending on how complex it may be, the transformation can consume a lot of CPU and resources.

In most cases, transformations done in JOINs happen after the data is read and offloaded to the tempdb database (SQL Server) or workfile database / temp-tablespaces as in other database systems.

Since the data in the workfile is not indexable, the time required to execute combined transformations and JOINs increases exponentially. The data retrieved becomes larger. Thus, resulting queries develop into a performance bottleneck through additional data growth.

So, how can a Database Developer or a DBA resolve those performance bottlenecks quickly and also provide themselves with more time to re-engineer and rewrite the queries for optimal performance?

There are two ways to solve such persistent problems effectively. One of them is by using virtual columns and/or functional indexes.

Functional indexes and queries

Normally, you create indexes on columns that either indicate a unique set of columns/values in a row (unique indexes or primary keys) or represent a set of columns/values that are or may be used in WHERE clause search conditions of a query.

If you do not have such indexes in place, and you have developed complex queries as described earlier, you will notice the following:

  1. Reduction in performance levels when using the explain query and seeing table scans or full index scans
  2. Very high CPU and resource usage caused by the queries;
  3. Long execution times.

Contemporary databases normally address these issues by allowing you to create a functional or function-based index, as named in SQLServer, Oracle, and MySQL (v 8.x). Or, it can be Index on expression/expression-based indexes, as in other databases (PostgreSQL and Db2).

Suppose you have a Purchase_Date column of the data type TIMESTAMP or DATETIME in your Order table, and that column has been indexed. We begin to query the Order table with a WHERE clause:

SELECT ...
FROM Order
WHERE DATE(Purchase_Date) = '03.12.2020'

This transaction will cause scanning of the whole index. However, if the column has not been indexed, you get a table scan.

After scanning the whole index, that index moves into tempdb / workfile (whole table if you get a table scan) before matching the value 03.12.2020.

As a large Order table uses lots of CPU and resources, you should create a functional index having the DATE expression (Purchase_Date) as one of the index columns and shown below:

CREATE ix_DatePurchased on sales.Order(Date(Purchase_Date) desc, ... )

In doing so, you make the matching predicate DATE (Purchase_Date) = ‘03.12.2020’ indexable. Thus, instead of moving the index or table to the tempdb / workfile before the matching of the value, we make the index only partially accessed and/or scanned. It results in lower CPU and resource usage.

Have a look at another example. There is a Customer table with the columns first_name, last_name. Those columns are indexed as such:

CREATE INDEX ix_custname on Customer(first_name asc, last_name asc),

Besides, you have a view that concatenates these columns into the customer_name column:

CREATE view v_CustomerInfo( customer_name, .... ) as
select first_name ||' '|| last_name as customer_name,.....
from Customer
where ...

You have a query from an eCommerce system that searches for the full customer name:

select c.*
from v_CustomerInfo c
where c.customer_name = 'John Smith'
....

Again, this query will produce a full index scan. In the worst-case scenario, it will be a full table scan moving all the data from the index or table to the workfile before the concatenation of the first_name and last_name columns and matching the ‘John Smith’ value.

Another case is creating a functional index as shown below:

CREATE ix_fullcustname on sales.Customer( first_name ||' '|| last_name desc, ... )

This way, you can make the concatenation in the view query into an indexable predicate. Instead of a full index scan or table scan, you have a partial index scan. Such a query execution results in lower CPU and resource usage, excluding the work in the workfile and thus ensuring faster execution time.

Virtual (generated) columns and queries

Generated columns (virtual columns or computed columns) are columns that hold the data generated on the fly. The data cannot be explicitly set to a specific value. It refers to the data in other columns queried, inserted, or updated in a DML query.

The values generation of such columns is automated based on an expression. These expressions might generate:

  1. A sequence of integer values;
  2. The value based on the values of other columns in the table;
  3. It might generate values by calling built-in functions or user-defined functions (UDFs). 

It is equally important to note that in some databases (SQLServer, Oracle, PostgreSQL, MySQL, and MariaDB) these columns can be configured to either persistently store the data with the INSERT and UPDATE statements execution, or execute the underlying column expression on the fly if we query the table and the column saving the storage space.

However, when the expression is complicated, as with complex logic in the UDF function, the savings of execution time, resources, and CPU query costs may not be as much as expected.

Thus, we can configure the column so it will persistently store the outcome of the expression in an INSERT or UPDATE statement. Then, we create a regular index on that column. This way, we’ll save the CPU, resource usage, and the query execution time. Again, it might be some slight increase in the INSERT and UPDATE performance, depending on the expression complexity.

Let’s look at an example. We declare the table and create an index as follows:

CREATE TABLE Customer as (
  customerID Int GENERATED ALWAYS AS IDENTITY,
  first_name VARCHAR(50) NOT NULL,
  last_name VARCHAR(50) NOT NULL,
  customer_name as (first_name ||' '|| last_name) PERSISTED
  ...
  );
CREATE ix_fullcustname on sales.Customer( customer_name desc, ... )

This way, we move the concatenation logic from the view in the previous example down into the table and store the data persistently. We retrieve the data using a matching scan on a regular index. It is the best possible outcome here.

By adding a generated column to a table and creating a regular index on that column, we can move the transformation logic down to the table level. Here, we persistently store the transformed data in insert or update statements which otherwise would be transformed in queries. The JOIN and the INDEX scans will be much simpler and faster.  

Functional indexes, generated columns, and JSON

Global web and mobile applications use light data structures such as JSON to move the data from the web/mobile device to the database and vice versa. The small footprint of JSON data structures makes the data transfer over the network quick and easy. It is easy to compress JSON to a very small size compared to other structures, i.e. XML. It can outperform structures in runtime parsing.

Due to the increased usage of JSON data structures, relational databases have the JSON storage format as either BLOB data type or CLOB data type. Both these types make the data in such columns unindexable as is.

For this reason, the database vendors introduced JSON functions to query and modify JSON objects, as you can easily integrate these functions into the SQL query or other DML commands. However, these queries depend on JSON objects’ complexity. They are very CPU and resource-consuming, as BLOB and CLOB objects need to be offloaded into memory, or, worse, into the workfile before querying and/or manipulation.

Assume that we have a Customer table with the Customer detail data stored as a JSON object in a column called CustomerDetail. We set up querying the table as below:

SELECT CustomerID,
  JSON_VALUE(CustomerDetail, '$.customer.Name') AS Name,
  JSON_VALUE(CustomerDetail, '$.customer.Surname') AS Surname,
  JSON_VALUE(CustomerDetail, '$.customer.address.PostCode') AS PostCode,
  JSON_VALUE(CustomerDetail, '$.customer.address."Address Line 1"') + ' '
  + JSON_VALUE(CustomerDetail, '$.customer.address."Address Line 2"') AS Address,
  JSON_QUERY(CustomerDetail, '$.customer.address.Country') AS Country
FROM Customer
WHERE ISJSON(CustomerDetail) > 0
  AND JSON_VALUE(CustomerDetail, '$.customer.address.Country') = 'Iceland'
  AND JSON_VALUE(CustomerDetail, '$.customer.address.PostCode') IN (101,102,110,210,220)
  AND Status = 'Active'
ORDER BY JSON_VALUE(CustomerDetail, '$.customer.address.PostCode')

In this example, we are querying the data for customers living in some parts of the Capital Region in Iceland.  All Active data should be retrieved into the workfile before applying the search predicate. Still, retrieval will result in too large CPU and resource usage.

Accordingly, there is an effective procedure to make JSON queries run faster. It involves utilizing the functionality through generated columns, as previously described.

We achieve the performance boost by adding generated columns. A generated column would search through the JSON document for specific data represented in the column using the JSON functions and store the value into the column.

We can index and query these generated columns using regular SQL where clause search conditions. Hence, searching for particular data in JSON objects becomes very fast.

We add two generated columns – Country and PostalCode:

ALTER TABLE Customer
ADD Country as JSON_VALUE(CustomerDetail,'$.customer.address.Country');
ALTER TABLE Customer
ADD PostCode as JSON_VALUE(CustomerDetail,'$.customer.address.PostCode');

CREATE INDEX ix_CountryPostCode on Country(Country asc,PostCode asc);

Also, we create a composite index on the specific columns. Now, we can change the query to the example displayed below:

SELECT CustomerID,
  JSON_VALUE(CustomerDetail, '$.customer.customer.Name') AS Name,
  JSON_VALUE(CustomerDetail, '$.customer.customer.Surname') AS Surname,
  JSON_VALUE(CustomerDetail, '$.customer.address.PostCode') AS PostCode,
  JSON_VALUE(CustomerDetail, '$.customer.address."Address Line 1"') + ' '
  + JSON_VALUE(CustomerDetail, '$.customer.address."Address Line 2"') AS Address,
  JSON_QUERY(CustomerDetail, '$.customer.address.Country') AS Country
FROM Customer
WHERE ISJSON(CustomerDetail) > 0
  AND Country = 'Iceland'
  AND PostCode IN (101,102,110,210,220)
  AND Status = 'Active'
ORDER BY JSON_VALUE(CustomerDetail, '$.customer.address.PostCode')

This limits the data retrieval to Active Customers only in some part of the Iceland Capital Region. This way is faster and more efficient than the previous query.

Conclusion

All in all, by applying virtual columns or functional indexes to tables that cause difficulties (CPU, and resource-heavy queries), we can eliminate problems fairly quickly.

Virtual columns and functional indexes can help with querying complex JSON objects stored in regular relational tables. However, we need to assess the problems carefully beforehand and make the necessary changes accordingly.

In some cases, if the query and/or JSON data structures are very complex, a part of the CPU and resource usage may shift from the queries to the INSERT / UPDATE processes. It gives us fewer overall CPU and resource savings than expected. If you experience similar issues, more thorough tables and queries redesign may be inevitable.

Tomas Helgi Johannsson

Tomas has over 20 years of experience as an Application Architect/Developer, Database Administrator, and Project Manager in a wide variety of business applications with a primary focus on performance. His main interest lies in client/server programming and distributed relational database design using DB2, Oracle, or MySQL/MariaDB.