We often see poorly written complex SQL queries running against a table or tables in databases. Those queries make the time of execution very long and cause consuming huge CPU and other resources. Still, complex queries provide valuable information to the application/person running them in many cases. Therefore, they are useful assets in all varieties of applications.
Complex queries are hard to debug
If we look closely at problematic queries, many of them are complex, especially those specific ones used in reports.
Complex queries often consist of five or more large tables and are joined together by many sub-queries. Each sub-query has a WHERE clause that carries out simple to complex calculations and/or data transformations while joining the columns relevant tables together.
Such queries may become challenging to debug without consuming a lot of resources. The reason is, it is difficult to determine whether each sub-query and/or joined sub-queries produce correct results.
A typical scenario is: they call you late at night to solve a problem on a busy database server with a complex query involved, and you need to fix it quickly. As a Developer or DBA, you may have very limited time and system resources available at a late hour. Thus, the first thing you need is a plan on how to debug the problematic query.
Sometimes, the debugging procedure goes well. Sometimes, it takes lots of time and effort before you reach the goal and solve the problem.
Writing Queries in CTE structure
But what if there was a way to write complex queries so that one could debug them quickly, piece by piece?
There is such a way. It is called Common Table Expression or CTE.
Common Table Expression is a standard feature in most modern databases like SQLServer, MySQL (as of version 8.0), MariaDB (version 10.2.1), Db2, and Oracle. It has a simple structure that encapsulates one or many sub-queries into a temporary named result set. You can use this result set further in other named CTEs or sub-queries.
A Common Table Expression is, to a certain extent, a VIEW that only exists and is referenced by the query at the time of execution.
Transforming a complex query to a CTE style query calls for some structured thinking. The same goes for OOP with encapsulation when rewriting a complex query into a CTE structure.
You need to think about:
- Each set of data that you are pulling from each table.
- How they are joined together to encapsulate the closest sub-queries into one temporary named result set.
Repeat it for each sub-query and set of data remaining until you reach the final result of the query. Note that each temporary named result set is also a sub-query.
The final part of the query should be a very “simple” select, returning the final result to the application. Once you have reached this final part, you can exchange it with a query that selects the data from an individually named temporary result set.
This way, the debugging of each temporary result set becomes an easy job.
To understand how we can build our queries from simple to complex, let’s look at the CTE structure. The simplest form is as follows:
WITH CTE_1 as ( select .... from some_table where ... ) select ... from CTE_1 where ...
Here CTE_1 is a unique name you give to the temporary named result set. There can be as many result sets as needed. By that, the form extends to, as shown below:
WITH CTE_1 as ( select .... from some_table where ... ), CTE_2 as ( select .... from some_other_table where ... ) select ... from CTE_1 c1,CTE_2 c2 where c1.col1 = c2.col1 ....
At first, each CTE part is created separately. Then it progresses, as CTE’s are linked together to build up the final result set of the query.
Now, let us examine another case, querying a fictional Sales database. We want to know what products, including quantity and total sales, were sold in each category the previous month, and which of them got more total sales than the month before that.
We construct our query into several CTE parts, where each part references the previous one. First, we construct a result set to list out the detailed data we need from our tables to form the rest of the query:
WITH detailed_data as ( select o.order_date, c.category_name,p.product_name,oi.quantity, oi.listprice, oi.discount from Orders o, Order_Item oi, Products p, Category c where o.order_id = oi.order_id and oi.product_id = p.product_id and p.category_id = c.category_id ) select dt.* from detailed_data dt. order by dt.order_date desc, dt.category_name, dt.product_name
The next step is to summarize the quantity and total sales data by each category and product names:
WITH detailed_data as ( select o.order_date, c.category_name,p.product_name,oi.quantity, oi.listprice, oi.discount from Orders o, Order_Item oi, Products p, Category c where o.order_id = oi.order_id and oi.product_id = p.product_id and p.category_id = c.category_id ), product_sales as ( select year(dt.order_date) year, month(dt.order_date) month, dt.category_name,dt.product_name,sum(dt.quantity) total_quantity, sum(dt.listprice * (1 - dt.discount)) total_product_sales from detailed_data dt group by year(dt.order_date) year, month(dt.order_date) month, dt.category_name,dt.product_name ) select ps.* from product_sales ps order by ps.year desc, ps.month desc, ps.category_name,ps.product_name
The final step is to create two temporary result sets representing the last month’s and the preceding month’s data. After that, filter out the data to be returned as the final result set:
WITH detailed_data as ( select o.order_date, c.category_name,p.product_name,oi.quantity, oi.listprice, oi.discount from Orders o, Order_Item oi, Products p, Category c where o.order_id = oi.order_id and oi.product_id = p.product_id and p.category_id = c.category_id ), product_sales as ( select year(dt.order_date) year, month(dt.order_date) month, dt.category_name,dt.product_name,sum(dt.quantity) total_quantity, sum(dt.listprice * (1 - dt.discount)) total_product_sales from detailed_data dt group by year(dt.order_date) year, month(dt.order_date) month, dt.category_name,dt.product_name ), last_month_data ( select ps.* from product_sales ps. where ps.year = year(CURRENT_DATE) -1 and ps.month = month(CURRENT_DATE) -1 ), prev_month_data ( select ps.* from product_sales ps. where ps.year = year(CURRENT_DATE) -2 and ps.month = month(CURRENT_DATE) -2 ) select lmd.* from last_month_data lmd, prev_month_data pmd where lmd.category_name = pmd.category_name and lmd.product_name = pmd.product_name and ( lmd.total_quantity > pmd.total_quantity or lmd.total_product_sales > pmd.total_product_sales ) order by lmd.year desc, lmd.month desc, lmd.category_name,lmd.product_name, lmd.total_product_sales desc, lmd.total_quantity desc
Note that in SQLServer you set getdate() instead of CURRENT_DATE.
This way, we can exchange the last part with a select that queries individual CTE parts to see the result of a selected part. As a result, we can quickly debug the problem.
Also, by executing an explain on each CTE part (and the whole query), we estimate how well each part and/or the whole query will perform on the tables and data.
Correspondingly, you can optimize each part by rewriting and/or adding proper indexes to the tables involved. Then you explain the whole query to see the final query plan and proceed with optimization if needed.
Recursive queries using CTE structure
Another useful feature of CTE is creating recursive queries.
Recursive SQL queries let you achieve things you would not imagine possible with this type of SQL and its rate. You can solve many business problems and even rewrite some complex SQL/application logic down to a simple recursive SQL-call to the database.
There are slight variations in creating recursive queries between database systems. However, the goal is the same.
A few examples of the usefulness of recursive CTE:
- You can use it to find gaps in data.
- You can create organization charts.
- You can create pre-computed data to use further in another CTE part
- Finally, you can create test data.
The word recursive says it all. You have a query that repeatedly calls itself with some starting point, and, EXTREMELY IMPORTANT, an ending point (a fail-safe exit as I call it).
If you do not have a fail-safe exit, or your recursive formula goes beyond it, you are in deep trouble. The query will go into an infinite loop resulting in very high CPU and very high LOG utilization. It will lead to memory and/or storage exhaustion.
If your query goes haywire, you must think very fast to disable it. If you cannot do so, alert your DBA immediately, so they prevent the database system from choking, killing the runaway thread.
See the example:
with RECURSIVE mydates (level,nextdate) as ( select 1 level, FROM_UNIXTIME(RAND()*2147483647) nextdate from DUAL union all select level+1, FROM_UNIXTIME(RAND()*2147483647) nextdate from mydates where level < 1000 ) SELECT nextdate from mydates );
This example is a MySQL/MariaDB recursive CTE syntax. With it, we produce one thousand random dates. The level is our counter and fail-safe exit to exit the recursive query safely.
As demonstrated, line 2 is our starting point, while lines 4-5 are the recursive call with the ending point in the WHERE clause (line 6). Lines 8 and 9 are the calls in executing the recursive query and retrieving the data.
DECLARE @today as date; DECLARE @1stjanprevyear as date; select @today = DATEADD(DAY, 0, DATEDIFF(DAY, 0, getdate())), @1stjanprevyear = DATEFROMPARTS(YEAR(GETDATE())-1, 1, 1) ; WITH DatesCTE as ( SELECT @1stjanprevyear as CalendarDate UNION ALL SELECT dateadd(day , 1, CalendarDate) AS CalendarDate FROM DatesCTE WHERE dateadd (day, 1, CalendarDate) < @today ), MaxMinDates as ( SELECT Max(CalendarDate) MaxDate,Min(CalendarDate) MinDate FROM DatesCTE ) SELECT i.* FROM InvoiceTable i, MaxMinDates t where i.INVOICE_DATE between t.MinDate and t.MaxDate OPTION (MAXRECURSION 1000);
This example is an SQLServer syntax. Here, we let the DatesCTE part produce all dates between today and January 1st of the previous year. We use it to return all Invoices belonging to those dates.
The starting point is the @1stjanprevyear variable and the fail-safe exit @today. A maximum of 730 days is possible. Thus, the maximum recursion option is set to 1000 to make sure it stops.
We could even skip the MaxMinDates part and write the final part, as shown below. It can be a faster approach, as we have a matching WHERE clause.
.... SELECT i.* FROM InvoiceTable i, DatesCTE t where i.INVOICE_DATE = t.CalendarDate OPTION (MAXRECURSION 1000);
Altogether, we have briefly discussed and shown how to transform a complex query into a CTE structured query. When a query is broken into different CTE parts, you can use them in other parts and call independently in the final SQL query for debugging purposes.
Another key point is that using CTE makes it simpler to debug a complex query when it is broken into manageable parts, to return the correct and expected result set. It is important to realize that running an explain on each query part and the whole query is crucial to ensure that the query and the DBMS run as optimal as possible.
I have also illustrated writing a powerful recursive CTE query/part in generating data on the fly to use further in a query.
Notably, when writing a recursive query, be VERY careful NOT to forget the fail-safe exit. Make sure to double-check the calculations used in the fail-safe exit to produce a stop signal and/or use the maxrecursion option that SQLServer provides.
Similarly, other DBMS may either use cte_max_recursion_depth (MySQL 8.0) or max_recursive_iterations (MariaDB 10.3) as additional fail-safe exits.