Filtering Data in T-SQL

Total: 0 Average: 0

Introduction

In most scenarios when working with data, users do not require all the data in each instance of a query. T-SQL (and SQL of course) provide means of retrieving only part of a data set in a table or a similar structure. The WHERE, HAVING, TOP and OFFSET-FETCH clauses are used to accomplish this in different ways.

Predicate Logic

Predicate Logic is a foundational concept that is fundamental to Relational Database Management Systems. By definition, a predicate is an expression that evaluates to TRUE, FALSE or UNKNOWN. The result of such expression is often used on filtering clauses like WHERE, HAVING, and JOINS to determine the final result set extracted from a table or a set of tables.
Predicates are typically composed of a column or derivative of a column on one side of an operator which is compared with a value or expression on the other side. When executing an SQL statement which relies on a predicate to filter data, the database engine must evaluate the predicate and then return the result set defined by that predicate.

Filtering Rows

The very common WHERE clause is used to filter rows in T-SQL. WHERE is evaluated at the second stage of the Logical Query Processing sequence in SQL Server. The graphic in Fig. 1 shows the simplified order of clauses in Logical Query Processing. As soon as the table FROM which we must extract data is identified, we must then filter the data so that subsequent clauses are operating on the desired subset of data. This is more efficient than if done otherwise.

Fig. 1 Logical Query Processing

The simplest form of the WHERE clause is well known to many users of SQL and T-SQL. It is the first expression in Listing 1. Other forms are also shown in the Listing.

Aside from the equality sign which is the most commonly used operator in the WHERE clauses, many other operators are used in constructing expressions on the right-hand side of predicates. The most common operators are processed in the following order:

  1. Parenthesis ()
  2. Multiplication Division and Modulo (* , / , % – )
  3. Positive, Negative, Addition, Concatenation, Subtraction (+, -)
  4. Comparison Operators (=, >, <, >=, <=, <>,!= , !>, !<)
  5. NOT
  6. AND
  7. BETWEEN, IN, LIKE, OR
  8. Assignment (=)

This implies that in order to guarantee some level of predictability when using complex expressions in your predicate, you should make use of parenthesis to explicit determine which parts of the expression are executed first. There are also important things one must know about the columns used in the WHERE clause:

  1. The way the Query Optimizer harnesses indexes is often dependent on whether the column(s) referred to in the WHERE clause are indexed
  2. When you use the function as part of a WHERE clause, it is likely that this behavior will be impacted and you will see a drop in performance

Fig. 2 Index Column: Clustered Index Seek in Execution Plan

Fig. 3 Non-Index Column: Clustered Index Scan in Execution Plan

Filtering Groups of Rows

T-SQL provides a technique to organize a data set into groups of rows to perform aggregate operations on each group. Once these groups are achieved, it is then possible to filter the groups of rows using the HAVING Clause. The HAVING clause is essentially a way to filter groups in the same way the WHERE clause is used to filter rows.

The HAVING clause accepts predicates similar to the way the WHERE clause does except when the elements of the predicate are not columns themselves but aggregate expressions.

Listing 2 shows samples of the COUNT and SUM aggregate functions being used in the HAVING clause. It’s worth mentioning that the HAVING clause is used to filter groups of rows in a similar way as the WHERE clause is used to filter individual rows.

Sampling Data Sets

In practice, regular users of SQL Server try to get a quick view of the data in a table or something as simple as viewing the structure by attempting to retrieve the entire data set using SELECT * FROM TABx. When one hits a large table – this is typically not a good idea. Using the TOP clause one can quickly sample the data in a table. When used by itself, the result set returned by the TOP clause is non-deterministic but when it’s used with the ORDER BY clause, you get the top N rows or the top N percent of rows based on the column used in the ORDER BY clause.

 

Fig. 4 TOP 5 Rows without ORDER BY


Fig. 5 TOP 5 Rows with ORDER BY


Fig. 6 TOP 5 Rows without ties


Fig. 7 TOP 5 Rows with ties

The WITH TIES clause includes rows with the same rank (based on a column in the ORDER BY clause). Figs 4, 5, 6 and 7 show the use of the TOP clause in the various ways already described. Also, note as shown in Listing 3 that TOP can be used in UPDATE statements.

Paging Data Sets

In addition to the TOP clause, we also have a use case where a user navigating a front-end application needs to sift through millions of rows of data. We find that attempting to load the entire data set at a go would be inefficient. The TOP clause allows us to view the first N rows of data or the first N PERCENT of data but we may need to see the NEXT N and so own. The OFFSET-FETCH command allows us to do this.


Fig. 8 First Page of an OFFSET-FETCH Query


Fig. 9 Second Page of an OFFSET-FETCH Query

The largest cost associated with a query involving the TOP or OFFSET-FETCH clauses is the sorting. It is also important to realize that internally, SQL Server is using the same underlying mechanism to realize the OFFSET-FETCH results as it does for a TOP query. TOP is unique to Microsoft products while OFFSET-FETCH is ANSI-compliant.

Fig. 10 Execution Plan for a TOP Query


Fig. 11 Execution Plan for an OFFSET-FETCH Query

Conclusion

In this article, we have seen the various ways T-SQL allows us to extract subsets of data based on the conditions we define. In the case of the WHERE and HAVING clauses, a predicate is explicitly required. The TOP and FETCH simply return subsets of the data set based on some order.

References

  1. TOP (Transact-SQL)
  2. SELECT – ORDER BY Clause
  3. Ben-Gan, I. (2016) T-SQL Fundamentals. Microsoft Press

Kenneth Igiri

Kenneth Igiri is a Database Administrator with eProcess International S.A., Ecobank Group's Shared Services Centre. Kenneth has over eight years' experience with SQL Server and Oracle databases as well as related technologies. His interests include database performance, HADR, and recently, Cloud.
Also, Kenneth teaches at Children's Church and writes fiction. You can connect with Kenneth via his blog or social network account.
Kenneth Igiri

Latest posts by Kenneth Igiri (see all)