Written by 19:38 Database development, Statements

How to Parse Strings Like a Pro Using SQL SUBSTRING() Function?

Do you enjoy parsing strings? If so, one of the indispensable string functions to use is SQL SUBSTRING. It is one of those skills a developer should have for any language.

So, how do you do it?

Important Points in String Parsing

Assume you’re new to parsing. What important points do you need to remember?

  1. Know what information is embedded within the string.
  2. Get the exact positions of each piece of information in a string. You may have to count all characters within the string.
  3. Know the size or length of each information piece in a string.
  4. Use the right string function that can extract each piece of information in the string easily.

Knowing all these factors will prepare you for using SQL SUBSTRING() and passing arguments to it.

SQL SUBSTRING Syntax

CodingSight - How to Parse Strings Like a Pro Using SQL SUBSTRING()

The syntax of SQL SUBSTRING is as follows:

SUBSTRING(string expression, start, length)

  • string expression – a literal string or an SQL expression that returns a string.
  • start – a number where extraction will start. It is also 1-based – the first character in the string expression argument must start with 1, not 0. In SQL Server, it is always a positive number. In MySQL or Oracle, however, it can be positive or negative. If negative, the scanning starts from the end of the string.
  • length – the length of characters to extract. SQL Server requires it. In MySQL or Oracle, it is optional.

4 SQL SUBSTRING Examples

1. Using SQL SUBSTRING to Extract from a Literal String

Let’s start with a simple example using a literal string. We use the name of a famous Korean girl group, BlackPink, and Figure 1 illustrates how SUBSTRING will work:

Extracting Black from BlackPink. To get the substring, you start with the character 1 and scan for 5 characters (see shaded blocks).
Figure 1. Extracting Black from BlackPink. To get the substring, you start with the character 1 and scan for 5 characters (see shaded blocks).

The code below shows how we’ll extract it:

-- extract 'black' from BlackPink (English)
SELECT SUBSTRING('BlackPink',1,5) AS result

Now, let’s also inspect the result set in Figure 2:

Result of extracting Black from BlackPink
Figure 2. Result of extracting Black from BlackPink

Isn’t it easy?

To extract Black from BlackPink, you start from position 1 and end in position 5. Since BlackPink is Korean, let’s find it out if SUBSTRING works on Unicode Korean characters.

(DISCLAIMER: I can’t speak, read, or write Korean, so I got the Korean translation from Wikipedia. I also used Google Translate to see which characters correspond to Black and Pink. Please forgive me if it’s wrong. Still, I hope that the point I am trying to clarify comeі across)

Let’s get have the string in Korean (see Figure 3). The Korean characters used translates to BlackPink:

Extracting Black from BlackPink in Korean. Black includes two Korean characters – see shaded blocks
Figure 3. Extracting Black from BlackPink in Korean. Black includes two Korean characters – see shaded blocks

Now, see the code below. We will extract two characters corresponding to Black.

-- extract 'black' from BlackPink (Korean)
SELECT SUBSTRING(N'블랙핑크',1,2) AS result

Did you notice the Korean string preceded by N? It uses Unicode characters, and the SQL Server assumes NVARCHAR and should be preceded by N. That’s the only difference in the English version. But will it run just fine? See Figure 4:

The result of extracting Black from BlackPink in Korean.
Figure 4. The result of extracting Black from BlackPink in Korean.

It ran without errors.

2. Using SQL SUBSTRING in MySQL With a Negative Start Argument

Having a negative start argument won’t work in SQL Server. But we can have an example of this using MySQL. This time, let’s extract Pink from BlackPink. Here’s the code:

-- Extract 'Pink' from BlackPink using MySQL Substring (English)
select substring('BlackPink',-4,4) as result;

Now, let’s have the result in Figure 5:

Extracting Pink from BlackPink with a negative start argument of -4 using MySQL.
Figure 5. Extracting Pink from BlackPink with a negative start argument of -4 using MySQL.

Since we passed -4 to the start parameter, the extraction started from the end of the string, going 4 characters backward. To achieve the same result in SQL Server, use the RIGHT() function.

Unicode characters also work with MySQL SUBSTRING, as you can see in Figure 6:

Extracting Pink from BlackPink in Korean.
Figure 6. Extracting Pink from BlackPink in Korean.

It worked just fine. But did you notice that we didn’t need to precede the string with N? Also, note that there are several ways to get a substring in MySQL. You’ve already seen SUBSTRING. The equivalent functions in MySQL are SUBSTR() and MID().

3. Parsing Substrings with Variable Start and Length Arguments

Sadly, not all string extractions use fixed start and length arguments. In such a case, you need CHARINDEX to get the position of a string you are targeting. Let’s have an example:

DECLARE @lineString NVARCHAR(30) = N'김제니 01/16/1996@jennierubyjane'
DECLARE @name NVARCHAR(5)
DECLARE @bday DATE
DECLARE @instagram VARCHAR(20)

SET @name = SUBSTRING(@lineString,1,CHARINDEX('@',@lineString)-11)
SET @bday = SUBSTRING(@lineString,CHARINDEX('@',@lineString)-10,10)
SET @instagram = SUBSTRING(@lineString,CHARINDEX('@',@lineString),30)

SELECT @name AS [Name], @bday AS [BirthDate], @instagram AS [InstagramAccount]

In the code above, you need to extract a name in Korean, the birthdate, and the Instagram account. By the way, if you are professional user of Instagram you should use Instagram post maker by Adobe.

We start with defining three variables to hold those pieces of information. After that, we can parse the string and assign the results to each variable.

You may think that having fixed starts and lengths is simpler. Besides, we can pinpoint it by counting the characters manually. But what if you have lots of these on a table?

Here’s our analysis:

  • The only fixed item in the string is the @ character in the Instagram account. We can get its position in the string using CHARINDEX. Then, we use that position to get the start and lengths of the rest.
  • The birthdate is in a fixed format using MM/dd/yyyy with 10 characters.
  • To extract the name, we start at 1. Since the birthdate has 10 characters plus the @ character, you can get to the ending character of the name in the string. From the position of the @ character, we go 11 characters back. The SUBSTRING(@lineString,1,CHARINDEX(‘@’,@lineString)-11) is the way to go.
  • To get the birthdate, we apply the same logic. Get the position of the @ character and move 10 characters backward to get the birthdate start value. 10 is a fixed length. SUBSTRING(@lineString,CHARINDEX(‘@’,@lineString)-10,10) is how to get the birthdate.
  • Finally, getting an Instagram account is straightforward. Start from the position of the @ character using CHARINDEX. Note: 30 is the Instagram username limit.

Check out the results in Figure 7:

Result of extracting name, birthdate, and Instagram account in a string.
Figure 7. Result of extracting name, birthdate, and Instagram account in a string.

4. Using SQL SUBSTRING in a SELECT Statement

You can also use SUBSTRING in the SELECT statement, but first, we need to have working data. Here’s the code:

SELECT
 CAST(P.LastName AS CHAR(50)) 
 + CAST(P.FirstName AS CHAR(50)) 
 + CAST(ISNULL(P.MiddleName,'') AS CHAR(50)) 
 + CAST(ea.EmailAddress AS CHAR(50)) 
 + CAST(a.City AS CHAR(30)) 
 + CAST(a.PostalCode AS CHAR(15)) AS line
INTO PersonContacts
FROM Person.Person p
INNER JOIN Person.EmailAddress ea 
  ON P.BusinessEntityID = ea.BusinessEntityID
INNER JOIN Person.BusinessEntityAddress bea 
  ON P.BusinessEntityID = bea.BusinessEntityID
INNER JOIN Person.Address a 
  ON bea.AddressID = a.AddressID

The above code makes a long string containing the name, email address, city, and postal code. We also want to store it in the PersonContacts table.

Now, let’s have the code to reverse engineer using SUBSTRING:

SELECT
 TRIM(SUBSTRING(line,1,50)) AS [LastName]
,TRIM(SUBSTRING(line,51,50)) AS [FirstName]
,TRIM(SUBSTRING(line,101,50)) AS [MiddleName]
,TRIM(SUBSTRING(line,151,50)) AS [EmailAddress]
,TRIM(SUBSTRING(line,201,30)) AS [City]
,TRIM(SUBSTRING(line,231,15)) AS [PostalCode]
FROM PersonContacts pc
ORDER BY LastName, FirstName

Since we used fixed-size columns, there’s no need to use CHARINDEX.

Using SQL SUBSTRING in a WHERE Clause – A Performance Trap?

It’s true. No one can stop you from using SUBSTRING in a WHERE clause. It is a valid syntax. But what if it causes performance problems?

That’s why we prove it with an example and then discuss how to fix this issue. But first, let’s prepare our data:

USE AdventureWorks
GO

SELECT * INTO SalesOrders FROM Sales.SalesOrderHeader soh

I can’t mess up the SalesOrderHeader table, so I dumped it to another table. Then, I made the SalesOrderID in the new SalesOrders table a primary key.

Now, we are ready for the query. I am using dbForge Studio for SQL Server with Query Profiling Mode ON to analyze the queries.

SELECT
 so.SalesOrderID
,so.OrderDate
,so.CustomerID
,so.AccountNumber
FROM SalesOrders so
WHERE SUBSTRING(so.AccountNumber,4,4) = '4030'

As you see, the above query runs fine. Now, look at the Query Profile Plan Diagram in Figure 8:

Plan diagram of the query with a SUBSTRING in the WHERE clause.
Figure 8. Plan diagram of the query with a SUBSTRING in the WHERE clause.

The plan diagram looks simple, but let’s inspect the properties of the Clustered Index Scan node. Particularly, we need the Runtime Information:

Logical reads are 785 from a query with a SUBSTRING in the WHERE clause. Rows read are also far from Actual rows returned
Figure 9. Logical reads are 785 from a query with a SUBSTRING in the WHERE clause. Rows read are also far from Actual rows returned

Illustration 9 shows 785 * 8KB pages read by the database engine. Notice also that the Actual Rows Read is 31,411. It is the total number of rows in the table. However, the query returned only 27,605 Actual Rows.

The entire table was read using the clustered index as a reference.

Why?

The thing is, the SQL Server needs to know if 4030 is a substring of an Account number. It must read and evaluate each record. Discard the rows that are not equal and return the rows that we need. It gets the job done but not fast enough.

What can we do to make it run faster?

Avoid SUBSTRING in the WHERE Clause and Achieve the Same Result Quicker

What we want now is to get the same result without using SUBSTRING in the WHERE clause. Follow the below steps:

  • Alter the table by adding a computed column with a SUBSTRING(AccountNumber, 4,4) formula. Let’s name it AccountCategory for a lack of a better term.
  • Create a non-clustered index for the new AccountCategory column. Include the OrderDate, AccountNumber, and CustomerID columns.

That’s it.

We change the WHERE clause of the query to adapt the new AccountCategory column:

SET STATISTICS IO ON

SELECT
 so.SalesOrderID
,so.OrderDate
,so.CustomerID
,so.AccountNumber
FROM SalesOrders so
WHERE so.AccountCategory = '4030'

SET STATISTICS IO OFF

There is no SUBSTRING in the WHERE clause. Now, let’s check the Plan Diagram:

The Index Scan has been replaced by Index Seek. SQL Server also used the new index on the computed column
Figure 10. The Index Scan has been replaced by Index Seek. SQL Server also used the new index on the computed column

The Index Scan has been replaced by Index Seek. Notice also that the SQL Server used the new index on the computed column. Are there also changes in logical reads and actual rows read? See Figure 11:

Logical reads have improved using the indexed computed column. And Rows read and returned are now equal
Figure 11. Logical reads have improved using the indexed computed column. And Rows read and returned are now equal

Reducing from 785 to 222 logical reads is a great improvement, more than three times less than the original logical reads. It also minimized Actual Rows Read to only those rows we need.

Thus, using SUBSTRING in the WHERE clause is not good for performance, and it goes for any other scalar-valued function used in the WHERE clause.

Conclusion

  • Developers can’t avoid parsing strings. A need for it will arise one way or another.
  • In parsing strings, it’s essential to know the information within the string, the positions of each piece of information, and their sizes or lengths.
  • One of the parsing functions is SQL SUBSTRING. It only needs the string to parse, the position to start extraction, and the length of the string to extract.
  • SUBSTRING can have different behaviors between SQL flavors like SQL Server, MySQL, and Oracle.
  • You can use SUBSTRING with literal strings and strings in table columns.
  • We also used SUBSTRING with Unicode characters.
  • Using SUBSTRING or any scalar-valued function in the WHERE clause can reduce query performance. Fix this with an indexed computed column.

If you find this post helpful, share it on your preferred social media platforms, or share your comment below?

Tags: , , , Last modified: June 12, 2023
Close