Do you enjoy parsing strings? If so, one of the indispensable string functions to use is SQL SUBSTRING. It is one of those skills a developer should have for any language.
So, how do you do it?
Important Points in String Parsing
Assume you’re new to parsing. What important points do you need to remember?
- Know what information is embedded within the string.
- Get the exact positions of each piece of information in a string. You may have to count all characters within the string.
- Know the size or length of each information piece in a string.
- Use the right string function that can extract each piece of information in the string easily.
Knowing all these factors will prepare you for using SQL SUBSTRING() and passing arguments to it.
SQL SUBSTRING Syntax
The syntax of SQL SUBSTRING is as follows:
SUBSTRING(string expression, start, length)
- string expression – a literal string or an SQL expression that returns a string.
- start – a number where extraction will start. It is also 1-based – the first character in the string expression argument must start with 1, not 0. In SQL Server, it is always a positive number. In MySQL or Oracle, however, it can be positive or negative. If negative, the scanning starts from the end of the string.
- length – the length of characters to extract. SQL Server requires it. In MySQL or Oracle, it is optional.
4 SQL SUBSTRING Examples
1. Using SQL SUBSTRING to Extract from a Literal String
Let’s start with a simple example using a literal string. We use the name of a famous Korean girl group, BlackPink, and Figure 1 illustrates how SUBSTRING will work:
The code below shows how we’ll extract it:
-- extract 'black' from BlackPink (English) SELECT SUBSTRING('BlackPink',1,5) AS result
Now, let’s also inspect the result set in Figure 2:
Isn’t it easy?
To extract Black from BlackPink, you start from position 1 and end in position 5. Since BlackPink is Korean, let’s find it out if SUBSTRING works on Unicode Korean characters.
(DISCLAIMER: I can’t speak, read, or write Korean, so I got the Korean translation from Wikipedia. I also used Google Translate to see which characters correspond to Black and Pink. Please forgive me if it’s wrong. Still, I hope that the point I am trying to clarify comeі across)
Let’s get have the string in Korean (see Figure 3). The Korean characters used translates to BlackPink:
Now, see the code below. We will extract two characters corresponding to Black.
-- extract 'black' from BlackPink (Korean) SELECT SUBSTRING(N'블랙핑크',1,2) AS result
Did you notice the Korean string preceded by N? It uses Unicode characters, and the SQL Server assumes NVARCHAR and should be preceded by N. That’s the only difference in the English version. But will it run just fine? See Figure 4:
It ran without errors.
2. Using SQL SUBSTRING in MySQL With a Negative Start Argument
Having a negative start argument won’t work in SQL Server. But we can have an example of this using MySQL. This time, let’s extract Pink from BlackPink. Here’s the code:
-- Extract 'Pink' from BlackPink using MySQL Substring (English) select substring('BlackPink',-4,4) as result;
Now, let’s have the result in Figure 5:
Since we passed -4 to the start parameter, the extraction started from the end of the string, going 4 characters backward. To achieve the same result in SQL Server, use the RIGHT() function.
Unicode characters also work with MySQL SUBSTRING, as you can see in Figure 6:
It worked just fine. But did you notice that we didn’t need to precede the string with N? Also, note that there are several ways to get a substring in MySQL. You’ve already seen SUBSTRING. The equivalent functions in MySQL are SUBSTR() and MID().
3. Parsing Substrings with Variable Start and Length Arguments
Sadly, not all string extractions use fixed start and length arguments. In such a case, you need CHARINDEX to get the position of a string you are targeting. Let’s have an example:
DECLARE @lineString NVARCHAR(30) = N'김제니 01/16/1996@jennierubyjane' DECLARE @name NVARCHAR(5) DECLARE @bday DATE DECLARE @instagram VARCHAR(20) SET @name = SUBSTRING(@lineString,1,CHARINDEX('@',@lineString)-11) SET @bday = SUBSTRING(@lineString,CHARINDEX('@',@lineString)-10,10) SET @instagram = SUBSTRING(@lineString,CHARINDEX('@',@lineString),30) SELECT @name AS [Name], @bday AS [BirthDate], @instagram AS [InstagramAccount]
In the code above, you need to extract a name in Korean, the birthdate, and the Instagram account.
We start with defining three variables to hold those pieces of information. After that, we can parse the string and assign the results to each variable.
You may think that having fixed starts and lengths is simpler. Besides, we can pinpoint it by counting the characters manually. But what if you have lots of these on a table?
Here’s our analysis:
- The only fixed item in the string is the @ character in the Instagram account. We can get its position in the string using CHARINDEX. Then, we use that position to get the start and lengths of the rest.
- The birthdate is in a fixed format using MM/dd/yyyy with 10 characters.
- To extract the name, we start at 1. Since the birthdate has 10 characters plus the @ character, you can get to the ending character of the name in the string. From the position of the @ character, we go 11 characters back. The SUBSTRING(@lineString,1,CHARINDEX(‘@’,@lineString)-11) is the way to go.
- To get the birthdate, we apply the same logic. Get the position of the @ character and move 10 characters backward to get the birthdate start value. 10 is a fixed length. SUBSTRING(@lineString,CHARINDEX(‘@’,@lineString)-10,10) is how to get the birthdate.
- Finally, getting an Instagram account is straightforward. Start from the position of the @ character using CHARINDEX. Note: 30 is the Instagram username limit.
Check out the results in Figure 7:
4. Using SQL SUBSTRING in a SELECT Statement
You can also use SUBSTRING in the SELECT statement, but first, we need to have working data. Here’s the code:
SELECT CAST(P.LastName AS CHAR(50)) + CAST(P.FirstName AS CHAR(50)) + CAST(ISNULL(P.MiddleName,'') AS CHAR(50)) + CAST(ea.EmailAddress AS CHAR(50)) + CAST(a.City AS CHAR(30)) + CAST(a.PostalCode AS CHAR(15)) AS line INTO PersonContacts FROM Person.Person p INNER JOIN Person.EmailAddress ea ON P.BusinessEntityID = ea.BusinessEntityID INNER JOIN Person.BusinessEntityAddress bea ON P.BusinessEntityID = bea.BusinessEntityID INNER JOIN Person.Address a ON bea.AddressID = a.AddressID
The above code makes a long string containing the name, email address, city, and postal code. We also want to store it in the PersonContacts table.
Now, let’s have the code to reverse engineer using SUBSTRING:
SELECT TRIM(SUBSTRING(line,1,50)) AS [LastName] ,TRIM(SUBSTRING(line,51,50)) AS [FirstName] ,TRIM(SUBSTRING(line,101,50)) AS [MiddleName] ,TRIM(SUBSTRING(line,151,50)) AS [EmailAddress] ,TRIM(SUBSTRING(line,201,30)) AS [City] ,TRIM(SUBSTRING(line,231,15)) AS [PostalCode] FROM PersonContacts pc ORDER BY LastName, FirstName
Since we used fixed-size columns, there’s no need to use CHARINDEX.
Using SQL SUBSTRING in a WHERE Clause – A Performance Trap?
It’s true. No one can stop you from using SUBSTRING in a WHERE clause. It is a valid syntax. But what if it causes performance problems?
That’s why we prove it with an example and then discuss how to fix this issue. But first, let’s prepare our data:
USE AdventureWorks GO SELECT * INTO SalesOrders FROM Sales.SalesOrderHeader soh
I can’t mess up the SalesOrderHeader table, so I dumped it to another table. Then, I made the SalesOrderID in the new SalesOrders table a primary key.
Now, we are ready for the query. I am using dbForge Studio for SQL Server with Query Profiling Mode ON to analyze the queries.
SELECT so.SalesOrderID ,so.OrderDate ,so.CustomerID ,so.AccountNumber FROM SalesOrders so WHERE SUBSTRING(so.AccountNumber,4,4) = '4030'
As you see, the above query runs fine. Now, look at the Query Profile Plan Diagram in Figure 8:
The plan diagram looks simple, but let’s inspect the properties of the Clustered Index Scan node. Particularly, we need the Runtime Information:
Illustration 9 shows 785 * 8KB pages read by the database engine. Notice also that the Actual Rows Read is 31,411. It is the total number of rows in the table. However, the query returned only 27,605 Actual Rows.
The entire table was read using the clustered index as a reference.
The thing is, the SQL Server needs to know if 4030 is a substring of an Account number. It must read and evaluate each record. Discard the rows that are not equal and return the rows that we need. It gets the job done but not fast enough.
What can we do to make it run faster?
Avoid SUBSTRING in the WHERE Clause and Achieve the Same Result Quicker
What we want now is to get the same result without using SUBSTRING in the WHERE clause. Follow the below steps:
- Alter the table by adding a computed column with a SUBSTRING(AccountNumber, 4,4) formula. Let’s name it AccountCategory for a lack of a better term.
- Create a non-clustered index for the new AccountCategory column. Include the OrderDate, AccountNumber, and CustomerID columns.
We change the WHERE clause of the query to adapt the new AccountCategory column:
SET STATISTICS IO ON SELECT so.SalesOrderID ,so.OrderDate ,so.CustomerID ,so.AccountNumber FROM SalesOrders so WHERE so.AccountCategory = '4030' SET STATISTICS IO OFF
There is no SUBSTRING in the WHERE clause. Now, let’s check the Plan Diagram:
The Index Scan has been replaced by Index Seek. Notice also that the SQL Server used the new index on the computed column. Are there also changes in logical reads and actual rows read? See Figure 11:
Reducing from 785 to 222 logical reads is a great improvement, more than three times less than the original logical reads. It also minimized Actual Rows Read to only those rows we need.
Thus, using SUBSTRING in the WHERE clause is not good for performance, and it goes for any other scalar-valued function used in the WHERE clause.
- Developers can’t avoid parsing strings. A need for it will arise one way or another.
- In parsing strings, it’s essential to know the information within the string, the positions of each piece of information, and their sizes or lengths.
- One of the parsing functions is SQL SUBSTRING. It only needs the string to parse, the position to start extraction, and the length of the string to extract.
- SUBSTRING can have different behaviors between SQL flavors like SQL Server, MySQL, and Oracle.
- You can use SUBSTRING with literal strings and strings in table columns.
- We also used SUBSTRING with Unicode characters.
- Using SUBSTRING or any scalar-valued function in the WHERE clause can reduce query performance. Fix this with an indexed computed column.
If you find this post helpful, share it on your preferred social media platforms 👍, or share your comment below 👇
- Do You Make These Mistakes When Using SQL CURSOR? - January 13, 2021
- How to Parse Strings Like a Pro Using SQL SUBSTRING() - January 4, 2021
- Everything You Need to Know About SQL CTE in One Spot - December 21, 2020