Processing an array of values inside a procedure or function is a common requirement in most business cases. Since SQL Server does not support an array variable type, developers use a list of values (mostly in CSV) as input.
Table Valued Parameters (TVP) Instead of Arrays
The SQL Server 2008 introduced a feature called table-valued parameters (TVP). It enabled users to combine values in a table and process them in table format. Stored procedures or functions could use this variable in join operations. It let them improve performance and avoid cursor-like operations item-by-item.
Although the table-based approach has been available for a long time, lists of values as parameters are widely used for the following two reasons:
- Table-based variables are not supported by JDBC drivers. Java applications and services have to use comma-separated lists or structured formats like XML to transfer the list of values to the database server.
- Legacy code is still in production and needs to be maintained, and migration projects are too costly for execution.
To solve this problem, specialists apply several approaches. Some of them include writing a .NET CLR function or using XML. However, as the execution of .NET CLR functions might not be available in all environments, and XML is usually not the fastest solution, I will focus on the other two common approaches. They are the numbers table and common table expression (CTE) methods.
Split String into Array in SQL Server
The numbers table approach means that you must manually create a table containing enough rows so that the longest string you split will never exceed this number.
In this example, I am using 100.000 rows with a clustered index and compression on the generated column. This allows accelerating the data retrieval.
IF EXISTS ( SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'dbo.NumbersTest') ) BEGIN DROP TABLE NumbersTest; END; CREATE TABLE NumbersTest (Number INT NOT NULL); DECLARE @RunDate datetime = GETDATE() DECLARE @i INT = 1; WHILE @i <= 100000 BEGIN INSERT INTO dbo.NumbersTest(Number) VALUES (@i); SELECT @i = @i + 1; END; CREATE UNIQUE CLUSTERED INDEX n ON dbo.NumbersTest(Number) WITH (DATA_COMPRESSION = PAGE); GO
Notice that index compression can only be used on Enterprise versions of SQL Server. Otherwise, you do not apply this option during the index creation.
With a newly created NumbersTest table, we can write a custom function performing an array splitting functionality:
CREATE FUNCTION dbo.Split_Numbers ( @List NVARCHAR(MAX), @Delimiter NVARCHAR(255) ) RETURNS TABLE WITH SCHEMABINDING AS RETURN ( SELECT Item = SUBSTRING(@List, Number, CHARINDEX(@Delimiter, @List + @Delimiter, Number) - Number) FROM dbo.NumbersTest WHERE Number <= CONVERT(INT, LEN(@List)) AND SUBSTRING(@Delimiter + @List, Number, LEN(@Delimiter)) = @Delimiter);
On the other hand, if we use the CTE method, it will not require a numbers table. Instead, we will use a recursive CTE to extract each part of the string from the “remainder” of the previous part.
CREATE FUNCTION dbo.Split_CTE ( @List NVARCHAR(MAX), @Delimiter NVARCHAR(255) ) RETURNS @Items TABLE (Item NVARCHAR(4000)) WITH SCHEMABINDING AS BEGIN DECLARE @ll INT = LEN(@List) + 1, @ld INT = LEN(@Delimiter); WITH a AS ( SELECT [start] = 1, [end] = COALESCE(NULLIF(CHARINDEX(@Delimiter, @List, 1), 0), @ll), [value] = SUBSTRING(@List, 1, COALESCE(NULLIF(CHARINDEX(@Delimiter, @List, 1), 0), @ll) - 1) UNION ALL SELECT [start] = CONVERT(INT, [end]) + @ld, [end] = COALESCE(NULLIF(CHARINDEX(@Delimiter, @List, [end] + @ld), 0), @ll), [value] = SUBSTRING(@List, [end] + @ld, COALESCE(NULLIF(CHARINDEX(@Delimiter, @List, [end] + @ld), 0), @ll)-[end]-@ld) FROM a WHERE [end] < @ll ) INSERT @Items SELECT [value] FROM a WHERE LEN([value]) > 0 OPTION (MAXRECURSION 0); RETURN; END
Now we need to test these two approaches. Let us start with a simple test case to check if they work properly:
DECLARE @Values NVARCHAR(MAX) =N'Value 1,Value 2,Value 3,Value 4,Value 5'; SELECT Item AS NumSplit FROM dbo.Split_Numbers (@Values, N','); SELECT Item AS CTESplit FROM dbo.Split_CTE (@Values, N',');
The Execution plan is below. As we see, functions use different methods to achieve the same result.
Since this is a very small dataset, performance cannot be compared. Let us do a test on a bigger set of values.
First, we need to create a table with test values. I will use the TestData table and populate it with different records, marking them by different types, depending on the category they are assigned to:
CREATE TABLE dbo.TestData ( string_typ INT, string_val NVARCHAR(MAX) ); CREATE CLUSTERED INDEX st ON dbo.TestData(string_len); CREATE TABLE #Temp(st NVARCHAR(MAX)); INSERT #Temp SELECT N'a,va,val,value,value1,valu,va,value12,,valu,value123,value1234'; GO INSERT dbo.TestData SELECT 1, st FROM #Temp; GO 10000 INSERT dbo.TestData SELECT 2, REPLICATE(st,10) FROM #Temp; GO 1000 INSERT dbo.TestData SELECT 3, REPLICATE(st,100) FROM #Temp; GO 100 INSERT dbo.TestData SELECT 4, REPLICATE(st,1000) FROM #Temp; GO 10 INSERT dbo.TestData SELECT 5, REPLICATE(st,10000) FROM #Temp; GO DROP TABLE #Temp; GO -- Cleanup up the trailing comma UPDATE dbo.TestData SET string_val = SUBSTRING(string_val, 1, LEN(string_val)-1) + 'x';
With the test data available and split functions ready, we can try to test the case on bigger datasets to see how different functions perform (for each type of data, in seconds).
SELECT func.Item FROM dbo.TestData AS tst CROSS APPLY dbo.Split_CTE(tst.string_val, ',') AS func WHERE tst.string_typ = 1; -- Values of string_typ are from 1-5 SELECT func.Item FROM dbo.TestData AS tst CROSS APPLY dbo.Split_Numbers(tst.string_val, ',') AS func WHERE tst.string_typ = 1; -- Values of string_typ are from 1-5
|STRING_TYP||Split_CTE function (s)||Split_Numbers function ( s)|
As the results show, with the strings getting larger, the advantage of the CTE method increases. It should be used over the table numbers approach.
Also, take into account that results depend on the hardware available on your server machine.
String Split Functions in MS SQL
The SQL Server 2016 provided a new built-in function STRING_SPLIT. This function changes the delimited string into a single-column table with two inputs, the string to be split into values and the separator character. It returns one column named value.
Have a look at the example below:
SELECT * FROM string_split('Value 1,Value 2,Value 3,Value 4,Value 5',',' );
The execution plan for this query is bellow:
The Estimated Number of Rows Per Execution value is always 50. It does not depend on the number of string elements.
If we have the user-defined table-valued functions, the Estimated Number of Rows is 100.
As a table-valued function, it can also be used in FROM and WHERE expressions, and wherever a table expression is supported.
For example, we will use the AdventureWorks2019 database to demonstrate the usage of string_split in JOIN statements:
USE AdventureWorks2019; DECLARE @Persons NVARCHAR(4000) = 'Miller,Margheim,Galvin,Duffy,Khanna'; SELECT PersonType, FirstName, MiddleName, LastName FROM PErson.Person WHERE LastName IN ( SELECT value FROM string_split(@Persons,',') );
If you are working on a newer version of SQL Server, using the list of values with STRING_SPLIT function is optimal. It is optimized and easy to use, without the potential third-to-find bugs of some custom solutions.
However, there are some limitations:
- Single-character separator is the only type accepted. If you need more characters, you have to use a custom function.
- Single-output column – the output is always a single-column table without positioning the string element within the delimited string. It allows sorting only by the element name.
- String data type – you use this function to delimit a string of numbers (although all values in the output column are numbers, their data type is a string). When you join them to numeric columns in other tables, the data type conversion is required. If you forget to do explicit casts, you might get unexpected results.
If these limitations are acceptable to you, feel free to apply the function.
The modern methods of processing arrays allow doing the required tasks appropriately. In my work, I would always suggest a built-in function rather than a user-defined one if they are similar from a performance point of view. It is always pre-deployed and available in all databases.
Besides, the software tools available for working with SQL servers help significantly. They provide the necessary values faster and can automate lots of tasks that often consume that time you might use better. For instance, dbForge SQL Complete offers a handy feature of getting the aggregates’ values for the selected data in the SSMS Results Grid (MIN, MAX, AVG, CONT, etc.).