In this article, we are going to touch upon the topic of performance of table variables. In SQL Server, we can create variables that will operate as complete tables. Perhaps, other databases have the same capabilities, however, I used such variables only in MS SQL Server.
Thus, you can write the following:
declare @t as table (int value)
Here, we declare the @t variable as a table that will contain a single Value column of the Integer type. It is possible to create more complex tables, however, in our example, one column is sufficient to explore the optimization.
Now, we can use this variable in our queries. We can add much data to it and perform data retrieval from this variable:
insert into @t select UserID from User or select * from @t
I noticed that table variables are used when it is necessary to fetch data for a large selection. For example, there is a query in the code that returns users of the site. Now, you collect IDs of all users, add them to the table variable and can search addresses for these users. Perhaps, someone may ask why we do not execute one query on the database and get everything right away? I have a simple example.
Assume that users come from the Web service, while their addresses are stored in your database. In this case, there is no way out. We got a bunch of user IDs from the service, and to avoid querying the database, someone decides that it’s easier to add all the IDs to the query parameter as a table variable and the query will look neatly:
select * from @t as users join Address a on a.UserID = users.UserID os
All this works correctly. In the C# code, you can quickly combine the results of both data arrays into one object using LINQ. However, the performance of the query may suffer.
The fact is that table variables were not designed for processing large volumes of data. If I’m not mistaken, the query optimizer will be always using the LOOP execution method. Thus, for each ID from @t, a search in the Address table will occur. If there are 1000 records in @t, the server will scan Address 1000 times.
In terms of execution, due to the insane number of scans, the server simply drops trying to find data.
It is much more effective to scan the whole Address table and find all the users at once. This method is called MERGE. However, SQL Server chooses it when there is a lot of sorted data. In this case, the optimizer does not know how much and what data will be added to the variable, and whether there is sorting because such a variable does not include indexes.
If there is little data in the table variable and you do not insert thousands of rows in it, everything is fine. However, if you like to use such variables and add a huge amount of data to them, you must continue reading.
Even if you replace the table variable with SQL, it will greatly speed up query performance:
select * from ( Select 10377 as UserID Union all Select 73736 Union all Select 7474748 …. ) as users join Address a on a.UserID = users.UserID
There may be a thousand of such SELECT statements and the query text will be huge, but it will be executed thousands of times faster for a large bulk of data because SQL Server can choose an effective execution plan.
This query does not look great. However, its execution plan cannot be cached because changing only one ID will change the whole query text as well and parameters cannot be used.
I think Microsoft did not expect users to use tabular variables in this way, but there is a nice workaround.
There are several ways to solve this problem. However, in my opinion, the most effective in terms of performance is to add OPTION (RECOMPILE) to the end of the query:
select * from @t as users join Address a on a.UserID = users.UserID OPTION (RECOMPILE)
This option is added once at the very end of the query following even ORDER BY. The purpose of this option is to make SQL Server recompile the query upon each execution.
If we measure the query performance after that, the time will most likely be reduced for performing the search. With large data, the performance improvement can be significant, from tens of minutes to seconds. Now, the server compiles its code before running each query and does not use the execution plan from the cache, but generates a new one, depending on the amount of data in the variable, and this usually helps a lot.
The drawback is that the execution plan is not stored and the server has to compile the query and look for an effective execution plan each time. However, I have not seen the queries where this process took more than 100 ms.
Is it a bad idea to use table variables? No, it is not. Just remember that they were not created for large data. Sometimes, it is better to create a temporary table, if there is much data, and insert data into this table, or even create an index on the fly. I had to do this with reports, though only once. Back then, I reduced the time for generating one report from 3 hours to 20 minutes.
I prefer using one big query instead of splitting it into several queries and the storing results in variables. Allow SQL Server to tune the performance of a big query and it will not let you down. Please note that you should resort to table variables only in extreme cases when you really see their benefits.