Completing SQL. Part 2: Optimizing string processing and file opening

Total: 4 Average: 5

What is this article about?

This is the second piece from the series of articles about the life of database IDE developers. Its structure is similar to that of the previous article. Just like in the first one, I’ll talk about the issues we faced and the solutions we arrived at – both effective and not so much. To understand this article, you don’t have to read the first part in its entirety, but the first few paragraphs would be useful to help you grasp the context.

For lazy readers, here are the main points of the first article:

  • We create a product line of IDEs for the following DBMS: MySQL, SQL Server, Oracle, and PostgreSQL. 
  • Those IDEs are desktop apps based on the .NET stack (with all that it implies)

Many features are focused on SQL code analysis. We use heavily modified ANTLR for this.

I will be adding links to the next parts as they are published:

In this part, I’ll focus on the issues related to string processing. When working with strings, one of the most critical moments is the stage of lexical analysis at which the script is being divided into separate words.

For those interested only in specific parts of this article, and for navigation in general, here are the contents:

Cool reinventing the wheel solutions

OutOfMemory in Visual Studio and ANTLR parser post-processing

The previous article was almost entirely concerned with the issues of SQL code parsing. To resolve such issues, we use ANTLR – and we had to work a lot with it. ANTLR is a tool that takes as input grammar that specifies a language and outputs a parser in one of the available languages. At some point, the generated parser grew to such an extent that we were afraid to debug it. You may ask “why?”. Because, should you press F11 (step into) when debugging and go to the parser file, Visual Studio would just crash. After connecting to that Visual Studio instance with another one, we learned that it failed out due to an OutOfMemory exception when analyzing the parser file. This file contained more than 200,000 lines of code. Unfortunately, debugging the parser is an essential part of the work process. We just couldn’t omit it. Hopefully, C# supports partial classes. With their help, we analyzed the generated parser using regular expressions and divided it into a few files. As a result, we managed to almost half the size of the main file, and Visual Studio started working perfectly with it.

Lexical analysis without substring before Span API

The API for processing substrings and subarrays without extra memory allocations has been recently introduced in .NET. Unfortunately, we needed to optimize lexical analysis for our products long before that. The main task of lexical analysis is classification – defining the boundaries of the words and checking them against a dictionary. Our first solution was very straight-forward: it cycled through a string to find sequences of characters, iterated until a delimiter was reached, then used Substring to cut off the unnecessary parts, converted the string to lowercase, and searched for it in the dictionary. If the word was found, the lexer would return its index. If not, the word was considered to be an object identifier. This is a simplified description of the algorithm to quickly get the point across without delving too deeply into the details.

The profiling results have shown that Substring and ToLower operations took the most time while searching through the dictionary was incredibly fast. This bottleneck had to be optimized somehow. We managed to improve the ToLower method quite a bit due to the fact that the keywords only contained Latin characters. The result was noticeable, but still not enough. Suddenly, we had an idea to convert the dictionary into a 26-ary tree, in which the index in an array of sub-nodes would correspond to the location of the letter in the English alphabet. So, when processing a string, we would navigate this tree. If at the end of the word the token number returns the non-zero value in the current node, then the language has a corresponding keyword and, if not, this word is probably an object identifier. This trick helped us increase the speed of lexical analysis approximately by 30%.

It would be interesting to return to lexer optimization armed with all of our new knowledge and new tools. I’m pretty sure we could get a little bit more there. For instance, around half a year after we hit on this solution, the BenchmarkDotNet library started gaining popularity. It organically integrates into any optimization workflow: after a bottleneck is found with the help of a profiling tool, a benchmark is created and the method gets rewritten many times over until the necessary result is reached. What’s surprising is that the results are often not at all obvious.

Background lexing during file opening

I already mentioned that people who work with SQL often have to deal with large files. After all, a C# parser that’s 200 thousand lines long is really out of the ordinary, while a large database dump (half a million rows, for instance) wouldn’t really surprise anyone. When integrating into VS and SSMS, we’re not meddling with the native code highlighting functionality, but we still have to deal with this task in our standalone tools. Fortunately, we’re working in a 64-bit address space and our IDEs can open very large files if the user has enough RAM. But it’s unlikely that someone is willing to wait for such a file to open for too long, right? In one of the versions, we greatly increased the speed of file opening with the help of a few nifty tricks. Syntax highlighting is based on the lexical analysis. This operation usually takes much more time compared to reading text from the disk. So what’s the catch? In one thread, the text is being read from the file, while the lexical analysis is performed in a different thread. The lexer reads the text row-by-row and if it requests a row that doesn’t exist yet, it will stop and wait. BlockingCollection<T> from BCL works on a similar basis, and the algorithm comprises a typical application of a concurrent Producer-Consumer pattern. The editor working in the main thread requests data about the first highlighted line and if it’s unavailable, the editor will stop and wait. In our editor we have used the producer-consumer pattern and blocking-collection twice:

  1. Reading from a file is a Producer, while lexer is a Consumer.
  2. Lexer is already a Producer, and the text editor is a Consumer. 

This set of tricks allows us to significantly shorten the time spent on opening large files. The first page of the document will be shown very quickly, but should the user try to move to the end of the file within the first few seconds, the document will freeze for some time until the background reader and lexer reach the end. But if you’re working with the beginning of the document or moving towards its end slowly, you won’t even notice those freezes.

Less obvious victories

Ambiguous optimization: partial lexical analysis

Syntactical analysis is usually divided into two levels:

  • the input character stream is processed in order to get the lexemes (tokens) based on the language rules – this is called lexical analysis
  • the parser consumes token stream checking it according to the rules of the formal grammar and often builds a syntax tree.

String processing is considered to be a costly operation. We had an idea on how to optimize it: not to perform full lexical analysis of the text every time but reanalyze only the part that was changed. But how to deal with multiline constructs like block comments or lines? We rode out the situation with the following trick. For every line, we store a line-end state: “no multiline tokens” = 0, “the beginning of a block comment” = 1, “the beginning of a multiline string literal” = 2. Lexical analysis starts from the changed section and ends when the line-end state will be equal to the stored one.

Sounds great, doesn’t it? So what’s that unobvious about this solution? Should it belong to the cool tricks section? At first, we tried packaging the results of the lexical analysis into lightweight structures: 2 bytes for shifting in the string, 2 bytes for the token type, 2 bytes for the length. On the basis of these structures, ANTLR Token instances were created for the parser on the fly. After the assessment, we found out that the performance improvement was less than 5%. The profiling results became even more obscure, the bottleneck was nowhere to be found. The only thing was that the new operator was incredibly slow. Saddened, we went home for the weekend. As it often happens when you were working on a task for a week without actually completing it, it stays at the back of your mind. On Monday we came up with an idea to put the intermediaries aside and store the ANTLR Token objects straight away. Going forward with it, we understood that this would require more RAM than lightweight structures. We knew that a number takes less memory than a string, but we didn’t take pains to accurately calculate everything at that time.

There was one more issue with this solution: it’s extremely inconvenient to monitor line numbers in such structures whilst line number is a required attribute of an ANTLR token because when a line is inserted or deleted, the number of the next line should be updated accordingly. We solved this issue by setting a line number immediately, before handing the token over to the parser. The tests we performed later have shown that the performance improved by 15-25%. It should be mentioned that the actual improvement is even greater. Some features that only relied on lexical analysis, such as Quick Info, worked instantaneously. Quick Info is a functionality that’s available in each product from our product line. It’s a pop-up hint with object information that appears when you hover a mouse cursor over an object identifier in the script. As for the advantage – it is even greater when there’s a need to process token stream multiple times for different features.

The amount of RAM required for all of this turned out to be much more than we expected. An ANTLR token consisted of: a beginning point – 8 bytes, an endpoint – 8 bytes, a link to the word’s text – 4 or 8 bytes (not mentioning the string itself), a link to the document’s text – 4 or 8 bytes, and a token type – 4 bytes. One important thing is that most of the time, tokens didn’t store their text and just calculated it when required – the text was only cached in some cases. Additionally, this was a class, not a structure, which meant even more costs. The total size of the object came out to be a few times larger than the initial text size. We understood that later. No one really expects large memory overconsumption from a class with a few int fields. Theoretically, we could get an advantage in memory usage on a script with very long object names and literals. With around 30 characters per word, this approach even becomes efficient, which is definitely not a regular situation.

So what can we conclude? On one hand, the performance improvement was initially lower than we hoped to get. The bottleneck seemed obvious. We focused on performance and got excessive consumption of RAM in a place we didn’t expect. It’s not that these results completely blindsided us, but we didn’t assume this would happen because we tried to use lightweight structures instead of classes from the very beginning. By replacing them with heavy objects, we knowingly went for additional memory expenses to get better performance. Our problem was that we didn’t meticulously calculate how much memory such a class would actually require. But one rarely contemplates about a few ints taking so much memory. Fortunately, this taught us an important lesson, so now each performance optimization ends with profiling memory consumption and vice versa.

At first, I wanted to classify this case as one about a mistake but, while working on this text and having conversations with its participants, I changed my mind. This is a story with a moral, and its result can be ultimately considered a victory because some features started working nearly instantaneously and others just a bit quicker. After all, it would be impossible to perform the background lexical analysis trick if there wasn’t an object where one of the threads could store tokens.

One more thing: before implementing a new solution, it’s sometimes better to stop and think again about the assumptions it’s built on, especially about the most obvious ones, and try to remember why they seem to be obvious.

Conclusions

Here are the intermediary conclusions:

  • The issues with development tools can be unpleasant and detrimental to the whole process. Such issues should be located and dealt with as soon as possible.
     
  • The new versions of .NET give us cool new possibilities. There may be a point in returning to the tasks we solved in the past and approach them again with the new tools.
     
  • Any optimization consists of two parts: locating the bottleneck (the profiler is a nice fit for this task) and optimizing it (benchmarks will help with this).
     
  • It’s better not to start optimizing unless thorough profiling has been performed, even if you know where exactly the neck stage is. It’s important to understand the greatest possible optimization potential.
     
  • When you have millions of objects to work with, each and every byte is crucial.
     
  • It’s a good idea to profile RAM after performance tuning, and vice versa.
Andrey Podkolzin

Andrey Podkolzin

Andrey has more than 5 years of .NET development experience in Devart. He likes to optimize and parallel code. In the past, he had a passion for game development, however, nowadays he focused on the desktop products development.