I’ve been working for a company that develops IDEs for database interaction for more than five years. Before starting to write this article, I had no idea how many fancy tales would lie ahead.
My team develops and supports IDE language features, and code auto-completion is the major one. I faced many exciting things happening. Some things we did great from the first attempt, and some others failed even after several shots.
SQL and dialects parsing
SQL is an attempt to look like a natural language, and the attempt is quite successful, I should say. Depending on the dialect, there are several thousands of keywords. To distinguish one statement from another, you often need to look for one or two words (tokens) ahead. This approach is called a lookahead.
There is a parser classification depending on how far they can look ahead: LA(1), LA(2), or LA(*), which means that a parser can look as far ahead as needed to define the right fork.
Sometimes, an optional clause ending matches the start of another optional clause. These situations make parsing much harder to run. T-SQL doesn’t make things easier. Also, some SQL statements may have, but not necessarily, endings that can conflict with the beginning of previous statements.
Don’t you believe it? There is a way of describing formal languages via Grammar. You can generate a parser out of it using this or that tool. The most notable tools and languages that describe grammar are YACC and ANTLR.
YACC-generated parsers are used in MySQL, MariaDB, and PostgreSQL engines. We could try and take them right from the source code and develop code completion and other functions based on the SQL analysis using these parsers. Furthermore, this product would receive free development updates, and the parser would behave the same way the source engine does.
So why are we still using ANTLR? It firmly supports C#/.NET, has a decent toolkit, its syntax is much easier to read and write. ANTLR syntax came to be so handy that Microsoft now uses it in its official C# documentation.
But let’s go back to SQL complexity when it comes to parsing. I would like to compare the grammar sizes of the publicly available languages. In dbForge, we use our pieces of grammar. They’re more complete than the others. Unfortunately, they are overloaded with the inserts of the C# code for supporting different functions.
The grammar sizes for different languages are as follows :
JS – 475 parser rows + 273 lexers = 748 rows
Java – 615 parser rows + 211 lexers = 826 rows
C# – 1159 parser rows + 433 lexers = 1592 rows
С++ – 1933 rows
MySQL – 2515 parser rows + 1189 lexers = 3704 rows
T-SQL – 4035 parser rows + 896 lexers = 4931 rows
PL SQL – 6719 parser rows + 2366 lexers = 9085 rows
The endings of some lexers feature the lists of the Unicode characters available in the language. Those lists are useless regarding the evaluation of language complexity. Thus, the number of rows I took always ended before these lists.
Evaluating the complexity of language parsing based on the number of rows in the language grammar is debatable. Still, I believe it’s important to show the numbers that showcase a huge discrepancy.
That’s not all. Since we’re developing an IDE, we should deal with incomplete or invalid scripts. We had to come up with many tricks, but customers still send many working scenarios with unfinished scripts. We need to resolve this.
Predicate wars
During the code parsing, the word sometimes doesn’t tell you which of the two alternatives to choose. The mechanism that solves this type of inaccuracies is lookahead in ANTLR. The parser method is the inserted chain of if’s, and each of them looks one step ahead. See the example of the grammar that generates the uncertainty of this kind:
rule1:
'a' rule2 | rule3
;
rule2:
'b' 'c' 'd'
;
rule3:
'b' 'c' 'e'
;
In the middle of the rule1, when the token ‘a’ is already passed, the parser will look two steps forward for choosing the rule to follow. This check will be performed once again, but this grammar can be re-written to exclude the lookahead. The downside is that such optimizations harm the structure, while the performance boost is rather small.
There are more complex ways to solve this kind of uncertainty. For example, the Syntax predicate (SynPred) mechanism in ANTLR3. It helps when the optional ending of a clause crosses the beginning of the next optional clause.
In terms of ANTLR3, a predicate is a generated method that performs a virtual text entry according to one of the alternatives. When successful, it returns the true value, and the predicate completion is successful. When it’s a virtual entry, it is called a backtracking mode entry. If a predicate works successfully, the real entry happens.
It’s only a problem when a predicate starts inside of another predicate. Then one distance might get crossed hundreds or thousands of times.
Let’s review a simplified example. There are three points of uncertainty: (A, B, C).
- The parser enters A, remembers its position in the text, starts a level-1 virtual entry.
- The parser enters B, remembers its position in the text, starts a level-2 virtual entry.
- The parser enters C, remembers its position in the text, starts a level-3 virtual entry.
- The parser completes a level-3 virtual entry, returns to level-2, and passes C once again.
- The parser completes a level-2 virtual entry, returns to level-1, and passes B and C once again.
- The parser completes a virtual entry, returns, and performs a real entry through A, B, and C.
As a result, all checks within C will be done 4 times, within B – 3 times, within A – 2 times.
But what if a fitting alternative is in the second or the third in the list? Then one of the predicate stages will fail. Its position in the text will roll back, and another predicate will start running.
When analyzing reasons for the app freezing, we often stumble upon the trace of SynPred executed several thousand times. SynPreds are especially problematic in recursive rules. Sadly, SQL is recursive by its nature. The ability to use subqueries almost everywhere has its price. However, it’s possible to manipulate the rule to make a predicate go away.
SynPred hurts performance. At some point, their number was put under rigid control. But the problem is that when you write grammar code, SynPred can appear to be unobvious to you. More so, changing one rule can cause SynPred to appear in another rule, and that makes control over them practically impossible.
We created a simple regular expression tool for controlling the number of predicates run by the special MSBuild Task. If the number of predicates didn’t match the number specified in a file, the task immediately failed the build and warned about an error.
When seeing the error, a developer should re-write the code of the rule several times to remove the redundant predicates. If one can’t avoid predicates, the developer would add it to a special file that drags extra attention for the review.
On rare occasions, we even wrote our predicates using C# just to avoid the ANTLR-generated ones. Luckily, this method also exists.
Grammar inheritance
When there are any changes to our supported DBMSes, we have to meet them in our tools. Support for grammatical syntax constructions is always a starting point.
We create a special grammar for each SQL dialect. It enables some code repetition, but it’s easier than trying to find what they have in common.
We went for writing our own ANTLR grammar preprocessor that does the grammar inheritance.
It also became obvious that we needed a mechanism for polymorphism – the ability to not only redefine the rule in the descendant but also calling the basic one. We would also like to control the position when calling the base rule.
Tools are a definite plus when we compare ANTLR to other language recognition tools, Visual Studio, and ANTLRWorks. And you don’t wanna lose this advantage while implementing the inheritance. The solution was specifying basic grammar in an inherited grammar in an ANTLR commentary format. For ANTLR tools it’s just a comment, but we can extract all required information from it.
We wrote a MsBuild Task that was embedded into the whole-build system as pre-build-action. The task was to do the job of a preprocessor for ANTLR grammar by generating the resulting grammar from its base and inherited peers. The resulting grammar was processed by ANTLR itself.
ANTLR postprocessing
In many programming languages, Keywords can’t be used as the subject names. There can be from 800 to 3000 keywords in SQL depending on the dialect. Most of them are tied to the context inside databases. Thus, forbidding them as object names would frustrate users. That’s why SQL has reserved and unreserved keywords.
You can’t name your object as the reserved word (SELECT, FROM, etc.) without quoting it, but you can do this to an unreserved word (CONVERSATION, AVAILABILITY, etc.). This interaction makes the parser development harder.
During the lexical analysis, the context is unknown, but a parser already requires different numbers for the identifier and the keyword. That’s why we added another postprocessing to the ANTLR parser. It replaced all the obvious identifier checks with calling a special method.
This method has a more detailed check. If the entry calls an identifier, and we expect that the identifier is met onwards, then it’s all good. But if an unreserved word is an entry, we should double-check it. This extra check reviews the branch search in the current context where this unreserved keyword can be a keyword. If there are no such branches, it can be used as an identifier.
Technically, this problem could be solved by the means of ANTLR but this decision isn’t optimal. The ANTLR way is to create a rule that lists all unreserved keywords and a lexeme identifier. Further on, a special rule will serve instead of a lexeme identifier. This solution makes a developer not forget to add the keyword where it’s used and in the special rule. Also, it optimizes the time spent.
Errors in syntax analysis without trees
The syntax tree is usually a result of parser work. It is a data structure that reflects the program text through formal grammar. If you want to implement a code editor with the language auto-completion, you will most likely get the following algorithm:
- Parse the text in the editor. Then you get a syntax tree.
- Find a node under the carriage and match it against the grammar.
- Find out which keywords and object types will be available at the Point.
In this case, the grammar is easy to imagine as a Graph or a State Machine.
Unfortunately, only the third version of ANTLR was available when the dbForge IDE had started its development. It, however, wasn’t as nimble and although you could tell ANTLR how to build a tree, the usage wasn’t smooth.
Moreover, many articles about this topic suggested using the ‘actions’ mechanism for running code when the parser was passing through the rule. This mechanism is very handy, but it has led to architectural problems and made supporting new functionality more complex.
The thing is, a single grammar file started accumulating ‘actions’ due to the big number of functionalities that should’ve rather been distributed to different builds. We managed to distribute actions handlers to different builds and make a sneaky subscriber-notifier pattern variation for that measure.
ANTLR3 works 6 times faster than ANTLR4 according to our measurements. Also, the syntax tree for big scripts could take too much RAM, which wasn’t good news, so we needed to operate within the 32-bit address space of Visual Studio and SQL Management Studio.
ANTLR parser post-processing
When working with strings, one of the most critical moments is the stage of lexical analysis where we divide the script into separate words.
ANTLR takes as input grammar that specifies the language and outputs a parser in one of the available languages. At some point, the generated parser grew to such an extent that we were afraid to debug it. Should you press F11 (step into) when debugging and go to the parser file, Visual Studio would just crash.
It turned out that it failed out due to an OutOfMemory exception when analyzing the parser file. This file contained more than 200,000 lines of code.
But debugging the parser is an essential part of the work process, and you can’t omit it. With the help of C# partial classes, we analyzed the generated parser using regular expressions and divided it into a few files. Visual Studio worked perfectly with it.
Lexical analysis without substring before Span API
The main task of lexical analysis is classification – defining the boundaries of the words and checking them against a dictionary. If the word is found, the lexer would return its index. If not, the word is considered to be an object identifier. This is a simplified description of the algorithm.
Background lexing during file opening
Syntax highlighting is based on lexical analysis. This operation usually takes much more time compared to reading text from the disk. What’s the catch? In one thread, the text is being read from the file, while the lexical analysis is performed in a different thread.
The lexer reads the text row-by-row. If it requests a row that doesn’t exist, it will stop and wait.
BlockingCollection<T> from BCL works on a similar basis, and the algorithm comprises a typical application of a concurrent Producer-Consumer pattern. The editor working in the main thread requests data about the first highlighted line, and if it’s unavailable, it will stop and wait. In our editor we have used the producer-consumer pattern and blocking-collection twice:
- Reading from a file is a Producer, while lexer is a Consumer.
- Lexer is already a Producer, and the text editor is a Consumer.
This set of tricks allows us to significantly shorten the time spent on opening large files. The first page of the document is shown very quickly, however, the document may freeze if users try to move to the end of the file within the first few seconds. It happens because the background reader and lexer need to reach the end of the document. However, if the user works moves from the beginning of the document towards the end slowly, there won’t be any noticeable freezes.
Ambiguous optimization: partial lexical analysis
The syntactical analysis is usually divided into two levels:
- the input character stream is processed to get lexemes (tokens) based on the language rules – this is called lexical analysis
- the parser consumes token stream checking it according to the formal grammar rules and often builds a syntax tree.
String processing is a costly operation. To optimize it, we decided not to perform a full lexical analysis of the text every time but reanalyze only the part that was changed. But how to deal with multiline constructs like block comments or lines? We stored a line-end state for every line: “no multiline tokens” = 0, “the beginning of a block comment” = 1, “the beginning of a multiline string literal” = 2. The lexical analysis starts from the changed section and ends when the line-end state is equal to the stored one.
There was one issue with this solution: it’s extremely inconvenient to monitor line numbers in such structures whilst line number is a required attribute of an ANTLR token because when a line is inserted or deleted, the number of the next line should be updated accordingly. We solved it by setting a line number immediately, before handing the token over to the parser. The tests we performed later have shown that the performance improved by 15-25%. The actual improvement was even greater.
The amount of RAM required for all of this turned out to be much more than we expected. An ANTLR token consisted of: a beginning point – 8 bytes, an endpoint – 8 bytes, a link to the word’s text – 4 or 8 bytes (not mentioning the string itself), a link to the document’s text – 4 or 8 bytes, and a token type – 4 bytes.
So what can we conclude? We focused on performance and got excessive consumption of RAM in a place we didn’t expect. We didn’t assume this would happen because we tried to use lightweight structures instead of classes. By replacing them with heavy objects, we knowingly went for additional memory expenses to get better performance. Fortunately, this taught us an important lesson, so now each performance optimization ends with profiling memory consumption and vice versa.
This is a story with a moral. Some features started working nearly instantaneously and others just a bit quicker. After all, it would be impossible to perform the background lexical analysis trick if there wasn’t an object where one of the threads could store tokens.
All further problems unfold in the context of desktop development on the .NET stack.
The 32-bit problem
Some users opt to use standalone versions of our products. Others stick to work inside Visual Studio and SQL Server Management Studio. Many extensions are developed for them. One of these extensions is SQL Complete. To clarify, it provides more powers and features than the standard Code Completion SSMS and VS for SQL.
SQL parsing is a very costly process, both in terms of CPU and RAM resources. To prompt the list of objects in user scripts, without unnecessary calls to the server, we store the object cache in RAM. Oftentimes, it doesn’t take up much space, but some of our users have databases that contain up to a quarter of a million objects.
Working with SQL is quite different from working with other languages. In C#, there are practically no files even with a thousand lines of code. Meanwhile, in SQL a developer can work with a database dump consisting of several million lines of code. There is nothing unusual about it.
DLL-Hell inside VS
There’s a handy tool to develop plugins in .NET Framework, it is an application domain. Everything is performed in an isolated way. It is possible to unload. For the most part, the implementation of extensions is, perhaps, the main reason why application domains were introduced.
Also, there is the MAF Framework, which was designed by MS to solve the problem of creating add-ons to the program. It isolates these add-ons to such an extent that it can send them to a separate process and take over all communications. Frankly speaking, this solution is too cumbersome and has not gained much popularity.
Unfortunately, Microsoft Visual Studio and SQL Server Management Studio built upon it, implement the extension system differently. It simplifies access hosting applications for plugins, but it forces them to fit in together within one process and domain with another one.
Just like any other application in the 21st century, ours has a lot of dependencies. The majority of them are well-known, time-proven, and popular libraries in the .NET world.
Pulling messages inside a lock
It is not widely known that .NET Framework will pump Windows Message Queue inside every WaitHandle. To put it inside every lock, any handler of any event in an application can be called if this lock has time to switch to kernel mode, and it is not released during the spin-wait phase.
This can result in re-entrancy in some very unexpected places. A few times it led to problems like “Collection was modified during enumeration” and various ArgumentOutOfRangeException.
Adding an assembly to a solution using SQL
When the project grows, the task of adding assemblies, simple at first, develops into a dozen of complicated steps. Once, we had to add a dozen of different assemblies to the solution, we performed a big refactoring. Nearly 80 solutions, including product and test ones, were created based on around 300 .NET projects.
Based on product solutions, we wrote Inno Setup files. They included lists of assemblies packaged in the installation that the user downloaded. The algorithm of adding a project was as follows:
- Create a new project.
- Add a certificate to it. Set up the tag of the build.
- Add a version file.
- Reconfigure the paths where the project is going.
- Rename the folder to match the internal specification.
- Add the project to the solution once again.
- Add a couple of assemblies that all projects need links to.
- Add the build to all necessary solutions: test and product.
- For all of the product solutions, add the assemblies to installation.
These 9 steps had to be repeated about 10 times. Steps 8 and 9 are not that trivial, and it is easy to forget to add builds everywhere.
Faced with such a big and routine task, any normal programmer would want to automate it. That is exactly what we wanted to do. But how do we indicate which solutions and installations exactly to add to the newly created project? There are so many scenarios and what is more, it is difficult to predict some of them.
We came up with a crazy idea. Solutions are connected with projects like many-to-many, projects with installations in the same way, and SQL can solve exactly the kind of tasks that we had.
We created a .Net Core Console App that scans all .sln files in the source folder, retrieves the list of projects from them with the help of DotNet CLI, and puts it in the SQLite database. The program has a few modes:
- New – creates a project and all necessary folders, adds a certificate, sets up a tag, adds a version, minimum essential assemblies.
- Add-Project – adds the project to all solutions that satisfy the SQL query that will be given as one of the parameters. To add the project to the solution, the program inside uses DotNet CLI.
- Add-ISS – adds the project to all installations, that satisfy SQL queries.
Although the idea to indicate the list of solutions through the SQL query may seem cumbersome, it completely closed all existing cases and most likely any possible cases in the future.
Let me demonstrate the scenario. Create a project “A” and add it to all solutions where projects “B” is used:
dbforgeasm add-project Folder1\Folder2\A "SELECT s.Id FROM Projects p JOIN Solutions s ON p.SolutionId = s.Id WHERE p.Name = 'B'"
A problem with LiteDB
A couple of years ago, we got a task to develop a background function for saving user documents. It had two main application flows: the ability to instantly close the IDE and leave, and upon returning to start from where you left off, and the ability to restore in urgent situations like blackouts or program crashes.
To implement this task, it was necessary to save the contents of the files somewhere on the side, and do it often and quickly. Apart from the contents, it was necessary to save some metadata, which made direct storage in the file system inconvenient.
At that point, we found the LiteDB library, which impressed us with its simplicity and performance. LiteDB is a quick lightweight embedded database, which was entirely written in C#. The speed and overall simplicity won us over.
In the course of the development process, the whole team was satisfied with working with LiteDB. The main problems, however, started after the release.
The official documentation guaranteed that the database ensured proper work with concurrent access from multiple threads as well as several processes. Aggressive synthetic tests showed that the database does not work correctly in a multithreaded environment.
To quickly fix the problem, we synchronized the processes with the help of the self-written interprocess ReadWriteLock. Now, after almost three years, LiteDB is working much better.
StreamStringList
This problem is the opposite of the case with the partial lexical analysis. When we work with a text, it is more convenient to work with it as a string list. Strings can be requested in random order, but certain memory access density is still present. At some point, it was necessary to run several tasks to process very big files without full memory load. The idea was as follows:
- To read the file line by line. Remember offsets in the file.
- Upon request, issue the next line set a required offset, and return the data.
The main task is completed. This structure does not take up much space compared to the file size. At the testing stage, we thoroughly check the memory footprint for big and very big files. Large files were processed for a long time, and small ones will be processed immediately.
There was no reference for checking the execution time. RAM is called Random Access Memory – it is its competitive advantage over SSD and especially over HDD. These drivers start to work badly for random access. It turned out that this approach slowed down the work by almost 40 times, compared to fully loading a file into memory. Besides, we read the file 2,5 -10 full times depending on the context.
The solution was simple, and improvement was enough so that the operation would only take a little longer than when the file is fully loaded into memory.
Likewise, RAM consumption was also insignificant. We found inspiration in the principle of loading data from RAM into a cache processor. When you access an array element, the processor copies dozens of neighboring elements to its cache because the necessary elements are often nearby.
Many data structures use this processor optimization to gain top performance. It is because of this peculiarity that random access to array elements is much slower than sequential access. We implemented a similar mechanism: we read a set of a thousand strings and remembered their offsets. When we access the 1001st string, we drop the first 500 strings and load the next 500. In case we need any of the first 500 lines, then we go to it separately, because we already have the offset.
The programmer does not necessarily need to carefully formulate and check non-functional requirements. As a result, we remembered for future cases, that we need to work sequentially with persistent memory.
Analyzing the exceptions
You can collect user activity data easily on the web. However, it is not the case with analyzing desktop applications. There is no such tool that is capable of giving an incredible set of metrics and visualization tools like Google Analytics. Why? Here my assumptions are:
- Throughout the major part of the history of desktop application development, they had no stable and permanent access to the Web.
- There are many development tools for desktop applications. Hence, it is impossible to build a multi-purpose user data collection tool for all UI frameworks and technologies.
A key aspect of collecting data is to track exceptions. For instance, we collect data about crashes. Previously, our users had to write to customer support email themselves, adding a Stack Trace of an error, which was copied from a special app window. Few users followed all these steps. Collected data is completely anonymized, which deprives us of the opportunity to find out reproduction steps or any other information from the user.
On the other hand, error data is in the Postgres database, and this paves the way for an instant check of dozens of hypotheses. You can immediately get the answers by simply making SQL queries to the database. It is often unclear from just one stack or exception type how the exception occurred, that is why all this information is critical to study the problem.
In addition to that, you have the opportunity to analyze all collected data and find the most problematic modules and classes. Relying on the results of the analysis, you can plan refactoring or additional tests to cover these parts of the program.
Stack decoding service
.NET builds contain IL code, which can be easily converted back into C# code, accurate to the operator, using several special programs. One of the ways to protect the program code is its obfuscation. Programs can be renamed; methods, variables, and classes can be replaced; code can be replaced with its equivalent, but it is really incomprehensible.
The necessity to obfuscate source code appears when you distribute your product in a way suggesting that the user gets the builds of your application. Desktop applications are those cases. All builds, including intermediate builds for testers, are carefully obfuscated.
Our Quality Assurance Unit uses decoding stack tools from the obfuscator developer. To start decoding, they need to run the application, find deobfuscation maps published by CI for a specific build, and insert the exception stack into the input field.
Different versions and editors were obfuscated in a different manner, which made it difficult for a developer to study the problem or could even put him on the wrong track. It was obvious that this process had to be automated.
The deobfuscation map format turned out to be pretty straightforward. We easily unparsed it and wrote a stack decoding program. Shortly before that, a web UI was developed to render exceptions by product versions and to group them by the stack. It was a .NET Core website with a database in SQLite.
SQLite is a neat tool for small solutions. We tried to put deobfuscation maps there too. Every build generated approximately 500 thousand encryption & decryption pairs. SQLite could not handle such an aggressive insert rate.
While data on one build was inserted into the database, two more were added to the queue. Not long before that problem, I was listening to a report on Clickhouse and was eager to try it out. It proved to be excellent, the insert rate sped up by more than 200 times.
That said, stack decoding (reading from database) slowed down by nearly 50 times, but as each stack took less than 1 ms, it was cost-ineffective to spend time studying this problem.
ML.NET for classification of exceptions
On the subject of the automatic processing of exceptions, we made a few more enhancements.
We already had the Web-UI for a convenient review of exceptions grouped by stacks. We had a Grafana for high-level analysis of product stability at the level of versions and product lines. But a programmer’s eye, constantly craving optimization, caught another aspect of this process.
Historically, dbForge line development was divided among 4 teams. Each team had its own functionality to work on, though the borderline was not always obvious. Our technical support team, relying on their experience, read the stack and assigned it to this or that team. They managed it quite well, yet, in some cases, mistakes occurred. The information on errors from analytics came to Jira on its own, but the support team still needed to classify tasks by team.
In the meantime, Microsoft introduced a new library – ML.NET. And we still had this classification task. A library of that kind was exactly what we needed. It extracted stacks of all resolved exceptions from Redmine, a project management system that we used earlier, and Jira, which we use at present. There are many similar applications like Jira as well available in the market such as SmartTask, Hive, etc. to name a few.
We obtained a data array that contained some 5 thousand pairs of Exception StackTrace and command. We put 100 exceptions aside and used the rest of the exceptions to teach a model. The accuracy was about 75%. Again, we had 4 teams, hence, random and round-robin would only attain 25%. It sufficiently saved up their time.
To my way of thinking, if we considerably clean up incoming data array, make a thorough examination of the ML.NET library, and theoretical foundation in machine learning, on the whole, we can improve these results. At the same time, I was impressed with the simplicity of this library: with no special knowledge in AI and ML, we managed to gain real cost-benefits in less than an hour.
Conclusion
Hopefully, some of the readers happen to be users of the products I describe in this article, and some lines shed light on the reasons why this or that function was implemented this way.
And now, let me conclude:
We should make decisions based on data and not assumptions. It is about behavior analytics and insights that we can obtain from it.
We ought to constantly invest in tools. There is nothing wrong if we need to develop something for it. In the next few months, it will save us a lot of time and rid us of routine. Routine on top of time expenditure can be very demotivating.
When we develop some internal tools, we get a super chance to try out new technologies, which can be applied in production solutions later on.
There are infinitely many tools for data analysis. Still, we managed to extract some crucial information using SQL tools. This is the most powerful tool to formulate a question to data and receive an answer in a structured form.
Tags: code completion, development, sql, sql parsing Last modified: January 17, 2023