Completing SQL. Part 1: The difficulties with parsing or tales of polishing ANTLR

Total: 22 Average: 3.9

I’ve been working for a company that develops IDEs for database interaction for more than five years. Before starting to write this article, I had no idea how many fancy tales would lie ahead. Thus after making 30 pages of it, I decided to group up the tales thematically and divide the article into several parts.

Introduction

In the course of publishing the parts, I will be adding the corresponding links:

Throughout my work experience, many interesting things have happened. We have detected several bugs in .NET, optimized the performance of some feature; optimized features, some of them by several times, and some – only by a tiny bit. Some things were done great from the first attempt,  and some failed even after several shots. My team develops and supports the IDE language features, and code auto-completion is the major one. That’s why the series of articles is titled in such a way. Every single one of them will convey several stories, some about success, some about the failure.

This part is dedicated to SQL parsing issues, fighting coping with them, and the errors that occurred throughout the process.

A few words about myself

I entered this job as a junior specialist with little to no experience. Like others, I was fascinated by the gaming industry and it was the chief reason why I started coding. I was involved in the development of several games and some of them have been successfully released. There are also a dozen games developed or abandoned at different stages of readiness.

Furthermore, I have completed several freelance projects before getting this job. The biggest one was the application for social networks being a football portal with tournament brackets, players’ data, and goals/score notifications via SMS. It was done more than 10 years ago and back then, not as many people used smartphones, and those who did, either had no Internet or used the godawful EDGE that barely managed to open text-version sites. So, the idea seemed cool to me.

It just happened that I always liked creating different tools for myself and developers, and sometimes it was pretty close to what I had to do at work. For example, when I was learning Win32API, one of the projects I made was a program for highlighting XML code in Rich Edit Control. There was also an option to save the highlighted text into BMP, HTML, or formerly fabulous BB-code for forums.

Another project close to my work was the code analysis for С/С++ that built its action diagram. This action diagram was in rigid compliance to one professor’s requirements and it was done clumsily with no knowledge of syntax analysis and worked only hanks to my tough personality. A few years later, I came across this program while cleaning my PC and decided to save and post it to GitHub for the sake of nostalgia.

These experiments plus freelance gave me some experience and an opportunity to get a job. In some time, after a couple of dozen painful reviews, there was some gain and I started to turn a profit for the company. Looking back, it’s funny to realize that thanks to these coincidences I started to engage in what I had always gravitated towards.

Difficulties

This block is intended to give some context of what we are doing.

Desktop development

Perhaps it’s not a difficulty at all since it’s a well-established field with very few grey areas:  libraries are stable, best practices are known.  Nevertheless, this aspect of our project is here since I, like other programmers, have a thing for the new stuff and all the novelties migrated into the web. It was one of those rainy days when I was sitting at the windowsill with a cup of hot cacao, covered in throw blanket and thought about Redis, react, high load, and distributed systems that were developed without my participation.

Another problem was the inability to explain the project complexity to fellow devs when drinking beer with them. When you work on the applications that operate under vastly different principles, it becomes even harder.

SQL and dialects parsing

Even before this job, I used to write small parsers for different languages. I also ran some .NET courses and suggested my students write their simple JSON parser as an additional assignment within the discussion of rows. But SQL and its variables are nothing like XML or JSON that are clear to both, parsers and people. Moreover, SQL syntax is more complex than its C/C++ counterpart with its myriad of functions accumulated throughout its existence.

SQL has a different structure with an attempt to look like a natural language, and the attempt is quite successful, I should say. Depending on the dialect there are several thousands of keywords in the language. To distinguish one statement from another, you often need to look for one or two words (tokens) ahead. This approach is called ‘lookahead’.

There is a parser classification depending on how far they can look ahead: LA(1), LA(2), or LA(*), which means that a parser can look as far ahead as needed to define a right fork. Sometimes, an optional clause ending matches the start of another optional clause and these situations make parsing much harder to run. T-SQL also doesn’t make things easier: a semicolon isn’t necessary. Also, some SQL statements may have, but not necessarily, endings that can conflict with the beginning of the previous statements.

Don’t you still believe me? There is a way of describing formal languages via grammar. Grammar is a code written on a certain language that describes another language. You can generate a parser out of the grammar using this or that tool. The most notable tools and languages that describe grammar are YACC and ANTLR. YACC-generated parsers are used in MySQL, MariaDB, and PostgreSQL engines. We could try and take them right from the source code and develop ‘code completion’ and other functions based on the SQL analysis using these parsers. Furthermore, this product would receive free development updates and the parser would behave the same way the source engine does.

So why are we still using ANTLR? It firmly supports C#/.NET, has a decent toolkit, its syntax is much easier to read and write. ANTLR syntax came to be so handy that Microsoft now uses it in its official C# documentation.

But let’s go back to SQL complexity compared to other languages when it comes to parsing. For that measure, I would like to compare the grammar sizes of the publicly available languages. In dbForge, we use our pieces of grammar: they’re more complete than the others but, unfortunately, they are overloaded with the inserts of the C# code for supporting different functions (see details in the ‘Syntax analysis without trees’ paragraph of the ‘Errors’ section).

The grammar sizes for different languages  are as follows :
JS – 475 parser rows + 273 lexers = 748 rows
Java – 615 parser rows + 211 lexers = 826 rows
C# – 1159 parser rows + 433 lexers = 1592 rows
С++ – 1933 rows

MySQL – 2515 parser rows + 1189 lexers = 3704 rows
T-SQL – 4035 parser rows + 896 lexers = 4931 rows
PL SQL – 6719 parser rows + 2366 lexers = 9085 rows

The endings of some lexers feature the lists of the Unicode characters available in the language. Those lists are useless regarding the evaluation of language complexity. Thus the number of rows I took always ended before these lists.

Evaluating the complexity of language parsing based on the number of rows in the language grammar is debatable. The completeness of the opened grammars compared to the real language syntax is a matter of argument as well. Still, I believe it’s important to show the numbers that showcase a huge discrepancy.

That’s not all. We need more than just parsing several SQL files. Since we’re developing an IDE, we should deal with incomplete or invalid scripts.  Even if you write your scripts without mistakes, the chance is that you write them incoherently. So the script hardly remains valid throughout the entire development process. First, I write SELECT FROM and see the list of available tables; after I choose one and move the carriage to SELECT, press SPACE, and wait for the list of columns from this very table. It’s a simple example but it shows that a parser, that ensures code completion work in IDE, can’t fall with an exception after meeting an invalid script.

We had to come up with many tricks to enable valid tip work during many similar scenarios but the customers still send different working scenarios with unfinished scripts, that’s why we need to come up with the newer tricks.

Predicate wars

During the code parsing, sometimes the following word doesn’t tell you which of two alternatives to choose. It’s often enough just to look at a certain number of words ahead and you can choose a definite alternative. The mechanism that solves this type of inaccuracies is “lookahead” in ANTLR. In this case, the parser method is the inserted chain of ‘if’s’ and each of them looks one step ahead. Down below is the example of the grammar that generates the uncertainty of this kind:

rule1:
  'a' rule2 | rule3
;

rule2:
  'b' 'c' 'd'
;

rule3:
  'b' 'c' 'e'
;

In the middle of rule1, when the token ‘a’ is already passed, the parser will look two steps forward for choosing the rule to follow. This check will be performed once again but this grammar can be re-written to exclude this ‘lookahead’. The downside is that such optimizations harm the structure while the performance boost is rather small.

There are more complex ways to solve this kind of uncertainty. For example, the Syntax predicate (SynPred) mechanism in ANTLR3. It helps when the optional ending of a clause crosses the beginning of the next optional clause. In terms of ANTLR3, a predicate is a generated method that performs a virtual text entry according to one of the alternatives, and if successful, it returns the ‘true’ value and such predicate completion is called successful. When it’s a virtual entry, it is called a ‘backtracking’ mode entry. If a predicate worked successfully, then the real entry happens. It’s only a problem when a predicate starts inside of another predicate so that one distance might get crossed hundreds or thousands of times.

Let’s review a simplified example. There are three points of uncertainty: (A, B, C).

  1. The parser enters A, remembers its position in the text, starts a level-1 virtual entry.
  2. The parser enters B, remembers its position in the text, starts a level-2 virtual entry.
  3. The parser enters C, remembers its position in the text, starts a level-3 virtual entry.
  4. The parser completes a level-3 virtual entry, returns to level-2 and passes C once again.
  5. The parser completes a level-2 virtual entry, returns to level-1 and passes B and C once again.
  6. The parser completes successfully a virtual entry, returns, and performs a real entry through A, B, and C.

As a result, all the checks within C will be done 4 times, within B – 3 times, within A – 2 times. But what if a fitting alternative is in the second or the third in the list? Then one of the predicate stages will fail, its position in the text will roll back, and another predicate will start running.

When analyzing the reasons for the app freezing, we often stumbled upon the trace of SynPred being executed several thousand times. SynPreds are especially problematic in recursive rules, and sadly, SQL is recursive by its nature: the ability to use subqueries almost everywhere has its price. However, it’s sometimes possible to manipulate the rule to make a predicate go away by using different tricks and gimmicks.

SynPred hurts performance. At some point, their number was put under a rigid control, but the problem is that when you write grammar code, SynPred can appear to be unobvious to you. More so, changing one rule can cause SynPred to appear in another rule that makes the control over them practically impossible. We created a simple ‘regular expression’ tool for controlling the number of predicates run by a special MSBuild Task. If the number of predicates didn’t match the number that was specified in a file, the task immediately failed the build and warned about an error.  When seeing the error, a developer should have re-written the code of the rule several times in an attempt to remove the redundant predicates (probably, with the assistance of the fellow developers).  If the predicate can’t be avoided, the developer adds one to a special file that drags extra attention for the review.

On rare occasions, we even wrote our predicates using C# just to avoid the ANTLR-generated ones.  Luckily, this method also exists.

Steep solutions

Grammar inheritance

As soon as any changes to DBMSes we support are announced, we start a process of implementing support for the changes in our tools.  Support for the grammatical syntax constructions is always a starting point. We create a special grammar for each SQL dialect that enables some code repetition but, in the end, it’s easier than trying to find what they have in common. Just a couple of years ago, MySQL and MariaDB were very similar, and writing separate pieces of grammar didn’t make sense. That’s why we supported those few constructions that were present MariaDB but were absent MySQL in the MySQL grammar. This approach had one con for the MySQL users: we could suggest the invalid constructions.  But in recent years, MySQL and MariaDB started to become more and more different in terms of the syntax and other features, so supporting two different languages within the same grammar became irrelevant.

Full grammar copying and its future support could be the solution. And after a long discussion, we decided to try a bold solution. I was happy we didn’t decide to go for copying – my entire being was against that. So we went for writing our own ANTLR grammar preprocessor that does the grammar inheritance. Just recently, my eye caught ANTLR3 grammar in an ANTLR4 official repository. I guess you need to read this sentence several times to understand the depth of this insanity.

It also became obvious that we would like to have a mechanism for polymorphism.  I.e. we needed an ability to not only redefine the rule in the descendant but also calling the basic one. We would also like to control the position when calling the base rule: it’s beginning, middle, or end. Thus we decided that all the rules can be redefined and there’s no need to specify anything else. To redefine the rule, it’s enough to specify a rule with the same name in the descendant grammar. Once it’s done, the rule from a parent grammar will be available under a different name.

“Tools are a definite plus when we compare ANTLR to other language recognition tools, Visual Studio, and ANTLRWorks. And you don’t wanna lose this advantage while implementing the inheritance. So how do we specify a base grammar? We could make some sort of rules convention but that’s rather unobvious. Another option was specifying extra info in a separate file but even now I feel disgusted for even daring to think about that. The solution was specifying a basic grammar in an inherited grammar in an ANTLR commentary format. For ANTLR tools it’s just a comment, but we can extract all required information from it.

When the requirements are set and done, it’s time for their implementation. We wrote a MsBuild Task that was embedded into the whole build system as pre-build-action. The task was to do the job of a preprocessor for ANTLR grammar by generating the resulting grammar from its base and inherited peers. The resulting grammar was processed by ANTLR itself.

So if there was a rule with the same name in both the inherited and parent grammars, the base rule was renamed – after the underline, the parent grammar name gets added. So by using this name, we could refer to it in the inherited grammar.

The preprocessor work mechanism didn’t take much time but with an addition of parser generation, it took extra 10-20 seconds to rebuild the project. And at some point, we got sick of this fact and started thinking about optimizing it. The decision was to inject the CS parser file with a hash sum of all the grammars that it depends on (as a comment). Before doing anything, the preprocessor compared these hashes with files hashes located on the disk; and the parser was considered valid if these hashes matched each other.

During the initial development stage, we had to come across many invalid parsers and grammars built on the outdated preprocessor versions. As a result, the title commentary received a hash sum of the build with its preprocessor.

One more ANTLR postprocessing

In many programming languages, Keywords can’t be used as the subject names. There can be from 800 to 3000 keywords in SQL depending on the dialect. Most of them are tied to the context inside databases so forbidding them as object names would be met with anger and frustration from the users. That’s why SQL has reserved and unreserved keywords. You can’t name your object as the reserved word (SELECT, FROM, etc.) without quoting it, but you can do this to an unreserved word (CONVERSATION, AVAILABILITY, etc.). This interaction makes the parser development harder.

During the lexical analysis, the context is unknown but a parser already requires different numbers for the identifier and the keyword. That’s why we added another postprocessing to the ANTLR parser: it replaces all the obvious identifier checks with calling a special method. This method has a more detailed check:  if the entry calls an identifier and the identifier is expected to be met onwards, then it’s all good. But if an unreserved word is an entry, then it should be double-checked. This extra check reviews the branch search in the current context where this unreserved keyword can be used as a keyword, and if there are no such branches, it can be used as an identifier.

Technically, this problem could be solved by the means of ANTLR but this decision isn’t optimal. The ANTLR way is to create a rule where all unreserved keywords and a lexeme identifier are listed. Further on, a special rule will be used instead of a lexeme identifier. Not only this solution makes a developer not forget to add the keyword where it’s used and in special rule, but also optimizes the time spent.

ERRORS

Syntax analysis without trees

The syntax tree is usually a result of parser work. The syntax tree is usually a data structure that reflects the program text through formal grammar. If you want to implement a code editor with the language auto-completion, you will most likely get the following algorithm:

  1. Text parsing in the editor. Then you get a syntax tree.
  2. Finding a node under the carriage and match it against the grammar. Find out which keywords and object types will be available at the Point.
    In this case, the grammar is easy to imagine as a Graph or a State Machine.

Unfortunately, only the third version of ANTLR was available when the dbForge IDE had started its development. The fourth version was done from scratch and is vastly different from the third, so once the parser finishes its journey through the code, a parsing tree gets automatically generated. The third version, however, wasn’t as nimble and although you could tell ANTLR how to build a tree, the usage wasn’t smooth what so ever.

Moreover, many articles about this topic suggested using the ‘actions’ mechanism for running code when the parser was passing through the rule. This mechanism is very handy and allows receiving quick results. But it has led to big architectural problems with product development and made supporting new functionality more complex. The thing is, a single grammar file started accumulating ‘actions’ due to the big number of functionalities that should’ve rather been distributed to different builds. So we managed to distribute actions handlers to different builds and make a sneaky subscriber-notifier pattern variation for that measure. It helped but the calls with and data transferring were still cluttering the grammar, complicating new functionality support, and applying serious limitations to the architecture. All in all, that was a cool decision… just with some downsides.

But it’s not as obvious as it seems. The thing is, ANTLR3 works 6 times faster than ANTLR4 according to our measurements. Also, the syntax tree for big scripts could take too much RAM, which wasn’t good news, so we needed to operate within 32-bit address space of Visual Studio and SQL Management Studio, an ever-so-crucial factor, you gotta agree.

Conclusion

So, the intermediate results are as follows:

  • ANTLR is a powerful tool for parser building.
  • Its advantages are the tools, convenient syntax, and a big number of supported languages
  • ANTLR4 was made from scratch and offers a workflow that is different from ANTLR3. There is always a way to get more value from third-party libraries than they’re supposed to give
  • SQL is a specific language and building its parsers is a tall order.
  • Code parsing for IDE-building has its features – prepare to work on unfinished or invalid scripts.
Andrey Podkolzin

Andrey Podkolzin

Andrey has more than 5 years of .NET development experience in Devart. He likes to optimize and parallel code. In the past, he had a passion for game development, however, nowadays he focused on the desktop products development.