Completing SQL. Part 4. Working with exceptions and the influence of data on the development process. Using ML.NET.

Total: 19 Average: 3.9

What is this article about?

This is going to be the final article in the series of articles on the life of IDE developers for databases. So in this part, I will spread the word about how the introduction of product analytics impacted some aspects of development.

To understand this article, it is not necessary to read all three parts in full; still helpful for understanding would be to read several opening paragraphs of the first part, since they give insight into the context. Nonetheless, here are some key points, in case you have no opportunity to get familiar with the first article:

  • We are making an IDE line for DBMS MySQL, SQL Server, Oracle, PostgreSQL
  • This is a .NET stack desktop application with all that it implies
  • SQL parsing is a difficult task in terms of performance and memory. Therefore, we constantly have to use different tricks for optimization

Let me provide links to the previous articles:

Analyzing the exceptions

You can perform the task of collecting user activity data in a very easy and cool way on the web. However, it is not the case with analyzing desktop applications. There is no such tool that is capable of giving an incredible set of metrics and visualization tools like Google Analytics. Why is that? It would be extremely interesting for me to read a retrospective article on why this happened. But for now, I will put forward a few of my assumptions:  

  1. Throughout the major part of the history of desktop application development, they had no stable and permanent access to the Web.
  2. There are a great number of development tools for desktop applications. Hence, it is impossible to build a multi-purpose user data collection tool for all UI-frameworks and technologies.

As a result, each company tackled the task of product analytics for every product differently. We also had to handle it in our way. For a couple of years, with the consent of users, we have been collecting a relatively small data set, regarding which functions of the dbForge tools line they use more often.

A key aspect of collecting data was to track exceptions. For instance, we collect data about crashes that our users have. Previously, our users had to write to customer support email themselves, adding a Stack Trace of an error, which was copied from a special app window. As a matter of fact, few users followed all these steps. Collected data is completely anonymized, which, unfortunately, deprives us of the opportunity to find out reproduction steps and or any other information from the user.

On the other hand, error data is in the Postgres database, and this paves the way for an instant check of dozens of hypotheses. What do the users with this error have in common? The same OS build? Or perhaps the same processes? Which functions had they all used before the error occurred? You can immediately get the answers to all of these questions by simply making SQL queries to the database. It is often unclear from just one stack or exception type how the exception occurred, that is why all this information is critical to study the problem. 

In addition to that, you have the opportunity to analyze all collected data and find the most problematic modules and classes. Relying on the results of the analysis, you can plan refactoring or additional tests to cover these parts of the program.

Stack decoding service

.NET builds contain IL code, which can be easily converted back into C# code, accurate to the operator, using a number of special programs. One of the ways to protect the program code is its obfuscation. The market suggests a range of solutions to this problem. Programs can be renamed; methods, variables, and classes can be replaced; code can be replaced with its equivalent, but it is really incomprehensible. If you use .NET as a server backend, then it is often unnecessary to obfuscate the code. An attacker is unlikely to gain access to the server file system, and even if he does, he will most probably not extract source code but steal some data.

The necessity to obfuscate source code appears, when you distribute your product in a way, that the user gets the builds of your application. Desktop applications are just those cases. In view of this, all builds including intermediate builds for testers, are carefully obfuscated. Alas, the flipside is encoded exception stacks at the testing stage as well as from the users. Our Quality Assurance Unit was trained to use decoding stack tools from the obfuscator developer. To start decoding, they had to run the application, find deobfuscation maps published by CI for a specific build, and insert the exception stack into the input field. This work was tedious, as every stack took several minutes. We were still able to control the flow when exceptions were generated by testers only or sent to our mail by the conscientious users. Yet, when decoded exceptions started arriving through analytics service, we could not cope with them anymore.

Different versions and editors were obfuscated in a different manner, which made it difficult for a developer to study the problem or could even put him on the wrong track. It was obvious that this process had to be automated. The deobfuscation map format turned out to be pretty straightforward. We easily unparsed it and wrote a stack decoding program. Shortly before that, a web-UI was developed to render exceptions by product versions and to group them by the stack. It was a .NET Core website with a database in SQLite. SQLite is a neat tool for small solutions. We tried to put deobfuscation maps there too. Every build generated approximately 500 thousand encryption & decryption pairs. SQLite could not handle such an aggressive insert rate. 

While data on one build was inserted into the database, two more were added into the queue. I am pretty sure we could enhance SQLite for such load, apart from packing into transactions and different consistency modes, we could try to sectionalize the database somehow. Not long before that problem, I was listening to a report on Clickhouse and was eager to try it out. It proved to be excellent, the insert rate sped up by more than 200 times. That said, stack decoding (reading from database) slowed down by nearly 50 times, but as each stack took less than 1 ms, it was cost-ineffective to spend time studying this problem.

ML.NET for classification of exceptions

On the subject of the automatic processing of exceptions, we made a few more enhancements. 

We already had the Web-UI for a convenient review of exceptions grouped by stacks, we had a Grafana for a high-level analysis of product stability at the level of versions and product lines. But a programmer’s eye, constantly craving optimization, caught another aspect of this process. Our support team was responsible for revising the line of exceptions and creating tasks from them in Jira. Consequently, we wrote a simple service that performed it automatically. Later, we taught it to set and update priorities to tasks depending on the number of users who faced this problem. 

Historically, dbForge line development was divided among 4 teams. Each team had its own functionality to work on, though the borderline was not always clear and obvious. Our technical support team, relying on their experience, read the stack and assigned it to this or that team. They managed it quite well, yet, in some cases, mistakes still occurred. The information on errors from analytics came to Jira on its own, but the support team still needed to classify tasks by teams.

In the meantime, Microsoft introduced a new library – ML.NET. And we still had this classification task, whose parameters we could not formulate on our own. A library of that kind was exactly what we needed. It extracted stacks of all resolved exceptions from Redmine, a project management system that we used earlier, and Jira, which we use at present. We obtained a data array that contained some 5 thousand pairs of Exception StackTrace and command. We put 100 exceptions aside and used the rest of the exceptions to teach a model. The accuracy was about 75%; again we had 4 teams, hence, random and round-robin would only attain 25%. It was a bit worse than what our support team managed to achieve, but it sufficiently saved up their time.

To my way of thinking, if we considerably clean up incoming data array, make a thorough examination of the ML.NET library, and theoretical foundation in machine learning on the whole, then we can improve these results. At the same time, I was impressed with the simplicity of this library: with no special knowledge in AI and ML, we managed to gain real cost-benefits in less than an hour. 

Conclusion

I believe that everyone who read as far as the end of the final article found something interesting and/or useful. 

Hopefully, some of the readers happen to be users of the products I describe in this article, and some lines shed light on the reasons why this or that functional was implemented this way.

I am writing these lines before even the first article in the series is published, so I am still unaware, but pretty sure I have found many helpful and noteworthy comments.

And now, let me draw the conclusions:

  • We should make decisions based on data and not assumptions. I wrote about this in the previous article using the example of optimization. In that case, data can be retrieved from memory and performance profilers. This article is about behavior analytics and insights that we can obtain from it.
  • Collecting and further analyzing the dataset of exceptions allows us to determine a list of problematic modules.
  • We ought to constantly invest in tools. There is nothing wrong if we need to develop something for it. In the next few months, it will save us a lot of time and rid us of routine. Routine on top of time expenditure can be very demotivating.
  • When we develop some internal tools, we get a super chance to try out new technologies, which can be applied in production solutions later on. 
  • There are infinitely many tools for data analysis, still, we managed to extract some really important information using SQL tools. This is the most powerful tool to formulate a question to data and receive an answer in a structured form.

I want to give thanks to everyone who read and helped me create this article!

Andrey Podkolzin

Andrey Podkolzin

Andrey has more than 5 years of .NET development experience in Devart. He likes to optimize and parallel code. In the past, he had a passion for game development, however, nowadays he focused on the desktop products development.