StringBuilder: the Past and the Future

In the previous article, I elaborated on peculiarities of string concatenation. In this article, I would like to consider the StringBuilder class in detail.

As we all know, strings in .Net are immutable (without use of unsafe). Therefore, it is not a good idea to perform concatenation frequently. It means that the following code has quite serious problems with memory load:

So, what is wrong with this code?

Each iteration creates a string, which is bigger than the previous string by 1, with the subsequent copying of characters from the old string.

Therefore, the total number of characters in use equals:

Total number of characters in use

This formula is a sum of arithmetic progression:

Arithmetical Progression

Thus, such concatenation scenario requires memory to be proportional to O(n2), where n is a length of a string.

To resolve problems in such code, we use the StringBuilder class, knowing that operations with this class do not lead to memory losses. In fact, StringBuilder is a mutable string.

This code is not ideal and spends a certain amount of memory in vain, but it does it subtly compared to the previous code.

Implementation of the StringBuilder class has been drastically changed in .NET 4.0 in comparison with the previous versions. Therefore, I think it will be interesting to observe the changes.

StringBuilder in .NET 2.0

In .NET 2.0, the StringBuilder class has the following fields:

currentThread — a thread ID, in which an object instance is created;
m_MaxCapacity — maximum capacity of the given instance;
m_StringValue — a string containing characters.

In fact, the StringBuilder class works with the String data type inside. Since strings are immutable on the mscorlib.dll side, StringBuilder can easily modify the string located in m_StringValue.

The initial length is 16 characters, and in case of shortage of space for adding new characters, StringBuilder replaces the inside string with the twice bigger string and copies all characters from the previous string + new characters to a new string.

Duplication of the string length leads to the (O(n)) linear complexity of memory, opposed to squared complexity that is specific to normal string.

The important addition to performance improvement is setting a required volume during creation of an instance of the StringBuilder class. Therefore, we do not need another size duplication and data copying in case of memory bottleneck. However, it is possible only if we know the size of resulting string beforehand.

Let’s consider implementation of the frequently used StringBuilder methods in .NET 2.0. For better code readability, I did not include check conditions of input parameters.

The Append() Method

This method simply checks free space in current instance for adding a new string. If there is enough space, a simple on-site copying into the vacant part of the string takes place. If there is not enough space, the size duplication and copying of the old and new strings take place.

The Insert() Method

Similarly to the previous method, this method checks free space in the current instance for insertion of a new string. Depending on this check, the method either doubles the string size or inserts on site into the source string without the size change.

The Remove() Method

This method deletes the needless characters shifting the remaining part of the string to the left. When deleting the last character, there is no need in shifting anything, in fact. That’s why, deletion from the end is much faster than from any part of the string.

The ToString() Method

This method returns either copy of the string, or the string it operates. As a rule, the first call of this method returns link to the source string. Therefore, it is executed very fast, but every next call leads to copying of the string. The StringBuilder class in .Net 2.0 focuses on the quickness of this method.

In general, the StringBuilder class in .Net 2.0 is implemented in rather simple fashion. It uses the modifiable string, and when there is not enough space, it creates a new string, the length of which is twice longer than the previous one. The scenario of the length duplication leads to the linear complexity of memory, which is much better than the squared complexity. However, the method is not effective at large lengths of strings. Besides, because of the large size, the string can often be located in bunch of large objects (LOH), that is not good as well.

StringBuilder in .NET 4.0

As I wrote before, implementation of the StringBuilder has been changed in .Net 4.0. Now, Char[] is used instead of String for storing characters, and class itself is a linked list of StringBuilders, similar to RopeString.

The reason of this change is obvious: there is no need in memory reallocation in case of memory bottleneck. It also means that the ToString() method works much slower, since the final string must be generated first, and the Append() method works much faster, since it does not require copying. However, it fits in the typical usage scenario for StringBuilder: multiple calls of Append(), then the call of ToString().

The StringBuilder class in .Net 4.0 has the following fields:

m_ChunkChars – an array containing characters of the current element of the linked list (a bit of a string);
m_ChunkPrevious – a link to previous element (StringBuilder) in the list;
m_ChunkLength – a real length of the current list element (quantity of characters used);
m_ChunkOffset – total number of string characters used (logical length);
m_MaxCapacity – a maximum capacity of the current instance of StringBuilder.

In .NET Framework 4 and .NET Framework 4.5, when you instantiate the SrtingBuilder object by calling the StringBuilder(Int32, Int32) constructor, both the length and capacity of the StringBuilder instance can grow beyond the value of its MaxCapacity property. This can occur particularly when you call the Append and AppendFormat methdos to append small strings.

The maximum length of the MaxChunkSize list element is 8000. As you understand, it is made for a purpose. Here is a comment from the class developer:

We want to keep chunk arrays out of large object heap (< 85K bytes ~ 40K chars) to be sure. Making the maximum chunk size big means less allocation code called, but also more waste in unused characters and slower inserts / replaces.

Let’s consider implementation of the most frequently used StringBuilder in .NET 4.0.

The Append() Method

If the current list element has enough characters for insertion of a new string, the copying takes place. If there is not enough space, only the part that fits is copied. As for the part that does not fit, a new list element is created (an instance of StringBuilder), which array length equals either the length of the entire source string, or the length of the remaining string – it depends which is bigger. However, as I wrote before, the maximum array length is 8000.

In general, a formula for calculation of a new element length looks in the following way:

Where minBlockCharCount is a string length remained after copying of its part that fits the current instance.

Therefore, in result of the following code:

The array lengths of list elements will equal: 8000, 4092, 2048, 1024, 512, 256, 128, 64, 32, 16, 16.

With such array lengths, the operation of calling a certain character in source string is executed quite fast. Practically, it takes O(1), since the quantity of list elements is not high.

The Insert() Method

If the current list element has enough space for insertion, the existing characters shift for providing space for new text. Otherwise, the method creates a new list element and copies elements from the previous list there.

What happens in result of such code?

The result will drastically differ from the code that uses Append().

We will get a very big list of StringBuilders, and each list element will be of 16 characters long. In result, operation of accessing a certain character by index will be executed slower than expected. If will be proportional to the list length, that is O(n).

The Remove() Method

Implementation of this method has been essentially complicated. However, let’s not forget that the previous implementation copied large amount of characters shifting them to the left. Here, we need to make shift to the left within one element in the list.

The ToString() Method

This method goes thought the linked list and sequentially copies characters of each of list elements into the resulting string.

Comparing Productivity

The most interesting part is productivity comparison between two versions of the class.

Test 1.What amount of memory required for storing a string of defined length?

Test 1. Memory required for storing a string

As you can see, when a string is short, the old implementation is much better. It is because each list element requires the information about the length, capacity, and indent.

But when the string length exceeds 16384, the old implementation begins to loose (because of the string size duplication, it contains lots of unused characters).

Test 2. The Append() Method

Test 2. The Append() Method

This method is 1.8 times faster. There is no need to double the length of the string and copy characters into it in case of low memory.

Test 3. The Insert() Method

We will insert into a string with the length of 1000 characters.

1. Insertion into the start of the string

Test 3. The Insert() method. Beginning of a string

2. Insertion into the middle of a string

Test 3. The Insert() method. Middle of a string

3. Insertion into the end of a string

Test 3. The Insert() method. End of the string

As you can see, the new implementation lose.

Test 4. The Remove() Method

We will remove 10 characters from the string until it ends.

1. Deletion from the start of a string.

Test 4. The Remove() method. Beginning of a string

2. Deletion from the middle of a string.

Test 4. The Remove() method. Middle of a string

3. Deletion from the end of a string.

Test 4. The Remove() method. End of a string

The new implementation is ahead of the old one when it comes to deletion from any place, since we do not need to indent characters of the remaining string to the left (to be more exact, we need to, but not that frequent as before).

Test 5. The ToString() Method

The previous implementation returned a simple link to the string it operated (at the first call), while the new implementation has to gather the resulting string by parts avoiding each element of the linked list.

1. The string is generated with the Append() method.

Test 5. ToString() - Append()

2. The string is generated with the Insert() method.

Test 5. ToString() - Insert()

The new implementation works a way slower if a string is generated with the Insert() method, since the list will consist of many elements (SrtingBuilders) 16 characters long.

Test 6. Access by a specific index

Because StringBulder is now a linked list, accessing a string by a certain index becomes expensive. Especially, when string is generated with help of the Insert method.

1. String is generated with the Append() method.

Test 6. Append() method

2. String is generated with the Insert() method.

Test 6. Insert() method

Test 7. Common scenario: multiple call of Append(), and then call of ToString()

As a rule, we work with a given class, according to a certain scenario: multiple call of the Append() method followed by the call of ToString(). Implementation of this call has changed specifically with an eye to this scenario.

Test 7. Multiple Append()

Conclusion

The StringBuilder class in .NET 2.0 has been optimized to speed up the work of the ToString() method, whereas in .NET 4.0, it has been optimized to speed up the Append() method. The new implementation of the Append() method woks almost two times faster, while the Insert() and ToString() method work slower. But since we work with the class according to a certain scenario: multiple call of the Append() method followed by a single call of ToString(), the productivity boost takes place.

Taking into account the new implementation of the class wherein only multiple call of the Append() method leads to the productivity boost, now we could call the class StringAppender *)

Timur Guev

Timur Guev

Timur is an experienced C# developer. For last three years, Timur has been developing the KSS (Kaspesky Subscription Service) highload system with the C#, SQL Server,and Azure technologies. At loose hours, Timur is teaching mathematics.
Timur Guev

Latest posts by Timur Guev (see all)

Timur Guev

Timur is an experienced C# developer. For last three years, Timur has been developing the KSS (Kaspesky Subscription Service) highload system with the C#, SQL Server, and Azure technologies. At loose hours, Timur is teaching mathematics.