Introduction
A string data type is one of the fundamental data types, along with numeric (int, long, double) and logical (Boolean) ones. You can hardly imagine at least one useful program that does not utilize this type.
On the .NET platform, the string type is presented as an immutable String class. In addition, it is strongly integrated into the CLR environment and is also supported by the C# compiler.
This article is devoted to concatenation – an operation performed on strings as often as the addition operation on numerals. You may think: “What is there to say?”, after all, we all know about string operator “+”, but as it turned out, it has its own quirks.
Language specification for string operator “+”
The C# language specification provides three overloadings for string operator “+”:
string operator + (string x, string y) string operator + (string x, object y) string operator + (object x, string y)
If one of the operands of string concatenation is NULL, the empty string is inserted. Otherwise, any argument, which is not a string, is represented as a string by calling the ToString virtual method. If the ToString method returns NULL, an empty string is inserted. It should be noted that according to the specification, this operation should never return NULL.
The description of the operator is clear enough, however, if we look at the implementation of the String class, we find a clear definition of only two operators “==” and “!=”. A reasonable question arises: what happens behind the scenes of string concatenation? How does the compiler handle string operator “+”?
The answer to this question turned out to be not so difficult. Let’s take a closer look at the static String.Concat method. The String.Concat method joins one or more instances of the String class or views as String values of one or more instances of Object. There are the following overloadings of this method:
public static String Concat (String str0, String str1) public static String Concat (String str0, String str1, String str2) public static String Concat (String str0, String str1, String str2, String str3) public static String Concat (params String[] values) public static String Concat (IEnumerable <String> values) public static String Concat (Object arg0) public static String Concat (Object arg0, Object arg1) public static String Concat (Object arg0, Object arg1, Object arg2) public static String Concat (Object arg0, Object arg1, Object arg2, Object arg3, __arglist) public static String Concat <T> (IEnumerable <T> values)
Details
Suppose we have the following expression s = a + b, where a and b are strings. The compiler converts it into a call of a Concat static method, i.e.
s = string.Concat (a, b)
The string concatenation operation, like any other addition operation in the C# language, is left-associative.
Everything is clear with two rows, but what if there are more rows? The expression s = a + b + c, given the left-associativity of the operation, could be replaced by:
s = string.Concat(string.Concat (a, b), c)
However, given the overloading that takes three arguments, it will be converted into:
s = string.Concat (a, b, c)
The similar situation is with the concatenation of four strings. To concatenate 5 or more strings, we have the string.Concat overloading (params string[]), so it is necessary to take into account the overhead associated with memory allocation for an array.
It should be also noted that the string concatenation operator is fully associative: it does not matter in which order we concatenate strings, so the expression s = a + (b + c), despite the explicitly indicated priority of concatenation execution, shall be processed as follows
s = (a + b) + c = string.Concat (a, b, c)
instead of the expected:
s = string.Concat (a, string.Concat (b, c))
Thus, summing up the foregoing: the string concatenation operation is always represented from left to right and calls the static String.Concat method.
Optimizing compiler for literal strings
The C# compiler has optimizations relating to literal strings. For example, the expression s = “a” + “b” + c, given the left-associativity of the “+” operator, is equivalent to s = (“a” + “b”) + c is converted to
s = string.Concat ("ab", c)
The expression s = c + “a” + “b”, despite the left-associativity of the operation of concatenation (s = (c + “a”) + “b”) is converted to
s = string.Concat (c, "ab")
In general, the position of the literals does not matter, the compiler concatenates everything it can, and only then tries to select an appropriate overload of the Concat method. The expression s = a + “b” + “c” + d is converted to
s = string.Concat (a, "bc", d)
Optimizations associated with empty and NULL strings should also be mentioned. The compiler knows that adding an empty sting does not affect the result of concatenation, so the expression s = a + “” + b is converted into
s = string.Concat (a, b),
instead of the expected
s = string.Concat (a, "", b)
Similarly, with the const string, the value of which is NULL, we have:
const string nullStr = null; s = a + nullStr + b;
is converted to
s = string.Concat (a, b)
The expression s = a + nullStr is converted to s = a ?? “”, if a is a string, and the call of the string.Concat method(a), if a is not a string, for example, s = 17 + nullStr, it is converted to s = string.Concat (17).
An interesting feature associated with the optimization of literal processing and the left-associativity of string operator “+”.
Let’s consider the expression:
var s1 = 17 + 17 + "abc";
taken into consideration the left-associativity, it is equivalent to
var s1 = (17 + 17) + "abc"; // сalling the string.Concat method (34, "abc")
As a result, at compile time, the numerals are added, so that the result will be 34abc.
On the other hand, the expression
var s2 = "abc" + 17 + 17;
is equivalent to
var s2 = ( "abc" + 17) + 17; // calling the string.Concat method ("abc", 17, 17)
the result will be abc1717.
So, there you go, the same concatenation operator leads to different results.
String.Concat VS StringBuilder.Append
It is necessary to say a few words about this comparison. Let’s consider the following code:
string name = "Timur"; string surname = "Guev"; string patronymic = "Ahsarbecovich"; string fio = surname + name + patronymic;
It can be replaced with the code using StringBuilder:
var sb = new StringBuilder (); sb.Append (surname); sb.Append (name); sb.Append (patronymic); string fio = sb.ToString ();
However, in this case, we will hardly get benefits from using StringBuilder. Apart from the fact that the code has become less readable, it has become more or less effective, since the implementation of the Concat method calculates the length of the resulting string and allocates memory only once, in contrast to StringBuilder that knows nothing about the length of the resulting string.
Implementation of the Concat method for 3 strings:
public static string Concat (string str0, string str1, string str2) { if (str0 == null && str1 == null && str2 == null) return string.Empty; if (str0 == null) str0 = string.Empty; if (str1 == null) str1 = string.Empty; if (str2 == null) str2 = string.Empty; string dest = string.FastAllocateString (str0.Length + str1.Length + str2.Length); // Allocate memory for strings string.FillStringChecked (dest, 0, str0); / string.FillStringChecked (dest, str0.Length, str1); string.FillStringChecked (dest, str0.Length + str1.Length, str2); return dest; }
Operator “+” in Java
A few words about string operator “+” in Java. Although I do not program in Java, yet I’m interested in how it works there. The Java compiler optimizes the “+” operator so that it uses the StringBuilder class and calls the append method.
The previous code is converted to
String fio = new StringBuilder(String.valueOf(surname)).append(name).append (patronymic).ToString()
It is worth noting that they intentionally refused from such optimization in C#, Eric Lippert has a post on this topic. The point is that such optimization is not the optimization as such, it is code rewriting. Besides, the creators of the C# language believe that developers should be familiar with the aspects of working with the String class and, if necessary, switch to StringBuilder.
By the way, Eric Lippert was the one who worked on optimization of the C# compiler associated with the concatenation of strings.
Conclusion
Perhaps, at first glance, it may seem strange that the String class does not define the operator “+” until we think about the optimization capacity of the compiler related to the visibility of a larger code fragment. For example, if the “+” operator was defined in the String class, the expression s = a + b + c + d would lead to the creation of two intermediate strings, a single call of the string.Concat (a, b, c, d) method allows performing the concatenation more effectively.
Tags: .net, c# Last modified: September 23, 2021
You mention that C# intentionally didn’t contain the same StringBuilder “optimization” as Java, but looking at the code, it seems as though the StringBuilder version would be slower. Depending on the size of the strings, it could cause up to 3 allocations as the strings are appended, instead of the single allocation that the C# Concat method would use (as you mentioned earlier).
Is there any way that the StringBuilder version would be quicker?
Such optimization in C# is useless because the Concat method provides overloads that (once) allocate a sufficient amount of memory to concatenate all strings.
The JAVA provides the only one Concat method:
public String concat(String str) {
int otherLen = str.length();
if (otherLen == 0) {
return this;
}
char buf[] = new char[count + otherLen];
getChars(0, count, buf, 0);
str.getChars(0, otherLen, buf, count);
return new String(0, count + otherLen, buf);
}
This method can concatenate only two strings and the following code is not efficient as the memory is allocated four times.
String s = “123”.concat(“456”).concat(“789”).concat(“101112”).concat(“1314151617”);
The following code:
String a = “123” + “456” + “789” + “101112” + “1314151617”;
Will be concerted to
a = new StringBuilder()
.append(“123”)
.append(“456”)
.append(“101112”)
.append(“1314151617”)
.toString();
Thanks for the explanation. It makes sense, but I think you could have worded the article better.
When stating that the C# team explicitly kept from using the optimization, it sounded like they were giving up a useful optimization, as they wanted developers to understand and do it themselves, when in reality they just had a better optimization.
Thanks for the great article.