Oracle Regular Expressions. Dangerous Range

An Oracle developer who often uses regular expressions in code sooner or later can face a phenomenon that is indeed mystical. Long-term searches for the root of the problem can lead to weight loss, appetite and provoke various kinds of psychosomatic disorders – all this can be prevented with the help of the regexp_replace function. It can have up to 6 arguments:

REGEXP_REPLACE (

  1. source_string,
  2. template,
  3. substituting_string,
  4. the start position of the match search with a template (default 1),
  5. a position of occurrence of the template in a source string (by default 0 equals all occurrences),
  6. modifier (so far it is a dark horse)

)

Returns the modified source_string in which all occurrences of the template are replaced by the value passed in the substituting_string parameter. Often a short version of the function is used, where the first 3 arguments are specified, which is enough to solve many problems. I will do the same. Suppose we need to mask all the string characters with asterisks in the ‘MASK: lower case’ string. To specify the range of lowercase characters, the ‘[a-z]‘ pattern should suit.

Expectation

Reality

If this event has not been reproduced in your database, then you are lucky so far. But more often you start digging in code, convert strings from one set of characters to another and eventually, a despair comes.

Defining a problem

The question arises – what is so special about letter ‘A’ that it has not been replaced because the rest of the uppercase characters were not supposed to be replaced as well. Maybe there are other correct letters except for this one. It is necessary to look at the entire alphabet of uppercase characters.

However

If the 6th argument of the function is not explicitly specified, for example, ‘i’ is case-insensitivity or ‘c’ is case-sensitivity when comparing a source string to a template, the regular expression uses the NLS_SORT parameter of the session/database by default. For example:

This parameter specifies the sorting method in ORDER BY. If we talk about sorting simple individual characters, then a certain binary number (NLSSORT-code) corresponds to each of them and the sorting actually takes place by the value of these numbers.

To illustrate this, let’s take the first and last few characters of the alphabet, both lowercase, and uppercase, and put them in a conditionally unordered table set and call it ABC. Then, let’s sort this set by the SYMBOL field and display its NLSSORT-code in the HEX format next to each symbol.

In the query, ORDER BY is specified for the SYMBOL field, but in fact, in the database, the sorting went by the values from the field NLS_CODE_HEX.

Now, go back to the range from the template and look at the table – what is vertical between the symbol ‘a’ (code 14000200) and ‘z’ (code 87000200)? Everything except the capital letter ‘A’. That’s all that has been replaced with an asterisk. And the code 14000100 of the letter ‘A’ is not included in the replacement range from 14000200 to 87000200.

Cure

Explicitly specify the case-sensitivity modifier

Some sources say that modifier ‘c’ is set by default, but we have just seen that this is not quite true. And if someone did not see it, then the NLS_SORT parameter of its session/database is most likely set to BINARY and the sorting is performed in correspondence with real codes of characters. Indeed, if you change the session parameter, the problem will be solved.
Tests were carried out in Oracle 12c.

Feel free to leave your comments and take care.

Sergey Averin

Sergey Averin

Sergey is a developer of Delphi applications for large databases and server logic for such applications. He mainly focuses on such databases as Oracle and Firebird. Has a deep knowledge of SQL and PL\SQL. Sometimes Sergey writes programming related articles.
Sergey Averin

Latest posts by Sergey Averin (see all)

Sergey Averin

Sergey is a developer of Delphi applications for large databases and server logic for such applications. He mainly focuses on such databases as Oracle and Firebird. Has a deep knowledge of SQL and PL\SQL. Sometimes Sergey writes programming related articles.