Synthetic Data Generation. Part 2: Data Changing

Total: 1 Average: 5

Character Data Change

Here, we take an example for the English and Russian alphabets, but you can do it for any other alphabet. The only condition is that its characters must be present in the NCHAR types.

We need to create a function that accepts the line, replaces every character with a pseudorandom character, and then puts the result together and returns it.

However, we need first to understand which characters we need. For that, we can execute the following script: 

DECLARE @tbl TABLE ([ValueInt] INT, [ValueNChar] NCHAR(1), [ValueChar] CHAR(1));

DECLARE @ind int=0;
DECLARE @count INT=65535;

WHILE(@count>=0)
BEGIN
	INSERT INTO @tbl ([ValueInt], [ValueNChar], [ValueChar])
	SELECT @ind, NCHAR(@ind), CHAR(@ind)
	
	SET @ind+=1;
	SET @count-=1;
END

SELECT *
INTO [test].[TblCharactersCode]
FROM @tbl;

We are making the [test].[TblCharacterCode] table that includes the following fields:

  • ValueInt – the numerical value of the character;
  • ValueNChar – the NCHAR-type character;
  • ValueChar – the CHAR-type character.

Let’s review the contents of this table. We need the following request:

SELECT [ValueInt]
      ,[ValueNChar]
      ,[ValueChar]
  FROM [test].[TblCharactersCode];

The numbers are in the range of 48 to 57:

Numbers
Picture 1. Numbers

The Latin characters in the uppercase are in the range of 65 to 90:

The Latin characters in the uppercase are in the range of 65 to 90
Picture 2. Latin characters in the uppercase

Latin characters in the lower care are in the range of 97 to 122:

Latin characters in the lower case
Picture 3. Latin characters in the lower case

Russian characters in the uppercase are in the range of 1040 to 1071:

Russian characters in the uppercase
Picture 4. Russian characters in the uppercase

Russian characters in the lower case are in the range of 1072 to 1103:

Russian characters in the lower case
Picture 5. Russian characters in the lower case

Also, you might need additional characters like those in the range of 33 to 47:

Additional characters like those in the range of 33 to 47
Picture 6. Additional characters – 1

And, characters in the range of 58 to 64:

Additional characters in the range of 58 to 64
Picture 7. Additional characters – 2

We select the necessary characters and put them into the [test].[SelectCharactersCode] table in the following way:

SELECT
	[ValueInt]
   ,[ValueNChar]
   ,[ValueChar]
   ,CASE
		WHEN ([ValueInt] BETWEEN 48 AND 57) THEN 1
		ELSE 0
	END AS [IsNumeral]
   ,CASE
		WHEN (([ValueInt] BETWEEN 65 AND 90) OR
			([ValueInt] BETWEEN 1040 AND 1071)) THEN 1
		ELSE 0
	END AS [IsUpperCase]
   ,CASE
		WHEN (([ValueInt] BETWEEN 65 AND 90) OR
			([ValueInt] BETWEEN 97 AND 122)) THEN 1
		ELSE 0
	END AS [IsLatin]
   ,CASE
		WHEN (([ValueInt] BETWEEN 1040 AND 1071) OR
			([ValueInt] BETWEEN 1072 AND 1103)) THEN 1
		ELSE 0
	END AS [IsRus]
   ,CASE
		WHEN (([ValueInt] BETWEEN 33 AND 47) OR
			([ValueInt] BETWEEN 58 AND 64)) THEN 1
		ELSE 0
	END AS [IsExtra]
INTO [test].[SelectCharactersCode]
FROM [test].[TblCharactersCode]
WHERE ([ValueInt] BETWEEN 48 AND 57)
OR ([ValueInt] BETWEEN 65 AND 90)
OR ([ValueInt] BETWEEN 97 AND 122)
OR ([ValueInt] BETWEEN 1040 AND 1071)
OR ([ValueInt] BETWEEN 1072 AND 1103)
OR ([ValueInt] BETWEEN 33 AND 47)
OR ([ValueInt] BETWEEN 58 AND 64);

Now, let’s examine the contents of this table using the following script:

SELECT [ValueInt]
      ,[ValueNChar]
      ,[ValueChar]
      ,[IsNumeral]
      ,[IsUpperCase]
      ,[IsLatin]
      ,[IsRus]
      ,[IsExtra]
  FROM [test].[SelectCharactersCode];

We receive the following result:

Selected characters
Picture 8. Selected characters

This way, we have the [test].[SelectCharactersCode] table, where:

  • ValueInt – the numerical value of the character
  • ValueNChar – the NCHAR-type character
  • ValueChar – the CHAR-type character
  • IsNumeral – the criterion of a character being a digit
  • IsUpperCase – the criterion of a character in the uppercase
  • IsLatin – the criterion of a character being a Latin character;
  • IsRus – the criterion of a character being a Russian character
  • IsExtra – the criterion of a character being an additional character

Now, we can get the code for the necessary characters’ insertion. For instance, this is how to do it for the Latin characters in the lower case:

SELECT 'SELECT '+CAST([ValueInt] AS NVARCHAR(255))+' AS [ValueInt], '+''''+[ValueNChar]+''''+' AS [ValueNChar], '+''''+[ValueChar]+''''+' AS [ValueChar] UNION ALL'
  FROM [test].[SelectCharactersCode]
  WHERE [IsUpperCase]=0
    AND [IsLatin]=1;

We receive the following result:

Latin characters for insertion
Picture 9. Latin characters for insertion

It is the same for the Russian characters in the lower case:

SELECT 'SELECT '+CAST([ValueInt] AS NVARCHAR(255))+' AS [ValueInt], '+''''+[ValueNChar]+''''+' AS [ValueNChar], '+COALESCE(''''+[ValueChar]+'''', 'NULL')+' AS [ValueChar] UNION ALL'
  FROM [test].[SelectCharactersCode]
  WHERE [IsUpperCase]=0
    AND [IsRus]=1;

We get the following result:

Russian characters for insertion
Picture 10. Russian characters for insertion

It is the same for the characters:

SELECT 'SELECT '+CAST([ValueInt] AS NVARCHAR(255))+' AS [ValueInt], '+''''+[ValueNChar]+''''+' AS [ValueNChar], '+''''+[ValueChar]+''''+' AS [ValueChar] UNION ALL'
  FROM [test].[SelectCharactersCode]
  WHERE [IsNumeral]=1;

The result is as follows:

Digits for insertion
Picture 11. Digits for insertion

So, we have codes to insert the following data separately: 

  • The Latin characters in the lower case.
  • The Russian characters in the lower case.
  • The digits.

It works for both the NCHAR and CHAR types.

Similarly, we can prepare an insertion script for any set of characters. Besides, each set will get its own tabulation function.

To be simple, we implement the common tabulation function that will return the necessary dataset for the previously selected data in the following way:

SELECT
	'SELECT ' + CAST([ValueInt] AS NVARCHAR(255)) + ' AS [ValueInt], '
	+ '''' + [ValueNChar] + '''' + ' AS [ValueNChar], '
	+ COALESCE('''' + [ValueChar] + '''', ‘NULL’) + ' AS [ValueChar], '
	+ CAST([IsNumeral] AS NCHAR(1)) + ' AS [IsNumeral], ' +
	+CAST([IsUpperCase] AS NCHAR(1)) + ' AS [IsUpperCase], ' +
	+CAST([IsLatin] AS NCHAR(1)) + ' AS [IsLatin], ' +
	+CAST([IsRus] AS NCHAR(1)) + ' AS [IsRus], ' +
	+CAST([IsExtra] AS NCHAR(1)) + ' AS [IsExtra]' +
	+' UNION ALL'
FROM [test].[SelectCharactersCode];

The final result is as follows:

Script for the characters’ selection
Picture 12. Script for the characters’ selection

The ready script is wrapped up in the tabulation function [test].[GetSelectCharacters].

It is important to remove an extra UNION ALL in the end of the script generated, and in the [ValueInt]=39, we need to change ”’ to  ””:

SELECT 39 AS [ValueInt], '''' AS [ValueNChar], '''' AS [ValueChar], 0 AS [IsNumeral], 0 AS [IsUpperCase], 0 AS [IsLatin], 0 AS [IsRus], 1 AS [IsExtra] UNION ALL

This tabulation function returns the following set of fields:

  • Num – the line number in the returned set of data;
  • ValueInt – the numerical value of the character;
  • ValueNChar – the NCHAR-type character;
  • ValueChar – the CHAR-type character;
  • IsNumeral – the criterion of the character being a digit;
  • IsUpperCase – the criterion defining that the character is in the uppercase;
  • IsLatin – the criterion defining that the character is a Latin character;
  • IsRus – the criterion defining that the character is a Russian character;
  • IsExtra – the criterion defining that the character is an extra one.

For the input, you have the following parameters:

  1. @IsNumeral – if it should return the numbers;
  2. @IsUpperCase:
    • 0 – it must return only the lower case for letters;
    • 1 – it must return only the uppercase letters;
    • NULL – it must return letters in all cases.
  3. @IsLatin – it must return the Latin characters
  4. @IsRus – it must return the Russian characters
  5. @IsExtra – it must return additional characters.

All flags are used according to the logical OR. E.g., if you need to have digits and Latin characters in the lower case returned, you call the tabulation function in the following way:

declare
@IsNumeral BIT=1
   ,@IsUpperCase BIT=0
   ,@IsLatin BIT=1
   ,@IsRus BIT=0
   ,@IsExtra BIT=0;

SELECT *
FROM [test].[GetSelectCharacters](@IsNumeral, @IsUpperCase, @IsLatin, @IsRus, @IsExtra);

We get the following result:

Digits and Latin letters in the lower case
Picture 13. Digits and Latin letters in the lower case

We implement the [test].[GetRandString] function that will replace the line with pseudorandom characters, keeping the initial string length. This function must include the possibility of operating those characters only that are digits. E.g., it can be useful when you alter the ID-card’s series and number. 

When we implement the [test].[GetRandString] function, we first get the set of characters necessary to generate a pseudorandom line of the specified length in the input parameter @Length. The rest of the parameters work as described above.

Then, we put the received set of data into the tabulation variable @tbl. This table saves the fields [ID] – the order number in the resulting table of characters, and [Value] – the character’s presentation in the NCHAR type.

After that, in a cycle, it generates a pseudorandom number in the range of 1 to the cardinality of the @tbl characters received earlier. We put this number into the [ID] of the tabulation variable @tbl for search. When the search returns the line, we take the [Value] character and “glue” it to the resulting line @res.

When the cycle’s work ends, the line received gets back via the @res variable.

You can change both the first and last names of the candidate in the following way:

UPDATE [dbo].[Employee]
SET [FirstName] = [test].[GetRandString](LEN([FirstName])),
        [LastName] = [test].[GetRandString](LEN([LastName]));

Thus, we’ve examined the function’s implementation and its usage for the NCHAR and NVARCHAR types. We can do the same easily for the CHAR and VARCHAR types.

Sometimes, however, we need to generate a line according to the characters set, not the alphabetical characters or numbers. In this way, we first need to use the following multi-operator function [test].[GetListCharacters].

The [test].[GetListCharacters] function gets the two following parameters for the input:

  1. @str – the line of characters itself;
  2. @IsGroupUnique – it defines if it needs to group unique characters in the line.

With the recursive CTE, the input line @str is transformed into the table of characters – @ListCharacters. That table contains the following fields: 

  1. ID – the order number of the line in the resulting table of characters;
  2. Character – the presentation of the character in NCHAR(1)
  3. Count – the number of the character’s repetitions in the line (it is always 1 if the parameter @IsGroupUnique=0)

Let’s take two examples of this function’s usage to understand its work better:

  • Transformation of the line into the list of non-unique characters:
SELECT *
FROM [test].[GetListCharacters]('123456888  789 0000', 0);

We get the result:

The list of non-unique characters
Picture 14. The list of non-unique characters

This example shows that the line is transformed into the list of characters “as is,” without grouping it by the characters’ uniqueness (the [Count] field always contains 1).

  • The transformation of the line into the list of unique characters
SELECT *
FROM [test].[GetListCharacters]('123456888  789 0000', 1);

The result is as follows:

The list of unique characters
Picture 15. The list of unique characters

This example shows that the line is transformed into the list of characters grouped by their uniqueness. The [Count] field displays the number of findings of each character in the input line.

Basing on the multi-operator function [test].[GetListCharacters], we create a scalar function [test].[GetRandString2].

The definition of the new scalar function shows its alikeness to the [test].[GetRandString] scalar function. The only difference is that it uses the [test].[GetListCharacters] multi-operator function instead of the [test].[GetSelectCharacters] tabulation function.

Here, let’s review two examples of the implemented scalar function usage:

We generate a pseudorandom line of 12 characters length from the input line of characters not grouped by uniqueness:

SELECT [test].[GetRandString2](12, '123456789!!!!!!!!0!!!', DEFAULT);

The result is:

64017!!5!!!7

The keyword is DEFAULT. It states that the default value sets the parameter. Here, it is zero (0).

Or

We generate a pseudorandom line in 12 characters length from the input line of characters grouped by uniqueness:

SELECT [test].[GetRandString2](12, '123456789!!!!!!!!0!!!', 1);

The result is the following:

35792!428273

Read Also

Synthetic Data Generation. Part 1: Data Copying

Synthetic Data Generation. Part 3: Backup and Restore

Evgeniy Gribkov

Evgeniy Gribkov

Evgeniy is a MS SQL Server database analyst, developer and administrator. He is involved in development and testing of tools for SQL Server database management. Evgeniy also writes SQL Server-related articles.