Synthetic Data Generation. Part 3: Backup and Restore

Total: 1 Average: 5

Implementation of the general script for data sanitization and the secret data changes

We’ve examined simple examples for each type of altered data in the previous articles (Part 1: Data Copying, Part 2: Data Changing):

  1. Changing the date and time;
  2. Changing the numerical value;
  3. Changing the byte sequence;
  4. Changing the characters’ data.

However, the examples described above don’t meet the criteria 2 and 3 for the data altering scripts:

  • Criterion 2: the selectivity of values won’t change significantly in the altered data. You can’t use NULL for the table’s field. Instead, you have to ensure that the same real data values stay the same in the altered data. E.g., if the real data contain the 103785 value 12 times in a table’s field subject to changes, the modified data must include a different (changed) value found 12 times in the same field of the table.
  • Criterion 3: the length and size of values should not be changed significantly in the altered data. E.g., you replace each character/byte with a pseudorandom character/byte.

Thus, we need to create a script taking the values’ selectivity in the table’s fields into consideration.

Let’s have a look at our database for the recruiting service. As we see, personal data are present in the candidates’ table only [dbo].[Employee].

Assume that the table includes the following fields:

The fields of the candidates’ table [dbo].[Employee]
Picture 16. The fields of the candidates’ table [dbo].[Employee]

Descriptions:

  • FirstName – name, line NVARCHAR(255)
  • LastName – last name, line NVARCHAR(255)
  • BirthDate – date of birth, DATE
  • DocNumber – the ID-card number with two digits at the beginning for the passport series, and the next seven digits are the document’s number. Between them, we have a hyphen as the NCHAR(10) line.
  • DocDate – the issue date of the ID-card, DATE
  • CountRequest – the number of requests for that candidate during the searching for resume, the integer INT
  • PaymentAmount – the resume promotion service fee received, the decimal number (18,2)
  • RemoteAccessCertificate – the remote access certificate, byte sequence VARBINARY
  • Address – the residential address or the registration address, line NVARCHAR(MAX)

Then, to keep the initial selectivity, we need to implement the following algorithm:

  1. Extract all unique values for each field and keep the results in temporary tables or tabulation variables;
  2. Generate a pseudorandom value for each unique value. This pseudorandom value must not differ significantly in length and size from the original value. Save the result in the same place where we saved the point 1 results. Each newly generated value must have a unique current value correlated.
  3. Replace all values in the table with new values from point 2.

In the beginning, we depersonalize the first and last names of candidates. We assume that the last and first names are always present, and they are no less than two characters long in each field.

First, we select unique names. Then, it generates a pseudorandom line for each name. The name’s length remains the same; the first character is in the uppercase, and the other characters are in the lower case. We use the previously created [test].[GetRandString] scalar function to generate a pseudorandom line of the specific length according to the characters’ criteria defined.  

Then, we update the names in the candidates’ table according to their unique values. It is the same for the last names.

We depersonalize the DocNumber field. It is the ID-card (passport) number. The first two characters stand for the document’s series, and the last seven digits are the number of the document. The hyphen is between them. Then we perform the sanitization operation.

We collect all unique documents’ numbers and generate a pseudorandom line for each one. The line’s format is ‘XX-XXXXXXX,’ where X is the digit in the range of 0 to 9. Here, we use the previously created [test].[GetRandString] scalar function  to generate a pseudorandom line of the specified length according to the characters’ parameters set.

After that, the [DocNumber] field is updated in the candidates’ table [dbo].[Employee].

We depersonalize the DocDate field (the ID-card issue date) and the BirthDate field (the candidate’s date of birth).

First, we select all the unique pairs made of “date of birth & date of the ID-card issue.” For each such pair, we create a pseudorandom date for the date of birth. The pseudorandom date of the ID-card issue is made according to that “date of birth” – the date of the document’s issue must not be earlier than the date of birth.

After that, these data are updated in the respective fields of the candidates’ table [dbo].[Employee].

And, we update the remaining fields of the table.

The CountRequest value stands for the number of requests made for that candidate by companies during the resume search.

The PaymentAmount is the final amount of the resume promotion service fee paid. We calculate these numbers similarly to the previous fields.

Note that it generates a pseudorandom integer for the first case and a pseudorandom decimal for the second case. In both cases, the pseudorandom number generation occurs in the range of “two times less than original” to “two times more than original.” The selectivity of values in the fields is not changed too much.

After that, it writes the values into the fields of the candidates’ table [dbo].[Employee].

Further, we collect unique values of the RemoteAccessCertificate field for the remote access certificate. We generate a pseudorandom byte sequence for each such value. The length of the sequence must be the same as the original. Here, we use the previously created [test].[GetRandVarbinary] scalar function to generate the pseudorandom byte sequence of the specified length.

Then recording into the respective field [RemoteAccessCertificate] of the [dbo].[Employee] candidates’ table takes place.

The last step is the collection of the unique addresses from the [Address] field. For each value, we generate a pseudorandom line of the same length as the original. Note that if it was NULL originally, it must be NULL in the generated field. It allows you to keep NULL and don’t replace it with an empty line. It minimizes the selectivity values’ mismatch in this field between the production database and the altered data.

We use the previously created [test].[GetRandString] scalar function to generate the pseudorandom line of the specified length according to the characters’ parameters defined.

It then records the data into the respective [Address] field of the candidates’ table [dbo].[Employee].

This way, we get the full script for depersonalization and altering of the confidential data.

Finally, we get the database with altered personal and confidential data. The algorithms we used make it impossible to restore the original data from the altered data. Also, the values’ selectivity, length, and size aren’t changed significantly. So, the three criteria for the personal and secret data altering scripts are ensured.

We won’t review the criterion 4 separately here. Our database contains all the data subject to change in one candidates’ table [dbo].[Employee]. The data conformity is needed within this table only. Thus, criterion 4 is also here. However, we need to remember this criterion 4 claiming that all interrelations must remain the same in the altered data.

We often see other conditions for personal and confidential data altering algorithms, but we won’t review them here. Besides, the four criteria described above are always present. In many cases, it is enough to estimate the functionality of the algorithm suitable to use it.

Now, we need to make a backup of the created database, check it by restoring on another instance, and transfer that copy into the necessary environment for development and testing. For this, we examine the full database backup creation.

Full database backup creation

We can make a database backup with construction BACKUP DATABASE as in our example.

Make a compressed full backup of the database JobEmplDB_Test. The checksum calculation takes place in the file E:\Backup\JobEmplDB_Test.bak. Further, we check the backup created.

Then, we check the backup created by restoring the database for it. Let’s examine the database restoring.

Restoring the database

You can restore the database with the help of RESTORE DATABASE construction in the following way:

USE [master]
RESTORE DATABASE [JobEmplDB_Test]
FROM  DISK = N'E:\Backup\JobEmplDB_Test.bak' WITH  FILE = 1,
MOVE N'JobEmplDB' TO N'E:\DBData\JobEmplDB.mdf',
MOVE N'JobEmplDB_log' TO N'F:\DBLog\JobEmplDB_log.ldf',
NOUNLOAD,
REPLACE,
STATS = 5;
GO

We restore the JobEmplDB_Test database of the E:\Backup\JobEmplDB_Test.bak backup. The files will be overwritten, and the data file will be transferred into the file E:\DBData\JobEmplDB.mdf, while the transactions log file will be moved into F:\DBLog\JobEmplDB_log.ldf.

After we successfully check how the database is restored from the backup, we forward the backup to the development and testing environments. It will be restored again with the same method as described above.

We’ve examined the first approach to the data populating into the database for testing and development. This approach implies copying and altering the data from the production database. Now, we’ll examine the second approach – the synthetic data generation.

Synthetic data generation

General algorithm

The sequence of actions is the following:

  1. Make a new empty database or clear a previously created database by purging all data.
  2. Create or renew a scheme in the newly created database – the same as that of the production databases.
  3. Copy of renew guidelines and regulations from the production database and transfer them into the new database.
  4. Generate synthetic data into the necessary tables of the new database.
  5. Make a backup of a new database.
  6. Deliver and restore the new backup in the necessary environment.

This is the general algorithm of the synthetic data generation. Further, we’ll describe it briefly how to implement this algorithm.

The general algorithm implementation

A new empty database creation

You can make an empty database with the help of the CREATE DATABASE construction in the following way. It creates a database JobEmplDB_Test with three filegroups:

  • PRIMARY – the primary default filegroup includes two files: JobEmplDB_Test1 (path D:\DBData\JobEmplDB_Test1.mdf) and JobEmplDB_Test2 (path D:\DBData\JobEmplDB_Test2.ndf). The original size of each file is 64 Mb. The growth step for each file is 8 Mb.
  • DBTableGroup – the user’s filegroup. It includes two files: JobEmplDB_TestTableGroup1 (path D:\DBData\JobEmplDB_TestTableGroup1.ndf) and JobEmplDB_TestTableGroup2 (path D:\DBData\JobEmplDB_TestTableGroup2.ndf). The initial size of each file is 8 Gb, the growth step for each file is 1 Gb.
  • DBIndexGroup – the user filegroup that defines two files: JobEmplDB_TestIndexGroup1 (path D:\DBData\JobEmplDB_TestIndexGroup1.ndf) and JobEmplDB_TestIndexGroup2 (path D:\DBData\JobEmplDB_TestIndexGroup2.ndf). The initial size of each file is 16 Gb, the growth step for each file is 1 Gb.

The transaction log is one file JobEmplDB_Testlog (path E:\DBLog\JobEmplDB_Testlog.ldf). The initial size of the file is 8 Gb, and the growth step is 1 Gb.

The database clean-up with the data purge

To clear the database off all its data, we need to do the following:

  1. Keep the definitions of all external keys.
  2. Disable all limitations and triggers.
  3. Delete all external keys.
  4. Clear the tables using the TRUNCATE construction.
  5. Restore all the external keys deleted in point 3.
  6. Enable all the limitations disabled in point 2.

You can save the definitions of all external keys with the following script:

1. The external keys’ definitions are saved in the tabulation variable @tbl_create_FK.

2. You can disable the limitations and triggers with the following script:

EXEC sp_MSforeachtable "ALTER TABLE ? NOCHECK CONSTRAINT ALL";
EXEC sp_MSforeachtable "ALTER TABLE ? DISABLE TRIGGER ALL";

To enable the limitations and triggers, you can use the following script:

EXEC sp_MSforeachtable "ALTER TABLE ? WITH CHECK CHECK CONSTRAINT ALL";
EXEC sp_MSforeachtable "ALTER TABLE ? ENABLE TRIGGER ALL";

Here, we use the saved procedure sp_MSforeachtable that applies the construction to all the database’s tables.

3. To delete the external keys, use the special script. Here, we receive the information about the external keys through the INFORMATION_SCHEMA.TABLE_CONSTRAINTS system presentation. We delete external keys through the cursor, one by one, using the formed dynamic script T-SQL, transferring the request into the system saved procedure sp_executesql.

4. To clear the tables with the TRUNCATE construction, use the dedicated script. The script works in the same way as above, but it receives the data for tables, and then it clears the tables one by one through the cursor, using the TRUNCATE construction.

5. Restoring the external keys is possible with the below script (earlier, we saved the external keys’ definitions in the tabulation variable @tbl_create_FK):

DECLARE @tsql NVARCHAR(4000);

DECLARE FK_Create CURSOR LOCAL FOR SELECT
	[Script]
FROM @tbl_create_FK;

OPEN FK_Create;

FETCH NEXT FROM FK_Create INTO @tsql;

WHILE (@@fetch_status = 0)
BEGIN
	EXEC sp_executesql @tsql = @tsql;
	FETCH NEXT FROM FK_Create INTO @tsql;
END

CLOSE FK_Create;
DEALLOCATE FK_Create;

The script works in the same way as the two other scripts we mentioned above. But it restores the external keys’ definitions through the cursor, one for each iteration.

A particular case of the data purging in the database is the current script. To get this output, we need the below construction in the scripts:

EXEC sp_executesql @tsql = @tsql;

Before this construction, or instead of it, we need to write the generated construction output. It is necessary to call it manually or via the dynamic T-SQL query. We do it via the system saved procedure sp_executesql.

Instead of the below code fragment in all cases:

WHILE (@@fetch_status = 0)
BEGIN
	EXEC sp_executesql @tsql = @tsql;
...

We write:

WHILE (@@fetch_status = 0)
BEGIN
	PRINT @tsql;

	EXEC sp_executesql @tsql = @tsql;
...

Or:

WHILE (@@fetch_status = 0)
BEGIN
	PRINT @tsql;
...

The first case implies both the output of constructions and their execution. The second case if for the output only – it is helpful for the scripts’ debugging.

Thus, we get the general database cleaning script.

Copying of the scheme and the necessary data from the production database into the newly created database

To copy the scheme and necessary data from the production database into the newly created database, you can use Visual Studio (SSDT) or the third-party utilities like:

Copying the guides and regulatory references from the production database into the newly created database

When you need to copy guidelines and regulatory references from the production database into the newly created database, you can use the T-SQL scripts:

Our example database of the recruitment service includes 5 guidelines:

  • [dbo].[Company] – companies
  • [dbo].[Skill] – skills
  • [dbo].[Position] – positions (occupation)
  • [dbo].[Project] – projects
  • [dbo].[ProjectSkill] – project and skills’ correlations

The “skills” table [dbo].[Skill] serves to show how to make a script for the data insertion from the production database into the test database.

We form the following script:

SELECT 'SELECT '+CAST([SkillID] AS NVARCHAR(255))+' AS [SkillID], '+''''+[SkillName]+''''+' AS [SkillName] UNION ALL'
  FROM [dbo].[Skill]
ORDER BY [SkillID];

We execute it in a copy of the production database that is usually available in read-only mode. It is a replica of the production database.

The result is as follows:

Generated scripts for skills selection
Picture 17. Generated scripts for skills selection

Now, wrap the result up into the script for the data adding as here. We have a script for the skills’ guideline compilation. The scripts for other guidelines are made in the same way.

However, it is much easier to copy the guidelines’ data through the data export and import in SSMS. Or, you can use the data import and export wizard.

Synthetic data generation

We’ve determined the pseudorandom values’ generation for lines, numbers, and byte sequences. It took place when we examined the implementation of the data sanitization and the confidential data altering algorithms for approach 1. Those implemented functions and scripts are also used for the synthetic data generation.

The recruitment service database requires us to fill the synthetic data in two tables only:

  • [dbo].[Employee] – candidates
  • [dbo].[JobHistory] – a candidate’s work history (experience), the resume itself

We can fill the candidates’ table [dbo].[Employee] with synthetic data using this script.

At the beginning of the script, we set the following parameters:

  • @count – the number of lines to be generated
  • @LengthFirstName – the name’s length
  • @LengthLastName – the last name’s length
  • @StartBirthDate – the lower limit of the date for the date of birth
  • @FinishBirthDate – the upper limit of the date for the date of birth
  • @StartCountRequest – the lower limit for the field [CountRequest]
  • @FinishCountRequest – the upper limit for the field [CountRequest]
  • @StartPaymentAmount – the lower limit for the field [PaymentAmount]
  • @FinishPaymentAmount – the upper limit for the field [PaymentAmount]
  • @LengthRemoteAccessCertificate – the byte sequence’s length for the certificate
  • @LengthAddress – the length for the field [Address]
  • @count_get_unique_DocNumber – the number of attempts to generate the unique document’s number [DocNumber]

The script complies with the uniqueness of the [DocNumber] field’s value.

Now, let’s fill the [dbo].[JobHistory] table with synthetic data as follows.

The start date of work [StartDate] is later than the issuing date of the candidate’s document [DocDate]. The end date of work [FinishDate] is later than the start date of work [StartDate].

It is important to note that the current script is simplified, as it does not deal with parameters of the generated data selectivity configuration.

Making a full database backup

We can make a database backup with the construction BACKUP DATABASE, using our script.

We create a full compressed backup of the database JobEmplDB_Test. The checksum is calculated into the file E:\Backup\JobEmplDB_Test.bak. It also ensures further testing of the backup.

Let’s check the backup by restoring the database from it. We need to examine the database restoring then.

Restoring the database

You can restore the database with the help of the RESTORE DATABASE construction, as shown below:

USE [master]
RESTORE DATABASE [JobEmplDB_Test]
FROM  DISK = N'E:\Backup\JobEmplDB_Test.bak' WITH  FILE = 1,
MOVE N'JobEmplDB' TO N'E:\DBData\JobEmplDB.mdf',
MOVE N'JobEmplDB_log' TO N'F:\DBLog\JobEmplDB_log.ldf',
NOUNLOAD,
REPLACE,
STATS = 5;
GO

We restore the database JobEmplDB_Test from the backup E:\Backup\JobEmplDB_Test.bak. The files are overwritten, and the data file is transferred to the file E:\DBData\JobEmplDB.mdf. The transaction log file is transferred to file F:\DBLog\JobEmplDB_log.ldf.

After checking the database restoring from the backup successfully, we transfer the backup to the necessary environments. It will be used for testing and development, and further deployment through the database restoring, as described above.

This way, we’ve examined the second approach to filling the database in with data for testing and development purposes. It is the synthetic data generation approach.

Evgeniy Gribkov

Evgeniy Gribkov

Evgeniy is a MS SQL Server database analyst, developer and administrator. He is involved in development and testing of tools for SQL Server database management. Evgeniy also writes SQL Server-related articles.