Synthetic Data Generation. Part 1: Data Copying

Total: 2 Average: 3.5

Introduction

Sooner or later, any information system gets a database, often – more than one. With time, that database gathers very much data, from several GBs to dozens of TBs. To understand how the functionals will perform with the data volumes increasing, we need to generate the data to fill that database.

All scripts presented and implemented will execute on the JobEmplDB database of a recruiting service. The database realization is available here

Approaches to the data filling in databases for testing and development

The database development and testing involve two primary approaches to filling in data:

  1. To copy the entire database from the production environment with personal and other sensitive data changed. This way, you ensure the data and erase confidential data.
  2. To generate synthetic data. It means generating the test data similar to the real data in look, properties, and interconnections.

The advantage of Approach 1 is that it approximates the data and their distribution by different criteria to the production database. It allows us to analyze everything precisely and, therefore, to make conclusions and prognosis accordingly.

However, this approach does not let you increase the database itself many times. It becomes problematic to predict changes in the entire information system’s functionality in the future.

On the other way, you can analyze impersonal sanitized data taken from the production database. Basing on them, you can define how to generate the test data that would be like the real data by their looks, properties, and interrelations. This way, Approach 1 produces Approach 2.

Now, let’s review in detail both the approaches to the data filling in databases for testing and development.

Data copying and altering in a production database

First, let’s define the general algorithm of copying and changing the data from the production environment.

The general algorithm

The general algorithm is as follows:

  1. Create a new empty database.
  2. Create a scheme in that newly created database – the same system as that from the production database.
  3. Copy the necessary data from the production database into the newly created database.
  4. Sanitize and change the secret data in the new database.
  5. Make a backup of the newly created database.
  6. Deliver and restore the backup in the necessary environment.

However, the algorithm becomes more complicated after step 5. For instance, step 6 requires a specific, protected environment for preliminary testing. That stage must ensure that all data are impersonal, and the secret data are changed.

After that stage, you can return to step 5 again for the tested database in the protected non-production environment. Then, you forward the tested backup to the necessary environments to restore it and use it for development and testing. 

We’ve presented the general algorithm of the production database’s data copying and changing. Let’s describe how to implement it.

Realization of the general algorithm

A new empty database creation

You can make an empty database with the help of the CREATE DATABASE construction as here.  

The database is named JobEmplDB_Test. It has three filegroups:

  • PRIMARY – it is the primary filegroup by default. It defines two files: JobEmplDB_Test1(path D:\DBData\JobEmplDB_Test1.mdf), and JobEmplDB_Test2 (path D:\DBData\JobEmplDB_Test2.ndf). Each file’s initial size is 64 Mb, and the growth step is 8 Mb for each file.
  • DBTableGroup – a custom filegroup that determine two files: JobEmplDB_TestTableGroup1 (path D:\DBData\JobEmplDB_TestTableGroup1.ndf) and JobEmplDB_TestTableGroup2 (path D:\DBData\JobEmplDB_TestTableGroup2.ndf). The initial size of each file is 8 Gb, and the growth step is 1 Gb for each file.
  • DBIndexGroup – a custom filegroup that determines two files: JobEmplDB_TestIndexGroup1 (path D:\DBData\JobEmplDB_TestIndexGroup1.ndf), and JobEmplDB_TestIndexGroup2 (path D:\DBData\JobEmplDB_TestIndexGroup2.ndf). The initial size is 16 Gb for each file, and the growth step is 1 Gb for each file.

Also, this database includes one journal of transactions: JobEmplDB_Testlog, path E:\DBLog\JobEmplDB_Testlog.ldf. The initial size of the file is 8 Gb, and the growth step is 1 Gb.

Copying of the scheme and the necessary data from the production database into a newly created database

To copy the scheme and the necessary data from the production database into the new one, you can use several tools. First, it is the Visual Studio (SSDT). Or, you can use third-party utilities like:

Making scripts for data changes

Essential requirements for the data changes’ scripts

1. It must be impossible to restore the real data using that script.
e.g., The lines’ inversion won’t suit, as it allows us to restore the real data. Usually, the method is to replace each character or byte by a pseudorandom character or byte. The same applies to the date and time.

2. The data changing must not alter the selectivity of their values.
It won’t work to assign NULL to the table’s field. Instead, you have to ensure that the same values in the real data would remain the same in the altered data. For instance, in real data, you have a value of 103785 found 12 times in the table. When you change this value in the altered data, the new value must remain 12 times in the same fields of the table.

3. The size and length of the values should not differ significantly in the altered data. E.g., you replace each byte or character with a pseudorandom byte or character. The initial string remains the same in size and length.

4. Interrelations in the data must not be broken after the changes. It relates to the external keys and all other cases where you refer to the altered data. Changed data must stay in the same relations as the real data were.

Data changes scripts implementation.

Now, let’s review the particular case of the data changing to depersonalize and hide the secret information. The sample is the recruitment database.

The sample database includes the following personal data that you need to depersonalize:

  1. Last and first name;
  2. Date of birth;
  3. The ID-card issue date;
  4. The remote access certificate as the bytes sequence;
  5. The service fee for resume promotion.

First, we’ll check simple examples for each type of the altered data:

  1. Date and time change;
  2. Numerical value change;
  3. Changing the byte sequences;
  4. Character data change.

Date and time change

You can get a random date and time using the following script:

DECLARE @dt DATETIME;

SET @dt = CAST(CAST(@StartDate AS FLOAT) + (CAST(@FinishDate AS FLOAT) - CAST(@StartDate AS FLOAT)) * RAND(CHECKSUM(NEWID())) AS DATETIME);

Here, @StartDate and @FinishDate are the start and end values of the range. They correlate respectively for the pseudorandom date and time generation.

To generate these data, you use the system functions RAND, CHECKSUM, and NEWID.

UPDATE [dbo].[Employee]
SET [DocDate] = CAST(CAST(CAST(CAST([BirthDate] AS DATETIME) AS FLOAT) + (CAST(GETDATE() AS FLOAT) - CAST(CAST([BirthDate] AS DATETIME) AS FLOAT)) * RAND(CHECKSUM(NEWID())) AS DATETIME) AS DATE);

The [DocDate] field stands for the document issue date. We replace it with a pseudorandom date, keeping in mind the dates’ ranges and their limitations.

The “bottom” limit is the candidate’s date of birth. The “upper” edge is the current date. We don’t need the time here, so the time and date format transformation to the necessary date comes in the end. You can get pseudorandom values for any part of the date and time in the same way.

Numerical value change

You can get a random integer with the help of the following script:

DECLARE @Int INT;

SET @Int = CAST(((@MaxVal + 1) - @MinVal) *
    RAND(CHECKSUM(NEWID())) + @MinVal AS INT);

@MinVal и @MaxVal are the start and end range’s values for the pseudorandom number generation. We generate it using the system functions RAND, CHECKSUM, and NEWID.

UPDATE [dbo].[Employee]
SET [CountRequest] = CAST(((@MaxVal + 1) - @MinVal) *
    RAND(CHECKSUM(NEWID())) + @MinVal AS INT);

The [CountRequest] field stands for the number of requests that companies make for this candidate’s resume.

Similarly, you can get pseudorandom values for any numerical value. E.g., have a look at the random number of the decimal type (18,2) generation:

DECLARE @Dec DECIMAL(18,2);

SET @Dec=CAST(((@MaxVal + 1) - @MinVal) *
    RAND(CHECKSUM(NEWID())) + @MinVal AS DECIMAL(18,2));

Thus, you can update the resume promotion service fee in the following way:

UPDATE [dbo].[Employee]
SET [PaymentAmount] = CAST(((@MaxVal + 1) - @MinVal) *
    RAND(CHECKSUM(NEWID())) + @MinVal AS DECIMAL(18,2));

Changing the byte sequences

You can get a random byte sequence using the following script:

DECLARE @res VARBINARY(MAX);

SET @res = CRYPT_GEN_RANDOM(@Length, CAST(NEWID() AS VARBINARY(16)));

@Length stands for the sequence’s length. It defines the number of bytes returned. Here, @Length must not be larger than 16.

Generation is done with the help of the system functions CRYPT_GEN_RANDOM and NEWID.

E.g., you can update the remote access certificate for each candidate in the following way:

UPDATE [dbo].[Employee]
SET [RemoteAccessCertificate] = CRYPT_GEN_RANDOM(CAST(LEN([RemoteAccessCertificate]) AS INT), CAST(NEWID() AS VARBINARY(16)));

We generate a pseudorandom byte sequence of the same length present in the [RemoteAccessCertificate] field at the time of the change. We suppose that the byte sequence length does not exceed 16.

Similarly, we can create our function that will return pseudorandom byte sequences of any length. It will put the results of the system function CRYPT_GEN_RANDOM work together using the simple “+” addition operator. But 16 bytes are usually enough in practice.

Let’s make a sample function returning the pseudorandom byte sequence of the definite length, where it will be possible to set the length of more than 16 bytes. For this, make the following presentation:

CREATE VIEW [test].[GetNewID]
AS
SELECT NEWID() AS [NewID];
GO

We need it to evade the limitation forbidding to use NEWID within the function.

In the same way, create the next presentation for the same purpose:

CREATE VIEW [test].[GetRand]
AS
SELECT RAND(CHECKSUM((SELECT TOP(1) [NewID] FROM [test].[GetNewID]))) AS [Value];
GO

Create one more presentation:

CREATE VIEW [test].[GetRandVarbinary16]
AS
SELECT CRYPT_GEN_RANDOM(16, CAST((SELECT TOP(1) [NewID] FROM [test].[GetNewID]) AS VARBINARY(16))) AS [Value];
GO

All three functions’ definitions are here. And here is the implementation of the function that returns a pseudorandom byte sequence of the definite length.

First, we define if the necessary function is present. If not – we create a stud first. In any case, the code involves changing the function’s definition appropriately. In the end, we add the function’s description via the extended properties. More details on the database’s documentation are in this article.

To update the remote access certificate for each candidate, you can do as follows:

UPDATE [dbo].[Employee]
SET [RemoteAccessCertificate] = [test].[GetRandVarbinary](CAST(LEN([RemoteAccessCertificate]) AS INT));

As you see, there are no limitations to the byte sequence length here.

Read also

Synthetic Data Generation. Part 2: Data Changing

Synthetic Data Generation. Part 3: Backup and Restore

Evgeniy Gribkov

Evgeniy Gribkov

Evgeniy is a MS SQL Server database analyst, developer and administrator. He is involved in development and testing of tools for SQL Server database management. Evgeniy also writes SQL Server-related articles.