Synthetic Data Generation. Part 4: Tools

Total: 2 Average: 5

In the previous part of the series, we’ve examined the second approach to filling the database in with data for testing and development purposes. It is the synthetic data generation approach.

Data generation tools (for external resources)

Full list of tools

DATPROF

DATPROF
Picture 18. DATPROF

IRI RowGen

IRI RowGen
Picture 19. IRI RowGen

Data Generator for SQL Server

Data Generator for SQL Server
Picture 20. Data Generator for SQL Server

Redgate SQL Data Generator

Redgate SQL Data Generator
Picture 21. Redgate SQL Data Generator

DTM Data Generator

DTM Data Generator
Picture 22. DTM Data Generator

Datanamic Data Generator MultiDB

Datanamic Data Generator MultiDB
Picture 23. Datanamic Data Generator MultiDB

Now, let’s examine one of these tools more precisely.

An overview of the employees’ generation by the Data Generator for SQL Server

The Data Generator for SQL Server utility is embedded in SSMS, and also it is a part of dbForge Studio. We reviewed this utility here. Let’s now examine how it works for synthetic data generation. As examples, we use the [dbo].[Employee] and the [dbo].[JobHistory] tables.

This generator can quickly generate first and last names of candidates for the [FirstName] and [LastName] fields respectively:

Configuring the synthetic data generation for the FirstName field
Picture 24. Configuring the synthetic data generation for the FirstName field
Configuring the synthetic data generation for the LastName field
Picture 25. Configuring the synthetic data generation for the LastName field

Note that FirstName requires choosing the “First Name” value in the “Generator” section. For LastName, you need to select the “Last Name” value from the “Generator” section.

It is important to note that the generator automatically determines which generation type it needs to apply to every field. The settings above were set by the generator itself, without manual correction.

You can configure distribution of values for the date of birth [BirthDate]:

Configuration of the synthetic data generation for the BirthDate field
Picture 26. Configuration of the synthetic data generation for the BirthDate field

Set the distribution for the document’s date of issue [DocDate] through the Phyton generator using the below script:

import random
from System import DateTime
# receive the value from the Birthday field 
bd = DateTime.Parse(str(BirthDate))
# receive the current date 
current = DateTime.Now
# calculate the age in years 
timeSpan = current - bd
age = (int)(timeSpan.TotalDays / 365);
# passport’s date of issue 
releaseDate = 0
if age >= 45:
releaseDate = bd.AddDays(45 * 365 + random.randint(1, 30)) 
# randomize the issue during the month
elif age >= 20:
  releaseDate  = bd.AddDays(20 * 365 + random.randint(1, 30))
 # randomize the issue during the month
else:
  releaseDate  = bd.AddDays(14 * 365 + random.randint(1, 30)) 
# randomize the issue during the month
releaseDate

This way, the [DocDate] configuration will look as follows:

Configuration of the synthetic data generation for the DocDate field
Picture 27. Configuration of the synthetic data generation for the DocDate field

For the document’s number [DocNumber], we can select the necessary type of unique data generation, and edit the generated data format, if needed:

Configuring the synthetic data generation for the DocNumber field
Picture 28. Configuring the synthetic data generation for the DocNumber field

E.g., instead of the format

(0[1-6]|1[0-6]|2[0-7]|3[0-9]|4[0-68]|5[0-9]|6[0-8]|7[1-7]|8[0-8]|9[0-589])-(\d{7})

We can set the following format:

(\d{2})-(\d{7})

This format means that the line will be generated in format XX-XXXXXXX (X – is a digit in the range of 0 to 9).

We set up the generator for [CountRequest] and [PaymentAmount] fields in the same way, according to the generated data type:

Configuring the synthetic data generation for the CountRequest field
Picture 29. Configuring the synthetic data generation for the CountRequest field
Configuring the synthetic data generation for the PaymentAmount field
Picture 30. Configuring the synthetic data generation for the PaymentAmount field

In the first case, we set the values’ range of 0 to 2048 for [CountRequest]. In the second case, it is the range of 0 to 100000 for [PaymentAmount].

We configure generation for [RemoteAccessCertificate] and [Address] fields in the same way:

Configuring the synthetic data generation for RemoteAccessCertificate field
Picture 31. Configuring the synthetic data generation for RemoteAccessCertificate field
Configuring the synthetic data generation for the Address field
Picture 32. Configuring the synthetic data generation for the Address field

In the first case, we limit the byte sequence [RemoteAccessCertificate] with the range of lengths of 16 to 32. In the second case, we select values for [Address] as real addresses. It makes the generated values looking like the real ones.

This way, we’ve configured the synthetic data generation settings for the candidates’ table [dbo].[Employee]. Let’s now set up the synthetic data generation for the [dbo].[JobHistory] table.

We set it to take the data for the [EmployeeID] field from the candidates’ table [dbo].[Employee] in the following way:

Configuring the synthetic data generation for the EmployeeID field
Picture 33. Configuring the synthetic data generation for the EmployeeID field

We select the generator’s type from the table or presentation. We then define the sample of MS SQL Server, the database, and the table to take the data from. We can also configure filters in the “WHERE filter” section, and select the [EmployeeID] field.

Here we suppose that we generate the “employees” first, and then we generate the data for the [dbo].[JobHistory] table, basing on the filled [dbo].[Employee] reference.

However, if we need to generate the data for both [dbo].[Employee] and [dbo].[JobHistory] at the same time, we need to select “Foreign Key (manually assigned) – references a column from the parent table,” referring to the [dbo].[Employee].[EmployeeID] column:

Selecting the generator: Foreign Key (manually assigned) - references a column from parent table
Picture 33-1. Selecting the generator: Foreign Key (manually assigned) – references a column from parent table

Similarly, we set up the data generation for the following fields.

[CompanyID] – from [dbo].[Company], the “companies” table:

Configuring the synthetic data generation for the CompanyID.
Picture 34. Configuring the synthetic data generation for the CompanyID.
[PositionID] – from the table of positions [dbo].[Position]:

Configuring the synthetic data generation for the PositionID field
Picture 35. Configuring the synthetic data generation for the PositionID field
[ProjectID] – from the table of projects [dbo].[Project]:

Configuring the synthetic data generation for the ProjectID field
Picture 36. Configuring the synthetic data generation for the ProjectID field

The tool cannot link the columns from different tables and shift them in some way. However, the generator can shift the date within one table – the “date” generator – fill with date values with Range – Offset from the column. Also, it can use data from a different table, but without any transformation (Table or View, SQL query, Foreign key generators).

That’s why we resolve the dates’ problem (BirthDate < DocDate и StartDate < DocDate) in a different way. It will be by division of the time range for every column.

E.g., we limit the BirthDate with the 40-50 years’ interval. Then, we restrict the DocDate with 20-40 years’ interval. The StartDate is, respectively, limited with 25-35 years’ interval, and we set up the FinishDate with the offset from StartDate.

We set up the date of birth:

Setting up the date of birth within 40-50 years' range.
Picture 37-1. Setting up the date of birth within 40-50 years’ range.

Set up the date of the document’s issue

Setting up the document's issue date in the 20-40 years' interval.
Picture 37-2. Setting up the document’s issue date in the 20-40 years’ interval.

Then, the StartDate will match the age from 35 to 45:

Setting up the date of start the work in the 25-35 years' range.
Picture 37-3. Setting up the date of start the work in the 25-35 years’ range.

The simple offset generator sets FinishDate:

Setting up the finish date of work with the offset generator.
Picture 37-4. Setting up the finish date of work with the offset generator.

The result is, a person has worked for three months till the current date.

Also, to configure the date of the working end, we can use a small Python script:

import random
from System import DateTime
  
bd = DateTime.Parse(str(StartDate))

releaseDate = bd.AddDays(random.randint(1, 30))
 
releaseDate

This way, we receive the below configuration for the dates of work end [FinishDate] data generation:

Setting up the synthetic data configuration for the FinishDate field
Picture 38. Setting up the synthetic data configuration for the FinishDate field

Similarly, we fill in the rest of fields. We set the generator type – string, and set the range for generated lines’ lengths:

Configuring the synthetic data generation for the Description field
Picture 39. Configuring the synthetic data generation for the Description field

Also, you can save the data generation project as dgen-file consisting of:

  1. Connection;
  2. The database name;
  3. All settings for bases, tables, and columns;
  4. All settings of generators by columns, etc.

We can save all these settings: it is enough to keep the project’s file and work with the database further, using that file:

Saving the data generation project
Picture 39-1. Saving the data generation project

There is also the possibility to both save the new generators from scratch and save the custom settings in a new generator:

Options to save the new generators from scratch and save the custom settings in a new generator

Thus, we’ve configured the synthetic data generation settings used for the jobs’ history table [dbo].[JobHistory].

Conclusion

This article examines two approaches to filling the data in the database for testing and development:

  1. Copying and changing the data from the production database;
  2. Synthetic data generation.

We’ve defied the objects for each approach and each script implementation. These objects are here. We’ve also provided scripts for changing the data from the production database and synthetic data generation. An example is the database of recruitment services. In the end, we’ve examined popular data generation tools.

We’ve also reviewed the Data Generator for SQL Server solution for the synthetic data generation into the recruitment service database in detail. You can see it yourself that using the ready solution reduces the synthetic data generation preparation time significantly.

References

Read also

Synthetic Data Generation. Part 1: Data Copying

Synthetic Data Generation. Part 2: Data Changing

Synthetic Data Generation. Part 3: Backup and Restore

Evgeniy Gribkov

Evgeniy Gribkov

Evgeniy is a MS SQL Server database analyst, developer and administrator. He is involved in development and testing of tools for SQL Server database management. Evgeniy also writes SQL Server-related articles.