In the previous part of the series, we’ve examined the second approach to filling the database in with data for testing and development purposes. It is the synthetic data generation approach.
Data generation tools (for external resources)
Full list of tools
Now, let’s examine one of these tools more precisely.
An overview of the employees’ generation by the Data Generator for SQL Server
The Data Generator for SQL Server utility is embedded in SSMS, and also it is a part of dbForge Studio. We reviewed this utility here. Let’s now examine how it works for synthetic data generation. As examples, we use the [dbo].[Employee] and the [dbo].[JobHistory] tables.
This generator can quickly generate first and last names of candidates for the [FirstName] and [LastName] fields respectively:
Note that FirstName requires choosing the “First Name” value in the “Generator” section. For LastName, you need to select the “Last Name” value from the “Generator” section.
It is important to note that the generator automatically determines which generation type it needs to apply to every field. The settings above were set by the generator itself, without manual correction.
You can configure distribution of values for the date of birth [BirthDate]:
Set the distribution for the document’s date of issue [DocDate] through the Phyton generator using the below script:
import random from System import DateTime # receive the value from the Birthday field bd = DateTime.Parse(str(BirthDate)) # receive the current date current = DateTime.Now # calculate the age in years timeSpan = current - bd age = (int)(timeSpan.TotalDays / 365); # passport’s date of issue releaseDate = 0 if age >= 45: releaseDate = bd.AddDays(45 * 365 + random.randint(1, 30)) # randomize the issue during the month elif age >= 20: releaseDate = bd.AddDays(20 * 365 + random.randint(1, 30)) # randomize the issue during the month else: releaseDate = bd.AddDays(14 * 365 + random.randint(1, 30)) # randomize the issue during the month releaseDate
This way, the [DocDate] configuration will look as follows:
For the document’s number [DocNumber], we can select the necessary type of unique data generation, and edit the generated data format, if needed:
E.g., instead of the format
We can set the following format:
This format means that the line will be generated in format XX-XXXXXXX (X – is a digit in the range of 0 to 9).
We set up the generator for [CountRequest] and [PaymentAmount] fields in the same way, according to the generated data type:
In the first case, we set the values’ range of 0 to 2048 for [CountRequest]. In the second case, it is the range of 0 to 100000 for [PaymentAmount].
We configure generation for [RemoteAccessCertificate] and [Address] fields in the same way:
In the first case, we limit the byte sequence [RemoteAccessCertificate] with the range of lengths of 16 to 32. In the second case, we select values for [Address] as real addresses. It makes the generated values looking like the real ones.
This way, we’ve configured the synthetic data generation settings for the candidates’ table [dbo].[Employee]. Let’s now set up the synthetic data generation for the [dbo].[JobHistory] table.
We set it to take the data for the [EmployeeID] field from the candidates’ table [dbo].[Employee] in the following way:
We select the generator’s type from the table or presentation. We then define the sample of MS SQL Server, the database, and the table to take the data from. We can also configure filters in the “WHERE filter” section, and select the [EmployeeID] field.
Here we suppose that we generate the “employees” first, and then we generate the data for the [dbo].[JobHistory] table, basing on the filled [dbo].[Employee] reference.
However, if we need to generate the data for both [dbo].[Employee] and [dbo].[JobHistory] at the same time, we need to select “Foreign Key (manually assigned) – references a column from the parent table,” referring to the [dbo].[Employee].[EmployeeID] column:
Similarly, we set up the data generation for the following fields.[CompanyID] – from [dbo].[Company], the “companies” table: [PositionID] – from the table of positions [dbo].[Position]: [ProjectID] – from the table of projects [dbo].[Project]:
The tool cannot link the columns from different tables and shift them in some way. However, the generator can shift the date within one table – the “date” generator – fill with date values with Range – Offset from the column. Also, it can use data from a different table, but without any transformation (Table or View, SQL query, Foreign key generators).
That’s why we resolve the dates’ problem (BirthDate < DocDate и StartDate < DocDate) in a different way. It will be by division of the time range for every column.
E.g., we limit the BirthDate with the 40-50 years’ interval. Then, we restrict the DocDate with 20-40 years’ interval. The StartDate is, respectively, limited with 25-35 years’ interval, and we set up the FinishDate with the offset from StartDate.
We set up the date of birth:
Set up the date of the document’s issue
Then, the StartDate will match the age from 35 to 45:
The simple offset generator sets FinishDate:
The result is, a person has worked for three months till the current date.
Also, to configure the date of the working end, we can use a small Python script:
import random from System import DateTime bd = DateTime.Parse(str(StartDate)) releaseDate = bd.AddDays(random.randint(1, 30)) releaseDate
This way, we receive the below configuration for the dates of work end [FinishDate] data generation:
Similarly, we fill in the rest of fields. We set the generator type – string, and set the range for generated lines’ lengths:
Also, you can save the data generation project as dgen-file consisting of:
- The database name;
- All settings for bases, tables, and columns;
- All settings of generators by columns, etc.
We can save all these settings: it is enough to keep the project’s file and work with the database further, using that file:
There is also the possibility to both save the new generators from scratch and save the custom settings in a new generator:
Thus, we’ve configured the synthetic data generation settings used for the jobs’ history table [dbo].[JobHistory].
This article examines two approaches to filling the data in the database for testing and development:
- Copying and changing the data from the production database;
- Synthetic data generation.
We’ve defied the objects for each approach and each script implementation. These objects are here. We’ve also provided scripts for changing the data from the production database and synthetic data generation. An example is the database of recruitment services. In the end, we’ve examined popular data generation tools.
We’ve also reviewed the Data Generator for SQL Server solution for the synthetic data generation into the recruitment service database in detail. You can see it yourself that using the ready solution reduces the synthetic data generation preparation time significantly.