The current article focuses on the conceptual understanding of Text Mining. It gives clear examples and teaches us to use the T-SQL script to simulate the basic Text Mining purpose and process.
Additionally, we review the importance of identifying frequently used words in Text Mining.
About Text Mining
First, we need to define the meaning of Text Mining. Without it, the usage of tools will be inappropriate and impossible. Moreover, without a proper understanding of the concept, it is hard to grasp the potential objectives and business value of Text Mining.
If you are new to this subject, try to define Text Mining how you understand it. Keep it in mind while reading this article. When we explain this concept, you’d see how close or far you were.
Note that Text Mining may have several definitions. As long as the central concept is clear, it is fine.
What is Text Mining
According to Wikipedia, Text Mining is “the discovery by computer of new, previously unknown information, by automatically extracting information from different written resources.”
However, the meaning of the term is broader. It ranges from the simple text or textual analysis to complex data mining where you apply modern tools and technologies.
What is Text Analysis
Text analysis or textual analysis is a method of analysing the text to understand it using suitable tools and technologies.
Why we need to understand Text
Understanding of the text brings us valuable information that the document contains.
What are the steps in Text Mining
Text Mining consists of two phases:
- Analysing the text.
- Understanding the text.
One of the Text Mining forms is information extraction.
What is Information Extraction
The term “Information extraction” stands for the process of extracting meaningful information from unstructured or textual data. It is valuable, as it compensates for the impossibility to get much from the numeric data extraction.
Prospects of Text Mining
According to Wikipedia, Text Mining applies to the following tasks:
- Information Extraction.
- Data Mining.
- Knowledge Discovery in Databases (KDD).
Text Analysis vs. Text Mining
We may consider using tools and technologies for text analysis as the first step in Text Mining. When we know for sure why we extract the data and then analyse them manually to get valuable information, we apply the basic form of Text Mining.
Text Analysis and Text Mining concepts are close, and the difference is mostly in the approach. Text Mining suggests using specific tools for data collection and preparing them for analysis.
Microsoft Text Mining Tools
The available Text Mining tools are various. They range from SQL Server Full-Text Search to building Business Intelligence Semantic Models.
How Text Analysis works
Several methodologies are applicable for Text Analysis. One of the most common methods is to extract words of interest from a text of a document to determine the nature of the source.
For example, if the text contains lots of data, it mostly includes some words describing its essence. These words are going to be used as keywords needed to understand the text.
Word Frequency
Word Frequency is one of the Text Mining methods. A word becomes a keyphrase, depending on the number of occurrences of that word in the text. It is one of the approaches helping to achieve the Text Mining objectives.
The process starts with the search within the text.
Example 1: Search with Incorrect Spelling
A good example of Text Mining or Textual Analysis is the attempt to search for a misspelled word.
Searching “text minding” instead of “text mining” on Google brings us automatically the best recommendation based on textual analysis:
As soon as we type a text into the search field, that text undergoes some form of Textual Analysis or Text Mining process.
It does not only help to rectify the spelling mistake. Also, the search engine redirects the surfer to the correct topic of interest. Note that the option to continue the search for the misspelled word remains there.
Example 2: Sports Word Tag Cloud
Another example of Text Mining is when you need to define the popularity of a particular sport in a region. The research grounds on extracting and analysing an individual’s favourite game:
This word tag cloud clearly tells that Football is the most popular sport. Cycling and Cricket are equally liked, and they share the second position. Badminton is the third favourite game in a region, city, or town.
These examples look appealing, but they may become quite complex to implement, as we focus on the basics of Text Mining.
Example 3: Human Intelligence
It is rather an abstract example, but a good one to understand the concept of Text Mining better.
In our daily lives, we use memory cues to recall something quite often. For example, you are going to meet a friend who lives in the street next to a children’s park, and that park is perfectly visible from far away. That park becomes the first “benchmark” you search for before taking a turn to the necessary house.
“Park” is a keyphrase. It reminds you that your friend’s home is not only nearby, but also in that direction.
However, the human mind is far too complex and unimaginably marvellous to limit its even simplest operations to finding keyphrases.
Importance of Keyphrases
We now understand the importance of keyphrases in a text or even in our daily lives. In no way, we can ignore it.
It makes identifying keyphrases an excellent option for being the first step in Text Mining. In some cases, it becomes an elementary form of Text Mining itself.
Simulating Text Mining with T-SQL
Now that we get the meaning of Text Mining or Textual Analysis, it is time to proceed to practice. The task is to understand the primary process of the keyphrases’ identification in Text Mining.
For that, we need a T-SQL script.
Please bear in mind that we are trying to simulate the keyphrase identification process in Text Mining. It is proper training that allows us to use available tools later on (in the upcoming articles) when we actually perform that process.
Setup Sample Database (SQLTextMining)
To work on essential Text Mining or Textual Analysis simulation, we have first to set up a sample database.
Please create a sample database named SQLTextMining as follows:
-- Create sample database (SQLTextMining)
CREATE DATABASE SQLTextMining;
GO
USE SQLTextMining;
-- (1) Create Article table in the sample database
CREATE TABLE [dbo].[Article] (
[ArticleId] INT IDENTITY (1, 1) NOT NULL,
[Category] VARCHAR (50) NULL,
[Author] VARCHAR (50) NULL,
[Title] VARCHAR (150) NULL,
[Published] DATETIME2 (7) NULL,
CONSTRAINT [PK_Article] PRIMARY KEY (ArticleId)
);
GO
-- Reset tabular data
TRUNCATE TABLE dbo.Article
GO
-- Add data to the table
SET IDENTITY_INSERT [dbo].[Article] ON
INSERT INTO [dbo].[Article] ([ArticleId], [Category], [Author], [Title], [Published]) VALUES (1, N'Development', N'Atif', N'Introduction to T-SQL Programming ', N'2017-01-01 00:00:00')
INSERT INTO [dbo].[Article] ([ArticleId], [Category], [Author], [Title], [Published]) VALUES (2, N'Testing', N'Peter', N'Database Unit Testing Fundamentals', N'2017-01-10 00:00:00')
INSERT INTO [dbo].[Article] ([ArticleId], [Category], [Author], [Title], [Published]) VALUES (3, N'Data Analysis', N'Haroon', N'Learn Basic Data Analysis with SQL Window Functions', N'2019-06-12 00:00:00')
INSERT INTO [dbo].[Article] ([ArticleId], [Category], [Author], [Title], [Published]) VALUES (4, N'Development', N'Peter', N'Common Table Expressions (CTE)', N'2017-02-10 00:00:00')
INSERT INTO [dbo].[Article] ([ArticleId], [Category], [Author], [Title], [Published]) VALUES (5, N'Testing', N'Sadaf', N'Manual Testing vs. Automated Testing', N'2017-03-20 00:00:00')
INSERT INTO [dbo].[Article] ([ArticleId], [Category], [Author], [Title], [Published]) VALUES (6, N'Testing', N'Atif', N'Beyond Database Unit Testing', N'2017-11-10 00:00:00')
INSERT INTO [dbo].[Article] ([ArticleId], [Category], [Author], [Title], [Published]) VALUES (7, N'Testing', N'Sadaf', N'Cross Database Unit Testing', N'2017-12-20 00:00:00')
INSERT INTO [dbo].[Article] ([ArticleId], [Category], [Author], [Title], [Published]) VALUES (8, N'Development', N'Haroon', N'How to Properly Use the T-SQL IsNumeric Function', N'2018-01-10 00:00:00')
INSERT INTO [dbo].[Article] ([ArticleId], [Category], [Author], [Title], [Published]) VALUES (9, N'Testing', N'Sadaf', N'Scripting and Testing Database for beginners ', N'2018-02-15 00:00:00')
INSERT INTO [dbo].[Article] ([ArticleId], [Category], [Author], [Title], [Published]) VALUES (10, N'Development', N'Atif', N'Advanced Database Development Methods', N'2018-07-10 00:00:00')
INSERT INTO [dbo].[Article] ([ArticleId], [Category], [Author], [Title], [Published]) VALUES (11, N'Testing', N'Sadaf', N'How to Write Unit Tests for your Database', N'2018-11-10 00:00:00')
INSERT INTO [dbo].[Article] ([ArticleId], [Category], [Author], [Title], [Published]) VALUES (12, N'Development', N'Peter', N'Database Development using Modern Tools', N'2018-12-10 00:00:00')
INSERT INTO [dbo].[Article] ([ArticleId], [Category], [Author], [Title], [Published]) VALUES (13, N'DLM', N'Atif', N'Designing, Developing and Deploying Database', N'2019-01-01 00:00:00')
INSERT INTO [dbo].[Article] ([ArticleId], [Category], [Author], [Title], [Published]) VALUES (14, N'DLM', N'Peter', N'How to Apply Database Lifecycle Management', N'2019-02-10 00:00:00')
INSERT INTO [dbo].[Article] ([ArticleId], [Category], [Author], [Title], [Published]) VALUES (15, N'Testing', N'Saqib', N'SQL Unit Testing Stored Procedures', N'2019-03-10 00:00:00')
INSERT INTO [dbo].[Article] ([ArticleId], [Category], [Author], [Title], [Published]) VALUES (16, N'Reporting', N'Haroon', N'SSRS Reports Development in Simple Terms', N'2019-07-10 22:26:42')
SET IDENTITY_INSERT [dbo].[Article] OFF
Viewing Article Table
Run the following T-SQL script to view the data in that sample database’s article table:
USE SQLTextMining
-- View Top 5 records from article table
SELECT TOP 5 a.ArticleId,a.Category,a.Title FROM dbo.Article a
order by a.ArticleId
The output is as follows:
Selecting Title for Text Mining Simulation
Choose the column Title to test the Text Mining / Textual Analysis simulation. We make it as simple as possible for better understanding.
The title column is the one that interests us. We test it against our manual Text Mining method applied with T-SQL script(s):
Finding Keyphrase Logic (Simulation)
As already mentioned, one of the first things in Text Mining is identifying keyphrases from the text analysed.
Now, we are going to extract these keyphrases manually using the T-SQL string function CHARINDEX().
The logic we are going to apply is to find the space in a string (text), which actually gives us a keyphrase. Each word is a keyphrase.
Extracting Keyphrase 1 from Title
We simulate the Text Mining process of identifying the first keyphrase by using charindex() function as follows:
USE SQLTextMining
-- Declare a table variable for keyphrase1
DECLARE @ArticleKeyphrase1 TABLE
(
ArticleId INT ,
TITLE VARCHAR(100),
Position INT,
Keyphrase VARCHAR(30)
)
-- Extract keyphrase using charindex() function and insert into the table (variable)
INSERT INTO @ArticleKeyphrase1
SELECT ArticleId,TITLE,CHARINDEX(' ',Title,1) AS Position,SUBSTRING(TITLE,1,CHARINDEX(' ',Title,1)) as KeyPhrase
FROM dbo.Article
ORDER BY ArticleId
-- View the first keyphrase extracted from the Title of the document
SELECT * FROM @ArticleKeyphrase1
The output is as follows:
Extracting Keyphrase 2 from Title
Extract the second keyphrase using the following T-SQL script. Continue the script in the same query window used for the previous script. Otherwise, the referenced table variable will not be recognized:
-- Creating a support table for keyphrase 2
DECLARE @ArticleKeyphrase2Support TABLE
(
ArticleId INT ,
TITLE VARCHAR(100)
)
-- Populating support table for keyphrase 2
INSERT INTO @ArticleKeyphrase2Support
SELECT ArticleId,SUBSTRING(TITLE,Position+1,len(title))
FROM @ArticleKeyphrase1
order by ArticleId
-- Creating keyphrase 2 table variable
DECLARE @ArticleKeyphrase2 TABLE
(
ArticleId INT ,
TITLE VARCHAR(100),
Position INT,
Keyphrase VARCHAR(30)
)
-- Inserting data into keyphrase 2 table after extracting second keyphrase
insert into @ArticleKeyphrase2
SELECT ArticleId,TITLE,CHARINDEX(' ',Title,1) AS Position,SUBSTRING(TITLE,1,CHARINDEX(' ',Title,1)) as KeyPhrase
FROM @ArticleKeyphrase2Support
-- View the second keyphrase extracted from Title column
SELECT * FROM @ArticleKeyphrase2
The output is as follows:
Extracting Keyphrase 3 from Title
Extract the third keyphrase using the T-SQL script as follows (type the script in the same query window used by previous scripts to make sure referenced variables are recognized):
-- Creating a support table (variable) for the third keyphrase extraction
DECLARE @ArticleKeyphrase3Support TABLE
(
ArticleId INT ,
TITLE VARCHAR(100)
)
-- Populating support table
INSERT INTO @ArticleKeyphrase3Support
SELECT ArticleId,SUBSTRING(TITLE,Position+1,len(title))
FROM @ArticleKeyphrase2
order by ArticleId
-- Creating table (variable) for keyphrase3
DECLARE @ArticleKeyphrase3 TABLE
(
ArticleId INT ,
TITLE VARCHAR(100),
Position INT,
Keyphrase VARCHAR(30)
)
-- Inserting third keyphrase into the table using charindex()
insert into @ArticleKeyphrase3
SELECT ArticleId,TITLE,CHARINDEX(' ',Title,1) AS Position,SUBSTRING(TITLE,1,CHARINDEX(' ',Title,1)) as KeyPhrase
FROM @ArticleKeyphrase3Support
-- View third extracted keyphrase from the Title column
SELECT * FROM @ArticleKeyphrase3
The results are below:
Summarising Extracted Keyphrases
We now see all the (three) keyphrases extracted manually with the following T-SQL script. Again, write and execute it from the same query window used for previous scripts:
-- Declare text mining table variable
DECLARE @TextMining TABLE
(ArticleId INT,
Title VARCHAR(100),
Keyphrase1 VARCHAR(50),
Keyphrase2 VARCHAR(50),
Keyphrase3 VARCHAR(50)
)
-- Add id, title, and first keyphrase to the table
INSERT INTO @TextMining
SELECT ArticleId,TITLE,Keyphrase,NULL,NULL FROM @ArticleKeyphrase1
-- Add the second keyphrase to the table
UPDATE @TextMining
SET Keyphrase2=Keyphrase
FROM @ArticleKeyphrase2 key2
inner join @TextMining tm
on key2.ArticleId=tm.ArticleId
-- Add the third keyphrase to the table
UPDATE @TextMining
SET Keyphrase3=Keyphrase
FROM @ArticleKeyphrase3 key3
inner join @TextMining tm
on key3.ArticleId=tm.ArticleId
-- View all the manually extracted keyphrases to simulate text mining initial steps
SELECT ArticleId,Keyphrase1,Keyphrase2,Keyphrase3 FROM @TextMining
The output is below:
It is easy to understand that most articles are about Database(s). Also, a couple of pieces are about Testing:
Congratulations! You have successfully simulated the process of keyphrase identification used in Text Mining or Textual Analysis. It helps to understand better why it is so crucial in analysing texts of documents, especially in the context of Text Mining.
Things to do
Now that you are familiar with the basics of Text Mining and successfully simulated keyphrase extraction, try the following exercises to train the skills further:
- Extract the fourth keyphrase from the Title column using the script referenced in this article.
- Add the score for particular keyphrases so that the keyphrase occurring most frequently would get the highest score.
- Convert the script simulating keyphrase extraction into a stored procedure to reuse it.