Learn to Store and Analyze Documents on Windows File System With SQL Server Semantic Search – Part 2

Total: 3 Average: 5

The focus of the article is on comparing documents that can be stored on Windows File System in one respect and in the other respect their comparative analysis that can be performed with Semantic Search in SQL Server.

Additionally, the readers will learn how to store unstructured data by exploring File Table and creating MS Word documents on the fly (instantly) to be consumed by Semantic Search.

This part of the article is related to the use of Semantic Search on unstructured data for the extraction of basic level business-crucial information provided standard naming is in place.

Prerequisites

The readers need to go through Learn to Store and Analyse Documents on Windows File System with SQL Server Semantic Search Part 1 to proceed further with it.

This article assumes the readers are familiar with most (if not all) of the following:

This article also assumes that following steps have already been completed.

  • FILESTREAM Enabled Database EmployeesFilestreamSample has been set up.
  • File Table has been set up.
  • MS Word document Asif Permanent Employee and Peter Permanent Employee has been stored in a windows folder managed through File Table.
  • Querying the database File Table shows the information about the stored documents.

Please refer to the article Learn to Store and Analyse Documents on Windows File System with SQL Server Semantic Search Part 1 to complete any of the above missing steps, which are mandatory to cover the walkthrough completely.

Performing Name-Based Documents Comparative Analysis

As we are preparing to perform documents comparative analysis based on their standard naming, at this point it is worth doing a quick check by querying the FILESTREAM Enabled database we set up in the first part of the article.

Run the following script against EmployeesFilestreamSample database:

-- View stored documents managed by File Table to check
SELECT stream_id
      ,[name]
      ,file_type
      ,creation_time
FROM EmployeesFilestreamSample.dbo.EmployeesDocumentStore

The results must show the stored documents as follows:

Semantic Search Checklist

We have successfully set up a FILESTREAM enabled database and created and stored two sample MS Word documents on File System using File Table.

However, that does not automatically qualify the documents for the Semantic Search usage scenario.

If you were following the earlier Semantic Search and Full-Text Search articles referenced in this article, then you must be well aware of why we cannot yet apply Semantic Search.

Semantic Search can be enabled in one of the following ways:

  1. If you have already setup Full-Text Search, then you can enable Semantic Search with a single step.
  2. You can directly set up Semantic Search, but that means you also have to take the steps required to set up Full-Text Search.

Full-Text Search Test Before Semantic Search Setup

If a Full-Text query works, then we just need to enable Semantic Search, because this means we have met most of the requirements.

To check that, you can also run a Full-Text query against the desired table as follows:

-- Searching word Employee using Full-Text search against EmployeesDocumentStore File Table
SELECT [name]
  FROM [EmployeesFilestreamSample].[dbo].[EmployeesDocumentStore]
  WHERE CONTAINS(name,'Employee')

The output is as follows:

This quick check shows us that we first need to fulfill the requirements of Full-Text Search and then enable Semantic Search.

Enabling Semantic Search for Use

We require at least the two following points before we can use Semantic Search to perform document comparative analysis:

  1. Unique Index.
  2. A Full-Text catalog.
  3. A Full-Text index.

Write the following T-SQL script to first create a unique index:

-- Create unique index required for Semantic Search
CREATE UNIQUE INDEX UQ_Stream_Id  
    ON EmployeesDocumentStore(stream_id)  
GO

Then create a Full-Text catalog based on the newly created unique index followed by creating a Full-Text index as shown below:

-- Getting Semantic Search ready to be used with File Table
CREATE FULLTEXT CATALOG EmployeesFileTableCatalog WITH ACCENT_SENSITIVITY = ON;
CREATE FULLTEXT INDEX ON EmployeesDocumentStore
(
name LANGUAGE 1033 STATISTICAL_SEMANTICS,
file_type LANGUAGE 1033 STATISTICAL_SEMANTICS,
file_stream TYPE COLUMN file_type LANGUAGE 1033 STATISTICAL_SEMANTICS
)
KEY INDEX UQ_Stream_Id
ON EmployeesFileTableCatalog WITH CHANGE_TRACKING AUTO, STOPLIST=SYSTEM;

The results are as follows:

Full-Text Search Test After Semantic Search Setup

Let us now run the same Full-Text query to search for the word Employee in the stored documents:

-- Searching (after Semantic Search setup) word Employee using Full-Text search against EmployeesDocumentStore File Table
SELECT [name]
  FROM [EmployeesFilestreamSample].[dbo].[EmployeesDocumentStore]
  WHERE CONTAINS(name,'Employee')

The output is as follows:

It is all fine for Full-Text queries to work against the File Table while we are getting it ready for Semantic Search. As mentioned earlier, Semantic Search is an add-on to Full-Text Search, that’s why you get an added benefit of using Full-Text queries.

Add More MS Word Documents

Now, go to the EmployeesDocumentStore File Table and click Explore FileTable Directory:

Create and store a new document called Sadaf Contract Employee as follows:

Next, add the following text to the newly-created document after ensuring the first line is the title of the document:

Sadaf Contract Employee (title)
Sadaf is a very efficient business analyst who does contact based work. She is fully capable of handling business requirements and turning them into technical specifications for the developers to work on. She is a very experienced business analyst.

Add another document called Mike Permanent Employee:

Update the document with the following text:

Mike Permanent Employee (Title of the document)
Mike is a fresh programmer whose expertise include web development. He is a quick learner and happy to work on any project. He has strong problem solving skills but he has less business knowledge. He requires assistance from other developers or business analysts to understand the problem and to meet the requirements.
He is good when he works on small projects but he struggles if he is given big or complex project.

We now have four documents stored on Windows File System managed by File Table, which are to be consumed by Semantic Search (including Full-Text Search).

Although we have just stored four MS Word documents in the folder as a sample, you can imagine the importance of using Semantic Search when hundreds of such documents are maintained by a SQL Server database, and you need to query those documents to find valuable information.

Please note that the standard naming of the documents matters a lot for the successful implementation of this approach keeping in mind the scope (title based Semantic Search) of this article.

Simple Counting of Documents

We can now compare these documents to find out the differences and similarities between them based on their standard naming using Semantic Search.

For example, a simple query can tell us the total number of documents stored in Windows Folder:

-- Getting total number of stored documents
SELECT COUNT(*) AS Total_Documents FROM EmployeesDocumentStore

Permanent vs Contract-Based Employees Comparison

This time we are using Semantic Search to compare how many permanent and contract-based employees we have got in our sample organization:

-- Creating a summary table variable
DECLARE @Documents TABLE
(DocumentType VARCHAR(100),
DocumentsCount INT)

INSERT INTO @Documents -- Storing total number of stored documents into summary table
SELECT 'Total Documents',COUNT(*) AS Total_Documents FROM EmployeesDocumentStore

INSERT INTO @Documents -- Storing total number of permananent employees documents stored into summary table
SELECT 'Total Permanent Employees',COUNT(*) 
FROM semantickeyphrasetable (EmployeesDocumentStore, *)
WHERE keyphrase = 'Permanent' 

INSERT INTO @Documents --Storing total number of permananent employees documents stored
SELECT 'Total Contract Employees',COUNT(*) 
FROM semantickeyphrasetable (EmployeesDocumentStore, *)
WHERE keyphrase = 'Contract'

SELECT DocumentType,DocumentsCount FROM @Documents

The output is as follows:

Documents Keyphrase Score

Please remember the relative score is a very solid criterion to find out the relevance of key phrases in your documents, but it may get complicated as well.

Let us run a simple (document name-based) Semantic Search query to view the keyphrase and its relative score for each of the documents:

-- Getting keyphrase and relative score for all the documents
SELECT * FROM semantickeyphrasetable(EmployeesDocumentStore, NAME) 
ORDER BY score

The output is as follows:

Add More Details to the Names of Documents

Let us add a little more details to the names of the documents by renaming the documents as follows:

  1. Asif Permanent Employee – Experienced Project Manager
  2. Mike Permanent Employee – Fresh Programmer
  3. Peter Permanent Employee – Fresh Project Manager
  4. Sadaf Contract Employee – Experienced Business Analyst

Finding Fresh Employees (Documents)

Let us now find the documents related to the fresh employees based on their (standard naming of ) titles:

-- Getting document name-based scoring to find fresh employees for a new project
SELECT (SELECT name from EmployeesDocumentStore where path_locator=document_key) as DocumentName
,keyphrase,score  FROM semantickeyphrasetable(EmployeesDocumentStore, NAME)  where keyphrase='fresh'
order by DocumentName desc

The results are as follows:

Finding Experienced Employees (Documents)

Now, suppose we want to quickly review all the experienced employees’ details for quite a complex project ahead.

You can find this out using the following Semantic Search query:

-- Getting document name-based scoring to find all experienced employees
SELECT (SELECT name from EmployeesDocumentStore where path_locator=document_key) as DocumentName ,keyphrase,score  FROM semantickeyphrasetable(EmployeesDocumentStore, NAME) 
where keyphrase='experienced' order by DocumentName

The output is as follows:

Finding all Project Managers (Documents)

Finally, if we would like to quickly go through the documents for all the project managers, then the following Semantic Search query needs to be issued:

-- Getting document name based scoring to find all project managers
SELECT (SELECT name from EmployeesDocumentStore where path_locator=document_key) as DocumentName ,keyphrase,score  FROM semantickeyphrasetable(EmployeesDocumentStore, NAME)
where keyphrase='Project'

The results are shown below:

Limitation of Semantic Search with Default File Table in SQL Server 2016

As we used the default option of creating File Table, which is easy to create and to manage documents on the fly, there is a catch to this (unless you use some complex workarounds, which is beyond the scope of this article). More precisely, Semantic Search can be applied to the titles of the documents, but applying it within the documents requires more effort.

Nevertheless, this approach is an excellent starting point for storing and finding related documents using Semantic Search as it allows us to enjoy the benefits of both Windows File System and SQL Server databases. However, keep in mind that the examples mentioned in this article predominantly depend on your sensible naming strategy, which should be strictly followed.

Congratulations! You can easily use File Table to store your documents on Windows File System and apply Full-Text Search. Apart from that, with some limitations (mentioned), you can perform document comparative analysis based on the titles of documents using Semantic Search.

Stay tuned as there is more to come in the following article(s). We are going to learn advanced uses of Semantic Search by utilizing the full potential of File Table so that we can search within the documents and explore many more possibilities.

Things to do

After implementing the walkthrough in this article, you can successfully store unstructured data such as MS Word documents in a Windows Folder using File Table. Also, you know how to apply both Full-Text Search and Semantic Search (for name-based queries):

  • Try adding a plain text document called Adil Contract Employee – Fresh Business Analyst.txt and apply Semantic Search to query the documents for fresh candidates by using the keyphrase fresh with the help of the code given in this article. The output should be the following:
  • Check if you can successfully remove the newly stored text file from the File Table after issuing this command DELETE FROM [dbo]. EmployeesDocumentStore] WHERE file_type=’txt’
  • Try adding more text documents and try different key phrases keeping the titles of the text documents in mind, and see if you can explore further.
Haroon Ashraf

Haroon Ashraf

Haroon's deep interest in logic and reasoning at an early age of his academic career paved his path to become a data professional. He began his professional life as a computer programmer more than 10 years ago working on his first data venture to migrate and rewrite a public sector database driven examination system from IBM AS400 (DB2) to SQL Server 2000 using VB 6.0 and Classic ASP along with developing reports and archiving many years of data. His work and interest revolves around Database-Centric Architectures and his expertise include database and reports design, development, testing, implementation and migration along with Database Life Cycle Management (DLM).