Learn to Store and Analyze Documents on Windows File System with SQL Server Semantic Search – Part 3

Total: 2 Average: 5

Being the final part of the article, it is going to take you to the next level of analyzing word documents stored in Windows folders, managed by File Table, and consumed by Semantic Search.

Additionally, the readers are going to gain more understanding of Semantic Search and how to make it work with MS Word documents for analysis.

This article provides a name-based analysis of the documents with equal attention to both theory and practice.

Prerequisites

This article assumes the following:

  1. The readers are familiar with the following concepts:
    1. File Table(s).
    2. Semantic Search.
    3. Full-Text Search.
  2. All the prerequisites and FILESTREAM Enabled EmployeesFilestreamSample database have been set up.
  3. EmployeesDocumentStore File Table has been set up
  4. The following documents have been added to the File Table:
    1. Asif Permanent Employee – Experienced Project Manager.
    2. Mike Permanent Employee – Fresh Programmer.
    3. Peter Permanent Employee – Fresh Project Manager.
    4. Sadaf Contract Employee – Experienced Business Analyst.
  5. The readers can perform some simple Semantic Search operations against the stored documents.

I suggest that you go through the following articles to fill any gaps in your understanding of the above prerequisites and to implement the walkthrough(s) in the final part of the article:

  1. Learn to Store and Analyse Documents on Windows File System with SQL Server Semantic Search Part 1.
  2. Learn to Store and Analyse Documents on Windows File System with SQL Server Semantic Search Part 2.

Part 1 Review

In the first part of the series we learned about the following things:

  1. Setting up a FILESTREAM database.
  2. Setting up a File Table.
  3. Creating MS Word Documents in Windows Folder.
  4. Saving MS Word Documents in a File Table.
  5. Viewing MS Word Document by running a T-SQL script.

Part 2 Review

In the second part of the article we implemented the following things:

  1. Enabling Semantic Search by defining a Full-Text index, a Catalog, and a unique index.
  2. Running Full-Text queries against the stored documents.
  3. Adding more MS Word Documents.
  4. Counting of Documents.
  5. Comparing Permanent vs Contract-based employees.
  6. Adding more details to the names of the documents.
  7. Finding fresh employees’ documents using the name (column) based Semantic Search.
  8. Finding experienced employees’ documents using the name (column) based Semantic Search.
  9. Finding all Project Manager’s documents using the name (column) based Semantic Search.

Finally, we learned some limitations of Semantic Search, which we may not overcome in this article, but we are going to improve our approach of how we use Semantic Search against the stored documents.

Please refer to the previous parts of the article if you lack any of the knowledge or skills mentioned above.

Name-Based Analysis Review

So far, we have learned how to perform a name-based analysis of the documents stored in a File Table using Semantic Search.

However, to perform such an analysis, you should meet the following conditions:

  1. Standard naming should be in place.
  2. Names should be detailed enough to provide the information required for analysis.

The above conditions are also the limitations of a name-based analysis, but this does not mean we cannot do much with this kind of analysis.

Our focus remains on the name (column) based Semantic Search approach to make it simple for understanding. So let’s delve into it a bit more to gain a solid understanding of how it works and how to use it.

View Name Column of the Documents

Let us view some of the main columns of the documents table including the Name column by running the following T-SQL script:

USE EmployeesFilestreamSample

-- View name column with file type of the stored documents in File Table for analysis
SELECT name,file_type 
FROM dbo.EmployeesDocumentStore

The output is as follows:

Understanding SEMANTICKEYPHRASETABLE Function

SQL Server offers the SEMANTICKEYPHRASETABLE function to perform a document analysis with Semantic Search.

According to Microsoft documentation, the syntax of this function is as follows:

SEMANTICKEYPHRASETABLE  
    (  
    table,  
    { column | (column_list) | * }  
     [ , source_key ]  
    )

This function gives us the key phrases associated with the document that can be used to analyze documents based on their name or content.

Since we are focused on the name (column) based analysis of the stored documents, we need to not only use this function but also understand how to utilize it properly in real-time scenarios.

Without going into much detail, let us briefly specify the required information for this function:

  1. Name of the File Table to be used for Semantic Search analysis.
  2. Name of the column to be used for Semantic Search analysis.

This function then returns the following data:

  1. Column_id.
  2. Document_Key.
  3. Keyphrase.
  4. Score.

About Column_id and document_key

The column_id is the column number and the document_key is the default primary key for the File Table document.

About Key Phrase

The key phrase is the key in any Semantic Search analysis that is based on what distinguishes one document from another.

It determines the meaning of the document or, in other words, it is a phrase that Semantic Search decides to index for analysis.

For example, a document about cars may contain several instances of the word speed, and this may well become a key phrase of that document.

This applies to both name and content of the document depending on which column we want to see the key phrases for.

About Score

The score helps us determine the strength of a key phrase associated with a document, such as how a document is best recognized by its key phrase.

The range of score is between 0.0 to 1.0 where 1.0 is the highest score.

Analyzing All Documents Using SEMANTICKEYPHRASETABLE Function

Let us use the SEMANTICKEYPHRASETABLE function for a name-based analysis of the documents stored in the windows folder managed by File Table.

Type the following T-SQL script:

USE EmployeesFilestreamSample
-- View key phrases and their score for the name column
SELECT * FROM SEMANTICKEYPHRASETABLE(EmployeesDocumentStore,name)
order by score desc

The output is as follows:

The returned results contain a list of all the key phrases attached to all the documents along with their scores.

The column_id 3 in the top row is clearly the name column, plus we also called the function by supplying this column (name):

You can find the document_key: 0xFD89E1811D4F3B2FEB1012DF0C8016F9ACEB2F3260 running the following script (although it is clear that this document is the one whose name contains the key phrase sadaf):

USE EmployeesFilestreamSample
-- Finding document name by its key (path_locator)
SELECT name,path_locator FROM dbo.EmployeesDocumentStore
WHERE path_locator=0xFD89E1811D4F3B2FEB1012DF0C8016F9ACEB2F3260

The output is shown below:

The key phrase sadaf has been given the best score, which is 1.0.

This means, if a standard naming of the documents takes place where the name of the document provides sufficient information for the document to become part of the Semantic Search analysis, then, the key phrase sadaf is the best match for that particular document name.

Let me demonstrate a few more examples to clarify this.

Analyzing Specific Document Using SEMANTICKEYPHRASETABLE Function

We can also narrow down our Semantic Search analysis based on the name column for a particular document.

For example, if we only need to view all the name column based key phrases of a particular document, we can specify the document key in the SEMANTICKEYPHRASETABLE Function.

First, we need to identify the document key for the document where we want to see all the key phrases.

You can run the following T-SQL script to get the document_key:

-- Find document_key of the document where the name contains Peter
SELECT name,path_locator as document_key From EmployeesDocumentStore
WHERE name like '%Peter%'

The output is as follows:

The document key is 0xFF6A92952500812FF013376870181CFA6D7C070220.

Now, let us view this document in terms of all the key phrases that can define the name of the document (as we are applying a name-based analysis):

-- View all the key phrases and their score for a document related to Peter permanent employee
SELECT column_id,name,keyphrase,score FROM SEMANTICKEYPHRASETABLE(EmployeesDocumentStore,name,0xFF6A92952500812FF013376870181CFA6D7C070220)
INNER JOIN dbo.EmployeesDocumentStore on path_locator=document_key
order by score desc

The results are as follows:

The key phrase employee gets the highest score in this document. We can clearly see that all the words of the column ultimately become key phrases, and then, these key phrases determine the meaning of the document.

The purpose of key phrases will get much clearer as we move on to the next important function, which can actually assist us in comparing a document with other documents.

Understanding SEMANTICSIMILARITYTABLE Function

This function helps us compare one document with all the other documents based on their key phrases.

According to Microsoft documentation, the syntax of this function is as follows:

SEMANTICSIMILARITYTABLE  
    (  
    table,  
    { column | (column_list) | * },  
    source_key  
    )

This function at least requires the name of the table, the column, and the document key to match other documents.

For instance, we can state that two documents are similar if they have a good keyphrase matching score.

Comparing Documents Using SEMANTICSIMILARITYTABLE Function

Now, we can compare a document with all the other documents using the SEMANTICSIMILARITYTABLE Function.

Comparing All Project Managers’ Documents

Let’s say we need to see all the documents related to project managers using Semantic Search.

From the above examples, we know that the document key for the specified document is 0xFF6A92952500812FF013376870181CFA6D7C070220, so we can use this key to find many other matches including project managers:

USE EmployeesFilestreamSample

-- View all the documents closely related to Peter project manager
SELECT SST.source_column_id,SST.matched_column_id,EDS.name,SCORE FROM SEMANTICSIMILARITYTABLE(EmployeesDocumentStore,name,0xFF6A92952500812FF013376870181CFA6D7C070220) SST
INNER JOIN dbo.EmployeesDocumentStore EDS on EDS.path_locator=SST.matched_document_key
order by score desc

The output is as follows:

The document that turned out to be most closely related is Asif Permanent Employee – Experienced Project Manager.docx, which makes sense since both employees have the following common features:

  1. Both are permanent employees.
  2. Both are project managers.

Comparing Experienced Business Analyst’s Documents

Now, we are going to compare the document related to an experienced business analyst to find the closest match for this document using Semantic Search provided we are limited to document name based analysis:

USE EmployeesFilestreamSample

-- Finding document_key for experienced business analyst
select name,path_locator as document_key from EmployeesDocumentStore
where name like '%experienced business analyst%'

-- View all the documents closely related to experienced business analyst
SELECT SST.source_column_id,SST.matched_column_id,EDS.name,SCORE FROM SEMANTICSIMILARITYTABLE(EmployeesDocumentStore,name,0xFD89E1811D4F3B2FEB1012DF0C8016F9ACEB2F3260) SST
INNER JOIN dbo.EmployeesDocumentStore EDS on EDS.path_locator=SST.matched_document_key
order by score desc

The output is as follows:

We can see from the results above that the closest match for the document related to an experienced business analyst is the document of the experienced project manager because they are both experienced. Nevertheless, if you look at the score, which is 0.3, it indicates that there is not much in common between the two documents (apart from the fact that they are both experienced).

Similarly, there are many more possibilities and applications for analyzing the documents stored on the Windows file system, and you can search them with Semantic Search as we have seen in the above examples.

Congratulations, you have successfully learned how to store documents in windows folders and analyze them by means of Semantic Search provided you comply with some standard naming of your documents. As well as that, you have come to know some additional information helpful in the analysis.

Things to do

I recommend you to try the following things to improve your skill and incorporate your knowledge:

  1. Try adding another MS Word document called Adil Contract Employee – Fresh Business Analyst.docx and use the SEMANTICKEYPHRASETABLE function to view its keyphrases alongside the score.
  2. Try the SEMANTICSIMILARITYTABLE function to find documents similar to the newly added document above using its document_key.
Haroon Ashraf

Haroon Ashraf

Haroon's deep interest in logic and reasoning at an early age of his academic career paved his path to become a data professional. He began his professional life as a computer programmer more than 10 years ago working on his first data venture to migrate and rewrite a public sector database driven examination system from IBM AS400 (DB2) to SQL Server 2000 using VB 6.0 and Classic ASP along with developing reports and archiving many years of data. His work and interest revolves around Database-Centric Architectures and his expertise include database and reports design, development, testing, implementation and migration along with Database Life Cycle Management (DLM).