Written by 11:40 Database administration, Search

Storing and Analyzing Documents on Windows File System with SQL Server Semantic Search – Part 2

This is the second part of the material dedicated to SQL Server Semantic Search. In the previous article, we explored the basics. Now, we are going to focus on comparing documents stored on Windows File System and the comparative analysis with the Semantic Search in SQL Server.

Performing Name-Based Documents Comparative Analysis

We are going to perform a comparative analysis of documents based on their standard naming. At this point, let’s do a quick check by querying the EmployeesFilestreamSampledatabase we set up earlier:

-- View stored documents managed by File Table to check
SELECT stream_id
      ,[name]
      ,file_type
      ,creation_time
FROM EmployeesFilestreamSample.dbo.EmployeesDocumentStore

The results must show us stored documents:

Semantic Search Checklist

We already have the database and two sample MS Word documents on File System using File Table (you might refer to Part 1 to refresh the knowledge if needed). However, that does not automatically qualify our documents for the Semantic Search scenario.

Semantic Search can be enabled in one of the following ways:

  1. If you have already set up the Full-Text Search, you can enable Semantic Search in a single step.
  2. You can set up the Semantic Search directly, but before you have to set up the Full-Text Search too.

Full-Text Search Test Before the Semantic Search Setup

If a Full-Text query works, we only need to enable Semantic Search. To check that, run a Full-Text query against the desired table:

-- Searching word Employee using Full-Text search against EmployeesDocumentStore File Table
SELECT [name]
  FROM [EmployeesFilestreamSample].[dbo].[EmployeesDocumentStore]
  WHERE CONTAINS(name,'Employee')

The output:

Thus, we need first to fulfill the Full-Text Search requirements and then enable Semantic Search.

Enabling Semantic Search for Use

At least two of the following points are necessary for using Semantic Search:

  • Unique Index
  • A Full-Text catalog
  • A Full-Text index

Execute the following T-SQL script to create a unique index:

-- Create unique index required for Semantic Search
CREATE UNIQUE INDEX UQ_Stream_Id  
    ON EmployeesDocumentStore(stream_id)  
GO

Create a Full-Text catalog based on the newly-created unique index. And then, create a Full-Text index as shown below:

-- Getting Semantic Search ready to be used with File Table
CREATE FULLTEXT CATALOG EmployeesFileTableCatalog WITH ACCENT_SENSITIVITY = ON;
CREATE FULLTEXT INDEX ON EmployeesDocumentStore
(
name LANGUAGE 1033 STATISTICAL_SEMANTICS,
file_type LANGUAGE 1033 STATISTICAL_SEMANTICS,
file_stream TYPE COLUMN file_type LANGUAGE 1033 STATISTICAL_SEMANTICS
)
KEY INDEX UQ_Stream_Id
ON EmployeesFileTableCatalog WITH CHANGE_TRACKING AUTO, STOPLIST=SYSTEM;

The results:

Full-Text Search Test After Semantic Search Setup

Let us run the same Full-Text query to search for the word Employee in the stored documents:

-- Searching (after Semantic Search setup) word Employee using Full-Text search against EmployeesDocumentStore File Table
SELECT [name]
  FROM [EmployeesFilestreamSample].[dbo].[EmployeesDocumentStore]
  WHERE CONTAINS(name,'Employee')

The output:

It is fine for Full-Text queries to work against the File Table while we are getting it ready for Semantic Search.

Add More MS Word Documents

We go to the EmployeesDocumentStore File Table and click Explore FileTable Directory:

Create and store a new document called Sadaf Contract Employee:

Next, add the following text to the newly-created document. The first line must be the title of the document!

Sadaf Contract Employee (title)

Sadaf is a very efficient business analyst who does contact-based work. She is fully capable of handling business requirements and turning them into technical specifications for the developers to work on. She is a very experienced business analyst.

Add another document called Mike Permanent Employee:

Update the document with the following text:

Mike Permanent Employee (Title of the document)

Mike is a fresh programmer whose expertise includes web development. He is a quick learner and happy to work on any project. He has strong problem-solving skills but he has less business knowledge. He requires assistance from other developers or business analysts to understand the problem and to meet the requirements.

He is good when he works on small projects but he struggles if he is given a big or complex project.

We have four documents stored on the Windows File System managed by File Table. These documents should be consumed by Semantic Search (including Full-Text Search).

Important: Although we have just stored four MS Word documents in the folder as a sample, you can imagine the importance of using Semantic Search when hundreds of such documents are maintained by a SQL Server database, and you need to query those documents to find valuable information.

The standard naming of documents matters a lot for the successful implementation of this approach.

Simple Counting of Documents

We can compare these documents and define the differences and similarities based on their standard naming using Semantic Search. For example, a simple query can tell us the total number of documents stored in the Windows Folder:

-- Getting total number of stored documents
SELECT COUNT(*) AS Total_Documents FROM EmployeesDocumentStore

Permanent vs Contract-Based Employees Comparison

This time, we are using Semantic Search to compare the number of permanent and contract-based employees in our organization:

-- Creating a summary table variable
DECLARE @Documents TABLE
(DocumentType VARCHAR(100),
DocumentsCount INT)

INSERT INTO @Documents -- Storing total number of stored documents into summary table
SELECT 'Total Documents',COUNT(*) AS Total_Documents FROM EmployeesDocumentStore

INSERT INTO @Documents -- Storing total number of permanent employees documents stored into summary table
SELECT 'Total Permanent Employees',COUNT(*) 
FROM semantickeyphrasetable (EmployeesDocumentStore, *)
WHERE keyphrase = 'Permanent' 

INSERT INTO @Documents --Storing total number of permanent employees documents stored
SELECT 'Total Contract Employees',COUNT(*) 
FROM semantickeyphrasetable (EmployeesDocumentStore, *)
WHERE keyphrase = 'Contract'

SELECT DocumentType,DocumentsCount FROM @Documents

The output:

Let us run a simple (document name-based) Semantic Search query to view the keyphrase and its relative score for each document:

-- Getting keyphrase and relative score for all the documents
SELECT * FROM semantickeyphrasetable(EmployeesDocumentStore, NAME) 
ORDER BY score

The output:

Let us add more details to the document names. We’ll rename them as follows:

  • Asif Permanent Employee – Experienced Project Manager
  • Mike Permanent Employee – Fresh Programmer
  • Peter Permanent Employee – Fresh Project Manager
  • Sadaf Contract Employee – Experienced Business Analyst

Finding Fresh Employees (Documents)

Find the documents related to fresh employees based on their titles (standard naming):

-- Getting document name-based scoring to find fresh employees for a new project
SELECT (SELECT name from EmployeesDocumentStore where path_locator=document_key) as DocumentName
,keyphrase,score  FROM semantickeyphrasetable(EmployeesDocumentStore, NAME)  where keyphrase='fresh'
order by DocumentName desc

The results:

Finding Experienced Employees (Documents)

Suppose we want to quickly review all details of experienced employees for the complex project ahead. Use the following Semantic Search query:

-- Getting document name-based scoring to find all experienced employees
SELECT (SELECT name from EmployeesDocumentStore where path_locator=document_key) as DocumentName ,keyphrase,score  FROM semantickeyphrasetable(EmployeesDocumentStore, NAME) 
where keyphrase='experienced' order by DocumentName

The output:

Finding all Project Managers (Documents)

Finally, if we want to quickly go through the documents for all project managers, we need the following Semantic Search query:

-- Getting document name-based scoring to find all project managers
SELECT (SELECT name from EmployeesDocumentStore where path_locator=document_key) as DocumentName ,keyphrase,score  FROM semantickeyphrasetable(EmployeesDocumentStore, NAME)
where keyphrase='Project'

The results:

After implementing the walkthrough, you can successfully store unstructured data, such as MS Word documents, in a Windows Folder using File Table.

Name-Based Analysis Review

So far, we have learned how to perform a name-based analysis of documents stored in a File Table using Semantic Search. However, we need the following conditions fulfilled:

  • Standard naming should be in place.
  • Names should provide the information required for analysis.

These conditions are also limitations of the name-based analysis. But this does not mean we cannot do much with it.

Our focus remains on the name/column-based Semantic Search approach.

View the Name Columns of the Documents

Let us view some of the main columns of the Documents table including the Name column:

USE EmployeesFilestreamSample

-- View name column with the file types of the stored documents in File Table for analysis
SELECT name,file_type 
FROM dbo.EmployeesDocumentStore

The output:

Understanding the SEMANTICKEYPHRASETABLE Function

SQL Server offers the SEMANTICKEYPHRASETABLE function to analyze the document with Semantic Search. The syntax is as follows:

SEMANTICKEYPHRASETABLE  
    (  
    table,  
    { column | (column_list) | * }  
     [ , source_key ]  
    )

This function gives us key phrases associated with the document. We can use them to analyze documents based on their names or content. In our case, we need to not only use this function but also understand how to utilize it properly.

The function requires the following data:

  • Name of the File Table to be used for Semantic Search analysis.
  • Name of the column to be used for Semantic Search analysis.

Then it returns the following data:

  • Column_id – the column number
  • Document_Key – the default primary key for the File Table document
  • Keyphrase – is a phrase that Semantic Search decides to index for analysis. It applies to both name and content of the document depending on which column we want to see the key phrases for
  • Score – determines the strength of a key phrase associated with a document, such as how a document is best recognized by its key phrase. The score can be between 0.0 to 1.0.

Analyzing All Documents Using the SEMANTICKEYPHRASETABLE Function

We use the SEMANTICKEYPHRASETABLE function for the name-based analysis of the documents stored in the Windows folder managed by the File Table.

Execute the following T-SQL script:

USE EmployeesFilestreamSample
-- View key phrases and their score for the name column
SELECT * FROM SEMANTICKEYPHRASETABLE(EmployeesDocumentStore,name)
order by score desc

The output:

We have a list of all key phrases attached to all documents and their scores. The column_id 3 in the top row is the name column. Plus, we also called the function by supplying this column (name):

You can find the document_key:0xFD89E1811D4F3B2FEB1012DF0C8016F9ACEB2F3260 running the following script (although it is clear that this document is the one where the name contains the key phrase sadaf):

USE EmployeesFilestreamSample
-- Finding document name by its key (path_locator)
SELECT name,path_locator FROM dbo.EmployeesDocumentStore
WHERE path_locator=0xFD89E1811D4F3B2FEB1012DF0C8016F9ACEB2F3260

The output:

The key phrase sadaf has been given the best score: 1.0.

Thus, in the case of the standard document naming with sufficient information for the Semantic Search analysis, our key phrase sadaf is the best match for that particular document name.

Analyzing Specific Document Using the SEMANTICKEYPHRASETABLE Function

We can narrow our Semantic Search analysis based on the name column. For example, we only need to view the name column-based key phrases of a particular document. We can specify the document key in the SEMANTICKEYPHRASETABLE Function.

First, we identify the document key for that document where we want to see all key phrases. Run the following T-SQL script:

-- Find document_key of the document where the name contains Peter
SELECT name,path_locator as document_key From EmployeesDocumentStore
WHERE name like '%Peter%'

The document key is 0xFF6A92952500812FF013376870181CFA6D7C070220

Now, let us view this document concerning all key phrases that can define the document name:

-- View all the key phrases and their score for a document related to Peter permanent employee
SELECT column_id,name,keyphrase,score FROM SEMANTICKEYPHRASETABLE(EmployeesDocumentStore,name,0xFF6A92952500812FF013376870181CFA6D7C070220)
INNER JOIN dbo.EmployeesDocumentStore on path_locator=document_key
order by score desc

The results:

The key phrase employee gets the highest score in this document. We can see that all words of the column are key phrases that determine the document’s meaning.

Understanding the SEMANTICSIMILARITYTABLE Function

This function helps us compare one document with all other documents based on key phrases. The syntax of this function is as follows:

SEMANTICSIMILARITYTABLE  
    (  
    table,  
    { column | (column_list) | * },  
    source_key  
    )

It requires the name of the table, the column, and the document key to match other documents. For instance, we can state that two documents are similar if they have a good keyphrase matching score.

Comparing Documents Using the SEMANTICSIMILARITYTABLE Function

Let’s compare a document to other documents using the SEMANTICSIMILARITYTABLE Function.

Comparing All Project Managers’ Documents

We need to see all documents related to project managers. From the above examples, we know that the document key for the specified document is 0xFF6A92952500812FF013376870181CFA6D7C070220. Hence, we can use this key to find other matches including project managers:

USE EmployeesFilestreamSample

-- View all the documents closely related to Peter project manager
SELECT SST.source_column_id,SST.matched_column_id,EDS.name,SCORE FROM SEMANTICSIMILARITYTABLE(EmployeesDocumentStore,name,0xFF6A92952500812FF013376870181CFA6D7C070220) SST
INNER JOIN dbo.EmployeesDocumentStore EDS on EDS.path_locator=SST.matched_document_key
order by score desc

The output:

The most closely related document is Asif Permanent Employee – Experienced Project Manager.docx. It makes sense since both employees are permanent workers, and both are project managers.

Comparing Experienced Business Analyst’s Documents

Now, we are going to compare the documents related to experienced business analysts and find the closest match using Semantic Search. We are limited to the document name-based analysis:

USE EmployeesFilestreamSample

-- Finding document_key for experienced business analyst
select name,path_locator as document_key from EmployeesDocumentStore
where name like '%experienced business analyst%'

-- View all the documents closely related to experienced business analyst
SELECT SST.source_column_id,SST.matched_column_id,EDS.name,SCORE FROM SEMANTICSIMILARITYTABLE(EmployeesDocumentStore,name,0xFD89E1811D4F3B2FEB1012DF0C8016F9ACEB2F3260) SST
INNER JOIN dbo.EmployeesDocumentStore EDS on EDS.path_locator=SST.matched_document_key
order by score desc

The output:

As we can see from the results above, the closest match for the document related to the experienced business analyst is the document of the experienced project manager because they are both experienced. Nevertheless, the score of 0.3 indicates that there is not much in common between these two documents.

Conclusion

Congratulations! We have successfully learned how to store documents in Windows folders and analyze them utilizing Semantic Search. We also explored the functions to use in practice. Now you can apply the new knowledge and try the following exercises to

Stay tuned for further materials!

Tags: , , , Last modified: October 13, 2021
Close