Creating and managing stoplists serves as one of the main components responsible for improving the Full-Text Search performance alongside reducing the size of the index. This article aims to help you master the job in-depth and implement multiple strategies to create stoplists on simple but interesting examples.
We’ll also highlight the importance of different methods to generate stoplists and clarify how to choose the most suitable method.
Why Mastering Stoplists
When we talk about using stoplists concerning the Full-Text Search, the first question is why we learn so much about these stoplists. The answer lies in both the evident and hidden benefits of using stoplists with Full-Text Search. There are also long-term gains they may bring once implemented successfully.
Importance of Stoplists
A stoplist is a language-specific component of Full-Text Search containing user-defined or system-provided stopwords. It has to exclude such words from becoming a part of Full-Text Search.
A Full-Text Search design without a stoplist is not the optimum use of language-specific components that should improve Full-Text Search efficiency and response time.
Prerequisites
- You can write and run T-SQL scripts.
- Full-Text Search must be installed into your SQL Server instance.
- You are familiar with basic Full-Text Search concepts and implementation.
- You have basic knowledge of stopwords and stoplist.
In case you lack the information to ensure the above requirements, refer to the below articles:
- Implementing Full-Text Search in SQL Server 2016 for beginners
- Implementing Full-Text Search in SQL Server 2016 for Advanced Users
- How to use Stopwords and Stoplist to improve SQL Server Full-Text Search (FTS)
Multiple Strategies of the Stoplist Creation
There are many different methods or strategies, but some of them are far better than others. Any database developer with the Full-Text Search implementation skills should be familiar with all the methods to pick the best one when required.
The most effective way to understand these different strategies is to apply them against a sample database.
Setup Sample Database
The first step is to set up the database to run Full-Text queries. Create a sample database called WatchReviewsMasterStoplist:
-- Create WatchReviewsMasterStoplist database
CREATE DATABASE WatchReviewsMasterStoplist;
GO
-- Connect to the sample database
USE WatchReviewsMasterStoplist
-- (2) Create WatchReview table
CREATE TABLE [dbo].[WatchReview]
(
[ReviewId] INT NOT NULL IDENTITY ,
[Date] DATETIME2 NULL,
[Person] VARCHAR(50) NULL,
[Details] VARCHAR(1000) NULL,
CONSTRAINT [PK_WatchReview] PRIMARY KEY (ReviewId)
)
-- (3) Populate WatchReview table
SET IDENTITY_INSERT [dbo].[WatchReview] ON
INSERT INTO [dbo].[WatchReview] ([ReviewId], [Date], [Person], [Details]) VALUES (1, N'2020-06-01 00:00:00', N'Asif', N'Hi, I have just bought this Casio black digital watch which is excellent and has date, alarm, stopwatch and timer as well.')
INSERT INTO [dbo].[WatchReview] ([ReviewId], [Date], [Person], [Details]) VALUES (2, N'2020-07-02 00:00:00', N'Asim', N'Hi, I have just bought this Casio black analog watch which is average and slightly discomforting and just got date and time.')
INSERT INTO [dbo].[WatchReview] ([ReviewId], [Date], [Person], [Details]) VALUES (3, N'2020-08-03 00:00:00', N'Martin', N'Hi, I have just purchased this Casio black digital watch which is excellent and has features like an alarm, stopwatch, date, and timer. ')
INSERT INTO [dbo].[WatchReview] ([ReviewId], [Date], [Person], [Details]) VALUES (4, N'2020-09-10 00:00:00', N'Brian ', N'Hi, I have just ordered this Seiko black digital watch which is excellent and has date, alarm and timer.')
INSERT INTO [dbo].[WatchReview] ([ReviewId], [Date], [Person], [Details]) VALUES (5, N'2020-10-11 00:00:00', N'Peter', N'Hi, I have just ordered this Seiko white analog watch which is average and has alarm, date and timer.')
SET IDENTITY_INSERT [dbo].[WatchReview] OFF
Multiple Stoplist Strategies Overview
You can use the following three ways/strategies to create a stoplist and prevent noise words from becoming a part of your Full-Text Search queries:
- Create a custom stoplist
- Create a system stoplist
- Create a stoplist from a custom (or system) stoplist
Stoplist Prerequisites
Remember that before you create a stoplist, you have to do the following:
- Create a unique index on one of the table columns if there is no key column available (Primary Key).
- Create a Full-Text Catalog.
- Create a Full-Text Index.
Create a Custom Stoplist
Assume that a unique index or primary key column is available (as in our sample database). We create a custom stoplist as follows:
- Create a Full-Text Catalog.
- Create an empty custom stoplist.
- Create a Full-Text Index with the custom stoplist created in step 2.
For detailed information, refer to the article How to use Stopwords and Stoplist to improve SQL Server Full-Text Search (FTS)
Pros and Cons of Custom Stoplists
The biggest advantage of creating a custom stoplist is that you can have full control over it and keep track of it by adding and removing words that you think are noise words.
There is a downside to using this approach, though. The stoplist is very limited, and it won’t include the system recommended stopwords to improve the Full-Text queries performance further.
Creating a System Stoplist
Create a system stoplist (if the primary key column or unique index is provided) as follows:
- Create a Full-Text Catalog.
- Create a Full-Text Index with the system stoplist.
We prepare to use the system stoplist by creating a Full-Text Catalog first:
-- Create Full-Text catalog
CREATE FULLTEXT CATALOG [WatchReviewCatalog] AS DEFAULT;
GO
Create a Full-Text Index on the reviews column (Details) of the table WatchReview with the system stoplist. Use the following T-SQL script:
-- Create Full-Text index with System Stoplist
CREATE FULLTEXT INDEX ON dbo.WatchReview(Details LANGUAGE [British English])
KEY INDEX PK_WatchReview
WITH STOPLIST = System;
GO
View the English Language System Stoplist
You can view the list of stopwords included in the system stoplist of British English. Run the following SQL query:
-- View system stoplist of English language
SELECT name,stopword FROM sys.fulltext_system_stopwords sw
INNER JOIN sys.fulltext_languages fl
on sw.language_id=fl.lcid
where name='British English'
The output is as follows:
View the Top 3 Most Recent Reviews
Run a quick check against the newly created sample database. Do it by getting the top 3 most recent watch reviews from the WatchReview table:
-- Top 3 most recent watch reviews
SELECT TOP 3 wr.Person,FORMAT(wr.Date,'dd-MMM-yyyy') as ReviewDate,wr.Details FROM dbo.WatchReview wr
ORDER BY wr.Date DESC
The output is as follows:
Checking Stopwords by running Full-Text queries
We can run Full-Text queries against the nominated table. Don’t be surprised if many noise words (included in the system stoplist to improve the performance) don’t show up in the Full-Text queries. It means that our system stoplist is doing its job correctly.
As we can see, the word ‘this’ is present in the top 3 reviews. However, it a noise word itself recognized by the system stoplist.
Let us check if a Full-Text Search returns the noise word ‘this’ included in the system stoplist:
--Run Full-Text query to Search for Noise word 'this'
SELECT * FROM dbo.WatchReview
WHERE CONTAINS(Details,'this')
The result set is below:
Now we can run the Full-Text query to search for the word ‘is’ which is another noise word:
--Run Full-Text query to Search for Noise word 'is'
SELECT * FROM dbo.WatchReview
WHERE CONTAINS(Details,'is')
The query output is below:
We can see that the word ‘is’ has also been successfully prevented from becoming a part of the Full-Text query result by saving valuable resources.
Full-Text query to search for the word ‘Hi’
We run a Full-Text query to search for the word ‘Hi.’ It is not a part of the system stoplist. Thus, the query must return it.
--Search Noise word 'Hi'
SELECT * FROM dbo.WatchReview
WHERE CONTAINS(Details,'Hi')
The output is as follows:
We got all the records containing the word ‘Hi.’ However, we would like to exclude it, since it is a noise word. If we use Full-Text Search in our textual analysis, this word may distract us from the objective.
This leads us to the third method: creating a stoplist from a custom or system stoplist.
Creating a Custom Stoplist from the System Stoplist
Create a custom stoplist called WatchMasterStoplist from the system stoplist with the following T-SQL script:
--Creating a custom stoplist WatchMasterStoplist from system stoplist
CREATE FULLTEXT STOPLIST [WatchMasterStoplist] FROM SYSTEM STOPLIST;
GO
Associating a new Custom Stoplist with Full-Text index
Associate the newly created custom stoplist with the Full-Text index as follows:
USE [WatchReviewsMasterStoplist]
GO
-- Associating new custom stoplist with Full-Text index
ALTER FULLTEXT INDEX ON [dbo].[WatchReview] SET STOPLIST = [WatchMasterStoplist]
Checking the Custom Stoplist by Searching for Word ‘this’
You can quickly check if your custom stoplist includes all the noise words from the system stoplist. Run the same code we used earlier to search for the word ‘this.’
--Run Full-Text query to Search for Noise word 'this'
SELECT * FROM dbo.WatchReview
WHERE CONTAINS(Details,'this')
This output must be blank.
Adding a new noise word ‘Hi’ to the Custom Stoplist
We need to add the noise word ‘Hi’ to the newly created custom stoplist. Run the following script:
--Alter customer stoplist to add Hi noise word
ALTER FULLTEXT STOPLIST WatchMasterStoplist
ADD 'Hi' LANGUAGE 'British English';
The result set is below:
The Final Check
Now we will run the final Full-Text query check for some noise words – we compare it to the original result set containing those noise words.
For example, if we focus on the fourth review (ordered by ReviewId), which does contain some noise words, and then compare the results with the Full-Text query results, we should see no noise words.
Run the following script to conduct the final check against the sample database:
USE WatchReviewsMasterStoplist
-- View the record which contains fourth review
SELECT ReviewId
,[Date]
,Person
,Details
FROM dbo.WatchReview
where ReviewId=4
--Run Full-Text query to Search for Noise words in the fourth review
SELECT * FROM dbo.WatchReview
WHERE CONTAINS(Details,'"Hi" OR "this" OR "I" OR "is" OR "and"')
and ReviewId=4
The output is as follows:
Full-Text query to search for the digital watches reviews
Now we can run our Full-Text Search queries with the stoplist in place. The following T-SQL script will inform us about the digital watches reviews (sample):
USE WatchReviewsMasterStoplist
--Run Full-Text query to Search for digital watch reviews sample
SELECT * FROM dbo.WatchReview WHERE CONTAINS(Details,'digital')
The results are below:
Congratulations! We have successfully implemented all three methods of creating and associating stoplists. We also examined adding new words to stoplists and checking the performance.
Still, I strongly recommend the third method. Create a custom stoplist from a system stoplist, and then add noise words to improve your Full-Text Search.
Things to do
Now that you are well aware of three methods of creating stoplists, you might do some exercises to improve your textual analysis skills, like the following:
- Add a custom stoplist from the system stoplist against the sample database, like in this article.
- Create a custom stoplist from the system stoplist against the sample database, like in this article.
- Run a T-SQL query to see the added stopwords after getting familiar with this article.