The current article is about Stopwords and Stoplist usage. The aim is to make Full-Text Search more efficient in terms of storage and performance. Additionally, the readers of this article will get a walkthrough of implementing Full-Text Search with Stopwords and Stoplist.
The article also highlights the importance of Stopwords and Stoplist(s) to keep the Full-Text Search precise and avoid noise words.
Understanding Stopwords and Stoplist
First, let’s clarify the essence of Stopwords and Stoplist. Then we’ll proceed to use them to improve Full-Text Search.
A stoplist, as the name implies, is a list of stopwords. When associated with Full-Text Search, the Stoplist can filter out meaningless words or terms, thus improving search results.
A stopword is a word that has a minor role in Full-Text Search, despite being important grammatically. Therefore, a stopword is not essential from the Full-Text Search perspective.
According to Microsoft documentation, a stopword can be a word with some meaning in a specific language, or it may be some token with no linguistic value. In both cases, it is useless for the Full-Text Search.
Examples of Stopwords
The following are stopwords in the (British/American) English language:
Since the above words do not contribute to Full-Text Search, they are stopwords, despite their significance in understanding a complete sentence.
Each supported language, including British English, has a default or system stoplist that can be modified according to particular requirements.
Users can create their own/custom stoplist, using the system stoplist, or from scratch. It is helpful to prevent noise words from becoming a part of the Full-Text Search results.
Languages and Stoplists
Full-Text search supports many different languages. Each of the supported languages has at least one system stoplist by default. However, there can be multiple stoplists for any language, including both system and custom stoplists.
List of supported languages
We can quickly check the list of all supported languages used with Full-Text Search. For this, run the following T-SQL script:
-- List of Full-Text Search supported languages SELECT ftl.lcid,ftl.name FROM sys.fulltext_languages ftl
The results are as follows:
The query on the SQL Server 2016 instance returns a total of 53 languages.
Using Stopwords and Stoplist with Full-Text Search
Now, we’ll get some hands-on experience of creating a stoplist of stopwords defined by the user. Then, we’ll use it for the Full-Text Search as a walkthrough. It will be just like a real-time scenario.
First of all, you need to ensure the following requirements to implement the walkthrough correctly:
- Having a basic understanding of Full-Text Search
- The ability to implement the Full-Text Search in SQL Server
- The presence the Full-Text Search option enabled/installed on the SQL instance you plan to use
To ensure those prerequisites, go through the below articles if you are already familiar with the T-SQL scripting:
- Implementing Full-Text Search in SQL Server 2016 for beginners
- Implementing Full-Text Search in SQL Server 2016 for Advanced Users
Check Full-Text Search status
Run the following query to check if you have the Full-Text Search installed into your SQL instance:
-- Is Full-Text Search installed then 1 or 0 SELECT fulltextserviceproperty('IsFulltextInstalled') as [Full-Text Search]
The result of the above script should return 1 as follows:
If you are getting any number other than 1, refer to the articles mentioned above again.
Setup Sample Database (WatchReviewsStoplist)
First, you need to set up a sample database named WatchReviewsStoplist. Use the below script:
-- Create WatchReviewsStoplist database CREATE DATABASE WatchReviewsStoplist; GO -- Connect to the sample database USE WatchReviewsStoplist -- (2) Create WatchReview table CREATE TABLE [dbo].[WatchReview] ( [ReviewId] INT NOT NULL IDENTITY , [Date] DATETIME2 NULL, [Person] VARCHAR(50) NULL, [Details] VARCHAR(1000) NULL, CONSTRAINT [PK_WatchReview] PRIMARY KEY (ReviewId) ) -- (3) Populate WatchReview table SET IDENTITY_INSERT [dbo].[WatchReview] ON INSERT INTO [dbo].[WatchReview] ([ReviewId], [Date], [Person], [Details]) VALUES (1, N'2020-01-01 00:00:00', N'Atif', N'Hi, I have just bought this Casio black digital watch which is excellent and has date, alarm, stopwatch and timer as well.') INSERT INTO [dbo].[WatchReview] ([ReviewId], [Date], [Person], [Details]) VALUES (2, N'2020-01-02 00:00:00', N'Qasim', N'Hi, I have just bought this Casio black analog watch which is average and slightly discomforting and just got date and time.') INSERT INTO [dbo].[WatchReview] ([ReviewId], [Date], [Person], [Details]) VALUES (3, N'2020-01-03 00:00:00', N'Martin', N'Hi, I have just purchased this Casio black digital watch which is excellent and has features like alarm, stopwatch, date and timer. ') INSERT INTO [dbo].[WatchReview] ([ReviewId], [Date], [Person], [Details]) VALUES (4, N'2020-01-10 00:00:00', N'Brian ', N'Hi, I have just ordered this Seiko black digital watch which is excellent and has date, alarm and timer.') INSERT INTO [dbo].[WatchReview] ([ReviewId], [Date], [Person], [Details]) VALUES (5, N'2020-01-11 00:00:00', N'Sarah', N'Hi, I have just ordered this Seiko white analog watch which is average and has alarm, date and timer.') SET IDENTITY_INSERT [dbo].[WatchReview] OFF
Check Sample Database
Run the following T-SQL script to view the table contents in the sample database:
-- Check sample database SELECT wr.ReviewId,wr.Person,wr.Details FROM dbo.WatchReview wr
The output is as follows:
Assume that we’ve asked the development team to improve the Full-Text Search for the customers’ reviews. They should do it by excluding noise words. This way, the Full-Text Search remains efficient and also focused.
To fulfill these requirements, we need first to set up the Full-Text Search. It means taking the following actions:
- Create a Full-Text Catalog
- Create an empty custom or user-defined Stoplist
- Create a Full-Text index
Create Full-Text Catalog
Create a full-text catalog with the following script:
-- Create Full-Text catalog CREATE FULLTEXT CATALOG [WatchReviewCatalog] AS DEFAULT; GO
Create an empty Custom Stoplist
Create a blank custom stoplist with the following script:
-- Create empty custom stoplist CREATE FULLTEXT STOPLIST [WatchReviewStoplist]; GO
A custom stoplist is now created.
Create Full-Text Index
Finally, create a Full-Text Index in the Details column for the British English language. It will point to the custom stoplist created earlier. Use the following T-SQL script for this action:
-- Create Full-Text index pointing to the previously created Stoplist CREATE FULLTEXT INDEX ON dbo.WatchReview(Details LANGUAGE [British English]) KEY INDEX PK_WatchReview WITH STOPLIST = WatchReviewStoplist; GO
Full-Text query to get customer rated excellent watch
Run the following Full-Text query to find out which product (watch) got excellent marks by customers:
--Search customer reviews where the product was rated excellent SELECT * FROM dbo.WatchReview WHERE CONTAINS(Details,'excellent')
The results are below:
We can see that customers have given excellent ratings to the black digital watch. Therefore, we know the best product based on reviews, and we understand that the Full-Text Search works fine.
Still, there are few concerns that you might ignore for a table of five rows, but it can get us performance and storage benefits in a relatively large result set if addressed promptly.
Search noise word ‘Hi’ with Full-Text query
There are many noise words attached to the Full-Text index. Soon, they can become a burden. Plus, they are no use to the search.
Let’s check if the noise word Hi is present in the Full-Text Search results because it is better to exclude it.
Run the Full-Text query as follows:
--Run Full-Text query to Search for Noise word 'Hi' SELECT * FROM dbo.WatchReview WHERE CONTAINS(Details,'Hi')
The results are below:
The results returned show that the word “Hi” is part of the Full-Text search results. Thus, we need to exclude it – it is meaningless to the search and does not help any analysis.
Alter Stoplist to add Stopword Hi
We will add the stopword Hi by altering our custom Stoplist so that it cannot be indexed by Full-Text search and should not be returned by Full-Text queries to save space and improve the search:
--Alter customer stoplist to add Hi noise word ALTER FULLTEXT STOPLIST WatchReviewStoplist ADD 'Hi' LANGUAGE 'British English';
Search noise word ‘Hi’ again after adding it as Stopword
Search for the noise word Hi after adding it to the Stoplist:
--Search Noise word 'Hi' after it has been added as a stopword SELECT * FROM dbo.WatchReview WHERE CONTAINS(Details,'Hi')
The output is as follows:
Check added Stopwords
Anytime, you can check the stopwords added to the Stoplist. Run the following script for this:
-- Check added stopwords SELECT * FROM sys.fulltext_stopwords
The results are below:
Congratulations! You have successfully created a stoplist and added a stopword to it. This way, you make the Full-Text Search more efficient in both performance and storage.
Things to do
Now that you can create stoplists and add stopwords to them, try the following to further improve your skills:
- Add words ‘I,’ ‘this,’ ‘just,’ ‘as noise words to your custom stoplist created in this walkthrough.
- Check the added stopwords by running the script we specified at the end of the walkthrough.
- How to use Stopwords and Stoplist to improve SQL Server Full-Text Search (FTS) - September 23, 2020
- Learn to Store and Analyze Documents on Windows File System with SQL Server Semantic Search – Part 3 - August 14, 2020
- Learn to Store and Analyze Documents on Windows File System With SQL Server Semantic Search – Part 2 - August 10, 2020