At the heart of search engines lies an important component called the tokenizer.
In this blog post, we will delve into the world of tokenizers, their purpose, and how they work in the context of full-text search engines. We'll also explore some examples using JavaScript to help solidify your understanding.
A tokenizer is a component of a full-text search engine that processes raw text and breaks it into individual tokens.
Tokens are the smallest units of text that the search engine will analyze and index.
In most cases, a token represents a single word, but in some instances, it can represent phrases, numbers, or other types of text.
The process of converting raw text into tokens is called tokenization.
Tokenizers are crucial for building efficient and accurate search indexes. They help search engines:
There are several types of tokenizers, each with its own strengths and weaknesses.
In this section, we will explore three common tokenizers: Whitespace Tokenizer, Pattern Tokenizer, and Standard Tokenizer.
The Whitespace Tokenizer is the simplest type of tokenizer. It splits the input text into tokens based on whitespace characters, such as spaces, tabs, and newlines.
1function whitespaceTokenizer(text) { 2 return text.split(/\s+/); 3} 4 5const sampleText = "Full-text search engines are powerful tools."; 6console.log(whitespaceTokenizer(sampleText)); 7// Output: ["Full-text", "search", "engines", "are", "powerful", "tools."]
Pros of the Whitespace Tokenizer:
Cons of the Whitespace Tokenizer:
In summary, while the Whitespace Tokenizer is simple and fast, its limitations in handling punctuation, special characters, and text normalization may result in less accurate search results and indexing. More sophisticated tokenizers, such as the Standard Tokenizer, can provide better accuracy and consistency by taking into account various linguistic rules and preprocessing techniques.
The Pattern Tokenizer uses a regular expression pattern to split the input text into tokens. This allows for more fine-grained control over the tokenization process and can handle more complex cases, such as splitting on punctuation marks or special characters.
1function patternTokenizer(text, pattern) { 2 return text.split(pattern); 3} 4 5const sampleText = "Full-text search engines are powerful tools."; 6const pattern = /[\s,.!?]+/; 7console.log(patternTokenizer(sampleText, pattern)); 8// Output: ["Full-text", "search", "engines", "are", "powerful", "tools"]
Pros of the Pattern Tokenizer:
Cons of the Pattern Tokenizer:
In summary, the Pattern Tokenizer offers more flexibility and control over the tokenization process compared to the Whitespace Tokenizer, allowing for better handling of special characters, punctuation, and language-specific rules. However, it may be more complex to implement and maintain, and it does not address token normalization or preprocessing issues by default.
The Standard Tokenizer is a more sophisticated tokenizer that takes into account various linguistic rules, such as handling punctuation, special characters, and compound words. It usually combines multiple tokenization strategies to provide a more accurate and meaningful representation of the text.
1function standardTokenizer(text) { 2 return text 3 .replace(/[\.,!?\n]+/g, " ") // Replace punctuation marks and newlines with spaces 4 .split(/\s+/) // Split on whitespace characters 5 .map((token) => token.toLowerCase()); // Convert all tokens to lowercase 6} 7 8const sampleText = "Full-text search engines are powerful tools."; 9console.log(standardTokenizer(sampleText)); 10// Output: ["full-text", "search", "engines", "are", "powerful", "tools"]
Pros of the Standard Tokenizer:
Cons of the Standard Tokenizer:
In summary, the Standard Tokenizer offers a more sophisticated approach to tokenization, taking into account various linguistic rules and providing better accuracy, consistency, and normalization. However, it can be more complex to implement and maintain, and it may have slower processing times compared to simpler tokenizers. Despite these drawbacks, the improved search accuracy and performance often make the Standard Tokenizer a preferred choice for full-text search engines.
Orama uses the Standard Tokenizer to tokenize text for indexing and searching.
The Standard Tokenizer is a good choice for Orama because it provides a more accurate and consistent representation of the text, which is important for search accuracy and performance. It also allows for customization to handle specific language characteristics or word separation rules, making it versatile across different languages.
To customize the tokenizer used by Orama, provide an object which has at least the following properties:
tokenize
: A function that accepts a string and language and returns a list of tokens.language (string)
: The language supported by the tokenizer.normalizationCache (Map)
: It can used to cache tokens normalization.In other words, a tokenizer must satisfy the following interface:
1interface Tokenizer { 2 language: string 3 normalizationCache: Map<string, string> 4 tokenize: (raw: string, language?: string) => string[] | Promise<string[]> 5}
For instance, with the following configuration only the first character of each string will be indexed and only the first character of a term will be searched:
1import { create } from '@orama/orama' 2 3const movieDB = await create({ 4 schema: { 5 title: 'string', 6 director: 'string', 7 }, 8 components: { 9 tokenizer: { 10 language: 'english', 11 normalizationCache: new Map(), 12 tokenize(raw) { 13 return raw[0] 14 } 15 } 16 } 17})
The Orama's default tokenizer is exported via @orama/orama/components
and can be customized as follows:
1import { create } from '@orama/orama' 2import { tokenizer as defaultTokenizer } from '@orama/orama/components' 3 4const movieDB = await create({ 5 schema: { 6 title: 'string', 7 director: 'string', 8 }, 9 components: { 10 tokenizer: await defaultTokenizer.createTokenizer({language: 'english', stemming: false }) 11 } 12})
Optionally you can pass the customization options without using createTokenizer
:
1import { create } from '@orama/orama' 2 3const movieDB = await create({ 4 schema: { 5 title: 'string', 6 director: 'string', 7 }, 8 components: { 9 tokenizer: { 10 language: 'english', 11 } 12 } 13})
Read more in the official docs.
In conclusion, tokenization is a fundamental aspect of full-text search engines, providing the basis for efficient text analysis and indexing. As we've explored, there are various tokenizers available, each with its own strengths and weaknesses. The Whitespace Tokenizer is simple and fast but lacks accuracy; the Pattern Tokenizer is flexible but requires more knowledge of regular expressions; and the Standard Tokenizer is more sophisticated, offering better accuracy and consistency, but at the cost of increased complexity and potentially slower processing times.
When choosing a tokenizer for your search engine, it is crucial to consider the specific requirements and characteristics of the data and languages you are working with, as well as the performance and accuracy trade-offs you are willing to make. In the case of Orama, the Standard Tokenizer is used due to its improved search accuracy, performance, and adaptability to different languages.
By understanding the different types of tokenizers and their implications, you can make more informed decisions about the tokenization strategies to employ in your search engine, ultimately improving the effectiveness and efficiency of your text-based searches.