Optimizing Orama: Schema Optimization
Posted by
Michele Riva
Michele Riva
CTO @OramaSearch
Product Updates
July 11, 2023

Optimizing Orama: Schema Optimization

Orama is a runtime-agnostic, edge-ready, in-memory full-text search engine that works both on client and server.

Through the implementation of optimized data structures and some clever tweaks, Orama can perform searches through millions of entries in microseconds.

One area that needs to be implemented correctly to fulfill its performance promises is schema optimization. By carefully designing your data schema, you can significantly improve the speed and accuracy of searches. This involves identifying the most important search fields, normalizing data to reduce redundancy, and choosing appropriate data types.

With a well-optimized schema, Orama can become even more powerful and efficient.

What is a Schema

An Orama Schema is an immutable JavaScript object that tells the search engine how to index the data, and should therefore only contain the data we want to search through at query time.

Let's imagine we have to build an application that lists dog breeds and has the following JSON as a data source. We want to index it to perform a full-text search against it, allowing users to search by breed, country of origin, character, and more:

1[ 2 { 3 "breed": "Labrador Retriever", 4 "country": "Canada", 5 "longevity": 10, 6 "character": [ 7 "Loyal", 8 "Friendly", 9 "Intelligent", 10 "Energetic", 11 "Good-natured" 12 ], 13 "colors": { 14 "fur": [ 15 "Yellow", 16 "Black", 17 "Chocolate" 18 ], 19 "eyes": [ 20 "Brown" 21 ] 22 } 23 }, 24 { 25 "breed": "German Shepherd", 26 "country": "Germany", 27 "longevity": 7, 28 "character": [ 29 "Loyal", 30 "Intelligent", 31 "Protective", 32 "Confident", 33 "Trainable" 34 ], 35 "colors": { 36 "fur": [ 37 "Black", 38 "Tan" 39 ], 40 "eyes": [ 41 "Brown" 42 ] 43 } 44 } 45]

We could easily write an Orama schema definition for the data above by just typing:

1import { create } from '@orama/orama' 2 3const db = await create({ 4 schema: { 5 breed: 'string', 6 country: 'string', 7 longevity: 'number', 8 character: 'string[]', 9 colors: { 10 fur: 'string[]', 11 eyes: 'string[]' 12 } 13 } 14})

With the schema defined above, Orama will create a Radix Tree for every string or string[] properties and an AVL Tree for every number or number[].

That means that eventually, Orama will create the following data structures to index your data:

  1. A Radix Tree for the breed property
  2. A Radix Tree for the country property
  3. An AVL Tree (but might be a Zip Tree or a Red-Black Tree in the future) for the longevity property
  4. A Radix Tree for the character property
  5. A Radix Tree for the colors.fur property
  6. A Radix Tree for the colors.eyes property

So as you can see, in that case we will be creating six data structures that must receive values at index time (so, when using the Orama insert function to feed the database).

With the example above we are able to perform search, group, sort, and filter operations on every field written in the schema. But are we sure that's what we want? Most of the time, that's really unlikely, so let's see how we can optimize the Orama schema.

Understanding Your Data

With the example above, we've seen how easy it can be to create a simple schema definition for your data, even when its shape can be pretty complex.

But every property we create in our schema will lead to bigger index size, slower insertion and search times, as the data structures (typically trees) we have to consider and then traverse will become more and more.

Therefore, it is important to understand that the schema definition does not represent your data in its completeness. It only represents the data you want to search through.

So, given the example above where we want to index a dictionary of dog breeds and their characteristics, when creating a schema we should ask ourselves: what data do I want to search?

Given six total properties:

  1. breed
  2. country
  3. longevity
  4. character
  5. colors.fur
  6. colors.eyes

Are you sure you want to perform full-text search by the country property? And what about the longevity property? Are you sure you want to use facets, filters, or group by this property?

If the answer is yes, then you should definitely put these into your schema definition. But if there's something you're not interested considering as a discrimination factor at search time, then putting it into the schema definition will only lead to slower insertion and search performance.

Define Your Use Case

We previously saw an example of a quite complex schema, where we indexed every property that appeared in the original JSON documents:

1import { create } from '@orama/orama' 2 3const db = await create({ 4 schema: { 5 breed: 'string', 6 country: 'string', 7 longevity: 'number', 8 character: 'string[]', 9 colors: { 10 fur: 'string[]', 11 eyes: 'string[]' 12 } 13 } 14})

Now let's consider a use case for full-text search in the context of a dictionary of dog breeds. What do we want to show to our users? Let's make a very simple example.

Adopt a Puppy!

Let's pretend for a moment that you're running a website where you help people to adopt a new, rescued dog. What information will people need to understand what kind of breed can be a good match for them?

  1. breed. People will be certainly searching by breed.
  2. character. If I want to explore the world, I might want an energetic dog to come with me!

All the other properties might not be so important at search time, but they might be when we want to display the final data.

Therefore, our schema might now be:

1import { create } from '@orama/orama' 2 3const db = await create({ 4 schema: { 5 breed: 'string', 6 character: 'string[]' 7 } 8})

We can still feed Orama with the entire JSON document we saw at the beginning of this post, but we will only be able to search, filter, group, and create facets based on breed and character properties.

When performing a search operation, Orama will return the entire document containing all the properties that originally appeared in the data source file.

Is Orama Semi-Schemaless then?

Yes, in a way, Orama is semi-schemaless. While a schema is required to tell the search engine how to index the data, it is only a representation of the data we want to search through. It doesn't have to contain every property in the data source file, and including unnecessary properties will only slow down insertion and search performance. Therefore, it's important to carefully consider which properties to include in the schema for optimal performance.

Another great advantage that comes with being semi-schemaless, is that we can store different data depending on the document. Let's take a look at the following example:

1import { create, insert } from '@orama/orama' 2 3const db = await create({ 4 schema: { 5 breed: 'string', 6 character: 'string[]' 7 } 8}) 9 10await insert(db, { 11 breed: 'Golden Retriever', 12 character: ['Loyal', 'Friendly'], 13 customProperty: 'foo', 14 customMetadata: 'bar' 15}) 16 17await insert(db, { 18 breed: 'German Shepherd', 19 character: ['Loyal', 'Intelligent'], 20 customInfo: 'my custom info', 21 customNumericData: 1000 22})

As you can see, once we define a basic schema and ensure that the data we insert respects its types, we can insert different properties in different documents. Orama won't index them, and will simply return them during search time without applying any computation.

Conclusion

Schema optimization is crucial for Orama's performance, and involves identifying important search fields, normalizing data, and choosing appropriate data types. The schema definition should only represent the data you want to search through, and unnecessary properties will slow down insertion and search performance. Orama is semi-schemaless, allowing for different properties to be stored in different documents.