OpenSearch is a distributed (runs on multiple nodes) search and analytics engine.
OpenSearch is a registered trademark of Amazon Web Services.
Document
A document is a unit that stores information (text or structured data). In OpenSearch, documents are stored in JSON format.
When you search for information, OpenSearch returns documents related to your search.
A document represents a row in a traditional database. For example, in a school database, a document might represent one student and contain the following data.
Index
An index is a collection of documents.
When you search for information, you query data contained in an index.
An index represents a database table in a traditional database.
Clusters and nodes
OpenSearch is designed to be a distributed search engine, meaning that it can run on one or more nodes—servers that store your data and process search requests. An OpenSearch cluster is a collection of nodes.
In each cluster, there is an elected cluster manager node, which orchestrates cluster-level operations, such as creating an index. Nodes communicate with each other, so if your request is routed to a node, that node sends requests to other nodes, gathers the nodes’ responses, and returns the final response.
Shards
OpenSearch splits indexes into shards. Each shard stores a subset of all documents in an index.
Shards are used for even distribution across nodes in a cluster. For example, a 400 GB index might be too large for any single node in your cluster to handle, but split into 10 shards of 40 GB each, OpenSearch can distribute the shards across 10 nodes and manage each shard individually. Despite being one piece of an OpenSearch index, each shard is actually a full Lucene index. This detail is important because each instance of Lucene is a running process that consumes CPU and memory. More shards is not necessarily better. Splitting a 400 GB index into 1,000 shards, for example, would unnecessarily strain your cluster. A good rule of thumb is to limit shard size to 10–50 GB.
Primary and replica shards
In OpenSearch, a shard may be either a primary (original) shard or a replica (copy) shard. By default, OpenSearch creates a replica shard for each primary shard. These replica shards act as backups in the event of a node failure—OpenSearch distributes replica shards to different nodes than their corresponding primary shards—but they also improve the speed at which the cluster processes search requests. You might specify more than one replica per index for a search-heavy workload.
Inverted index
An inverted index maps words to the documents in which they occur.
In addition to the document ID, OpenSearch stores the position of the word within the document for running phrase queries, where words must appear next to each other.
Relevance
When you search for a document, OpenSearch matches the words in the query to the words in the documents. Each document is assigned a relevance score that tells you how well the document matched the query.
Individual words in a search query are called search terms. Each search term is scored:
- A search term that occurs more frequently in a document will tend to be scored higher. A document about dogs that uses the word
dog
many times is likely more relevant than a document that contains the worddog
fewer times. This is the term frequency component of the score. - A search term that occurs in more documents will tend to be scored lower. A query for the terms
blue
andaxolotl
should prefer documents that containaxolotl
over the likely more common wordblue
. This is the inverse document frequency component of the score. - A match on a longer document should tend to be scored lower than a match on a shorter document. A document that contains a full dictionary would match on any word but is not very relevant to any particular word. This corresponds to the length normalization component of the score.
OpenSearch uses the BM25 ranking algorithm to calculate document relevance scores and then returns the results sorted by relevance. See Okapi BM25.