Managing Indexes

OpenSearch® is a database engine designed to handle massive amount of data. To do so, it relies on important concepts such as indexes, shards and replicas. Understanding what they are and how they work together is crucial to manage indexes efficiently, which is key to maintaining the performance and scalability of your addon.
OpenSearch® also comes with powerful features that help managing indexes and keeping them in control. Leveraging them allows to ease maintenance work while ensuring compliance with technical or legal requirements.
Here are a few insights and options that might be worth considering.

Understanding Indexes, Shards and Replicas

In OpenSearch®, an index designates a thing that holds all the documents and data you want to analyze and search. It can be viewed as an organizational container that stores documents.

Shards are the building blocks of an OpenSearch® index, each containing a subset of the documents stored in the index. Splitting indexes this way, in several shards, allows OpenSearch® to distribute data horizontally across the different nodes of the cluster. An index is therefore a collection of shards spread on multiple nodes.

This design gives OpenSearch® three major benefits:

  • It allows OpenSearch® to handle enormous volumes of data that would not fit on a single server (scalability)
  • It improves performances by processing search and indexing operations in parallel
  • It adds fault tolerance, thanks to replicas.

There are two types of shards:

  • Primary shards hold the original data. This is were data are written.
  • Replica shards are read-only copies of the primary shards. Each replica shard is always located on a different node of the cluster, ensuring redundancy and high availability in case of a node failure.

Technically speaking, shards are Apache Lucene indexes. Theorically, each one can contain a maximum of about 2.1 billion documents. OpenSearch® itself can handle up to 4000 shards per cluster node!

Choosing the Number of Shards

Choosing an appropriate number of shards before creating an index is primordial for multiple reasons:

  • It’s quite hard to modify the number of primary shards of an index once it has been created
  • The size of the shards has an impact on the database performances:
    • Splitting data into too many small shards causes resource waste and can cause performance issue, as well as out of memory errors.
    • On the other side, too large shards are simply inefficient performances-wise and are hard to recover in case of a failure.

Setting the number of shards also depends on your usecase. AWS officially advises to stick to the following guidelines:

  • Keep shard size between 10-30GiB when search latency is a key performance goal.
  • Keep shard size between 30-50GiB for write intensive workflows such as log analytics.

Finally, all these considerations can be summarized by the following formula, suggested by AWS. It allows to estimate the appropriate number of primary shards:

approximate number of primary shards = (source data + room to grow) × 1.1 / desired shard size

Make sure that the result doesn’t create unnecessarily small shards. That would be counter-productive!

Choosing the Number of Replicas

With Starter Plans

Since Starter plans only feature a single node, OpenSearch® is not able to fulfill the default replica configuration. This results in OpenSearch® returning a yellow health status for all user-created indexes: the data are stored in three shards instead of six.

We are keeping these defaults on purpose for two main reasons:

  • To emphasize the fact that your data are not replicated with Starter plans.
  • To make your indexes ready for a possible future plan upgrade (switching to a Business plan).

If you don’t plan on moving to a Business plan, or if you don’t care about replication, there is no benefit in setting the number of replicas to a value above 0.

With Business Plans

Business plans, on the other hand, are indeed able to replicate the data over the cluster nodes. Therefore, user-created indexes should always all be green.

Since OpenSearch® clusters provided by Scalingo feature three nodes, the number of replicas should most probably be set between 0 and 3, depending on the replication level you want.

Here is an example request creating an index named my-index with two primary shards and three replicas:

PUT my-index
{
  "settings": {
    "index": {
      "number_of_shards": 2,
      "number_of_replicas": 3
    }
  }
}

Using Index Templates

Index templates allows to initialize new indexes with predefined values for settings, mappings, analyzers, shards, replicas, etc. If the name of the new index matches one of the patterns of one of the templates, OpenSearch® applies the template to the newly created index.

Leveraging this feature allows to prevent duplication of settings and therefore avoids inconsistencies.

The following example creates an index template named logs_template and applies it to any new index whose name matches the pattern logs-*. The newly created index will use two primary shards and two replicas. It will also have a mapping configured by default.

Here is an example request creating an index template named logs_template, setting the number of shards and replicas to 2 for all new indexes which name matches logs-*. The newly created index will also have a mapping.

PUT _index_template/logs_template
{
  "index_patterns": ["logs-*"],
  "template": {
    "settings": {
      "number_of_shards": 2,
      "number_of_replicas": 2
    },
    "mappings": {
      "properties": {
        "timestamp": {
          "type": "date",
          "format": "yyyy-MM-dd HH:mm:ss"
        }
      }
    }
  }
}

Using Time-Based Indexes

When dealing with time-based data such as logs, metrics, economic indicators or even orders, a good practice generally consists in using time-based indexes, like something-2025.05.06. The granularity of the name pattern is up to you: it can be hours, days, weeks, or even months, depending on your usecase and the amount of data you plan to ingest.

Using such a pattern to name indexes has multiple benefits:

  • Capacity planning becomes more precise and easier. For example, if you know the average size of a document, and the number of documents you are ingesting per unit of time (e.g. 250 per hour), it becomes quite easy to estimate the size of the index for any unit of time. It also becomes easier to track the growth rate of the indexes over time.
  • When using index templates, it allows to update the number of primary shards of an index without having to reindex all data. For instance, if you are using a day-wise index created via an index template, any change applied to this template becomes effective the next day. Existing indexes on their side are left untouched.
  • Deleting old data also becomes easier: time-based name allows to quickly identify which indexes are to be kept, and which are to be removed.
  • It makes implementing lifecycle policies easier. Here again, the naming convention makes things clear and pretty straightforward.

Using Aliases

In OpenSearch®, an alias is like a nickname or a shortcut that points to one or more indexes. It’s a flexible and powerful way to manage how applications or users access the documents stored in your database. By obfuscating the underlying indexes, aliases create a simpler experience. For instance, instead of querying a time-based index named something-2025.05.06, an application could use an alias like something-today pointing to something-2025.05.06.

Since they abstract index names, aliases avoid downtime when transitioning from an index to another. For example, if your application queries the database via an alias named something pointing to something-a, switching to querying something-b only consists in updating the alias. The queried name remains the same and there is no need to update the application code. Moreover, the operation is be completely transparent for the application. This logic also applies when rotating time-based indexes.

Because an alias can point to multiple indexes, it can be used as an aggregate of multiple indexes. For instance, an alias named something-2024 could point to something-2024.01, something-2024.02, something-2024.03, etc. This allows to easily create sliding windows of data from the n previous unit of time (e.g. last-6-months).

An alias can also be configured to be read-only or read-write, which can be convenient to prevent data loss due to wrong processing.

Finally, an alias can be filtered based on multiple criterias (terms, dates, time ranges, …) to only expose a subset of the underlying indexes.

Here is an example of a request creating an alias named logs-2-weeks aggregating the indexes logs-week-32 and logs-week-33. This alias is read-only (by default):

POST _aliases
{
  "actions": [
    { "add": { "index": "logs-week-32", "alias": "logs-2-weeks" } },
    { "add": { "index": "logs-week-33", "alias": "logs-2-weeks" } }
  ]
}

Implementing Index State Management

Index State Management (ISM) is a plugin that helps automate routine administrative tasks by triggering actions based on conditions like index age, size, or document count. ISM allows to create policies to automatically manage actions such as index rollovers or deletions, tailored to your specific needs.

Leveraging such a feature allows to keep indexes size under limits and avoids keeping useless old data in your database. It’s generally a good way to keep your data under control and prevents indexes from growing indefinitely.

The following requests shows an example leveraging the rollover API to create a rollover policy for an alias. This policy automatically creates a new index once the current one contains at least one document. The new index is automatically named after the first one (in this example, log-001, log-002, log-003, etc.). The new index is also added to the alias and is set to be the one writable, so that adding a new document to the alias actually adds it to the newly created index. The previous index is updated to be marked as non-writable for the alias.

# Creates the policy:

PUT _plugins/_ism/policies/rollover_policy
{
  "policy": {
    "description": "Example rollover.",
    "default_state": "rollover",
    "states": [
      {
        "name": "rollover",
        "actions": [
          {
            "rollover": {
              "min_doc_count": "1"
            }
          }
        ],
        "transitions": []
      }
    ],
    "ism_template": {
      "index_patterns": ["log*"],
      "priority": 100
    }
  }
}


# Creates an index template and link the policy:

PUT _index_template/ism_rollover
{
  "index_patterns": ["log*"],
  "template": {
    "settings": {
      "plugins.index_state_management.rollover_alias": "log"
    }
  }
}


# Creates the first index and the `log` alias:

PUT log-001
{
  "aliases": [
    "log": {
      "is_write_index": true
    }
  ]
}

Optimizing Mappings and Data Types

Mappings and data types are essential components of OpenSearch® that define how documents and their properties are structured and indexed.

A mapping is a schema definition that tells OpenSearch® how to store and index documents. It mostly sets data types to properties of the documents, which helps OpenSearch® understand how to efficiently store, index and search them.

Using static mappings with the appropriate data types allows to optimize storage, improves query performance and thus makes your indexes more powerful. It also helps maintain data consistency by preventing invalid data to be added to the belonging index.

Without explicit mappings, OpenSearch falls back on dynamic mappings, which are often more costly and less efficient. Avoiding them and specifying your documents structure is generally a better practice to optimize your indexes.

The following table presents the main available data types:

Type Description Usecase
text Analyzed and tokenized Full-text search
keyword Not analyzed Filtering, exact matching, and sorting
long, integer, float, etc Numeric types Range queries and aggregations
date Datetime values Time-based queries
boolean True / False  
object For nested JSON objects  
nested For nested JSON objects Allows querying inner objects independently

Here is an example query creating an index named movies, with some mappings:

PUT movies
{
  "mappings": {
    "properties": {
      "title":    { "type": "text" },
      "release":  { "type": "date", "format": "strict_year_month" },
      "synopsis": { "type": "text" },
      "duration": { "type": "short" },
      //...
    }
  }
}

Reindexing Data When Needed

In OpenSearch®, reindexing data consists in copying data from one index to another. Reindexing can really help improve performance by creating a new, compact, optimized and consistent index. It’s necessary in several situations such as:

  • When changing mappings, analyzers, settings, data types, etc. after an index is created.
  • When trying to fix corrupted or malformes data, for instance, if data has been ingested incorrectly (wrong field types, inconsistent format, etc.).
  • When in the need to modify the sharding strategy.

Please keep in mind that reindexing can be resource-intensive. As a consequence, we usually advise to run this process during off-peak hours to limit the impact on your application and users.

If data is still being written to the old index during reindexing, please also consider steps to capture or reprocess the new data afterward.

Finally, keep in mind that aliases can help seamlessly switch from an old to a new index without requiring a client update.

Here is an example query asking OpenSearch® to reindex an index called old_index into a new one called new_index:

POST _reindex
{
    "source": { "index": "old_index" },
    "dest": { "index": "new_index" }
}

Suggest edits

Managing Indexes

©2025 Scalingo