Suggestions on Using Elasticsearch

Elasticsearch is an open-source search engine. This section provides some suggestions on using Elasticsearch to help you better use CSS.

Improving Indexing Efficiency

Selecting Appropriate Number of Shards and Replicas

When you create index data, you are advised to specify the number of shards and replicas. Otherwise, default settings (five shards and one replica) will be used.

The shard quantity is strongly relevant to the indexing speed. Too many or too few shards will lead to slow indexing. If you specify too many shards, numerous files will be opened during retrieval, slowing down the communication between servers. If you specify too few shards, the index size of a single shard may be too large, slowing down the indexing speed.

Specify the shard quantity based on the node quantity, disk quantity, and index size. It is recommended that the size of a single shard not exceed 30 GB. The shard size is calculated using the following formula: Size of a shard = Total amount of data/Shard quantity

PUT /my_index
{
  "settings": {
    "number_of_shards":   1,
    "number_of_replicas":  0
  }
}

Storing Data in Different Indices

Elasticsearch relies on Lucene to index and store data and it suits dense data, which means that all documents have the same field.

Creating Indices by Time Range

You are advised to create indices to store time-related data, such as log data, by time range, instead of storing all data in a super large index.

For example, you can store data in an index named by year (example: logs_2014) or by month (example: logs_2014-10). When the volume of data becomes very large, you can store data in an index named by day (example: logs_2014-10-24).

Creating indices by time range has the following advantages:

Optimizing Index Configurations

Using Index Templates

Elasticsearch allows you to use index templates to control settings and mappings of certain created indices, for example, controlling the shard quantity to 1 and disabling the _all field. You can use the index template to control the settings you want to apply to the created indices.

In the following example, the index matching logstash-* uses the my_logs template, and the priority value of the my_logs template is 1.

Versions earlier than 7.x

PUT /_template/my_logs 
{
  "template": "logstash-*", 
  "order":    1, 
  "settings": {
    "number_of_shards": 1 
  },
  "mappings": {
    "_default_": { 
      "_all": {
        "enabled": false
      }
    }
  },
  "aliases": {
    "last_3_months": {} 
  }
}

Versions later than 7.x

PUT /_template/my_logsa
{
  "index_patterns": ["logstasaah-*"],
  "order": 1,
  "settings": {
    "number_of_shards": 1
  },
  "mappings": {
    "properties": {
      "_all": {
        "enabled": false
      }
    }
  },
  "aliases": {
    "last_3_months": {}
  }
}

Data Backup and Restoration

Elasticsearch replicas provide high availability during runtime, which ensures service continuity even when sporadic data loss occurs.

However, replicas do not protect against failures. In case of a failure, you need a backup of your cluster so that you can restore data.

To back up cluster data, create snapshots and save them in OBS buckets. This backup process is "smart". You are advised to use your first snapshot to store a copy of your data. All subsequent snapshots will save the differences between the existing snapshots and the new data. As the number of snapshots increases, backups are added or deleted accordingly. This means that subsequent backups will take a shorter time because only a small volume of data needs to be transferred.

Improving Query Efficiency by Filtering

Filters are important because they are fast. They do not calculate relevance (skipping the entire scoring phase) and are easily cached.

Usually, when you look for an exact value, you will not want to score the query. You would want to include/exclude documents, so you will use a constant_score query to execute the term query in a non-scoring mode and apply a uniform score of one.

GET /my_store/products/_search
{
    "query" : {
        "constant_score" : { 
            "filter" : {
                "term" : { 
                    "city" : "London"
                }
            }
        }
    }
}

Retrieving Large Amount of Data Through Scroll API

In the scenario where a large amount of data is returned, the query-then-fetch process supports pagination with the from and size parameters, but within limits. Results are sorted on each shard before being returned. However, with larger from values, the sorting process can become very heavy, using vast amounts of CPU, memory, and bandwidth. For this reason, deep pagination is not recommended.

You can use a scroll query to retrieve large numbers of documents from Elasticsearch efficiently, without affecting system performance. Scrolling allows you to do an initial search and to keep pulling batches of results from Elasticsearch until there are no more results left.

Differences Between Query and Filter

In general, a filter will outperform a scoring query.

When used in filtering context, the query is said to be a non-scoring or filtering query. That is, the query simply asks the question: Does this document match? The answer is always a simple, binary yes|no.

Typical filtering cases are listed as follows:

When used in a querying context, the query becomes a "scoring" query. Similar to the non-scoring query, this query also determines if a document matches and how well the document matches. A typical use for a scoring query is to find documents:

Checking Whether a Query Is Valid

Queries can become quite complex. Especially, when they are combined with different analyzers and field mappings, they can become a little difficult to follow. You can use the validate-query API to check whether a query is valid.

For example, on the Kibana Console page, run the following command to check whether the query is valid. In this example, the validate request tells you that the query is invalid.

Versions earlier than 7.x

GET /gb/tweet/_validate/query 
{   
 "query": {      
 "tweet" : {       
   "match" : "really powerful"     
  }    
} 
}

Versions later than 7.x

GET /gb/tweet/_validate/query  
{ 
"query": { 
   "productName" : { 
  "match" : "really powerful" 
  } 
  } 
 }

The response to the preceding validate request tells us that the query is invalid. To find out why it is invalid, add the explain parameter to the query string and execute the following command.

Versions earlier than 7.x

GET /gb/tweet/_validate/query?explain 
{
"query": {
   "tweet" : {
  "match" : "really powerful"
  }
  }
 }

Versions later than 7.x

GET /gb/tweet/_validate/query?explain
{    
 "query": {       
 "productName" : {        
   "match" : "really powerful"      
  }     
}  
}

According to the command output shown in the following, the type of query (match) is mixed up with the name of the field (tweet).

{
  "valid": false,
  "error": "org.elasticsearch.common.ParsingException: no [query] registered for [tweet]"
}

Using the explain parameter has the added advantage of returning a human-readable description of the (valid) query, which helps in understanding exactly how CSS interprets your query.