Elasticsearch’s red status means at least one primary shard (and all of its replicas) is missing. This reminds you of missing data: searches will return partial results, and indexing into that shard will return an exception.

x-pack bug

When I open kibana Web UI this morning, I find this page:

Don’t panic since I have handled this many times. There must be some unassigned shards used by x-pack plugin. Let’s see what these index are.

curl -XGET http://xs333:19201/_cluster/health?level=indices | json_pp

The output is abbreviated so that we can address monitor index only.

{
{
  "cluster_name": "fusiones-v2",
  "status": "red",
  "timed_out": false,
  "number_of_nodes": 38,
  "number_of_data_nodes": 28,
  "active_primary_shards": 5043,
  "active_shards": 5064,
  "relocating_shards": 0,
  "initializing_shards": 0,
  "unassigned_shards": 165,
  "delayed_unassigned_shards": 0,
  "number_of_pending_tasks": 0,
  "number_of_in_flight_fetch": 0,
  "task_max_waiting_in_queue_millis": 0,
  "active_shards_percent_as_number": 96.84452094090649,
  "indices": {
     ".monitoring-kibana-6-2017.08.04" : {
       "active_shards" : 0,
       "unassigned_shards" : 2,
       "relocating_shards" : 0,
       "number_of_shards" : 1,
       "active_primary_shards" : 0,
       "number_of_replicas" : 1,
       "status" : "red",
       "initializing_shards" : 0
      },
    ".monitoring-es-6-2017.08.04" : {
       "initializing_shards" : 0,
       "status" : "red",
       "number_of_replicas" : 1,
       "active_primary_shards" : 0,
       "number_of_shards" : 1,
       "relocating_shards" : 0,
       "unassigned_shards" : 2,
       "active_shards" : 0
      }
  }
}

x-pack reserves monitor index of recent 7 days by default, How does the index of 2017.08.04 still exist and become unassigned? I can only speculate this is a bug of x-pack, and delete these red shards.

curl -XDELETE http://xs333:19201/.monitoring-kibana-6-2017.08.04
curl -XDELETE http://xs333:19201/.monitoring-es-6-2017.08.04

After these commands, We can see normal Kibana, like this.

Unfortunately, the status is still red, and there are 165 unassigned shards.

Explain API

GET _cluster/health?level=indices told us which index is red, to explain the allocation of its shard,

GET _cluster/allocation/explain

{
  "index": "rtlogindex_2017-10-09-21_part-00008",
  "shard": 0,
  "primary": true
}

Specify the index and shard id of the shard you would like an explanation for, as well as the primary flag to indicate whether to explain the primary shard for the given shard id or one of its replica shards. These three request parameters are required.

The upper request produces the following output:

{
  "index": "rtlogindex_2017-10-09-21_part-00008",
  "shard": 0,
  "primary": true,
  "current_state": "unassigned",
  "unassigned_info": {
    "reason": "INDEX_CREATED",
    "at": "2017-10-09T18:39:56.971Z",
    "last_allocation_status": "no"
  },
  "can_allocate": "no",
  "allocate_explanation": "cannot allocate because allocation is not permitted to any of the nodes",
  "node_allocation_decisions": [
    {
      "node_id": "SLcx_8U7S-C-5aaS39-iTw",
      "node_name": "xs328-i3",
      "transport_address": "10.34.41.52:39301",
      "node_attributes": {
        "rack": "xs-r1",
        "ml.enabled": "true"
      },
      "node_decision": "no",
      "weight_ranking": 1,
      "deciders": [
        {
          "decider": "filter",
          "decision": "NO",
          "explanation": """node does not match index setting [index.routing.allocation.include] filters [_name:"xs393-i4"]"""
        }
      ]
    }
  ]
}

The explain API found the primary shard 0 of rtlogindex_2017-10-09-21_part-00008 to explain,

  1. which is in the unassigned state (see current_state) due to the index having just been created (see unassigned_info).
  2. The shard cannot be allocated (see can_allocate) due to none of the nodes permitting allocation of the shard (see allocate_explanation).
  3. When drilling down to each node’s decision (see node_allocation_decisions), we observe that node xs328-i3 received a decision not to allocate (see node_decision) due to the filter decider (see decider) preventing allocation with the reason that node does not match index setting index.routing.allocation.include (see explanation inside the deciders section). The explanation also contains the exact setting to change to allow the shard to be allocated in the cluster.

For more information of allocator and decider, allocators try to find the best nodes to hold the shard, and deciders make the decision if allocating to a node is allowed.

In consideration of the reason, we’d better update index’s settings by,

PUT rtlogindex_2017-10-09-21_part-00008/_settings
{
  "index.routing.allocation.include._name": null
}

Afterwards, this index become green,

So far, we fixed the Unassigned Primary Shards of index, the solution also works on Unassigned Replica Shards and Assigned Shards.

Reference