478 lines
16 KiB
Plaintext
478 lines
16 KiB
Plaintext
[[health-api]]
|
|
=== Health API
|
|
++++
|
|
<titleabbrev>Health</titleabbrev>
|
|
++++
|
|
|
|
An API that reports the health status of an {es} cluster.
|
|
|
|
[[health-api-request]]
|
|
==== {api-request-title}
|
|
|
|
`GET /_health_report` +
|
|
|
|
`GET /_health_report/<indicator>` +
|
|
|
|
[[health-api-prereqs]]
|
|
==== {api-prereq-title}
|
|
|
|
* If the {es} {security-features} are enabled, you must have the `monitor` or
|
|
`manage` <<privileges-list-cluster,cluster privilege>> to use this API.
|
|
|
|
[[health-api-desc]]
|
|
==== {api-description-title}
|
|
|
|
The health API returns a report with the health status of an Elasticsearch cluster. The report
|
|
contains a list of indicators that compose Elasticsearch functionality.
|
|
|
|
Each indicator has a health status of: `green`, `unknown`, `yellow` or `red`. The indicator will
|
|
provide an explanation and metadata describing the reason for its current health status.
|
|
|
|
The cluster's status is controlled by the worst indicator status.
|
|
|
|
In the event that an indicator's status is non-green, a list of impacts may be present in the
|
|
indicator result which detail the functionalities that are negatively affected by the health issue.
|
|
Each impact carries with it a severity level, an area of the system that is affected, and a simple
|
|
description of the impact on the system.
|
|
|
|
Some health indicators can determine the root cause of a health problem and prescribe a set of
|
|
steps that can be performed in order to improve the health of the system. The root cause and remediation
|
|
steps are encapsulated in a `diagnosis`.
|
|
A diagnosis contains a cause detailing a root cause analysis, an action containing a brief description
|
|
of the steps to take to fix the problem, the list of affected resources (if applicable), and a detailed
|
|
step-by-step troubleshooting guide to fix the diagnosed problem.
|
|
|
|
NOTE: The health indicators perform root cause analysis of non-green health statuses. This can
|
|
be computationally expensive when called frequently. When setting up automated polling of the API
|
|
for health status set `verbose` to `false` to disable the more expensive analysis logic.
|
|
|
|
[[health-api-path-params]]
|
|
==== {api-path-parms-title}
|
|
|
|
`<indicator>`::
|
|
(Optional, string) Limit the information returned to
|
|
a specific indicator. Supported indicators are:
|
|
+
|
|
--
|
|
`master_is_stable`::
|
|
Reports health issues regarding
|
|
the stability of the node that is seen as the master by the node handling
|
|
the health request. In case of enough observed master changes in a short period of time
|
|
this indicator will aim to diagnose and report back useful information
|
|
regarding the cluster formation issues it detects.
|
|
|
|
`shards_availability`::
|
|
Reports health issues regarding shard assignments.
|
|
|
|
`disk`::
|
|
Reports health issues caused by lack of disk space.
|
|
|
|
`ilm`::
|
|
Reports health issues related to
|
|
Indexing Lifecycle Management.
|
|
|
|
`repository_integrity`::
|
|
Tracks repository integrity and reports health issues
|
|
that arise if repositories become corrupted.
|
|
|
|
`slm`::
|
|
Reports health issues related to
|
|
Snapshot Lifecycle Management.
|
|
|
|
`shards_capacity`::
|
|
Reports health issues related to the shards
|
|
capacity of the cluster.
|
|
--
|
|
|
|
[[health-api-query-params]]
|
|
==== {api-query-parms-title}
|
|
|
|
`verbose`::
|
|
(Optional, Boolean) If `true`, the response includes additional details that help explain the status of each non-green indicator.
|
|
These details include additional troubleshooting metrics and sometimes a root cause analysis of a health status.
|
|
Defaults to `true`.
|
|
|
|
`size`::
|
|
(Optional, integer) The maximum number of affected resources to return.
|
|
As a diagnosis can return multiple types of affected resources this parameter will limit the number of resources returned for each type to the configured value (e.g. a diagnosis could return
|
|
`1000` affected indices and `1000` affected nodes).
|
|
Defaults to `1000`.
|
|
|
|
[role="child_attributes"]
|
|
[[health-api-response-body]]
|
|
==== {api-response-body-title}
|
|
|
|
`cluster_name`::
|
|
(string) The name of the cluster.
|
|
|
|
`status`::
|
|
(Optional, string) Health status of the cluster, based on the aggregated status of all indicators
|
|
in the cluster. If the health of a specific indicator is being requested, this top
|
|
level status will be omitted. Statuses are:
|
|
|
|
`green`:::
|
|
The cluster is healthy.
|
|
|
|
`unknown`:::
|
|
The health of the cluster could not be determined.
|
|
|
|
`yellow`:::
|
|
The functionality of a cluster is in a degraded state and may need remediation
|
|
to avoid the health becoming `red`.
|
|
|
|
`red`:::
|
|
The cluster is experiencing an outage or certain features are unavailable for use.
|
|
|
|
`indicators`::
|
|
(object) Information about the health of the cluster indicators.
|
|
+
|
|
.Properties of `indicators`
|
|
[%collapsible%open]
|
|
====
|
|
`<indicator>`::
|
|
(object) Contains health results for an indicator.
|
|
+
|
|
.Properties of `<indicator>`
|
|
[%collapsible%open]
|
|
=======
|
|
`status`::
|
|
(string) Health status of the indicator. Statuses are:
|
|
|
|
`green`:::
|
|
The indicator is healthy.
|
|
|
|
`unknown`:::
|
|
The health of the indicator could not be determined.
|
|
|
|
`yellow`:::
|
|
The functionality of an indicator is in a degraded state and may need remediation
|
|
to avoid the health becoming `red`.
|
|
|
|
`red`:::
|
|
The indicator is experiencing an outage or certain features are unavailable for use.
|
|
|
|
`symptom`::
|
|
(string) A message providing information about the current health status.
|
|
|
|
`details`::
|
|
(Optional, object) An object that contains additional information about the cluster that
|
|
has lead to the current health status result. This data is unstructured, and each
|
|
indicator returns <<health-api-response-details, a unique set of details>>. Details will not be calculated if the
|
|
`verbose` property is set to false.
|
|
|
|
`impacts`::
|
|
(Optional, array) If a non-healthy status is returned, indicators may include a list of
|
|
impacts that this health status will have on the cluster.
|
|
+
|
|
.Properties of `impacts`
|
|
[%collapsible%open]
|
|
========
|
|
`severity`::
|
|
(integer) How important this impact is to the functionality of the cluster. A value of 1
|
|
is the highest severity, with larger values indicating lower severity.
|
|
|
|
`description`::
|
|
(string) A description of the impact on the cluster.
|
|
|
|
`impact_areas`::
|
|
(array of strings) The areas of cluster functionality that this impact affects.
|
|
Possible values are:
|
|
+
|
|
--
|
|
* `search`
|
|
* `ingest`
|
|
* `backup`
|
|
* `deployment_management`
|
|
--
|
|
|
|
========
|
|
|
|
`diagnosis`::
|
|
(Optional, array) If a non-healthy status is returned, indicators may include a list of
|
|
diagnosis that encapsulate the cause of the health issue and an action to take in order to remediate the problem.
|
|
The diagnosis will not be calculated if the `verbose` property is false.
|
|
+
|
|
.Properties of `diagnosis`
|
|
[%collapsible%open]
|
|
========
|
|
`cause`::
|
|
(string) A description of a root cause of this health problem.
|
|
|
|
`action`::
|
|
(string) A brief description the steps that should be taken to remediate the problem.
|
|
A more detailed step-by-step guide to remediate the problem is provided by the
|
|
`help_url` field.
|
|
|
|
`affected_resources`::
|
|
(Optional, array of strings) If the root cause pertains to multiple resources in the
|
|
cluster (like indices, shards, nodes, etc...) this will hold all resources that this
|
|
diagnosis is applicable for.
|
|
|
|
`help_url`::
|
|
(string) A link to the troubleshooting guide that'll fix the health problem.
|
|
========
|
|
=======
|
|
====
|
|
|
|
[role="child_attributes"]
|
|
[[health-api-response-details]]
|
|
==== Indicator Details
|
|
|
|
Each health indicator in the health API returns a set of details that further explains the state of the system. The
|
|
details have contents and a structure that is unique to each indicator.
|
|
|
|
[[health-api-response-details-master-is-stable]]
|
|
===== master_is_stable
|
|
|
|
`current_master`::
|
|
(object) Information about the currently elected master.
|
|
+
|
|
.Properties of `current_master`
|
|
[%collapsible%open]
|
|
====
|
|
`node_id`::
|
|
(string) The node id of the currently elected master, or null if no master is elected.
|
|
|
|
`name`::
|
|
(string) The node name of the currently elected master, or null if no master is elected.
|
|
====
|
|
|
|
`recent_masters`::
|
|
(Optional, array) A list of nodes that have been elected or replaced as master in a recent
|
|
time window. This field is present if the master
|
|
is changing rapidly enough to cause problems, and also present as additional information
|
|
when the indicator is `green`. This array includes only elected masters, and does _not_
|
|
include empty entries for periods when there was no elected master.
|
|
+
|
|
.Properties of `recent_masters`
|
|
[%collapsible%open]
|
|
====
|
|
`node_id`::
|
|
(string) The node id of a recently active master node.
|
|
|
|
`name`::
|
|
(string) The node name of a recently active master node.
|
|
====
|
|
|
|
`exception_fetching_history`::
|
|
(Optional, object) If the node being queried sees that the elected master has stepped down
|
|
repeatedly, the master history is requested from the most recently elected master node for
|
|
diagnosis purposes. If fetching this remote history fails, the exception information is
|
|
returned in this detail field.
|
|
+
|
|
.Properties of `exception_fetching_history`
|
|
[%collapsible%open]
|
|
====
|
|
`message`::
|
|
(string) The exception message for the failed history fetch operation.
|
|
|
|
`stack_trace`::
|
|
(string) The stack trace for the failed history fetch operation.
|
|
====
|
|
|
|
`cluster_formation`::
|
|
(Optional, array) If there has been no elected master node recently, the node being queried attempts to
|
|
gather information about why the cluster has been unable to form, or why the node being queried has been
|
|
unable to join the cluster if it has formed. This array could contain any entry for each master eligible
|
|
node's view of cluster formation.
|
|
+
|
|
.Properties of `cluster_formation`
|
|
[%collapsible%open]
|
|
====
|
|
`node_id`::
|
|
(string) The node id of a master-eligible node
|
|
|
|
`name`::
|
|
(Optional, string) The node name of a master-eligible node
|
|
|
|
`cluster_formation_message`::
|
|
(string) A detailed description explaining what went wrong with cluster formation, or why this node was
|
|
unable to join the cluster if it has formed.
|
|
====
|
|
|
|
[[health-api-response-details-shards-availability]]
|
|
===== shards_availability
|
|
|
|
`unassigned_primaries`::
|
|
(int) The number of primary shards that are unassigned for reasons other than initialization or relocation.
|
|
|
|
`initializing_primaries`::
|
|
(int) The number of primary shards that are initializing or recovering.
|
|
|
|
`creating_primaries`::
|
|
(int) The number of primary shards that are unassigned because they have been very recently created.
|
|
|
|
`restarting_primaries`::
|
|
(int) The number of primary shards that are relocating because of a node shutdown operation.
|
|
|
|
`started_primaries`::
|
|
(int) The number of primary shards that are active and available on the system.
|
|
|
|
`unassigned_replicas`::
|
|
(int) The number of replica shards that are unassigned for reasons other than initialization or relocation.
|
|
|
|
`initializing_replicas`::
|
|
(int) The number of replica shards that are initializing or recovering.
|
|
|
|
`restarting_replicas`::
|
|
(int) The number of replica shards that are relocating because of a node shutdown operation.
|
|
|
|
`started_replicas`::
|
|
(int) The number of replica shards that are active and available on the system.
|
|
|
|
[[health-api-response-details-disk]]
|
|
===== disk
|
|
|
|
`indices_with_readonly_block`::
|
|
(int) The number of indices the system enforced a read-only index block (`index.blocks.read_only_allow_delete`) on
|
|
because the cluster is running out of space.
|
|
|
|
`nodes_with_enough_disk_space`::
|
|
(int) The number of nodes that have enough available disk space to function.
|
|
|
|
`nodes_over_high_watermark`::
|
|
(int) The number of nodes that are running low on disk and it is likely that they will run out of space. Their disk usage
|
|
has tripped the <<cluster-routing-watermark-high, high watermark threshold>>.
|
|
|
|
`nodes_over_flood_stage_watermark`::
|
|
(int) The number of nodes that have run out of disk. Their disk usage has tripped the <<cluster-routing-flood-stage, flood stage
|
|
watermark threshold>>.
|
|
|
|
`unknown_nodes`::
|
|
(int) The number of nodes for which it was not possible to determine their disk health.
|
|
|
|
[[health-api-response-details-repository-integrity]]
|
|
===== repository_integrity
|
|
|
|
`total_repositories`::
|
|
(Optional, int) The number of currently configured repositories on the system. If there are no repositories
|
|
configured then this detail is omitted.
|
|
|
|
`corrupted_repositories`::
|
|
(Optional, int) The number of repositories on the system that have been determined to be corrupted. If there are
|
|
no corrupted repositories detected, this detail is omitted.
|
|
|
|
`corrupted`::
|
|
(Optional, array of strings) If corrupted repositories have been detected in the system, the names of up to ten of
|
|
them are displayed in this field. If no corrupted repositories are found, this detail is omitted.
|
|
|
|
[[health-api-response-details-ilm]]
|
|
===== ilm
|
|
|
|
`ilm_status`::
|
|
(string) The current status of the Indexing Lifecycle Management feature. Either `STOPPED`, `STOPPING`, or `RUNNING`.
|
|
|
|
`policies`::
|
|
(int) The number of index lifecycle policies that the system is managing.
|
|
|
|
`stagnating_indices`::
|
|
(int) the number of indices managed by {ilm} that has been stagnant longer than expected.
|
|
|
|
`stagnating_indices_per_action`::
|
|
(optional, map) Summary of the number of indices, grouped by action, that have been stagnant longer than
|
|
expected.
|
|
+
|
|
.Properties of `stagnating_indices_per_action`
|
|
[%collapsible%open]
|
|
=======
|
|
`downsample`::
|
|
(int) The number of stagnant indices in the `downsample` action.
|
|
|
|
`allocate`::
|
|
(int) The number of stagnant indices in the `allocate` action.
|
|
|
|
`shrink`::
|
|
(int) The number of stagnant indices in the `shrink` action.
|
|
|
|
`searchable_snapshot`::
|
|
(int) The number of stagnant indices in the `searchable_snapshot` action.
|
|
|
|
`rollover`::
|
|
(int) The number of stagnant indices in the `rollver` action.
|
|
|
|
`forcemerge`::
|
|
(int) The number of stagnant indices in the `forcemerge` action.
|
|
|
|
`delete`::
|
|
(int) The number of stagnant indices in the `delete` action.
|
|
|
|
`migrate`::
|
|
(int) The number of stagnant indices in the `migrate` action.
|
|
|
|
|
|
=======
|
|
|
|
[[health-api-response-details-slm]]
|
|
===== slm
|
|
|
|
`slm_status`::
|
|
(string) The current status of the Snapshot Lifecycle Management feature. Either `STOPPED`, `STOPPING`, or `RUNNING`.
|
|
|
|
`policies`::
|
|
(int) The number of snapshot policies that the system is managing.
|
|
|
|
`unhealthy_policies`::
|
|
(map) A detailed view on the policies that are considered unhealthy due to having
|
|
several consecutive unsuccessful invocations.
|
|
The `count` key represents the number of unhealthy policies (int).
|
|
The `invocations_since_last_success` key will report a map where the unhealthy policy
|
|
name is the key and it's corresponding number of failed invocations is the value.
|
|
|
|
[[health-api-response-details-shards-capacity]]
|
|
===== shards_capacity
|
|
`data`::
|
|
(map) A view with information about the current capacity of shards for data nodes that do not belong to the frozen tier.
|
|
+
|
|
.Properties of `data`
|
|
[%collapsible%open]
|
|
=====
|
|
`max_shards_in_cluster`::
|
|
(int) Indicates the maximum number of shards that the cluster can hold.
|
|
|
|
`current_used_shards`::
|
|
(optional, int) The total number of shards hold by the cluster. Only displayed in the case the indicator's status is `red` or `yellow`.
|
|
|
|
=====
|
|
|
|
`frozen`::
|
|
(map) A view with information about the current capacity of shards for data nodes that belong to the frozen tier.
|
|
+
|
|
.Properties of `frozen`
|
|
[%collapsible%open]
|
|
=====
|
|
`max_shards_in_cluster`::
|
|
(int) Indicates the maximum number of shards the cluster can hold for the partially mounted indices.
|
|
|
|
`current_used_shards`::
|
|
(optional, int) The total number of shards the partially mounted indices have in the cluster. Only displayed in the case the indicator's status is `red` or `yellow`.
|
|
|
|
=====
|
|
|
|
[[health-api-example]]
|
|
==== {api-examples-title}
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
GET _health_report
|
|
--------------------------------------------------
|
|
|
|
The API returns a response with all the indicators regardless
|
|
of current status.
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
GET _health_report/shards_availability
|
|
--------------------------------------------------
|
|
|
|
The API returns a response for just the shard availability indicator.
|
|
|
|
[source,console]
|
|
--------------------------------------------------
|
|
GET _health_report?verbose=false
|
|
--------------------------------------------------
|
|
|
|
The API returns a response with all health indicators but will
|
|
not calculate details or root cause analysis for the response. This is helpful
|
|
if you would like to monitor the health API and do not want the overhead of
|
|
calculating additional troubleshooting details each call.
|