Health Rules

Health rules let you specify the parameters that represent what you consider normal or expected operations for your environment, for example, the CPU Utilization for a host. You can create health rules to monitor one entity or a group of entities. See Create Health Rules to Monitor Entities or a Group of Entities.

View Health Rules

To view the list of configured health rules, click Configure > Health Rules. The number in brackets indicates the total number of configured health rules. The Description column includes the health rollup details. This list also presents other health rule details such as:

Health rule name
Entity type monitored by the health rule
Description
Number of monitored entities
Actions linked to the health rule
Health rule creation date and time
Health rule last edited date and time
Health rule evaluation status

Use the search box to search for a health rule by name. Use the check box in the first column to select health rules to:

Enable health evaluation for all health rules
Delete one or more health rules
Make of copy of a health rule

Enable or Disable Health Rules

In the Health Rules list, use the toggle button to enable or disable the evaluation of the health rules. You can enable or disable the evaluation of a single health rule or all the health rules associated with the entity. Once you disable a health rule, the evaluation of that health rule is suspended until you re-enable it again.

A health rule does not appear on the list of health rules on the <entity> details page if it is disabled. View the disabled health rules on the Health Rules list.

View Status of Monitored Objects

In the Health Rules list, click the number of entities in the Monitored Entities column to view the list of all monitored objects associated with a health rule. The health status of each monitored object displays next to the object.

A health rule can evaluate a maximum of 100000 entities. If the number of monitored entities exceeds this maximum limit, the health rule stops evaluating.

Add or Update Actions Triggered by a Health Violation

In the Health Rules list, click Link Action associated with the health rule. The Edit Health Rule wizard appears. Add or update actions as required. These actions trigger when the health rule violates. See Edit a Health Rule.

You must create an action before you link it to the health rule. See Actions.

Health Violation

A health violation occurs when the performance of an entity being monitored by the health rule violates the conditions set by the rule. The health statuses are represented as critical, warning, normal, NA, and unknown. See View Details of Violation Health Rule.

A health violation event occurs when the health status of an entity changes. Examples of a health violation are:

Violation Started: Warning
Violation Started: Critical
Violation Upgraded: Warning to Critical
Violation Downgraded: Critical to Warning
Violation Continues: Warning
Violation Updated: Warning
Violation Continues: Critical
Violation Updated: Critical
Violation Ended: Critical to Normal
Violation Ended: Warning to Normal
Violation Cancelled: Warning
Violation Cancelled: Critical

The health statuses of entities and health violations are displayed on the Observe page. You can trigger an action, such as an HTTP Request Action, for a health violation.

Entity Types and Entities

You can choose the entity types that a health rule monitors. Entities are instances of entity types. For example, k8s: pod (Kubernetes namespace) is an entity type while o2-k8s-monitoring-appdynamics-otel-collector-lg29c (Kubernetes pod instance) is an instance of a k8s:pod. Note that you define a health rule for an entity type while the health rules are evaluated at the entity level.

You can monitor the performance of an individual entity or group the entities and monitor their aggregate performance. See Health Evaluation.

Filters

You filter the list of entities based on attributes (A), tags (T), or parent entity type to fetch the desired entity to monitor. Depending on the entity you select, a list of attribute keys, tags, and their values are available for you to choose from.

The attributes are the properties associated with the entities. The tags are the metadata that enhance organization and discoverability of resources in a highly dynamic and complex cloud environment. Using attributes and tags, you can quickly identify the entities and troubleshoot the root cause of issues.

For each attribute and tag, you can select multiple values from the list. The filter expression returns true if at least one of the values selected for each argument (attribute or tag) is true.

You can also select multiple attributes, tags, or a combination of attributes and tags and apply filter. The filter expression returns true if all the arguments (selected attributes and tags) are true.

Similarly, a list of parent entities is also available to choose from. You can narrow down the criteria further by specifying the operators.

You can filter the entities by the parent entity only if you have configured the aggregated (rollup) health evaluation for the entities when defining the entity details for a health rule.

Health Evaluation

The performance or the health of the entity is evaluated at:

The individual entity level (granular)—the alerts are triggered based on the performance of a single entity, for example, a service instance. See Entity Types and Entities.
Parent entity level (aggregation of a group of entities)—the alerts are triggered based on the aggregate performance of a group of entities, such as, service instances grouped by the parent service. Aggregate performance is calculated based on the metric you select. Specifically, for example, alerts are only triggered if the performance of all the grouped entities deteriorates, meaning the performance of all the service instances within the service deteriorates.

If entities in a configured aggregated (rollup) health evaluation path are missing, the health rule is not evaluated.

Health Rule Evaluation Schedule

You can ingest metrics defined for your environment and start the evaluating the health rules using the metrics. The health rules are evaluated every minute. By default, all health rules are enabled.

Health Rule Wait Time After Violation

The health rule Wait Time After Violation enables you control how often a violation is generated while the conditions found to violate a health rule continue. If the health rule is violated, with a status of either Critical or Warning, a Violation Open: Critical or Violation Open: Warning event is generated. This event is used to initiate any required actions.

Once an Open event has occurred, the status of the health rule is evaluated every minute. If the same violation is detected, the violation remains open with the same status. A corresponding Violation Continues: Critical or Violation Continues: Warning event may be generated.

A Violation Continues event every minute might be too noisy for your health rule. The health rule's Wait Time after Violation setting is used to throttle how often these Continues events are generated for continuing health rule violations. The default is every 30 minutes.

To use Violation Continues Critical and Violation Continues Warning events, adjust the default Wait Time after Violation value to the desired frequency.

The violations displayed in the Entity Health Timeline section are updated only when a health rule violation event is triggered.

If for some reason, the health rule is not evaluated—for example, if a pod stops reporting—the Evaluation Status of the health rule is marked as a gray question mark or Unknown in the Current Evaluation Status tab in the right panel of the health rules list. The current violation event remains open until the Wait Time after Violation period has elapsed, at which point, the violation event is closed and a new event is triggered, causing the Health status itself of the rule to display as Unknown.

Health Evaluation for Delayed Metrics

Metrics ingested from services such as Amazon CloudWatch service are often delayed at the source itself. Sometimes the delay could be quite long to be ignored. This leads to a time gap between the observed timestamp and ingested timestamp, meaning there could be a time delay between an event occurring in your environment and it being reported on the Cisco Cloud Observability UI. This time delay is dependent on Splunk AppDynamics namespaces. The following table provides a maximum delay time you can expect for metrics to be reported for various Splunk AppDynamics namespaces:

Namespace	Approximate Time Delay
K8	4-5 mins
APM	4-5 mins
CNI	10-15 mins

Health Rollup

You can aggregate (rollup) the health of the entities to define the health of a group of entities. This means a child entity can define the health of a parent entity by rollup relationship. You can define the health rollup relationship between the entities and the parents.
For example, you can define the health rollup relationship between namespace and cluster as illustrated in this image; if 40% of the namespaces are reported unhealthy, the cluster health should also be reported as unhealthy:

However, namespaces within a cluster can have different health statuses such as healthy, warning or, critical. In the preceding image, if 40% of the namespaces are unhealthy with the status as critical, then the cluster health is reported as critical. Note that while calculating the warning count, the namespaces in critical state are also included in the warning count. The following table illustrates the various health statuses of a cluster with 10 namespaces:

Total number of namespaces in the cluster

Number of namespaces in critical state

Number of namespaces in warning state

Cluster health

10

3

2

Critical count = 3

Warning count = (3 + 2) = 5

40% of 10 = 4

Cluster health is Warning

10

2

3

Critical count = 2

Warning count = (2 + 3) = 5

40% of 10 = 4

Cluster health is Warning

10

4

Critical count = 4

Warning count = (4 + 4) = 8

40% of 10 = 4

Cluster health is Critical