Health rules let you specify the parameters that represent what you consider normal or expected operations for your environment, for example, the
CPU Utilization for a host. You can create health rules to monitor one entity or a group of entities. See Create Health Rules to Monitor Entities or a Group of Entities.
View Health Rules
To view the list of configured health rules, click Configure > Health Rules. The number in brackets indicates the total number of configured health rules. The Description column includes the health rollup details. This list also presents other health rule details such as:
- Health rule name
- Entity type monitored by the health rule
- Number of monitored entities
- Actions linked to the health rule
- Health rule creation date and time
- Health rule last edited date and time
- Health rule evaluation status
Use the search box to search for a health rule by name. Use the check box in the first column to select health rules to:
- Enable health evaluation for all health rules
- Delete one or more health rules
- Make of copy of a health rule
health violation occurs when the performance of an entity being monitored by the health rule violates the conditions set by the rule. The health statuses are represented as critical, warning, normal, NA, and unknown. See View Details of Violation Health Rule.
A health violation event occurs when the health status of an entity changes. Examples of a health violation are:
- Violation Started: Warning
- Violation Started: Critical
- Violation Upgraded: Warning to Critical
- Violation Downgraded: Critical to Warning
- Violation Continues: Warning
- Violation Updated: Warning
- Violation Continues: Critical
- Violation Updated: Critical
- Violation Ended: Critical to Normal
- Violation Ended: Warning to Normal
- Violation Cancelled: Warning
- Violation Cancelled: Critical
The health statuses of entities and health violations are displayed on the Observe page. You can trigger an action, such as an HTTP Request Action, for a health violation.
Entity Types and Entities
You can choose the entity types that a health rule monitors. Entities are instances of entity types. For example, k8s: pod (Kubernetes namespace) is an entity type while o2-k8s-monitoring-appdynamics-otel-collector-lg29c (Kubernetes pod instance) is an instance of a k8s:pod. Note that you define a health rule for an entity type while the health rules are evaluated at the entity level.
You can monitor the performance of an individual entity or group the entities and monitor their aggregate performance. See Health Evaluation.
You filter the list of entities based on attributes (A), tags (T), or parent entity type to fetch the desired entity to monitor. Depending on the entity you select, a list of attribute keys, tags, and their values are available for you to choose from.
The attributes are the properties associated with the entities. The tags are the metadata that enhance organization and discoverability of resources in a highly dynamic and complex cloud environment. Using attributes and tags, you can quickly identify the entities and troubleshoot the root cause of issues.
For each attribute and tag, you can select multiple values from the list. The filter expression returns true if at least one of the values selected for each argument (attribute or tag) is true.
You can also select multiple attributes, tags, or a combination of attributes and tags and apply filter. The filter expression returns true if all the arguments (selected attributes and tags) are true.
Similarly, a list of parent entities is also available to choose from. You can narrow down the criteria further by specifying the operators.
You can filter the entities by the parent entity only if you have configured the aggregated (rollup) health evaluation for the entities when defining the entity details for a health rule.
The performance or the health of the entity is evaluated at:
- The individual entity level (granular)—the alerts are triggered based on the performance of a single entity, for example, a service instance. See Entity Types and Entities.
Parent entity level (aggregation of a group of entities)—the alerts are triggered based on the aggregate performance of a group of entities, such as, service instances grouped by the parent service. Aggregate performance is calculated based on the metric you select. For example, for
average response time, alerts are triggered only if the performance of all the grouped entities deteriorates the performance of all the service instances within the service deteriorates.
If entities in a configured aggregated (rollup) health evaluation path are missing, the health rule is not evaluated.
Health Rule Evaluation Schedule
You can ingest metrics defined for your environment and start the evaluating the health rules using the metrics. The health rules are evaluated every minute. By default, all health rules are enabled.
Health Rule Wait Time After Violation
The health rule Wait Time After Violation enables you control how often a violation is generated while the conditions found to violate a health rule continue. If the health rule is violated, with a status of either Critical or Warning, a Violation Open: Critical or Violation Open: Warning event is generated. This event is used to initiate any required actions.
Once an Open event has occurred, the status of the health rule is evaluated every minute. If the same violation is detected, the violation remains open with the same status. A corresponding Violation Continues: Critical or Violation Continues: Warning event may be generated.
A Violation Continues event every minute might be too noisy for your health rule. The health rule's Wait Time after Violation setting is used to throttle how often these Continues events are generated for continuing health rule violations. The default is every 30 minutes.
To use Violation Continues Critical and Violation Continues Warning events, adjust the default Wait Time after Violation value to the desired frequency.
The violations displayed in the Health Violations section are updated only when a health rule violation event is triggered.
If for some reason, the health rule is not evaluated—for example, if a pod stops reporting—the Evaluation Status of the health rule is marked as a gray question mark or
Unknown in the Current Evaluation Status tab in the right panel of the health rules list. The current violation event remains open until the Wait Time after Violation period has elapsed, at which point, the violation event is closed and a new event is triggered, causing the Health status itself of the rule to display as
Health Evaluation for Delayed Metrics
Metrics ingested from services such as Amazon CloudWatch service are often delayed at the source itself. Sometimes the delay could be quite long to be ignored. This leads to a time gap between the observed timestamp and ingested timestamp, meaning there could be a time delay between an event occurring in your environment and it being reported on the AppDynamics Cloud UI. This time delay is dependent on AppDynamics namespaces. The following table provides a maximum delay time you can expect for metrics to be reported for various AppDynamics namespaces:
|Namespace||Approximate Time Delay|
You can aggregate (rollup) the health of the entities to define the health of a group of entities. This means a child entity can define the health of a parent entity by rollup relationship. You can define the health rollup relationship between the entities and the parents.
For example, you can define the health rollup relationship between namespace and cluster as illustrated in this image; if 40% of the namespaces are reported unhealthy, the cluster health should also be reported as unhealthy:
However, namespaces within a cluster can have different health statuses such as healthy, warning or, critical. In the preceding image, if 40% of the namespaces are unhealthy with the status as critical, then the cluster health is reported as critical. Note that while calculating the warning count, the namespaces in critical state are also included in the warning count. The following table illustrates the various health statuses of a cluster with 10 namespaces:
|Total number of namespaces in the cluster||Number of namespaces in critical state||Number of namespaces in warning state||Cluster health|
Critical count = 3
Warning count = (3 + 2) = 5
40% of 10 = 4
Cluster health is Warning
Critical count = 2
Warning count = (2 + 3) = 5
40% of 10 = 4
Cluster health is Warning
Critical count = 4
Warning count = (4 + 4) = 8
40% of 10 = 4
Cluster health is Critical