A condition consists of a boolean statement that compares the current value of a metric against one or more static or dynamic thresholds based on a selected baseline. If the condition is true, the health rule violates. You can configure the rules for evaluating a condition using multiple thresholds.
Static thresholds are straightforward. For example, is the
Memory Utilization for a pod greater than 80%? The condition is evaluated as
true if the
Memory Utilization is greater than 80%, the health rule violates. You can also select the source from which you want to query the data from. The health evaluation varies depending on the data source you choose because metrics from different sources have different granularity and properties.
Dynamic thresholds are based on a percentage in relation to, or a standard deviation from, a baseline built on a rolled-up baseline trend pattern.
You can define a threshold for a health rule based on a single metric value or on a mathematical expression built from multiple metric values.
The following are some examples of health rule conditions:
- To know if there are pods with readiness/liveness issues affecting your services, define a condition:
readiness probe status =0 for 80% pods in a workload
liveness probe status =0 for more than 30% pods in a workload
- To know if any services are impacted by pod restarts, define a condition:
Pod Restarts are greater than 3 for 80% pods on a workload
- To know about failed or pending pods, define a condition:
Sum of Failed pods over a workload is greater than 10%
Sum of Pending Pods over a workload is greater than 10%
- If the value of Errors per Minute/Calls per Minute over the last 15 days > 0.2.
This example combines two metrics in a single condition. You can use the expression builder embedded in the health rules wizard to create conditions based on a complex expression comprising multiple interdependent metrics.
If the (average response time > baseline OR errors per minute > baseline) AND (calls per minute > the defined threshold).
This example uses multiple conditions to evaluate the health rules. You can use the CUSTOM option to define a boolean expression to evaluate the conditions.
Critical and Warning Conditions
Conditions are classified as either critical or warning conditions.
Critical conditions are evaluated before warning conditions. If you have defined a critical condition and a warning condition in the same health rule, the warning condition is evaluated only if the critical condition is not true.
The configuration procedures for critical and warning conditions are identical, but you configure these two types of conditions in separate panels. You can copy a critical condition configuration to a warning configuration and vice-versa and then adjust the metrics in the copy to differentiate them. For example, in the Critical Condition panel you can create a critical condition based on the rule:
- If the
Request Countis greater than 40
Then from the Warning Condition panel, copy that condition and edit it to be:
- If the
Request Countis greater than 35
As performance changes, a health rule violation can be upgraded from warning to critical if performance deteriorates to the higher threshold or downgraded from critical to warning if performance improves to the warning threshold.
When metric levels exceed the acceptable range, conditions violate, and a health rule violates. The details of the violation are displayed on the Health Violation pane in the Observe UI. This pane displays the following details:
- Number of violating health rules
- List of all the violating health rules along with the status
- The start time of the violation
- End time (depending on the time period for data collection)
Condition Evaluation Criteria
When you define multiple conditions for a health rule, they are evaluated based on the criteria you define. You can use the following options to define the evaluation criteria:
- All: the health rule violates if all the conditions defined in the criteria evaluate as
- Any: the health rule violates if one of the conditions defined in the criteria evaluates as
- Custom: the health rule violates if the boolean expression with multiple conditions evaluates as
For information on how to configure evaluation criteria, see Condition Evaluation Criteria.
The following table uses examples to illustrate how a health rule is evaluated based on the criteria and when is it considered to violate.
Health Rule Configuration
the condition evaluates as
|A health rule that compares 'average response time' with a defined baseline.|
|Multiple conditions with ||one of the health rule conditions evaluates as |
A health rule that monitors the health of K8 pod may measure any of the following performance metrics:
|Multiple conditions with |
all of the health rule conditions evaluate as
A health rule that monitors the health of APM service measures all of the following metrics:
|Multiple conditions with |
the boolean expression with multiple conditions evaluates as
The condition is evaluated only if a valid combination of conditions using
To ensure that alerts are triggered quickly before there is any significant business impact, the health rules do not evaluate all the conditions in a boolean expression. The health rule starts evaluating the first condition and continues to evaluate the following conditions until it can deterministically mark the expression as true or false. As soon as the evaluation determines the expression to violate, an alert is triggered.
A health rule that monitors the health of a APM Service, measures the performance based on the following conditions:
Custom Boolean Expression
A condition consists of single or multiple statements that evaluate different metrics. You can define a single condition or multiple conditions to evaluate the performance metrics of your application. When you define multiple conditions, you may want to define an evaluation criteria using a boolean expression.
Advantages of using a boolean expression are:
- eliminates the need to create multiple health rules to monitor various performance metrics. Using a boolean expression allows you to evaluate complex criteria for multiple conditions in one go.
- well-calibrated boolean expression ensures reduced false alerts.
- easy to create and maintain health rules with complex evaluation criteria using simple condition names. Conditions are named as A, B, C and so on.
- allows the use of
ORoperators to define a highly complex boolean expression.
Temporary spikes in metric performance data is a major cause of false alerts. Persistence thresholds allow you to define a sensitivity level for a health rule and thereby reduce the number of false alerts. You can define the number of times metric performance data should exceed the defined threshold during the evaluation time frame to constitute a violation and subsequently trigger an alert.
You can define a persistence threshold for a condition only if you have defined an evaluation time frame of 30 minutes or less.
For example, when monitoring the CPU utilization, you would not want to receive a notification of a single violation of the threshold. However, if the violation of the threshold continues and occurs multiple times during the evaluation time period, you would want a notification.
Health Rule Evaluation Time Frame
The health rule evaluation time frame is the period of time over which the data used to evaluate the health rule is collected.
Different kinds of metrics provide better results using different sets of data. You can manage how much data AppDynamics Cloud uses when it evaluates a particular health rule by setting the data collection time period. You can define an evaluation time frame between 1 minute to 120 minutes. The default value is 30 minutes. You can select the following values in the Use data from last drop-down:
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120
How are Conditions Evaluated if No Data is Reported?
The Evaluate to true on no data option controls the evaluation of the condition in cases where any metric on which the condition is based, does not report any data. The condition evaluates to
unknown (default) when no data is returned. If the health rule is based on all the conditions evaluating to true, having no data returned may affect whether the health rule triggers an action.
When you define a health rule evaluation time frame, reference data is collected for each data point. If the configured metric fails to report data during the time frame, the health rule condition is evaluated as follows:
Evaluate to true on no data
Trigger only when a violation occurs x times in last y min(s)
The condition is evaluated for each data point in the evaluation time frame. The condition evaluates as
For example, when you set the persistence threshold, X = 3 for an evaluation time frame, Y = 5. This means that 5 data points are required to evaluate the condition. Data is reported for 4 data points, no data is reported for 1 data point and the metric exceeds the threshold twice. The condition evaluates as true for the minute when no data is reported.
The condition does not evaluate as
The condition does not evaluate as