Download PDF
Download page Health Rule Evaluation Conditions.
Health Rule Evaluation Conditions
A health rule condition is an acceptable performance range for an identified metric. A condition defines the metric levels that constitute a Warning status or a Critical status.
A condition consists of a boolean statement that compares the current value of a metric against one or more static or dynamic thresholds based on a selected baseline. If the condition is true, the health rule violates. You can configure the rules for evaluating a condition using multiple thresholds.
Static thresholds are straightforward. For example, is the Memory Utilization
for a pod greater than 80%? The condition is evaluated as true
if the Memory Utilization
is greater than 80%, the health rule violates. You can also select the source from which you want to query the data from. The health evaluation varies depending on the data source you choose because metrics from different sources have different granularity and properties.
Dynamic thresholds are based on a percentage in relation to, or a standard deviation from, a baseline built on a rolled-up baseline trend pattern.
You can define a threshold for a health rule based on a single metric value or on a mathematical expression built from multiple metric values.
The following are some examples of health rule conditions:
- To know if there are pods with readiness/liveness issues affecting your services, define a condition:
readiness probe status =0 for 80% pods in a workload
liveness probe status =0 for more than 30% pods in a workload
- To know if any services are impacted by pod restarts, define a condition:
Pod Restarts are greater than 3 for 80% pods on a workload
- To know about failed or pending pods, define a condition:
Sum of Failed pods over a workload is greater than 10%
Sum of Pending Pods over a workload is greater than 10%
- If the value of Errors per Minute/Calls per Minute over the last 15 days > 0.2.
This example combines two metrics in a single condition. You can use the expression builder embedded in the health rules wizard to create conditions based on a complex expression comprising multiple interdependent metrics. If the (average response time > baseline OR errors per minute > baseline) AND (calls per minute > the defined threshold).
This example uses multiple conditions to evaluate the health rules. You can use the CUSTOM option to define a boolean expression to evaluate the conditions.
Critical and Warning Conditions
Conditions are classified as either critical or warning conditions.
Critical conditions are evaluated before warning conditions. If you have defined a critical condition and a warning condition in the same health rule, the warning condition is evaluated only if the critical condition is not true.
The configuration procedures for critical and warning conditions are identical, but you configure these two types of conditions in separate panels. You can copy a critical condition configuration to a warning configuration and vice-versa and then adjust the metrics in the copy to differentiate them. For example, in the Critical Condition panel you can create a critical condition based on the rule:
- If the
Request Count
is greater than 40
Then from the Warning Condition panel, copy that condition and edit it to be:
- If the
Request Count
is greater than 35
As performance changes, a health rule violation can be upgraded from warning to critical if performance deteriorates to the higher threshold or downgraded from critical to warning if performance improves to the warning threshold.
Condition Violation
When metric levels exceed the acceptable range, conditions violate, and a health rule violates. The details of the violation are displayed on the Entity Health Timeline section in the entity centric page. This section displays the following details:
- Number of violations of the type Alert and Anomaly.
- The start time of the violation
- End time (depending on the time period for data collection)
See Health Violation Timeline.
Condition Evaluation Criteria
When you define multiple conditions for a health rule, they are evaluated based on the criteria you define. You can use the following options to define the evaluation criteria:
- All: the health rule violates if all the conditions defined in the criteria evaluate as
true
. - Any: the health rule violates if one of the conditions defined in the criteria evaluates as
true
. - Custom: the health rule violates if the boolean expression with multiple conditions evaluates as
true
.
For information on how to configure evaluation criteria, see Condition Evaluation Criteria.
The following table uses examples to illustrate how a health rule is evaluated based on the criteria and when is it considered to violate:
Health Rule Configuration | Evaluation | Example |
---|---|---|
Single condition | The condition evaluates as | A health rule that compares 'average response time' with a defined baseline. |
Multiple conditions with ANY evaluation criteria | One of the health rule conditions evaluates as true | A health rule that monitors the health of K8 pod may measure any of the following performance metrics:
|
Multiple conditions with ALL evaluation criteria | All of the health rule conditions evaluate as | A health rule that monitors the health of APM service measures all of the following metrics:
|
Multiple conditions with CUSTOM evaluation criteria | The boolean expression with multiple conditions evaluates as The condition is evaluated only if a valid combination of conditions using To ensure that alerts are triggered quickly before there is any significant business impact, the health rules do not evaluate all the conditions in a boolean expression. The health rule starts evaluating the first condition and continues to evaluate the following conditions until it can deterministically mark the expression as true or false. As soon as the evaluation determines the expression to violate, an alert is triggered. | A health rule that monitors the health of a APM Service, measures the performance based on the following conditions:
AND
|
Custom Boolean Expression
A condition consists of single or multiple statements that evaluate different metrics. You can define a single condition or multiple conditions to evaluate the performance metrics of your application. When you define multiple conditions, you may want to define an evaluation criteria using a boolean expression.
Advantages of using a boolean expression are:
- eliminates the need to create multiple health rules to monitor various performance metrics. Using a boolean expression allows you to evaluate complex criteria for multiple conditions in one go.
- well-calibrated boolean expression ensures reduced false alerts.
- easy to create and maintain health rules with complex evaluation criteria using simple condition names. Conditions are named as A, B, C and so on.
- allows the use of
AND
andOR
operators to define a highly complex boolean expression.
Persistence Thresholds
Temporary spikes in metric performance data is a major cause of false alerts. Persistence thresholds allow you to define a sensitivity level for a health rule and thereby reduce the number of false alerts. You can define the number of times metric performance data should exceed the defined threshold during the evaluation time frame to constitute a violation and subsequently trigger an alert.
You can define a persistence threshold for a condition only if you have defined an evaluation time frame of 30 minutes or less.
For example, when monitoring the CPU utilization, you would not want to receive a notification of a single violation of the threshold. However, if the violation of the threshold continues and occurs multiple times
during the evaluation time period, you would want a notification.
Health Rule Evaluation Time Frame
The health rule evaluation time frame is the period of time over which the data used to evaluate the health rule is collected.
Different kinds of metrics provide better results using different sets of data. You can manage how much data Cisco Cloud Observability uses when it evaluates a particular health rule by setting the data collection time period. You can define an evaluation time frame between 1 minute to 120 minutes. The default value is 30 minutes. You can select the following values in the Use data from last drop-down:
1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120
How are Conditions Evaluated if No Data is Reported?
The Evaluate to true on no data option controls the evaluation of the condition in cases where any metric on which the condition is based, does not report any data. The condition evaluates to unknown
(default) when no data is returned. If the health rule is based on all the conditions evaluating to true, having no data returned may affect whether the health rule triggers an action.
When you define a health rule evaluation time frame, reference data is collected for each data point. If the configured metric fails to report data during the time frame, the health rule condition is evaluated as follows:
Evaluate to true on no data | Trigger only when a violation occurs x times in last y min(s) | Condition Evaluation |
---|---|---|
Enabled | Enabled | The condition is evaluated for each data point in the evaluation time frame. The condition evaluates as For example, when you set the persistence threshold, X = 3 for an evaluation time frame, Y = 5. This means that 5 data points are required to evaluate the condition. Data is reported for 4 data points, no data is reported for 1 data point and the metric exceeds the threshold twice. The condition evaluates as true for the minute when no data is reported. When both the options are enabled, the event message on the User Interface (UI) does not specify if the health rule violation is triggered because of no data or if the persistence threshold is genuinely breached. |
Enabled | Disabled | The condition evaluates as |
Disabled | Disabled | The condition does not evaluate as |