A health rule condition is an acceptable performance range for an identified metric. A condition defines the metric levels that constitute a Warning status or a Critical status.

A condition consists of a boolean statement that compares the current value of a metric against one or more static or dynamic thresholds based on a selected baseline. If the condition is true, the health rule violates. You can configure the rules for evaluating a condition using multiple thresholds.

Static thresholds are straightforward. For example, is the Memory Utilization for a pod greater than 80%? The condition is evaluated as true if the Memory Utilization is greater than 80%, the health rule violates. You can also select the source from which you want to query the data from. The health evaluation varies depending on the data source you choose because metrics from different sources have different granularity and properties.

Dynamic thresholds are based on a percentage in relation to, or a standard deviation from, a baseline built on a rolled-up baseline trend pattern.

You can define a threshold for a health rule based on a single metric value or on a mathematical expression built from multiple metric values. 

The following are some examples of health rule conditions:

  • To know if there are pods with readiness/liveness issues affecting your services, define a condition:

readiness probe status =0 for 80% pods in a workload
liveness probe status =0 for more than 30% pods in a workload

  • To know if any services are impacted by pod restarts, define a condition:

Pod Restarts are greater than 3 for 80% pods on a workload

  • To know about failed or pending pods, define a condition:

Sum of Failed pods over a workload is greater than 10%
Sum of Pending Pods over a workload is greater than 10%

  • If the value of Errors per Minute/Calls per Minute over the last 15 days > 0.2. 
    This example combines two metrics in a single condition. You can use the expression builder embedded in the health rules wizard to create conditions based on a complex expression comprising multiple interdependent metrics.
  • If the (average response time > baseline OR errors per minute > baseline) AND (calls per minute > the defined threshold).
    This example uses multiple conditions to evaluate the health rules. You can use the CUSTOM option to define a boolean expression to evaluate the conditions.

Critical and Warning Conditions

Conditions are classified as either critical or warning conditions. 

Critical conditions are evaluated before warning conditions. If you have defined a critical condition and a warning condition in the same health rule, the warning condition is evaluated only if the critical condition is not true.

The configuration procedures for critical and warning conditions are identical, but you configure these two types of conditions in separate panels. You can copy a critical condition configuration to a warning configuration and vice-versa and then adjust the metrics in the copy to differentiate them. For example, in the Critical Condition panel you can create a critical condition based on the rule:

  • If the Request Count is greater than 40

Then from the Warning Condition panel, copy that condition and edit it to be:

  • If the Request Count is greater than 35

As performance changes, a health rule violation can be upgraded from warning to critical if performance deteriorates to the higher threshold or downgraded from critical to warning if performance improves to the warning threshold.

Condition Violation

When metric levels exceed the acceptable range, conditions violate, and a health rule violates. The details of the violation are displayed on the Entity Health Timeline section in the entity centric page. This section displays the following details:

  • Number of violations of the type Alert and Anomaly.
  • The start time of the violation
  • End time (depending on the time period for data collection)

See Health Violation Timeline.

Condition Evaluation Criteria

When you define multiple conditions for a health rule, they are evaluated based on the criteria you define. You can use the following options to define the evaluation criteria:

  • All: the health rule violates if all the conditions defined in the criteria evaluate as true.
  • Any: the health rule violates if one of the conditions defined in the criteria evaluates as true.
  • Custom: the health rule violates if the boolean expression with multiple conditions evaluates as true.

For information on how to configure evaluation criteria, see Condition Evaluation Criteria.

The following table uses examples to illustrate how a health rule is evaluated based on the criteria and when is it considered to violate:

Health Rule Configuration

Evaluation

Example

Single condition

The condition evaluates as true

A health rule that compares 'average response time' with a defined baseline.
Multiple conditions with ANY evaluation criteriaOne of the health rule conditions evaluates as true

A health rule that monitors the health of K8 pod may measure any of the following performance metrics:

  • CPU Requests or
  • Memory Requests
Multiple conditions with ALL evaluation criteria

All of the health rule conditions evaluate as true

A health rule that monitors the health of APM service measures all of the following metrics:

  • Calls Per Minute
  • Average Response Time greater than a baseline value
  • Errors Per Minute
Multiple conditions with CUSTOM evaluation criteria

The boolean expression with multiple conditions evaluates as true

The condition is evaluated only if a valid combination of conditions using AND and OR operators is entered, else the evaluation fails.

To ensure that alerts are triggered quickly before there is any significant business impact, the health rules do not evaluate all the conditions in a boolean expression. The health rule starts evaluating the first condition and continues to evaluate the following conditions until it can deterministically mark the expression as true or false. As soon as the evaluation determines the expression to violate, an alert is triggered.

A health rule that monitors the health of a APM Service, measures the performance based on the following conditions:

  • (Average Response Time greater than baseline OR Errors Per Minute greater than baseline) 

AND

  • (Calls Per Minute greater than threshold)

Custom Boolean Expression

A condition consists of single or multiple statements that evaluate different metrics. You can define a single condition or multiple conditions to evaluate the performance metrics of your application. When you define multiple conditions, you may want to define an evaluation criteria using a boolean expression.

Advantages of using a boolean expression are:

  • eliminates the need to create multiple health rules to monitor various performance metrics. Using a boolean expression allows you to evaluate complex criteria for multiple conditions in one go.
  • well-calibrated boolean expression ensures reduced false alerts.
  • easy to create and maintain health rules with complex evaluation criteria using simple condition names. Conditions are named as ABC and so on.
  • allows the use of AND and OR operators to define a highly complex boolean expression. 
You can use a maximum of 8 operators in your boolean expression.

Persistence Thresholds

Temporary spikes in metric performance data is a major cause of false alerts. Persistence thresholds allow you to define a sensitivity level for a health rule and thereby reduce the number of false alerts. You can define the number of times metric performance data should exceed the defined threshold during the evaluation time frame to constitute a violation and subsequently trigger an alert.

You can define a persistence threshold for a condition only if you have defined an evaluation time frame of 30 minutes or less.

For example, when monitoring the CPU utilization, you would not want to receive a notification of a single violation No Alert example of the threshold. However, if the violation of the threshold continues and occurs multiple times Alert example during the evaluation time period, you would want a notification. 

Persistence Threshold - CPU Utilization

Health Rule Evaluation Time Frame

The health rule evaluation time frame is the period of time over which the data used to evaluate the health rule is collected. 

Different kinds of metrics provide better results using different sets of data. You can manage how much data Cisco Cloud Observability uses when it evaluates a particular health rule by setting the data collection time period. You can define an evaluation time frame between 1 minute to 120 minutes. The default value is 30 minutes. You can select the following values in the Use data from last drop-down:

1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 15, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120

You can define a persistence threshold for a condition only if you have defined the evaluation time frame of 30 mins or less.

How are Conditions Evaluated if No Data is Reported?

The Evaluate to true on no data option controls the evaluation of the condition in cases where any metric on which the condition is based, does not report any data. The condition evaluates to unknown (default) when no data is returned. If the health rule is based on all the conditions evaluating to true, having no data returned may affect whether the health rule triggers an action.

When you define a health rule evaluation time frame, reference data is collected for each data point. If the configured metric fails to report data during the time frame, the health rule condition is evaluated as follows:

Evaluate to true on no data

Trigger only when a violation occurs x times in last y min(s)

Condition Evaluation

EnabledEnabled

The condition is evaluated for each data point in the evaluation time frame. The condition evaluates as true when metric fails to report any data for a given data point.

For example, when you set the persistence threshold, X = 3 for an evaluation time frame, Y = 5. This means that 5 data points are required to evaluate the condition. Data is reported for 4 data points, no data is reported for 1 data point and the metric exceeds the threshold twice. The condition evaluates as true for the minute when no data is reported.

When both the options are enabled, the event message on the User Interface (UI) does not specify if the health rule violation is triggered because of no data or if the persistence threshold is genuinely breached.

EnabledDisabled

The condition evaluates as true if a metric fails to report data for any data point during the evaluation time frame. 

DisabledDisabled

The condition does not evaluate as true if the configured metric fails to report data for any data point during the evaluation time frame.