This topic introduces health rules, the policy statements that define triggers in AppDynamics policies.
What is a Health Rule?
Health rules let you specify the parameters that represent what you consider normal or expected operations for your environment. The parameters rely on metric values, for example, the average response time for a business transaction or CPU utilization for a node.
The health statuses are critical, warning, normal, and unknown. When the performance of an entity affected by the rule violates the rule's conditions, a health rule violation exists.
When the health status of an entity changes, a health rule violation event occurs. Examples of health rule violation events are a health rule violation starting, ending, upgrading from warning to critical, or downgrading from critical to warning.
The health statuses of entities and health rule violations are surfaced in the controller user interface. A health rule violation event can also be used to trigger a policy, which can initiate automatic actions, such as sending alerting emails or running remedial scripts.
You create health rules using the health rule wizard, described in Configure Health Rules. The wizard groups commonly-used system entities and related metrics to simplify setting up health rules. You can also use, as is or modified, the default health rules provided by AppDynamics.
Health Rule Scopes
The health rule scope determines the set of default health rule types. You can choose the scope to get a set of default health rule types for applications, servers, or databases. For example, when you choose a mobile application as the scope, you're given health rules such as crash rates and HTTP/network error rates.
If the health rule scope is for an application, the health rules would be for business transactions, CPU/memory utilization, etc.
From Alert & Respond > Health Rules, you can select one of the following health rule scopes from the drop-down list:
- User Experience: Browser Apps
- User Experience: Mobile Apps
You can also create new health rules to add to the default set for each scope. You may want to add the health rule app starts to your mobile application. This health rule is not part of the default set of health rules in the mobile app scope, so you would just need to add a new health rule.
Heath Rule Types
The health rule wizard groups health rules into types that are categorized by the entity that the health rule covers. This allows the wizard to display appropriate configuration items during the health rule creation process.
The health rule types are:
- Transaction Performance
- Overall Application Performance: Groups metrics related to load, response time, slow calls, stalls, with applications
- Business Transaction Performance: Groups metrics related to load, response time, slow calls, stalls, etc. with business transactions
- Node Health
- Node Health-Hardware, JVM, CLR: Groups metrics like CPU and heap usage, disk I/O, etc. with nodes
- Node Health-Transaction Performance: Groups metric related to load, response time, slow calls, stalls, etc. with nodes
- Node Health-JMX: Java only, groups metrics related to connection pools, thread pools, etc with specific JMX instances and objects in specific nodes and tiers
- User Experience-Browser Apps
- IFrames: Groups metrics like first-byte time, requests per minute, etc. with the performance of iframes for the end user
- AJAX Requests: Groups metrics like Ajax callback execution time, errors per minute, etc. with the performance of Ajax requests for the end user
- Virtual Pages: Groups metrics like End User Response Time, Digest Cycles, HTML Download Time, DOM Building Time, etc. for virtual pages created with Angular. See information on what these metrics mean in the context of virtual pages. for
- User Experience-Mobile Apps
- Mobile Apps: Groups metrics related to mobile app crashes, starts, and server calls as well as network requests and errors
- Network Requests: Groups metrics like HTTP and network errors, request time, and requests per minute with network requests
- Servers: Groups metrics related to hardware resources
- Databases & Remote Services: Groups metrics related to response time, load, or errors with databases and other backends
- Advanced Network: Groups metrics related to Network Visibility, such as PIE (performance impact events), zero window, data retransmission, and errors. (New in 4.5.2)
- Error Rates: Groups metrics related to exceptions, return codes, and other errors with applications or tiers
- Information Points: Groups metrics like response time, load, or errors with information points
- Service Endpoints: Java and .NET only; groups metrics like average response time, calls per minute, and errors per minute with service endpoints
- Custom: Presents all the metrics collected by the agent that could affect a single business transaction, a single node or overall application performance. Use this type to create rules that evaluate custom metrics.
When you select one of these health rule types, the wizard offers you the metrics commonly associated with that type in an embedded browser.
Health Rule Schedules
The metrics associated with a health rule are evaluated according to a schedule that you control. You can configure:
- when a health rule is in effect
- which data set should be used, based on time
- what special rules should be in place during a violation event
Time evaluation for health rule schedules is based on the time zone of the Controller, regardless of where an app agent is situated. For example, if a Controller is in San Francisco but the app agent is in Dubai, Pacific Time applies to the health rule schedule.
All SaaS Controllers use Pacific Time (PT).
Health Rule Enabled Schedule
By default, health rules are always enabled. Instead, you can define schedules for the health rules.
Built-in schedules exist for:
- End of business hours
- Weekday lunch
- Weekday mornings
You can also configure your own schedules based on UNIX cron expressions using custom values.
Health Rule Evaluation Window
The health rule evaluation window is the period of time over which the data used to evaluate the health rule is collected.
Different kinds of metrics may provide better results using different sets of data. You can manage how much data AppDynamics uses when it evaluates a particular health rule by setting the data collection time period. The default value is 30 minutes.
- For metrics based on an average calculation, such as average response time, AppDynamics averages the response time over the evaluation window. A five-minute window means that the last five minutes of data is used to evaluate if the health rule was violated.
- For metrics based on a sum calculation, such as the number of calls, AppDynamics uses the total number of calls counted during the evaluation window.
Health Rule Wait Time After Violation
The health rule wait time setting lets you control how often an event is generated while the conditions found to violate a health rule continue. If the Controller determines that a health rule has been violated, with a status of either Critical or Warning, an Open Critical or Open Warning event is generated. That event can be used to trigger any policies that match the health rule, and thus to initiate any actions that the policies require.
Once an Open event has occurred, the Controller continues to evaluate the status of the health rule every minute. If the Controller continues to detect the same violation, the violation remains open with the same status. A corresponding Continues Critical or Continues Warning event may be generated to link to any related policies.
But a Continues event every minute might be too noisy for your situation. The health rule's Wait Time after Violation setting is used to throttle how often these Continues events are generated for continuing health rule violations. The default is every 30 minutes.
To use Continues Critical and Continues Warning events, adjust the default Wait Time after Violation value to the desired frequency. Then configure a policy matching that health rule with the Health Rule Violation Continues - Warning and/or Health Rule Violation Continues - Critical events selected in the Health Rule Violation Events section of the policy settings.
Note that the violations displayed in the Health Rules Violations page, under Troubleshoot, are updated only when a health rule violation event is triggered.
If the Controller is unable to evaluate the rule—for example, if a node simply stops reporting—the Evaluation Status of the health rule is marked as a grey question mark or Unknown in the Current Evaluation Status tab in the right panel of the health rules list. The current violation event remains open until the Wait Time after Violation period has elapsed, at which point the violation event is closed and a new event is triggered, causing the Health status itself of the rule to display as Unknown.
Default Health Rules
AppDynamics provides a default set of health rules for some products, such as applications and servers. These default health rules vary depending on the entity. To see the default rules, before any health rules have been added to your AppDynamics installation:
- Select the Alert & Respond tab at the top.
- Click Health Rules in the left panel.
- From the drop-down list in the right panel select the entity.
The default health rules are displayed.
If any of these predefined health rules are violated, the affected items are marked in the UI as yellow-orange if it is a Warning violation and red if it is a Critical violation.
In many cases, the default health rules may be the only health rules that you need. If the conditions are not configured appropriately for your application, you can edit them. You can also disable the default health rules.
Health Rule Entities
A health rule can evaluate metrics associated with an entire application or a limited set of entities. For example, you can create business transaction performance health rules that evaluate certain metrics for all business transactions in the application or node health rules that cover all the nodes in the application or all the nodes in specified tiers. The default health rules are in this category.
You can also create health rules that are narrowly applied to a limited set of entities in the application, or even a single entity such as a node or a JMX object or an error. For example, you can create a JMX health rule that evaluates the initial pool size and number of active connections for specific connection pools in nodes that share certain system properties.
Monitoring Serverless Entities
Serverless functions are tracked at the tier level. A serverless function is indicated by a lambda (λ) icon inside each tier. When you configure a health rule for an application comprising serverless entities, you can choose to monitor the serverless tiers in the Affected Entities tab. For information on how various health rules are evaluated for serverless entities comprising tiers for AWS Lambda, see Evaluating Serverless Tiers.
The health rule wizard lets you specify precisely which entities the health rule affects, enabling the creation of very specific health rules. For example, for a business transaction, you can limit the tiers that the health rule applies to, or limit the health rule application to specific business transactions by name or by names that match certain criteria.
For node health rules, you can specify the type of the node, such as Java, .NET, PHP, and so on.
You can specify that a health rule applies only to nodes that meet certain criteria.
The Type of Node pulldown menu does not allow you to specify Node.js, Python, or Web Service nodes. To restrict a health rule to these types of nodes, you can specify the affected entity as a tier and then select only Node.js or Python or Web Service tiers as needed. Or to more finely-tune the affected nodes, use the Nodes matching the following criteria menu item to specify node names or matching environment variables or meta-info to restrict the health rule to the nodes you want.
Entities Affected by a Health Rule
For an Overall Application Performance health rule type, the health rule applies to the entire application, regardless of the business transaction, tier, or node.
If you configure your Health Rule to work with tiers, you must also configure the parallel policy to work with tiers. However, if you configure your Health Rule to work with tiers, but your policy is configured with nodes first, you will not trigger any actions or notifications. The inverse is also true. The following screenshots show examples of a health rule and a policy created in the correct order.
The following table lists the entities that you can apply health rules to.
|Health rule type||Applicable Entities|
|Business Transaction Performance|
|Databases & Remote Services health|
|Node Health—Transaction Performance or Node Health—Hardware, JVM, CLR|
|User Experience - Mobile Apps|
|User Experience - Mobile Network Requests|
|User Experience - Browser Apps—Pages, iframes, Ajax Requests, Virtual Pages, Synthetic jobs|
Health Rule Conditions
You define the acceptable range for a metric by establishing health rule conditions. A health rule condition sets the metric levels that constitute a Warning status and a Critical status.
A condition consists of a Boolean statement that compares the current value of a metric against one or more static or dynamic thresholds based on a selected baseline. If the condition is true, the health rule violates. The rules for evaluating a condition using multiple thresholds depend on configuration.
Static thresholds are straightforward. For example, is a business transaction's average response time greater than 200 ms?
Dynamic thresholds are based on a percentage in relation to, or a standard deviation from, a baseline built on a rolled-up baseline trend pattern. A daily trend baseline rolls up values for a particular hour of the day during the last thirty days, whereas a weekly trend baseline rolls up values for a particular hour of the day, for a particular day of the week, for the last 90 days. For more information about baselines, see Dynamic Baselines.
You can define a threshold for a health rule based on a single metric value or on a mathematical expression built from multiple metric values.
The following are typical health rule conditions:
- If the value of the Average Response Time is greater than the default baseline by 3 X the Baseline Standard Deviation . . .
- If the count of the Errors Per Minute is greater than 1000 . . .
- If the number of MB of Free Memory is less than 2 X the Default Baseline . . .
- If the value of Errors per Minute/Calls per Minute over the last 15 days > 0.2 . . .
The last example combines two metrics in a single condition. You can use the expression builder embedded in the health rules wizard to create conditions based on a complex expression comprising multiple interdependent metrics.
Often a condition consists of multiple statements that evaluate different metrics. A health rule is violated either when one of its condition evaluates to true or when all of its conditions evaluate to true, depending on how the condition is configured.
For example, a health rule that measures response time—average response time greater than some baseline value—makes more business sense if it is correlated with the application load —for example, 50 concurrent users or 10,000 calls per minute—on the system. You may not want to use the response time condition alone in a policy that initiates a remedial action if the load is low, even if the response time threshold is reached. The first part of the condition would evaluate the response time performance measurement and the second part would ensure that the health rule is violated only when there is sufficient load.
Health Rule Evaluation Scope
The health rule evaluation scope defines how many nodes in the affected entities must violate the condition before the health rule is considered violated.
Evaluation scope applies only to business transaction performance type health rules and node health type health rules in which the affected entities are defined at the tier level.
For example, you may have a critical condition in which the condition is unacceptable for any node, or you may want to consider the condition a violation only if the condition is true for 50% or more of the nodes in a tier.
Options for this evaluation scope are:
- The tier average: Evaluation is performed on the tier average instead of the individual nodes.
- Any node: If any node exceeds the thresholds, the rule is violated.
- Percentage of the nodes: If x% of the nodes exceed the thresholds, the rule is violated.
- Number of nodes: If x nodes exceed the thresholds, the rule is violated.
Critical and Warning Conditions
Conditions are classified as either critical or warning conditions.
Critical conditions are evaluated before warning conditions. If you have defined a critical condition and a warning condition in the same health rule, the warning condition is evaluated only if the critical condition is not true.
The configuration procedures for critical and warning conditions are identical, but you configure these two types of conditions in separate panels. You can copy a critical condition configuration to a warning configuration and vice-versa and then adjust the metrics in the copy to differentiate them. For example, in the Critical Condition panel you can create a critical condition based on the rule:
- If the Average Response Time is greater than 1000
Then from the Warning Condition panel, copy that condition and edit it to be:
- If the Average Response Time is greater than 500
As performance changes, a health rule violation can be upgraded from warning to critical if performance deteriorates to the higher threshold or downgraded from critical to warning if performance improves to the warning threshold.
Preparing to Set Up Health Rules
AppDynamics recommends the following process to set up health rules for your application:
- Identify the key metrics on the key entities that you need to monitor.
- Click Alert & Respond > Health Rules to examine any default health rules that were provided by AppDynamics.
- Compare your list of metrics with the metrics configured in these rules.
- If the default health rules cover all the key metrics you need, determine whether the pre-configured conditions are applicable to your environment. If necessary, modify the conditions for your needs.
- You can also view the list of affected entities for each of the default health rules and modify the entities.
- If default health rules do not cover all your needs or if you need very finely-applied health rules to cover specific use cases, create new health rules.
- Create schedules for health rules, if needed.
In some situations, a health rule is more useful if it runs at a particular time. See Health Rule Schedules.
Health Rule Management
To view current health rules, including the default health rules, and to access the health rule wizard, click Alert & Respond > Health Rules. Then choose the type of entity for which you want health rules from the pulldown menu at the top.
Current health rules are listed in the left panel. If you click one of these rules, a list appears in the right panel showing which entities this selected health rule affects and what the status of the latest evaluation is. You can also select the Evaluation Events tab to see a detailed list of evaluation events.
In the left panel, you can directly delete or duplicate a health rule. From here you can also access the health rule wizard to add a new rule or edit an existing one.
You can turn off evaluation of all health rules in the selected entity by clearing the Evaluate Health Rules check box. Check it when you want health rule evaluation to start again.
See Configure Health Rules for details on using the health rule wizard.
View Health Rule Status in the UI
Across the UI, health rule status is color-coded:
- Green is healthy
- Yellow/orange is a warning condition
- Red is a critical condition
- Grey indicates that the status of the health rule is unknown (for example, if the Controller cannot gather the data necessary to evaluate the rule)
If you see a health rule violation reported in the UI, you can click it to get more information about the violation.
Here are the health summary bars on the built-in dashboards:
A health column is displayed in various lists, such as the tier list below:
In the dashboards, health rule violations are displayed in the Events panel.