Alerts let you know when problems exist and help you anticipate problems that might be developing. Responses let you automate preventative actions to address those problems before they cause a severe slowdown or outage. Think of alert and respond as the automation of your runbooks.
The alert and respond system is made up of three parts:
- Health rules: define key performance metric thresholds for your application, across the stack.
- Policies: link health rule violations, and other performance-based events, with appropriate actions.
- Actions: automate what should be done in a wide variety of situations, including sending alerts and performing diagnostic and remedial tasks.
Sample Use Cases
The AppDynamics platform recognizes some broad-based health issues commonly experienced by applications, such as "Business Transaction response time is much higher than normal" or "Memory utilization is too high". These are configured as default health rules, which define how high is "much higher than normal" or "too high". Use policies to attach these rules to alerts (whom to notify) and responses (what to do) when these problems exist. You can use these rules "as is" or modify them for your environment. See Default Health Rules.
In addition to the broad-based rules, you can customize precise automatic alerts and responses for narrowly circumscribed situations. This lets you fine-tune your system, ensuring that the right alert goes to the right person, the right action is taken for the right problem on the right cluster or server.
- You do not want to alert your team if performance in a few clusters is lagging, but if more than 20% of the clusters are unhealthy, or if servers in particular clusters or servers that meet certain criteria are performing poorly, you do want to trigger an alert. You can define health rules that apply to specific tiers or nodes. If these rules violate the system knows exactly which entity is experiencing problems and therefore whom to alert. This rule affects only one node: the order processing server.
- Performance is deteriorating in one business transaction so you want to view snapshots for that one transaction. You create a diagnostic action.
- You want to send an alert whenever an app agent stops reporting to the Controller. Create a node health rule based on the value of the Availability metric reported by the agent. If Availability is less than 1, the agent is not reporting.
- You have a large operation with several development teams, each responsible for a different service. You create a health rule for one service and then copy it. Then you create different policies in which you can pair each copy of the health rule to an alert addressed to the appropriate team.
- You have an application that performs well for normal load. However, peak loads can cause the application to slow. During peak load, AppDynamics not only detects the connection pool contention, but also allows you to create a remediation script that can automate increasing or decreasing the size of connection pool. You can require human approval to run this script or simply configure it to execute automatically when it is triggered. Create a runbook and associate it with a policy so that it will fire when the connection pool is exhausted.