On this page:

 

Your Rating:
Results:
PatheticBadOKGoodOutstanding!
20 rates

The Tier Metric Correlator enables you to identify load and performance anomalies across all nodes in a tier. Suppose you have a tier composed of a cluster of nodes running on containers or servers. You expect all the nodes to behave exactly the same under the same load conditions. How can you monitor this cluster for anomalies and outliers? The Tier Metric Correlator makes it easy to answer the following questions:

  • Are all the nodes behaving within the expected band of performance, or do some nodes have outliers (slow calls, stalls, and errors)?
  • In what time windows, and on which nodes, are these outliers occurring?
  • Are outliers associated with a specific node cluster—for example, a cluster running a canary release?
  • Are any resource issues (such as high CPU I/O or paging) correlated with these outliers?
  • Are calls getting distributed evenly across all nodes in the tier?

This topic describes two example use cases

  • Using Tier Metric Correlation to quickly identify balances and imbalances in the distribution of calls across all nodes in a tier.
  • Using Tier Metric Correlation to monitor and troubleshoot a canary deployment scenario. This example shows how you can easily compare performance across node clusters, identify nodes with transaction outliers, and see if any resource issues are causing these outliers.

Key Concepts

Transaction Outliers

The first step in metric correlation is to determine the transaction outliers for a tier: the number of Slow Calls, Very Slow Calls, and Stalled Calls whose response times are significantly outside the norm for that tier. The Transaction Outliers Heatmap visualizes the rate of these outliers and their distribution across all nodes in the tier.

A heatmap is a time-series chart with an extra dimension: the color intensity of each bar shows the distribution of outliers across all nodes. The darker the hue, the more nodes have outliers. The chart colors the bars in shades of gray (fewer outliers) and orange (more outliers). 

Heatmaps make it easy to identify the normal bands of performance and any outliers for all nodes in a tier. In the following example, all nodes fall within two bands:

This heatmap highlights two time windows where the performance metric is noticeably higher for 50% of the nodes:

Correlated Metrics

The Correlated Metrics heatmaps enable you to correlate transaction outliers with specific resource metrics. The Correlate Metrics window makes it easy to identify performance metrics that correlate (and do not correlate) with transaction outliers of interest. The following example shows three heatmaps:

  1. Transaction Outliers 
  2. Correlated metric
  3. Uncorrelated metric (no outliers)

Troubleshooting Nodes and Servers

When you select a set of transaction outliers, the Selected Nodes and Impacted Servers charts show the distribution of these outliers across all nodes in the tier. This makes it easy to see if these outliers are associated with specific node clusters. Double-click on a pie chart to troubleshoot the node or server.

Example Use Cases

Comparing Load Distributions

In this example, a DevOps engineer wants to ensure that transaction calls are getting distributed evenly to all the nodes in a tier. She goes to the Tiers & Nodes view, right-clicks on a tier, and chooses Correlate Metrics. Looking at the Calls Per Minute heatmap, she can immediately see that the band of performance is 20-22 calls per minute, but for some nodes, the rate is higher during certain intervals. She decides to investigate the relevant load balancer and finds that a simple misconfiguration is causing the device to distribute calls unevenly at certain times. Using heatmaps, she can identify and fix a minor issue before it has a significant impact on her team's mission-critical applications.

Canary Deployment Testing

A DevOps engineer is responsible for a four-tier e-commerce application. The Order-Tier has five nodes running version 1.0 of the service. She deploys a "canary" (version 1.1 of the service) on one node. Before she deploys 1.1 on all nodes, she wants to see if there is any performance degradation on this node.

She opens the Controller, go to the Tiers & Nodes view for the application of interest, right-clicks on the Order-Tier, and chooses Correlate Metrics. The Tier Metric Correlator appears.

The Transaction Outliers heatmap shows that some calls are outliers: Errors, Slow Calls, Very Slow Calls, or Stalled Calls whose response times are significantly higher than the band of performance for that tier.

Her first question is: Are these outliers associated with our "canary node" (ORD-N1)? She drag-selects a set of these outliers. The Node Distribution in Selection pie chart (right) shows that all outliers are associated with the canary node. Clearly, the new code is not performing as well as the old code.

   

Her next question is: Are any resource issues causing these outliers? To answer this question, she examines the Correlated Metrics heatmaps to look for metrics that correlate with the outliers on the canary node. Most heatmaps show no correlation. For example, CPU Busy% shows that all nodes stay within the band of performance of 0-20%. 

 

However, the CPU I/O Wait 95th Percentile(%) heatmap shows a strong correlation: All the metric outliers occur on the canary node, while all other nodes remain within the band of performance. 

The Pages paged out 95th Percentile (pages) heatmap also shows a strong correlation with the transaction outliers on the canary node. With just a few clicks, she can immediately see that the canary node is performing worse; that the node has a CPU I/O problem; and that the CPU I/O problem is related to paging, which indicates a disk problem. 

 

   

To reduce the visual noise and highlight the correlations, she unchecks all the uncorrelated metrics. Her next step is to investigate and troubleshoot the underlying server. To drill down into the canary node, she double-clicks on the Server Distribution in Selection pie chart (bottom right). 

  

The Server Dashboard for the canary node appears. She goes to the Volumes tab and sees that a lot of spikes in I/O operations and queue wait times. She decides that the canary code is not ready to deploy to the entire tier. She needs to re-examine the canary code, fix the regression, and re-test.

Enabling Tier Metric Correlation

Tier Metric Correlation must be enabled per account on the Controller. If you are using a SaaS Controller, contact AppDynamics Support. If you are using an on-premises Controller, do the following:

  1. Log in to the Controller administration console using the root user password:
    http://<controller host>:<port>/controller/admin.jsp
  2. Go to the Accounts page and select the account for which you want to enable this feature.
  3. Click Add Property in the accounting settings and add the following: 
    ENABLE_SIM_HEATMAPS = true

To correlate percentile metrics, you must enable percentile metric reporting on both the Controller and the Machine Agent. By default, reporting is disabled on the Controller and enabled on the agent.

 

Workflow Description

The following steps outline the general workflow:

  • Go to the Tiers & Nodes view for the application of interest.

  • Right-click on the tier and choose Correlate Metrics. The Tier Metric Correlator appears.
  • Identify transaction outliers  for the tier – calls flagged as Slow, Very Slow, Stalls, or Errors (see Business Transaction Performance).  
  • Drag the cursor to select the outliers of interest. The pie charts on the right show the distribution of these outliers by node and server.
  • Identify any correlated metrics  for the outliers. Unselect checkboxes for non-correlated metrics 
  • Identify and troubleshoot the nodes  and servers where the outliers are occurring. Double-click on a pie slice to open the dashboard view. You can use the correlated metrics to guide your troubleshooting.  

 

  • No labels