The Tier Metric Correlator enables you to identify load and performance anomalies across all nodes in a tier. Suppose you have a tier composed of a cluster of nodes running on containers or servers. You expect all the nodes to behave exactly the same under the same load conditions. How can you monitor this cluster for anomalies and outliers? You can use the Tier Metric Correlator to answer the following questions:

This page describes two example use cases

Key Concepts

Transaction Outliers

The first step in metric correlation is to determine the transaction outliers for a tier: the number of Slow Calls, Very Slow Calls, and Stalled Calls whose response times are significantly outside the norm for that tier. The Transaction Outliers Heatmap visualizes the rate of these outliers and their distribution across all nodes in the tier.

A heatmap is a time-series chart with an extra dimension: the color intensity of each bar shows the distribution of outliers across all nodes. The darker the hue, the more nodes have outliers. The chart colors the bars in shades of gray (fewer outliers) and orange (more outliers). 

Transaction Outliers in a Tier

Heatmaps enable you to easily identify the normal bands of performance and any outliers for all nodes in a tier. In this example, all nodes fall within two bands:

Heatmap

This heatmap highlights two time windows where the performance metric is noticeably higher for 50% of the nodes:

Heatmap Time Windows

Correlated Metrics

The Correlated Metrics heatmaps enable you to correlate transaction outliers with specific resource metrics. The Correlate Metrics panel enables you to easily identify performance metrics that correlate (and do not correlate) with transaction outliers of interest. This example shows three heatmaps:

  1. Transaction outliers 
  2. Correlated metric
  3. Uncorrelated metric (no outliers)

Correlated Metrics

Troubleshooting Nodes and Servers

When you select a set of transaction outliers, the Selected Nodes and Impacted Servers charts show the distribution of these outliers across all nodes in the tier. This makes it easy to determine if these outliers are associated with specific node clusters. Double-click a pie chart to troubleshoot the node or server.

Pie Chart

Example Use Cases

Comparing Load Distributions

In this example, a DevOps engineer wants to ensure that transaction calls are getting distributed evenly to all the nodes in a tier. The engineer accesses the Tiers & Nodes view, right-clicks a tier, and selects Correlate Metrics. Looking at the Calls Per Minute heatmap, the engineer can immediately see that the band of performance is 20-22 calls per minute, but for some nodes, the rate is higher during certain intervals. The engineer decides to investigate the relevant load balancer and finds that a simple misconfiguration is causing the device to distribute calls unevenly at certain times. Using heatmaps, they can identify and fix a minor issue before it has a significant impact on their team's mission-critical applications.

Comparing Load Distributions

Canary Deployment Testing

A DevOps engineer is responsible for a four-tier e-commerce application. The Order-Tier has five nodes running version 1.0 of the service. The engineer deploys a "canary" (version 1.1 of the service) on one node. Before the engineer deploys 1.1 on all nodes, they want to determine if there is any performance degradation on this node.

Canary Deployment Testing

The engineer opens the Controller, accesses the Tiers & Nodes view for the application of interest, right-clicks the Order-Tier, and selects Correlate Metrics. The Tier Metric Correlator displays.

The Transaction Outliers heatmap shows that some calls are outliers: Errors, Slow Calls, Very Slow Calls, or Stalled Calls whose response times are significantly higher than the band of performance for that tier.

Tier Metrics Correlator

The first question is: Are these outliers associated with our "canary node" (ORD-N1)? The engineer drag-selects a set of these outliers. The Node Distribution in Selection pie chart (right) shows that all outliers are associated with the canary node. Clearly, the new code is not performing as well as the old code.

   Tier Metrics with Pie Chart

The next question is: Are any resource issues causing these outliers? The engineer examines the Correlated Metrics heatmaps to look for metrics that correlate with the outliers on the canary node. Most heatmaps show no correlation. For example, CPU Busy% shows that all nodes stay within the band of performance of 0-20%. 

Correlated Metrics Heatmap 

However, the CPU I/O Wait 95th Percentile(%) heatmap shows a strong correlation: All the metric outliers occur on the canary node, while all other nodes remain within the band of performance. 

Tier Metrics Correlator

The Pages paged out 95th Percentile (pages) heatmap also shows a strong correlation with the transaction outliers on the canary node. With just a few clicks, they can immediately see that the canary node is performing worse; that the node has a CPU I/O problem; and that the CPU I/O problem is related to paging, which indicates a disk problem. 


   Pages paged out 95th Percentile

To reduce the visual noise and highlight the correlations, the engineer unchecks all the uncorrelated metrics. The next step is to investigate and troubleshoot the underlying server. To drill down into the canary node, they double-click on the Server Distribution in Selection pie chart (bottom right). 

  Server Distribution in Selection Chart

The Server Dashboard for the canary node displays. The engineer selects the Volumes tab and sees many spikes in I/O operations and queue wait times. They decide that the canary code is not ready to deploy to the entire tier, and need to re-examine the canary code, fix the regression, and re-test.

Canary Code

Enabling Tier Metric Correlation

You enable Tier Metric Correlation per account on the Controller.
If you are using a SaaS Controller, contact customer support
If you are using an on-premises Controller:

  1. Log in to the Controller administration console using the root user password:
    http://<controller host>:<port>/controller/admin.jsp
  2. Access the Accounts page, and select the account for which you want to enable this feature.
  3. Select Add Property in the accounting settings and add: 
    ENABLE_SIM_HEATMAPS = true

To correlate percentile metrics, you must enable percentile metric reporting on both the Controller and the Machine Agent. By default, reporting is disabled on the Controller and enabled on the Agent.

Workflow Description

These steps outline the workflow description:

Workflow Description for Correlate Metrics