The Tier Metric Correlator enables you to identify load and performance anomalies across all nodes in a tier. Suppose you have a tier composed of a cluster of nodes running on containers or servers. You expect all the nodes to behave exactly the same under the same load conditions. How can you monitor this cluster for anomalies and outliers? You can use the Tier Metric Correlator to answer the following questions:
This page describes two example use cases:
The first step in metric correlation is to determine the transaction outliers for a tier: the number of Slow Calls, Very Slow Calls, and Stalled Calls whose response times are significantly outside the norm for that tier. The Transaction Outliers Heatmap visualizes the rate of these outliers and their distribution across all nodes in the tier.
A heatmap is a time-series chart with an extra dimension: the color intensity of each bar shows the distribution of outliers across all nodes. The darker the hue, the more nodes have outliers. The chart colors the bars in shades of gray (fewer outliers) and orange (more outliers).
Heatmaps enable you to easily identify the normal bands of performance and any outliers for all nodes in a tier. In this example, all nodes fall within two bands:
This heatmap highlights two time windows where the performance metric is noticeably higher for 50% of the nodes:
The Correlated Metrics heatmaps enable you to correlate transaction outliers with specific resource metrics. The Correlate Metrics panel enables you to easily identify performance metrics that correlate (and do not correlate) with transaction outliers of interest. This example shows three heatmaps:
When you select a set of transaction outliers, the Selected Nodes and Impacted Servers charts show the distribution of these outliers across all nodes in the tier. This makes it easy to determine if these outliers are associated with specific node clusters. Double-click a pie chart to troubleshoot the node or server.
In this example, a DevOps engineer wants to ensure that transaction calls are getting distributed evenly to all the nodes in a tier. The engineer accesses the Tiers & Nodes view, right-clicks a tier, and selects Correlate Metrics. Looking at the Calls Per Minute heatmap, the engineer can immediately see that the band of performance is 20-22 calls per minute, but for some nodes, the rate is higher during certain intervals. The engineer decides to investigate the relevant load balancer and finds that a simple misconfiguration is causing the device to distribute calls unevenly at certain times. Using heatmaps, they can identify and fix a minor issue before it has a significant impact on their team's mission-critical applications.
A DevOps engineer is responsible for a four-tier e-commerce application. The Order-Tier has five nodes running version 1.0 of the service. The engineer deploys a "canary" (version 1.1 of the service) on one node. Before the engineer deploys 1.1 on all nodes, they want to determine if there is any performance degradation on this node.
The engineer opens the Controller, accesses the Tiers & Nodes view for the application of interest, right-clicks the Order-Tier, and selects Correlate Metrics. The Tier Metric Correlator displays.
The Transaction Outliers heatmap shows that some calls are outliers: Errors, Slow Calls, Very Slow Calls, or Stalled Calls whose response times are significantly higher than the band of performance for that tier.
The first question is: Are these outliers associated with our "canary node" (ORD-N1)? The engineer drag-selects a set of these outliers. The Node Distribution in Selection pie chart (right) shows that all outliers are associated with the canary node. Clearly, the new code is not performing as well as the old code.
The next question is: Are any resource issues causing these outliers? The engineer examines the Correlated Metrics heatmaps to look for metrics that correlate with the outliers on the canary node. Most heatmaps show no correlation. For example, CPU Busy% shows that all nodes stay within the band of performance of 0-20%.
However, the CPU I/O Wait 95th Percentile(%) heatmap shows a strong correlation: All the metric outliers occur on the canary node, while all other nodes remain within the band of performance.
The Pages paged out 95th Percentile (pages) heatmap also shows a strong correlation with the transaction outliers on the canary node. With just a few clicks, they can immediately see that the canary node is performing worse; that the node has a CPU I/O problem; and that the CPU I/O problem is related to paging, which indicates a disk problem.
To reduce the visual noise and highlight the correlations, the engineer unchecks all the uncorrelated metrics. The next step is to investigate and troubleshoot the underlying server. To drill down into the canary node, they double-click on the Server Distribution in Selection pie chart (bottom right).
The Server Dashboard for the canary node displays. The engineer selects the Volumes tab and sees many spikes in I/O operations and queue wait times. They decide that the canary code is not ready to deploy to the entire tier, and need to re-examine the canary code, fix the regression, and re-test.
You enable Tier Metric Correlation per account on the Controller.
If you are using a SaaS Controller, contact customer support.
If you are using an on-premises Controller:
root
user password:http://<controller host>:<port>/controller/admin.jsp
ENABLE_SIM_HEATMAPS
= trueTo correlate percentile metrics, you must enable percentile metric reporting on both the Controller and the Machine Agent. By default, reporting is disabled on the Controller and enabled on the Agent.
sim.machines.percentile.percentileMonitoringAllowed
property. See Controller Settings for Machine Agents <machine_agent_home>/extensions/ServerMonitoring/conf/ServerMonitoring.yml
file and set the percentileEnabled
property. See Machine Agent Settings for Server VisibilityThese steps outline the workflow description:
Access the Tiers & Nodes view for the application of interest.