To effectively determine the root cause of an anomaly, let us consider an example of a performance problem of a service (OrderServiceVodka). This example follows an anomaly from the moment it surfaces until confirmation of the root cause and assumes that you start with the Health Violation section of the page.

This page describes several options for analyzing available data to determine the root cause of an anomaly.

View the Anomaly Details

To analyze an anomaly and determine why the service response is slow, start by looking at the anomaly details.

Select the appropriate time range to ensure that you view the latest data.

  1. Expand the Health Violation timeline to display the Anomaly Detected timeline. 
  2. Select the anomaly event on the timeline. The anomaly details display in the right panel.
  3. This simple anomaly starts in the Critical state and remains there for most of its lifecycle; we can focus on the initial event. In some cases, there might be several state changes. For such cases, it might be useful to examine multiple alert events.

Examine the Dependency Flow Map

A dependency flow map lists the impacted service (OrderServiceVodka) and its dependencies that may be affected by the anomaly. In this example, the dependent remote service (backend RedisDBVodka-e6032994-df1d-46d0-97e0-a860f70e8569) is unhealthy as indicated by the red color as shown in the flow map example.

Examine the Suspected Causes

The Suspected Causes (top 3) show likely root causes of performance problem for OrderServiceVodka service. The system lists the most probable root cause as the first suspected cause, the next most probable root cause as the second suspected cause, and similarly the third suspected cause. In this example, backend RedisDBVodka-e6032994-df1d-46d0-97e0-a860f70e8569 is the first suspected cause and hence the most probable root cause. You start by examining the first suspected cause. 

Examine Suspected Cause 1

This suspected cause indicates that the problem could be with the backend RedisDBVodka-e6032994-df1d-46d0-97e0-a860f70e8569. Notice the deviating metric, Average Response Time (ART).  

 The Call Path helps trace the propagation of the anomaly. Notice that OrderServiceVodka is unhealthy because AccountServiceVodka is unhealthy and  AccountServiceVodka is unhealthy because RedisDBVodka-e6032994-df1d-46d0-97e0-a860f70e8569 is unhealthy.

  1. Click the OrderServiceVodka link to view the call path between the service OrderServiceVodka and AccountServiceVodka. Notice that OrderServiceVodka is unhealthy because AccountServiceVodka is unhealthy. 
  2. Click the AccountServiceVodka link to view the call path between the service AccountServiceVodka and RedisDBVodka-e6032994-df1d-46d0-97e0-a860f70e8569. Notice that AccountServiceVodka is unhealthy because RedisDBVodka-e6032994-df1d-46d0-97e0-a860f70e8569 is unhealthy.
  3. Click RedisDBVodka-e6032994-df1d-46d0-97e0-a860f70e8569l link to view the call path between the backend RedisDBVodka-e6032994-df1d-46d0-97e0-a860f70e8569 and other services.
You can also view the endpoint details for each of the services. This section provides a tabular comparison of the performance values of various endpoints. This helps quickly narrow down the deviating metric and corresponding endpoint.

Examine the Metric Performance

Notice the Suspected Cause 1 Metric graph, the metric value shoots up as the anomaly starts. Hover over a time point to view the metric value in numerical form. You can view the deviating metric for RedisDBVodka-e6032994-df1d-46d0-97e0-a860f70e8569 and correlate it with the violating metric. Notice that the ART for the violating metric increased at the same time and in the similar pattern as that of ART for RedisDBVodka-e6032994-df1d-46d0-97e0-a860f70e8569. The graphs help correlate the deviation and the pattern.

Examine the Entity Page

Click RedisDBVodka-e6032994-df1d-46d0-97e0-a860f70e8569l link on the call path of the Suspected Cause 1 to view the entity page of the backend. Notice the entity-specific details like the health violations and the metric performance graphs.

After examining all the details, the hypothesis is now confirmed:

The ART performance of backend RedisDBVodka-e6032994-df1d-46d0-97e0-a860f70e8569 degraded which in turn impacted the performance of other services. Hence, the root cause of this anomaly is the backend RedisDBVodka-e6032994-df1d-46d0-97e0-a860f70e8569.

Examine Suspected Cause 2

This suspected cause indicates that the problem could be with the service FulfillmentServiceVodka. The call path indicates that FulfillmentServiceVodka affects PaymentServiceVodka which in turn affects OrderServiceVodka.

Examine Suspected Cause 3

This suspected cause indicates that the problem could be with the service RedisDBVodka-e6032994-df1d-46d0-97e0-a860f70e8569. Note that the entity listed as a root cause here is the same as the suspected cause 1. However, the call path is different. The call path indicates that OrderServiceVodka directly uses the backend RedisDBVodka-e6032994-df1d-46d0-97e0-a860f70e8569.

After analyzing all the suspected causes, you may deduce that all the dependent services were affected by the backend RedisDBVodka-e6032994-df1d-46d0-97e0-a860f70e8569.

Conclude

OrderServiceVodka service has multiple dependencies including other services and backends. With anomaly details and suspected cause(s) information you quickly eliminate:

  • All but a few origins of a performance problem associated with a Service, and
  • All but the most relevant (deviating) metrics associated with the service. 

This process spares you the tedious process of investigating multiple metrics on each dependency. Instead, you confirm or negate Suspected Causes with a quick glance at timelines, flow maps, and metric performance graphs. Anomaly Detection and suspected cause information help determine the root cause, presenting you with the information you need to quickly form and verify a hypothesis.