Log Collector Observability

The Log Collector's telemetry, in the form of Filebeat metrics (and health check metrics if Agent Management is enabled) in OTLP format, provides observability into log collection activity on your cluster. You can send these metrics to Cisco Cloud Observability or to a metrics backend on your cluster for which an OTLP exporter is available, such as Prometheus. This page explains how to implement a metrics pipeline and how to use these metrics for troubleshooting log collection problems on your cluster:

Default Log Collector Telemetry

The following table describes the Filebeat metrics the Log Collector exports by default. You can configure it to export other metrics. For a full list of available metrics, see Elastic's documentation on "beat stats" fields.

Metric Name	Cisco AppDynamics Metric Content Type	Description
`agent:health`	`Gauge`	A value of `"1"` is sent once every 5 minutes. This helps in liveness checks.
`beat.memstats.memory_alloc`	`Gauge`	Number of bytes allocated to heap objects
`filebeat.events.active`	`Sum`	Number of active events. See https://www.elastic.co/guide/en/beats/metricbeat/8.4/exported-fields-beat.html
`filebeat.input.filestream.events.create`	`Sum`	Number of file system "create" events
`filebeat.input.filestream.events.delete`	`Sum`	Number of file system "delete" events
`filebeat.input.filestream.events.eof`	`Sum`	Number of EOF errors while reading files
`filebeat.input.filestream.events.rename`	`Sum`	Number of file system "rename" events
`filebeat.input.filestream.events.truncate`	`Sum`	Number of file system "truncate" events
`filebeat.input.filestream.events.write`	`Sum`	Number of file system "write" events
`filebeat.input.filestream.files.open`	`Sum`	Number open files according to the file system
`filebeat.input.filestream.harvester.running`	`Sum`	Number of running harvesters
`filebeat.input.filestream.harvester.stopped`	`Sum`	Number of stopped harvesters
`libbeat.output.read.errors`	`Sum`	Number of read errors
`libbeat.output.write.bytes`	`Sum`	Number of bytes written by exporter
`libbeat.output.write.errors`	`Sum`	Number of write errors
`system.load.norm.5`	`Gauge`	System load 5 minute normalized average
`system.load.norm.15`	`Gauge`	System load 15 minute normalized average

Configure the Log Collector to Export Metrics to Cisco AppDynamics Distribution of OpenTelemetry Collector

Choose only one procedure in this section:

If your collectors-values.yaml is in the simplified layout (August 2022 or newer), modify or add these parameters in appdynamics-cloud-k8s-monitoring.logCollectorConfig. For details on each parameter, see Log Collector Settings.

Create a backup of collectors-values.yaml.
Set monitoring.otlpmetric.enabled to true.
(Optional) Modify the list of metrics to export in monitoring.otlpmetric.metrics.
Validate collectors-values.yaml with a YAML validator like YAML Lint.
To apply these changes to your cluster, run the helm upgrade command with the override settings you specified in collectors-values.yaml: See Upgrade or Uninstall Kubernetes and App Service Monitoring.
Check to see if your Log Collector pods restarted, and if they did not, find the name of their daemonset and restart it:
```
kubectl get ds
kubectl rollout restart ds <daemonset-name> -n appdynamics
```
BASH

If your collectors-values.yaml is in a legacy layout, modify or add these parameters in appdynamics-cloud-k8s-monitoring.logCollectorConfig.filebeatYaml. For details on each parameter, see AppDynamics Log Collector Settings - Legacy YAML Layout.

Create a backup of collectors-values.yaml.
Set monitoring.enabled to true.
Set monitoring.otlpmetric.resource_attrs.k8.cluster.name to the name of your cluster.
(Optional) Modify the list of metrics to export in monitoring.otlpmetric.metrics.
If monitoring.otlpmetric.endpoint is missing, set it to the endpoint of the Cisco AppDynamics Distribution of OpenTelemetry Collector.
If any parameters marked "Do not modify" in Log Collector Settings - Advanced YAML Layout are missing, add them.
Validate collectors-values.yaml with a YAML validator like YAML Lint.
To apply these changes to your cluster, see Upgrade or Uninstall Kubernetes and App Service Monitoring.
Check to see if your Log Collector pods restarted, and if they did not, find the name of their daemonset and restart it:
```
kubectl get ds
kubectl rollout restart ds <daemonset-name> -n appdynamics
```
BASH

Send Metrics from Cisco AppDynamics Distribution of OpenTelemetry Collector to Cisco Cloud Observability

The Cisco AppDynamics Distribution of OpenTelemetry Collector has an otlphttp processor with OAuth that is preconfigured to send metrics to

Cisco Cloud Observability, so you don't need to configure anything.

Troubleshooting Using Filebeat Metrics

Prerequisites

Monitoring needs to be enabled in your Log Collector configuration for the metrics to be reported.

Find the Right Log Collector Entity

Follow these steps to find the ID of the logcollector:log_collector_instance entity you want to troubleshoot.

To list all the logcollector:log_collector_instance entities which have sent Filebeat metrics to your Tenant, run this query in Query Builder:
```
fetch id, attributes from entities(logcollector:log_collector_instance)
```
CODE

Add filters to your query to narrow down the results. You can filter by any of these attributes in a metric packet to get the correct logcollector:log_collector_instance:

Attribute	Description for Kubernetes-based Log Collector	Description for Non Kubernetes-based Log Collector
`agent.display.name`	`"Log Collector"`	`"Log Collector"`
`agent.name`	Helm release name	Linux service name
`agent.deployment.name`	Kubernetes Deployment name. Identical to the output of the `kubectl get deployment` command.	Your own deployment name for this instance of the Log Collector
`agent.platform.id`	Kubernetes cluster ID	AWS EC2 instance ARN
`agent.platform.type`	`"k8s"`	`"aws_ec2"`
`agent.version`	Log Collector version	Log Collector version
`agent.platform.name`	Kubernetes cluster name	AWS EC2 instance name
`agent.deployment.type`	The Kubernetes workload type: `"deployment"`, `"statefulset"`, or `"daemonset"`	`" "` (an empty string of length 1)
`agent.deployment.unit`	Kubernetes pod name	`" "` (an empty string of length 1)
`agent.deployment.scope`	Kubernetes namespace name	`" "` (an empty string of length 1)

Sample query for Kubernetes-based Log Collector:

since now-2h fetch id, attributes from entities(logcollector:log_collector_instance)[attributes("agent.platform.name")="k8s-cluster-name" && attributes("agent.deployment.unit")="lca-pod-name"]

SQL

Sample query for EC2-based Log Collector:

since now-2h fetch id, attributes from entities(logcollector:log_collector_instance)[attributes("agent.deployment.name")="custom-deployment-name" && attributes("agent.platform.name")="EC2-instance-name"]

SQL

Create UQL Queries for Metrics

Follow these steps to query the system for Filebeat metrics using Explore > Query Builder.

Once you have the correct logcollector:log_collector_instance entity, you can create queries like this to fetch metrics. For example:

Query to fetch filebeat.input.filestream.events.write values for the past 2 hours with a granularity of 5 minutes:

since now - 2h
fetch
metrics(lca:filebeat.input.filestream.events.write)
from
entities(logcollector:log_collector_instance:5EZqxMcvPRKOkvpya3aJXA)
limits
metrics.granularityDuration(PT5M)

SQL

For metrics which are of type Sum, you might just be interested in just the current value of the metric, not the deltas. For these, you can fetch the cumulative sum of the metric till the present time like this:
```
since now - 2h until now
fetch
metrics(lca:filebeat.input.filestream.files.open).sumCumulative
from
entities(logcollector:log_collector_instance:5EZqxMcvPRKOkvpya3aJXA)
```
SQL

Query to fetch the agent:health metric for a health check:

since now-1h fetch metrics(agent:health) from entities(logcollector:log_collector_instance)[attributes("agent.deployment.name")="custom-deployment-name" && attributes("agent.platform.name")="EC2-instance-name"]

SQL

When you fetch the metrics using UQL, the values which show up against the timestamps mean different things for different metric types. If the metric is of type Sum, the values are actually the delta values, so if you want to see the absolute value, you might need to fetch the cumulative sum. If the metric is of the type Gauge, the values showing up are the absolute values. For more about metric content types, see Cisco AppDynamics Metrics Model.

Some Troubleshooting Scenarios Using Log Collector Metrics

agent:health

This metric is reported when Agent Management is enabled. It serves as a liveness check for the Log Collector instance. A value of "1" is reported every 5 minutes if the collector is healthy. You can configure health rules on this metric, with alerting enabled. This can help to pinpoint any Log Collector instance that is down in a large cluster.

filebeat.events.active

This metric represents the number of events(logs) which are in flight. Which basically means these are the logs which the Log Collector has ingested, but hasn't exported yet, so these are in memory.

Generally, if log load is consistent, this metric is also nearly constant.
It is correlated to the number of logs coming in.
If you see an abrupt spike in the metric, that means logs are getting piled up in memory in the Log Collector. This can be due to the Log Collector not being able to export the logs to Cisco AppDynamics Distribution of OpenTelemetry Collector. After a while, it would usually plateau because some logs will get dropped after the max number of retries is reached.
This metric will then decrease to normal levels again once the Log Collector is able to export the logs to Cisco AppDynamics Distribution of OpenTelemetry Collector.

filebeat.input.filestream.files.open

This metric tracks the number of files the Log Collector is collecting logs from. So it should correspond to the number of containers matching Log Collector configuration and actively producing logs.

An increase in the metric would mean the Log Collector is collecting logs from more containers/pods. If this is unintentional, you might take a look why new pods have come up. you may need to update the the Log Collector config so that the Log Collector does not collect logs from these containers.
A decrease in the metric means the Log Collector is collecting logs from less containers/pods. This would either be due to pods going down, or the Log Collector is not collecting logs from pods it's supposed to be collecting. Possible causes might be due to high load, or incorrect config.

filebeat.input.filestream.events.create

This metric tracks the creation of log files, from which the the Log Collector is collecting logs.

This metric is usually increasing at a constant rate, as files are constantly created and deleted by Kubernetes as a part of log file rotation.
There can be sudden increases, when new pods spin up and the Log Collector starts collecting logs from these pods. If that is not the case, then it might be worth investigating what caused the spike. Files might be getting rotated more frequently than is normal.

filebeat.input.filestream.events.delete

This metric tracks the deletion of log files.

Typically it is either 0 (so will not be reported) or will be nearly constantly increasing(as log files are getting deleted at a constant rate).
A steep increase in this metric might mean logs are getting deleted more frequently, which could be due to retention period is less than required, or due to heavy load and there are space limitations.

filebeat.input.filestream.events.write

This metric tracks the number of write events taking place in the files from which the Log Collector is collecting logs.

It is directly correlated to the logs volume being ingested by the Log Collector.
Any increase or decrease in this metric represents the increase or decrease in the volume of logs coming in.

libbeat.output.write.bytes

This metric tracks the total bytes exported by the Log Collector.

It is directly correlated to the logs volume being exported by the Log Collector.
Any increase or decrease in this metric represents the increase or decrease in the volume of logs being exported by the Log Collector.
In a stable enviromnent, with a healthy Log Collector pod running, the log volume coming in and going out of the Log Collector should be same. So the metrics filebeat.input.filestream.events.write and libbeat.output.write.bytes are usually correlated. So let's say if you take their ratio, it should be nearly constant.
- However, if you observe that the ratio is changing, you can draw inferences from that. Let's say in an environment with a high load, the filebeat.input.filestream.events.write metric remains consistent but the libbeat.output.write.bytes metric is decreasing, that might mean the Log Collector is not able to keep up with the log volume, or there might be some issue in the processing pipeline. There might be a lag building up due to this.

libbeat.output.write.errors

This metric keeps track of the number of errors occurring in the the Log Collector exporter.

Typically it would be 0 (so not reported), if there are no errors while exporting.
The possible errors might be internal to the Log Collector, like if it's not able to convert the logs to OTLP format.
Or it might be due to some issue in exporting the logs to Cisco AppDynamics Distribution of OpenTelemetry Collector. Possible issues in exporting logs to Cisco AppDynamics Distribution of OpenTelemetry Collector might be due to network issue, where the Log Collector is not able to reach
Cisco AppDynamics Distribution of OpenTelemetry Collector. Or it might be some issue with the Cisco AppDynamics Distribution of OpenTelemetry Collector configuration, such as something wrong with the logs pipeline.

filebeat.input.filestream.events.truncate

This metric tracks the number of truncate events happening to files from which the Log Collector is collecting logs. This metric is useful in keeping track of the log file rotation in environments where the rotation is done by truncating the log files, say in a Java application where logging is handled by log4j.

beat.memstats.memory_alloc

This metric tracks the total heap memory allocated to the the Log Collector Filebeat process. You can monitor the the Log Collector health by keeping track of any spikes or dips in this metric compared to the established baseline.

system.load.norm.5 and system.load.norm.15

These metrics represent the normalised system CPU load over a given time range, 5 minutes or 15 minutes. The value is always between 0 to 1. This metric helps us in keeping track of the health of the node the Log Collector is running on. Compared to the extablished baseline, abrupt spikes or dips might correspond to some issue with the health of the node the the Log Collector pod is running in.

Send Metrics from Cisco AppDynamics Distribution of OpenTelemetry Collector to Prometheus

Deploy Prometheus on your cluster and set its scrape target to 'appdynamics-otel-collector-service:8889'.
Verify that you can access the Prometheus dashboard:
1. Port-forward to your prometheus service or pod:
```
kubectl port-forward <prometheus-service-or-pod-endpoint> 9090:9090 -n appdynamics
```
  YML
2. Verify that you can see the Prometheus dashboard on your web browser at localhost:9090.

In collectors-values.yaml, add these sections to appdynamics-otel-collector. For details on each parameter, see Advanced Settings for the Cisco AppDynamics Distribution of OpenTelemetry Collector.

 appdynamics-otel-collector:
  clientId: ...
  clientSecret: ...
  endpoint: ...
  tokenUrl: ...

  # add below part to the config
  configOverride:
    exporters:
      prometheus/local:
        endpoint: "0.0.0.0:8889"
        resource_to_telemetry_conversion:
          enabled: true

    service:
      pipelines:
        metrics:
          exporters: [otlphttp, prometheus/local]

  # service expose collector for external traffics.
  service:
    name: "appdynamics-otel-collector-service"
    ports:
      - name: http
        port: 4318
        protocol: TCP
        targetPort: 4318
      - name: grpc
        port: 4317
        protocol: TCP
        targetPort: 4317
      - name: zpage
        port: 55679
        protocol: TCP
        targetPort: 55679
      - name: prometheus
        port: 8889
        protocol: TCP
        targetPort: 8889

YML

Validate collectors-values.yaml with a YAML validator like YAML Lint.
Apply the override YAML's changes to appdynamics-collectors : see "Upgrade Collectors" in Upgrade or Uninstall Kubernetes and App Service Monitoring.

Confirm that you see the additional port 8889 exposed in service/appdynamics-otel-collector-service:

NAME                                        TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                                AGE
...
service/appdynamics-otel-collector-service  ClusterIP   10.111.179.127   <none>        4318/TCP,4317/TCP,55679/TCP,8889/TCP   9m57s
...

YML

Optional: Write Metrics to the Log File

If you would rather write metrics to the Log Collector's log files rather than, or in addition to, exporting metrics via the OTLP exporter, follow these steps:

Set monitoring.enabled to true.
If you're using a legacy collectors-values.yaml, set monitoring.otlp.resource_attrs.k8.cluster.name to the name of your cluster.
Do not modify the list of metrics to export in monitoring.otlp.metrics. This setting isn't supported when logging metrics.
(Optional) To modify the monitoring period for metrics, set logging.metrics.period. Its default is 30s (30 seconds).
(Optional) To enable the writing of metrics to the log files, set logging.files.enabled to true.

If logging.metrics.enabled is true but logging.files.enabled is false, the Log Collector writes metrics logs (and other logs) to console/pod-logs.
If you're using a legacy collectors-values.yaml and if any parameters marked "Do not modify" in Log Collector Settings - Advanced YAML Layout are missing from your collectors-values.yaml, add them.
Validate collectors-values.yaml with a YAML validator like YAML Lint.
To apply these changes to your cluster, run the helm upgrade command with the override settings you specified in collectors-values.yaml: See Upgrade or Uninstall Kubernetes and App Service Monitoring.
Check to see if your Log Collector pods restarted, and if they did not, find the name of their daemonset and restart it:
```
kubectl get ds
kubectl rollout restart ds <daemonset-name> -n appdynamics
```
BASH