Download PDF
Download page Log Collector Observability.
Log Collector Observability
The Log Collector's telemetry, in the form of Filebeat metrics (and health check metrics if Agent Management is enabled) in OTLP format, provides observability into log collection activity on your cluster. You can send these metrics to Cisco Cloud Observability or to a metrics backend on your cluster for which an OTLP exporter is available, such as Prometheus. This page explains how to implement a metrics pipeline and how to use these metrics for troubleshooting log collection problems on your cluster:
Default Log Collector Telemetry
The following table describes the Filebeat metrics the Log Collector exports by default. You can configure it to export other metrics. For a full list of available metrics, see Elastic's documentation on "beat stats" fields.
Metric Name | Cisco AppDynamics Metric Content Type | Description |
---|---|---|
agent:health | Gauge | A value of |
beat.memstats.memory_alloc | Gauge | Number of bytes allocated to heap objects |
filebeat.events.active | Sum | Number of active events. See https://www.elastic.co/guide/en/beats/metricbeat/8.4/exported-fields-beat.html |
filebeat.input.filestream.events.create | Sum | Number of file system "create" events |
filebeat.input.filestream.events.delete | Sum | Number of file system "delete" events |
filebeat.input.filestream.events.eof | Sum | Number of EOF errors while reading files |
filebeat.input.filestream.events.rename | Sum | Number of file system "rename" events |
filebeat.input.filestream.events.truncate | Sum | Number of file system "truncate" events |
filebeat.input.filestream.events.write | Sum | Number of file system "write" events |
filebeat.input.filestream.files.open | Sum | Number open files according to the file system |
| Sum | Number of running harvesters |
filebeat.input.filestream.harvester.stopped | Sum | Number of stopped harvesters |
libbeat.output.read.errors | Sum | Number of read errors |
libbeat.output.write.bytes | Sum | Number of bytes written by exporter |
libbeat.output.write.errors | Sum | Number of write errors |
system.load.norm.5 | Gauge | System load 5 minute normalized average |
| Gauge | System load 15 minute normalized average |
Configure the Log Collector to Export Metrics to Cisco AppDynamics Distribution of OpenTelemetry Collector
Choose only one procedure in this section:
If your collectors-values.yaml
is in the simplified layout (August 2022 or newer), modify or add these parameters in appdynamics-cloud-k8s-monitoring.logCollectorConfig
. For details on each parameter, see Log Collector Settings.
- Create a backup of
collectors-values.yaml
. - Set
monitoring.otlpmetric.enabled
to true. - (Optional) Modify the list of metrics to export in
monitoring.otlpmetric.metrics
. - Validate
collectors-values.yaml
with a YAML validator like YAML Lint. To apply these changes to your cluster, run the
helm upgrade
command with the override settings you specified incollectors-values.yaml
: See Upgrade or Uninstall Kubernetes and App Service Monitoring.Check to see if your Log Collector pods restarted, and if they did not, find the name of their daemonset and restart it:
kubectl get ds kubectl rollout restart ds <daemonset-name> -n appdynamics
BASH
collectors-values.yaml
is in a legacy layout, modify or add these parameters in appdynamics-cloud-k8s-monitoring.logCollectorConfig.filebeatYaml
. For details on each parameter, see AppDynamics Log Collector Settings - Legacy YAML Layout.- Create a backup of
collectors-values.yaml
. - Set
monitoring.enabled
to true. - Set
monitoring.otlpmetric.resource_attrs.k8.cluster.name
to the name of your cluster. - (Optional) Modify the list of metrics to export in
monitoring.otlpmetric.metrics
. If
monitoring.otlpmetric
.endpoint
is missing, set it to the endpoint of the Cisco AppDynamics Distribution of OpenTelemetry Collector.- If any parameters marked "Do not modify" in Log Collector Settings - Advanced YAML Layout are missing, add them.
- Validate
collectors-values.yaml
with a YAML validator like YAML Lint. To apply these changes to your cluster, see Upgrade or Uninstall Kubernetes and App Service Monitoring.
Check to see if your Log Collector pods restarted, and if they did not, find the name of their daemonset and restart it:
kubectl get ds kubectl rollout restart ds <daemonset-name> -n appdynamics
BASH
Send Metrics from Cisco AppDynamics Distribution of OpenTelemetry Collector to Cisco Cloud Observability
The Cisco AppDynamics Distribution of OpenTelemetry Collector has an otlphttp processor with OAuth that is preconfigured to send metrics to
Cisco Cloud Observability, so you don't need to configure anything.
Troubleshooting Using Filebeat Metrics
Prerequisites
- Monitoring needs to be enabled in your Log Collector configuration for the metrics to be reported.
Find the Right Log Collector Entity
Follow these steps to find the ID of the logcollector:log_collector_instance
entity you want to troubleshoot.
To list all the
logcollector:log_collector_instance
entities which have sent Filebeat metrics to your Tenant, run this query in Query Builder:fetch id, attributes from entities(logcollector:log_collector_instance)
CODEAdd filters to your query to narrow down the results. You can filter by any of these attributes in a metric packet to get the correct
logcollector:log_collector_instance
:Attribute Description for Kubernetes-based Log Collector Description for Non Kubernetes-based Log Collector agent.display.name
"Log Collector"
"Log Collector"
agent.name
Helm release name Linux service name agent.deployment.name
Kubernetes Deployment name. Identical to the output of the
kubectl get deployment
command.Your own deployment name for this instance of the Log Collector agent.platform.id
Kubernetes cluster ID AWS EC2 instance ARN agent.platform.type
"k8s"
"aws_ec2"
agent.version
Log Collector version Log Collector version agent.platform.name
Kubernetes cluster name AWS EC2 instance name agent.deployment.type
The Kubernetes workload type:
"deployment"
,"statefulset"
, or"daemonset"
" "
(an empty string of length 1)agent.deployment.unit
Kubernetes pod name " "
(an empty string of length 1)agent.deployment.scope
Kubernetes namespace name " "
(an empty string of length 1)
Sample query for Kubernetes-based Log Collector:since now-2h fetch id, attributes from entities(logcollector:log_collector_instance)[attributes("agent.platform.name")="k8s-cluster-name" && attributes("agent.deployment.unit")="lca-pod-name"]
SQLSample query for EC2-based Log Collector:
since now-2h fetch id, attributes from entities(logcollector:log_collector_instance)[attributes("agent.deployment.name")="custom-deployment-name" && attributes("agent.platform.name")="EC2-instance-name"]
SQL
Create UQL Queries for Metrics
Follow these steps to query the system for Filebeat metrics using Explore > Query Builder.
Once you have the correct logcollector:log_collector_instance
entity, you can create queries like this to fetch metrics. For example:
Query to fetch
filebeat.input.filestream.events.write
values for the past 2 hours with a granularity of 5 minutes:since now - 2h fetch metrics(lca:filebeat.input.filestream.events.write) from entities(logcollector:log_collector_instance:5EZqxMcvPRKOkvpya3aJXA) limits metrics.granularityDuration(PT5M)
SQLFor metrics which are of type
Sum
, you might just be interested in just the current value of the metric, not the deltas. For these, you can fetch the cumulative sum of the metric till the present time like this:since now - 2h until now fetch metrics(lca:filebeat.input.filestream.files.open).sumCumulative from entities(logcollector:log_collector_instance:5EZqxMcvPRKOkvpya3aJXA)
SQLQuery to fetch the
agent:health
metric for a health check:since now-1h fetch metrics(agent:health) from entities(logcollector:log_collector_instance)[attributes("agent.deployment.name")="custom-deployment-name" && attributes("agent.platform.name")="EC2-instance-name"]
SQL
Sum
, the values are actually the delta values, so if you want to see the absolute value, you might need to fetch the cumulative sum. If the metric is of the type Gauge
, the values showing up are the absolute values. For more about metric content types, see Cisco AppDynamics Metrics Model.
Some Troubleshooting Scenarios Using Log Collector Metrics
agent:health
This metric is reported when Agent Management is enabled. It serves as a liveness check for the Log Collector instance. A value of "1" is reported every 5 minutes if the collector is healthy. You can configure health rules on this metric, with alerting enabled. This can help to pinpoint any Log Collector instance that is down in a large cluster.
filebeat.events.active
This metric represents the number of events(logs) which are in flight. Which basically means these are the logs which the Log Collector has ingested, but hasn't exported yet, so these are in memory.
- Generally, if log load is consistent, this metric is also nearly constant.
- It is correlated to the number of logs coming in.
- If you see an abrupt spike in the metric, that means logs are getting piled up in memory in the Log Collector. This can be due to the Log Collector not being able to export the logs to Cisco AppDynamics Distribution of OpenTelemetry Collector. After a while, it would usually plateau because some logs will get dropped after the max number of retries is reached.
- This metric will then decrease to normal levels again once the Log Collector is able to export the logs to Cisco AppDynamics Distribution of OpenTelemetry Collector.
filebeat.input.filestream.files.open
This metric tracks the number of files the Log Collector is collecting logs from. So it should correspond to the number of containers matching Log Collector configuration and actively producing logs.
- An increase in the metric would mean the Log Collector is collecting logs from more containers/pods. If this is unintentional, you might take a look why new pods have come up. you may need to update the the Log Collector config so that the Log Collector does not collect logs from these containers.
- A decrease in the metric means the Log Collector is collecting logs from less containers/pods. This would either be due to pods going down, or the Log Collector is not collecting logs from pods it's supposed to be collecting. Possible causes might be due to high load, or incorrect config.
filebeat.input.filestream.events.create
This metric tracks the creation of log files, from which the the Log Collector is collecting logs.
- This metric is usually increasing at a constant rate, as files are constantly created and deleted by Kubernetes as a part of log file rotation.
- There can be sudden increases, when new pods spin up and the Log Collector starts collecting logs from these pods. If that is not the case, then it might be worth investigating what caused the spike. Files might be getting rotated more frequently than is normal.
filebeat.input.filestream.events.delete
This metric tracks the deletion of log files.
- Typically it is either 0 (so will not be reported) or will be nearly constantly increasing(as log files are getting deleted at a constant rate).
- A steep increase in this metric might mean logs are getting deleted more frequently, which could be due to retention period is less than required, or due to heavy load and there are space limitations.
filebeat.input.filestream.events.write
This metric tracks the number of write events taking place in the files from which the Log Collector is collecting logs.
- It is directly correlated to the logs volume being ingested by the Log Collector.
- Any increase or decrease in this metric represents the increase or decrease in the volume of logs coming in.
libbeat.output.write.bytes
This metric tracks the total bytes exported by the Log Collector.
- It is directly correlated to the logs volume being exported by the Log Collector.
- Any increase or decrease in this metric represents the increase or decrease in the volume of logs being exported by the Log Collector.
- In a stable enviromnent, with a healthy Log Collector pod running, the log volume coming in and going out of the Log Collector should be same. So the metrics
filebeat.input.filestream.events.write
andlibbeat.output.write.bytes
are usually correlated. So let's say if you take their ratio, it should be nearly constant.- However, if you observe that the ratio is changing, you can draw inferences from that. Let's say in an environment with a high load, the
filebeat.input.filestream.events.write
metric remains consistent but thelibbeat.output.write.bytes
metric is decreasing, that might mean the Log Collector is not able to keep up with the log volume, or there might be some issue in the processing pipeline. There might be a lag building up due to this.
- However, if you observe that the ratio is changing, you can draw inferences from that. Let's say in an environment with a high load, the
libbeat.output.write.errors
This metric keeps track of the number of errors occurring in the the Log Collector exporter.
- Typically it would be 0 (so not reported), if there are no errors while exporting.
- The possible errors might be internal to the Log Collector, like if it's not able to convert the logs to OTLP format.
- Or it might be due to some issue in exporting the logs to Cisco AppDynamics Distribution of OpenTelemetry Collector. Possible issues in exporting logs to Cisco AppDynamics Distribution of OpenTelemetry Collector might be due to network issue, where the Log Collector is not able to reach
- Cisco AppDynamics Distribution of OpenTelemetry Collector. Or it might be some issue with the Cisco AppDynamics Distribution of OpenTelemetry Collector configuration, such as something wrong with the logs pipeline.
filebeat.input.filestream.events.truncate
This metric tracks the number of truncate events happening to files from which the Log Collector is collecting logs. This metric is useful in keeping track of the log file rotation in environments where the rotation is done by truncating the log files, say in a Java application where logging is handled by log4j.
beat.memstats.memory_alloc
This metric tracks the total heap memory allocated to the the Log Collector Filebeat process. You can monitor the the Log Collector health by keeping track of any spikes or dips in this metric compared to the established baseline.
system.load.norm.5 and system.load.norm.15
These metrics represent the normalised system CPU load over a given time range, 5 minutes or 15 minutes. The value is always between 0 to 1. This metric helps us in keeping track of the health of the node the Log Collector is running on. Compared to the extablished baseline, abrupt spikes or dips might correspond to some issue with the health of the node the the Log Collector pod is running in.
Send Metrics from Cisco AppDynamics Distribution of OpenTelemetry Collector to Prometheus
- Deploy Prometheus on your cluster and set its scrape target to
'appdynamics-otel-collector-service:8889'
. - Verify that you can access the Prometheus dashboard:
Port-forward to your
prometheus
service or pod:kubectl port-forward <prometheus-service-or-pod-endpoint> 9090:9090 -n appdynamics
YML- Verify that you can see the Prometheus dashboard on your web browser at
localhost:9090
.
In
collectors-values.yaml
, add these sections toappdynamics-otel-collector
. For details on each parameter, see Advanced Settings for the Cisco AppDynamics Distribution of OpenTelemetry Collector.appdynamics-otel-collector: clientId: ... clientSecret: ... endpoint: ... tokenUrl: ... # add below part to the config configOverride: exporters: prometheus/local: endpoint: "0.0.0.0:8889" resource_to_telemetry_conversion: enabled: true service: pipelines: metrics: exporters: [otlphttp, prometheus/local] # service expose collector for external traffics. service: name: "appdynamics-otel-collector-service" ports: - name: http port: 4318 protocol: TCP targetPort: 4318 - name: grpc port: 4317 protocol: TCP targetPort: 4317 - name: zpage port: 55679 protocol: TCP targetPort: 55679 - name: prometheus port: 8889 protocol: TCP targetPort: 8889
YML- Validate
collectors-values.yaml
with a YAML validator like YAML Lint. - Apply the override YAML's changes to
appdynamics-collectors
: see "Upgrade Collectors" in Upgrade or Uninstall Kubernetes and App Service Monitoring. Confirm that you see the additional port
8889
exposed inservice/appdynamics-otel-collector-service
:NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE ... service/appdynamics-otel-collector-service ClusterIP 10.111.179.127 <none> 4318/TCP,4317/TCP,55679/TCP,8889/TCP 9m57s ...
YML
Optional: Write Metrics to the Log File
If you would rather write metrics to the Log Collector's log files rather than, or in addition to, exporting metrics via the OTLP exporter, follow these steps:
- Set
monitoring.enabled
to true. - If you're using a legacy
collectors-values.yaml
, setmonitoring.otlp.resource_attrs.k8.cluster.name
to the name of your cluster. - Do not modify the list of metrics to export in
monitoring.otlp.metrics
. This setting isn't supported when logging metrics. - (Optional) To modify the monitoring period for metrics, set
logging.metrics.period
. Its default is30s
(30 seconds). (Optional) To enable the writing of metrics to the log files, set
logging.files.enabled
totrue
.Iflogging.metrics.enabled
istrue
butlogging.files.enabled
isfalse
, the Log Collector writes metrics logs (and other logs) to console/pod-logs.- If you're using a legacy
collectors-values.yaml
and if any parameters marked "Do not modify" in Log Collector Settings - Advanced YAML Layout are missing from yourcollectors-values.yaml
, add them. - Validate
collectors-values.yaml
with a YAML validator like YAML Lint. To apply these changes to your cluster, run the
helm upgrade
command with the override settings you specified incollectors-values.yaml
: See Upgrade or Uninstall Kubernetes and App Service Monitoring.Check to see if your Log Collector pods restarted, and if they did not, find the name of their daemonset and restart it:
kubectl get ds kubectl rollout restart ds <daemonset-name> -n appdynamics
BASH