Overview of Infrastructure Visibility

You can determine the root cause of application issues by looking at application, network, server, and machine metrics that measure infrastructure utilization.
For example, the following infrastructure issues may slow down your application:

Too much time spent in garbage collection of temporary objects (application metric)
Packet loss between two nodes that results in re-transmissions and slow calls (network metric)
Inefficient processes that result in high CPU utilization (server metric)
Excessively high rates of reads/writes on a specific disk or partition (hardware metric)

Infrastructure Visibility enables you to isolate, identify, and troubleshoot these types of issues. Infrastructure Visibility is based on a Machine Agent that runs with an App Server Agent on the same machine. These two agents provide multi-layer monitoring:

The App Server Agent collects metrics about applications and identifies applications, tiers, and nodes with slow transactions, stalled transactions, and other application-performance issues.
The Network Agent monitors the network packets sent and received on each node and identifies lost/re-transmitted packets, TCP bottlenecks, high round trip times, and other network issues.
The Machine Agent collects metrics at two levels:
- Server Visibility metrics for local processes, services, and resource utilization.
- Basic machine metrics for disks, memory, CPU, and network interfaces.

This multi-layer monitoring enables you to determine possible correlations between application issues and service, process, hardware, network, or other issues on the machine.

Agent Monitoring Metrics

Network Visibility

Network Visibility monitors traffic flows, network packets, TCP connections, and TCP ports. Network Agents leverage the APM intelligence of App Server Agents to identify the TCP connections used by each application. Network Visibility includes:

Detailed metrics about dropped/re-transmitted packets, TCP window sizes (Limited/Zero), connection setup/tear down issues, high round trip times, and other performance-impacting issues
Network Dashboard that highlights network KPIs (Key Performance Indicators) for tiers, nodes, and network links
Right-click dashboards for tiers, nodes, and network links that enable quick drill-downs from transaction outliers to network root causes
Automatic mapping of TCP connections with application flows
Automatic detection of intermediate load balancers that split TCP connections
Diagnostic mode for collecting advanced diagnostic information for individual connections

Server Visibility

Server Visibility monitors local processes, services, and resource utilization. You can use these metrics to identify time windows when problematic application performance correlates with problematic server performance on one or more nodes.

Server Visibility is an add-on module to the Machine Agent. With Server Visibility enabled, the Machine Agent provides the following functionality:

Extended hardware metrics such as machine availability, disk/CPU/virtual-memory utilization, and process page faults
Monitor application nodes that run inside Docker containers and identify container issues that impact application performance
The Tier Metric Correlator which enables you to identify load and performance anomalies across all nodes in a tier
Import and define server tags used to query, filter, and compare related servers using custom metadata
Monitor internal or external HTTP and HTTPS services
Support for grouping servers so you can apply health rules to specific server groups
Support for defining alerts that trigger when certain conditions are met or exceeded based on monitored server hardware metrics

Basic Machine Metrics

The Machine Agent collects basic hardware metrics from the server's OS and provides the following functionality:

Basic hardware metrics from the server's OS such as CPU and memory utilization, throughput on network interfaces, and disk and network I/O
Support for creating extensions to generate custom metrics
Support for running remediation scripts to automate your runbook procedures. You can optionally configure the remediation action to require human approval before starting the script.
JVM Crash Guard for monitoring JVM crashes and optionally running remediation scripts

Java and .NET Infrastructure Monitoring

Infrastructure Visibility uses different agents to monitor Java and .NET environments:

The Java Agent collects metrics for business applications and JVMs. The Machine Agent collects Server Visibility and hardware/OS metrics.
The .NET Agent collects metrics for business applications and instrumented CLRs. The .NET Agent includes a .NET Machine Agent that collects IIS and hardware/OS metrics (see Monitor Windows Hardware Resources). The Machine Agent collects Server Visibility metrics.

Infrastructure Visibility Strategies

You can use these strategies to locate infrastructure issues that affect application performance:

Transaction snapshots for slow or stalled transactions – Use snapshots to correlate infrastructure metrics for the specific node so that you can identify the root cause of slow or stalled transactions.
Metric correlation –
- One example workflow is to open the Node Dashboard for a mission-critical server with a machine agent installed and then cross-compare data in the following tabs:
  - JVM (application performance)
  - JMX (server performance)
  - Server (hardware resource consumption)
- The Network Dashboard includes right-click dashboards for tiers, nodes, and network links. Use these dashboards to find correlations between application issues and network root causes.
- The Tier Metric Correlator enables you to identify load and performance anomalies in a tier composed of a cluster of nodes running on containers or servers.
Health rules – Configure health rules on metrics such as garbage collection time, connection pool contention, or CPU usage to catch issues early in the cycle before any impact on your business transactions.
Infrastructure rules, policies, and alerts –
- Create health rules on metrics such as garbage collection time, connection pool contention, or CPU usage to catch issues early in the cycle before any impact on your business transactions.
- Define policies that trigger actions (such send an email, start diagnostics, or perform a thread dump) when Infrastructure metrics report a critical level.
- You can configure alerts for JVM and CLR crashes respectively using JVM Crash Guard and the .NET Machine Agent.
- Configure the agent to run scripts in response to critical events (for example, restart an application or JVM in response to a crash).

With the right monitoring strategy in place, you can be alerted to problems and fix them before user transactions are affected.