Back pressure is common in microservices and service-oriented applications, especially in distributed environments. Back pressure occurs when two services communicate and one service gets overwhelmed. This diagram shows the sequence of events:
- Receiver has a TCP Receive buffer for incoming packets. It advertises the available buffer space to the sender.
- Sender sends a chunk of data based on the advertised window size.
- Application on the receiver processes the packets in the buffer and frees up space in the buffer.
- Receiver advertises the new window size.
This process works smoothly until the receiving node cannot process incoming packets quickly enough. In this case, the receiver sends TCP Limited or TCP Zero messages to the sender and the transfer slows down.
A DevOps engineer is responsible for monitoring a mission-critical app. She scans the Application Dashboard and notices that a link and tier have turned yellow and another link has turned red. She decides to investigate.
- She switches over to the Network Dashboard and sees a significant increase in Performance Impacting Events (PIE) between the Order-Tier and the Payment-Tier. While the Application Dashboard shows performance issues on the upstream Order-Tiers, the Network Dashboard implies that the problem is further downstream—between the Order-Tier and Payment-Tier.
- To troubleshoot further, she goes to the Transaction Snapshots list, filters on stalled transactions, and double-clicks on a transaction to drill down.
- She drills down into the Order-Tier, since nearly all of the transaction time (99.7%) occurred at this tier.
- In the Transaction Drilldown page, she switches over to the Network view. Scanning the dashboard, she can see immediately that:
- The transaction correlates with a spike in stalled calls and Performance Impacting Events.
- All of these events (Client Limited and Client Zero Window) took place on the client (Order-Tier) node.
- She returns to the Network Dashboard, right-clicks on the Order-Tier, and chooses View Metrics. She immediately sees that the Client Limited and Client Zero Window events are spiking for the entire tier and correlate with the spike in stalled calls.