Network administrators have traditionally learned to use packet capture for performing analysis when troubleshooting complex problems, with some admins investing in expensive packet capture appliances to capture and retain network traffic for later analysis. When they receive reports of performance degradations, they extract trace files and load them into a network analyzer (e.g., Wireshark or a commercial equivalent) where they can examine the individual packets for clues.
The evolution of IT systems is challenging old ways of performing network diagnostics
While traffic data analysis will continue to remain a viable source of data for troubleshooting performance degradations, network teams must take a fresh look at the problem, particularly in the context of complex IT infrastructure deployments combining physical, virtualized, and cloud environments, as well as software-defined networks (SDN).
What are the specifics trends affecting packet capture that make it difficult for network administrators to quickly diagnose performance issues with stream-to-disk appliances and network analyzers, particularly in real time? Let’s find out more.
Six reasons why traditional packet capture won’t work like before
1) Overall system complexity
The first challenge facing packet capture is due to the overall complexity of IT environments themselves. The ongoing deployment of physical, virtualized, and cloud environments, as well as software-defined networks (SDN) and software-as-a-service (SaaS) applications, is only adding to the underlying complexity of today’s IT architectures. These aspects include:
- IT systems are becoming more and more redundant, leading to multiple paths for data streams
- The number of applications in the largest organizations is now counted in the hundreds or thousands While SaaS applications add to the total, the use of Citrix environments prolongs the life cycle of the oldest applications.
- The complexity of each application chain is such that simply determining where to tap traffic is a challenge in itself, particularly in multi-cloud and hybrid public-private cloud environments.
2) Traffic volumes
Traffic volumes continue to increase very fast making meaningful packet capture a real challenge. Whereas we used to measure actual network usage in the hundreds of Mbps, we now speak in terms of 10Gbps usage and higher in many corporate networks.
When you consider that a trace file of 100MB for loading into a software sniffer corresponds to only 80ms of a 10Gbps traffic stream, you begin to get a sense of the magnitude of the problem. Is it really possible to arrive at a proper diagnosis in a complex IT environment based on the basis of such a short time interval?
3) History: throughput and data retention
Stream-to-disk appliances work on the basis of storing production traffic for later analysis. There is a direct relationship between the throughput to be analyzed, the storage available, and the retention time of data. As an example, for an average 500Mbps of traffic, if you require seven days of packet capture and traffic history, you will need an appliance with approximately 12TB of storage. Running on a fully saturated 10Gbps link, your history will be no more than one and a half hours in all.
There are several additional challenges raised by the lack of sufficient time retention:
- Events are not reported in real time and they are often intermittent. If you are not fortunate enough to see them go by during the short retention windows, you cannot diagnose it. (And end-users will continue to be affected and unhappy.)
- Drawing conclusions on response times without being in a position to compare to a baseline—where the application is running properly—does not enable you to understand if there is a degradation and where it comes from.
4) “It’s not the network” isn’t acceptable anymore
The number of applications that are mission critical for the organization has grown in proportion to its operations. A performance degradation has business consequences like never before. Having tools that are sufficient to show that the network is performing fine and is not the cause of the degradation is not enough anymore.
Only those solutions that can pinpoint the root cause of the degradation end-to-end (i.e., wherever it comes from) can actually deliver business value.
5) Changes in data center topologies
With the need for redundancy and resilience, sustainability, and energy efficiency, and the introduction of new design architectures (such as OpenCompute) in support of those objectives, data center topologies have changed radically in the last 5 years, dramatically affecting packet capture opportunities. Network flows can use different paths to be distributed across physical, virtualized and software-defined networks (SDN), while most applications include load balancing and high availability (HA) mechanisms, making it that much more challenging to capture the precise traffic sample that you need without ingesting an overwhelming volume of traffic.
New technologies have also brought new challenges:
- Virtualization is now broadly used. It also means that most of the key traffic may not travel through any physical cables or switches at any point in time.
- The instant, on-the-fly provisioning and the automated deployment of virtualized infrastructure create an additional challenge for legacy NPM solutions that are designed to capture traffic over a limited set of physical network segments.
6) Human resources and expertise
Unfortunately, IT resources have not grown commensurate with the growth of most organizations. Consequently, there is not necessarily more time and expertise available to inspect trace files. And as we have discussed, manual handling of packet data cannot be taken at the scale of new data centers, presenting IT teams with a major challenge.
As a result of these 6 factors, how can IT teams maintain their ability to efficiently troubleshoot and monitor the performance of networks and applications in their infrastructure?