You only need a few measurements to determine the health of a network.
Problems always arise in computer networks that affect performance. When problems occur, network engineers use a combination of skill, experience, and tools to determine which measurements to review -- and in what order -- to segment the problem and get to the root cause. Once the problem is found, network engineers can remediate and restore full network and application services.
Only a few measurements of networks and the applications delivered across them are needed. They are:
- availability and reliability (of nodes and apps),
- network performance primitives (jitter, latency, loss & throughput), and
- application performance (response time).
Of course, there are a myriad of useful measures to aid in determining the root cause of a network or application slowdown, but these are derivative or most useful once a fundamental measurement indicates there is an issue. A discovered map of infrastructure and dependencies, while not a measurement per se, is vital to visualizing and understanding ones network. Too often undocumented changes are made and even responsible network engineers can be surprised about how their networks, servers, and apps are configured.
When measuring networks, one makes cost-visibility tradeoffs in selecting the sources of data to be collected for use in deriving fundamental measurements. Primary data sources include: SNMP (simple network management protocol) data and MIBs (management information bases) garnered from network nodes, Netflow/IPFix flow data from routers and switches, the results of synthetic network tests (squirt bits onto the network and measure/compare at a receiver), and information gleaned from passive packet inspection.
While these sources provide somewhat different information, they do overlap and can sometimes substitute for each other. SNMP is the least expensive but provides only modest performance insight. Flow data from the infrastructure is an excellent surrogate for traffic analysis in WANs and is less expensive than looking at packets. Synthetic analyses are useful when instrumentation cannot be placed in required locations (as when applications or infrastructure are outsourced and access to the data center is not available) or when special projects such as assessing network readiness for high-def videoconferencing suggest it, but provide little insight along a network path for root cause analysis, only total scalars. Packet inspection provides the most visibility and accurate reflection of user experience, but is also often the most expensive (high performance hardware to capture and analyze high volume network traffic) unless the packet analyzer is focused on a specific protocol of interest (VoIP, DNS, multi-cast, whatever). Network engineers require all these measurement sources and analyzers to interpret them at one time or another in assuring the performance of their networks.
Depending on network media type, there are also other measurements which might help connote the health of the network or point to the root cause of failures or slowdowns. On some Ethernet LANs one might need to assure the power draw of the PoE system for supporting cameras or assess alien crosstalk to prepare for an upgrade to 10G. On Wi-Fi networks there are unique signal/noise, RF interferer, and hand-off time measurements that can enlighten network engineers on how to improve wireless performance. On fiber runs, the cleanliness of end faces or the incidence of micro-bends might help determine why throughput is not as expected. Cellular networks too have media-specific measures, not unlike the Wi-Fi measurements. These and a number of other measurements provide some detailed insight… but first an indicator measurement should be monitored and used to alert to issues.
The best operating practice is to monitor leading indicator measurements, especially end user response time; it should indicate the problem domain (application, server, network, client) if there is an issue. Then follow the outliers, using detailed measurements, toward associated degradation factors to get to root cause.