Catalog
concept#Observability#Reliability#Platform#Security

Network Monitoring

Continuous monitoring of networks to detect outages, performance issues and security incidents.

Network monitoring is the continuous observation of network devices, links and traffic to detect faults, measure performance, and ensure availability.
Established
Medium

Classification

  • Medium
  • Technical
  • Architectural
  • Intermediate

Technical context

SNMP-based polling and trapsFlow systems (NetFlow/IPFIX) for traffic analysisMetric pipelines like Prometheus and exporters

Principles & goals

Measureability: Key KPIs must be instrumented and reproducible.Correlation: Correlate metrics, logs and traces for effective troubleshooting.Alert quality over quantity: Optimize signal-to-noise to avoid fatigue.
Run
Enterprise, Domain, Team

Use cases & scenarios

Compromises

  • Alert fatigue from too many irrelevant notifications.
  • Security risks from unsecured monitoring interfaces.
  • Wrong decisions due to incomplete or biased data.
  • Define sensible baselines before setting alerts.
  • Route alerts by on-call roles and escalation rules.
  • Set retention policy aligned with compliance and analytics needs.

I/O & resources

  • SNMP/gNMI access to network devices
  • Flow data (NetFlow, sFlow, IPFIX)
  • Topology and inventory information
  • Dashboards with metrics and trend views
  • Configurable alerts and escalation paths
  • Regular capacity and incident reports

Description

Network monitoring is the continuous observation of network devices, links and traffic to detect faults, measure performance, and ensure availability. It combines metric collection, alerting and visualization to support troubleshooting and capacity planning. Effective monitoring enables SLA verification, incident response and trend-based optimization.

  • Early detection of outages and faster recovery times.
  • Data-driven capacity planning and cost optimization.
  • Improved incident response and documented SLAs.

  • Blind spots when telemetry or devices are not instrumented.
  • High data overhead when retaining large metric and flow datasets.
  • False-positive alerts with overly coarse thresholding.

  • Packet loss

    Proportion of lost packets over a link; important for service quality.

  • Latency (round-trip time)

    Time taken for packets to travel round-trip; indicator of delays.

  • Interface utilization

    Percentage utilization of a network interface over time.

ISP: end-to-end availability monitoring

An ISP monitors backbone links and peering points with active and passive checks to secure SLAs.

Enterprise: security monitoring via flow analysis

An enterprise uses NetFlow/IPFIX to detect lateral attacker movement and integrate with SIEM.

Data center: capacity planning using long-term metrics

A data center aggregates interface and switch metrics to plan hardware upgrades proactively.

1

Inventory: record devices, interfaces and KPIs.

2

Connect data sources: configure SNMP, flow, exporters.

3

Define storage and retention; plan metric aggregations.

4

Implement dashboards and alerts, fine-tune thresholds.

5

Introduce regular review cycles and test playbooks.

⚠️ Technical debt & bottlenecks

  • Outdated exporters/agents that do not provide modern metrics.
  • Monolithic DB storage without partitioning for retention.
  • Undocumented, unstructured alert rules.
Telemetry pipeline throughput limitsStorage and retention costsAlert noise and configuration overhead
  • Setting alerts solely on absolute thresholds without context.
  • Relying only on synthetic checks without real telemetry.
  • Treating monitoring as an afterthought, not considered in design phase.
  • Insufficient windows for baseline determination leading to false alarm sensitivity.
  • Relying on a single metric without cross-validation with flows or logs.
  • Lack of testing alert playbooks before production rollout.
Knowledge of network protocols (TCP/IP, SNMP, NetFlow)Skills with monitoring and observability toolsExperience in alert design and incident management
Availability and fault toleranceScalability to handle high telemetry volumesSecurity and integrity of monitoring data
  • Limited budget and storage resources
  • Legacy devices with limited protocols
  • Network access restrictions and firewall policies