New Series: Real-Time Network Analysis at Scale

DIY Network Analysis
that scales

Open-source methods for building production-grade network telemetry pipelines. No vendor lock-in. No black boxes. Just Kafka, OpenSearch, and Valkey.

500K
events / second ingest
5 min
sliding hot window
< 5ms
Valkey query latency
100×
pre-aggregation reduction
3
anomaly detection layers

Why network analysis at scale is hard

Enterprise and service-provider networks generate hundreds of thousands of flow records per second. Naive approaches collapse under the weight of the data before a single query can run.

🌊

Volume Problem

500,000 events/sec × 400 bytes = 200 MB/sec sustained ingest. A 5-minute window holds 150 million raw events.

150M events → 60 GB in 5 minutes
💾

Memory Problem

Storing raw events in Redis/Valkey for the hot window requires 75 GB RAM and 500K write operations per second.

$400+/hr just for the hot store

Latency Problem

Network ops teams need answers in milliseconds. Scanning 150M documents in OpenSearch for every dashboard refresh is not an option.

Need <5ms, not 500ms
🔍

Detection Problem

Anomalies like slow exfiltration, port scans, and novel protocols require different algorithms — no single detector catches everything.

3 layers needed for full coverage

The five-layer pipeline

The key insight: pre-aggregate 100× in Kafka Streams before touching any persistent store. This reduces 500K events/sec to 5K records/sec — making everything downstream trivial.

🌐
Network Devices
NetFlow · IPFIX · sFlow
500K events/sec
UDP
📨
Apache Kafka
8 partitions
10-min retention
200 MB/sec
consume
⚙️
Kafka Streams
1-sec tumbling window
100:1 reduction
→ 5K records/sec
Valkey
TTL 301 sec
9 MB data
<5ms queries
🔎
OpenSearch
ILM: delete 6 min
RCF anomaly
Alerting plugin

🚨 Anomaly Detection Path

Kafka Streams also runs Z-score detection inline (<1 sec). Alerts flow to a separate Kafka topic → OpenSearch alerts index → Alerting plugin → PagerDuty / Slack.

📊 Dashboard Query Path

Last 5 minutes → query Valkey directly (sub-5ms). Historical analysis and ad-hoc queries → OpenSearch (<200ms). Both served from OpenSearch Dashboards or Grafana.

Every component, open-source

No proprietary services. The entire pipeline runs on Apache-licensed or equivalent open-source software. Every component can be self-hosted on your own hardware.

Message Buffer

Apache Kafka

The backbone. Decouples network collectors from all downstream consumers and provides replay capability. KRaft mode eliminates ZooKeeper dependency.

KRaft 8 partitions 10-min TTL 200 MB/sec
Aggregation

Kafka Streams

Embedded in a plain Java app — no separate cluster needed. Performs stateful 1-second tumbling window aggregation per dimension key, reducing 500K→5K records/sec.

Embedded JVM 1-sec windows Z-score inline Zero ops
Hot Store

Valkey

Redis-compatible, fully open-source. Stores pre-aggregated 1-second metric buckets with 301-second TTL. Keys expire automatically — no cleanup job needed.

TTL 301s Hash per bucket 9 MB data <5ms query
Analytics

OpenSearch

Receives aggregated data via built-in ingest-kafka plugin. Provides ILM auto-deletion, pipeline aggregations for pattern detection, and the RCF anomaly detection plugin.

ingest-kafka ILM 6-min Pipeline aggs Search pipeline
ML Detection

Random Cut Forest

OpenSearch anomaly detection plugin. Unsupervised ML — no labelled data needed. Trains a separate model per src_ip. Detects novel patterns no rule would catch.

Per-entity model Multi-variate Anomaly grade 0-1 Attribution
Collector

pmacct / Rust Collector

Receives raw NetFlow/IPFIX/sFlow datagrams at line rate. Decodes protocol, serialises to JSON/Avro, and produces to Kafka. Replace with any UDP collector.

NetFlow v9 IPFIX sFlow syslog

What's actually inside OpenSearch?

A common question: is Apache Flink built into OpenSearch? No — and you don't need it. Here's exactly what OpenSearch provides natively versus what lives outside.

opensearch-capabilities.yaml
# Built-in to OpenSearch core (plugins/ and modules/)
native:
  ingestion:
    - ingest-kafka          # plugins/ingest-kafka
    - ingest-kinesis        # plugins/ingest-kinesis
    - ingest-common         # modules/ingest-common
  aggregations:
    - extended_stats        # server/ core
    - moving_avg            # Holt-Winters seasonal model
    - serial_diff           # sudden change detection
    - derivative            # rate-of-change
    - bucket_selector       # filter anomalous buckets
  lifecycle:
    - ILM                   # auto-delete at 6 minutes
    - search-pipeline       # enforce 5-min window server-side

# Separate plugins (opensearch-project org, not core)
plugins:
  - anomaly-detection     # Random Cut Forest (RCF)
  - alerting              # scheduled queries + notifications

# External — completely separate projects
external:
  flink:   NOT part of OpenSearch  # not needed — use Kafka Streams
  kafka:   NOT part of OpenSearch  # Apache project
  valkey:  NOT part of OpenSearch  # separate open-source project

Three layers, three latency tiers

No single algorithm catches every class of network anomaly. Three complementary layers operate independently — the failure of one doesn't affect the others.

Layer 1 — Kafka Streams

Statistical Z-Score

< 1 second
  • Welford online mean + variance per metric key
  • No historical data stored — pure streaming
  • Flags |z-score| > 3.5 standard deviations
  • Catches: DDoS, port scans, link failures
  • Adapts to baseline drift via EWMA
Layer 2 — OpenSearch Aggs

Pattern Aggregations

1 – 2 minutes
  • Holt-Winters seasonal decomposition
  • Detects time-of-day / day-of-week anomalies
  • extended_stats_bucket + bucket_selector
  • Catches: 3am traffic, correlated multi-host events
  • No ML plugin required — built into core
Layer 3 — RCF Plugin

Random Cut Forest

2 – 5 minutes
  • Unsupervised ML — zero labelled training data
  • Separate model per src_ip (category_field)
  • Multi-variate: bytes + flows + latency together
  • Catches: slow exfiltration, novel protocols, drift
  • relevant_attribution explains which feature fired

Match your server to your event rate

The disk type matters more than you think. An HDD caps the entire pipeline at ~150K events/sec regardless of CPU or RAM.

Event Rate CPU RAM Disk NIC Status Bottleneck
Up to 80K/sec 8 cores 32 GB 1TB HDD 1GbE ✓ Fits Disk I/O
Up to 250K/sec 8 cores 32 GB 1TB NVMe 10GbE ✓ Comfortable CPU (8 cores)
Up to 400K/sec 16 cores 64 GB 2TB NVMe 10GbE ~ Tight CPU at peak
500K/sec (target) 32 cores 96 GB 2×1TB NVMe 25GbE ✓ Recommended None at 70%
500K/sec + HA 3 × 16 core 3 × 64 GB 3 × 2TB NVMe 10GbE each ✓ Production Fully redundant

⚠ An 8-core / 32GB server cannot sustain 500K events/sec. Maximum comfortable throughput is ~250K/sec. The NIC is the first bottleneck — a 1GbE NIC caps at 312K events/sec at 400 bytes each.

Coming soon — DIY guides

Step-by-step tutorials for building every component of this pipeline yourself, from a single server PoC to a production three-node cluster.

📨
Guide 8 min read

Setting Up Kafka in KRaft Mode for Network Telemetry

Configure a single-broker Kafka cluster optimised for 200 MB/sec sustained write throughput — no ZooKeeper required.

kafka  networking  DIY
⚙️
Tutorial 12 min read

100× Aggregation: Kafka Streams for Network Flow Data

Write a Kafka Streams application that reduces 500K raw flow events per second to 5K aggregated metrics using 1-second tumbling windows.

kafka-streams  aggregation
Deep Dive 15 min read

Three-Layer Anomaly Detection on Network Traffic

Z-score inline detection, Holt-Winters seasonal aggregations in OpenSearch, and Random Cut Forest ML — how they work together and what each one catches.

anomaly-detection  ML

Start building your pipeline today

All guides use open-source tools only. No cloud account required for the PoC.