OpenTelecom.ai — DIY Network Analysis

The Challenge

Why network analysis at scale is hard

Enterprise and service-provider networks generate hundreds of thousands of flow records per second. Naive approaches collapse under the weight of the data before a single query can run.

🌊

Volume Problem

500,000 events/sec × 400 bytes = 200 MB/sec sustained ingest. A 5-minute window holds 150 million raw events.

150M events → 60 GB in 5 minutes

💾

Memory Problem

Storing raw events in Redis/Valkey for the hot window requires 75 GB RAM and 500K write operations per second.

$400+/hr just for the hot store

⚡

Latency Problem

Network ops teams need answers in milliseconds. Scanning 150M documents in OpenSearch for every dashboard refresh is not an option.

Need <5ms, not 500ms

🔍

Detection Problem

Anomalies like slow exfiltration, port scans, and novel protocols require different algorithms — no single detector catches everything.

3 layers needed for full coverage

Architecture

The five-layer pipeline

The key insight: pre-aggregate 100× in Kafka Streams before touching any persistent store. This reduces 500K events/sec to 5K records/sec — making everything downstream trivial.

🌐

Network Devices

NetFlow · IPFIX · sFlow
500K events/sec

UDP ›

📨

Apache Kafka

8 partitions
10-min retention
200 MB/sec

consume ›

⚙️

Kafka Streams

1-sec tumbling window
100:1 reduction
→ 5K records/sec

›

⚡

Valkey

TTL 301 sec
9 MB data
<5ms queries

›

🔎

OpenSearch

ILM: delete 6 min
RCF anomaly
Alerting plugin

🚨 Anomaly Detection Path

Kafka Streams also runs Z-score detection inline (<1 sec). Alerts flow to a separate Kafka topic → OpenSearch alerts index → Alerting plugin → PagerDuty / Slack.

📊 Dashboard Query Path

Last 5 minutes → query Valkey directly (sub-5ms). Historical analysis and ad-hoc queries → OpenSearch (<200ms). Both served from OpenSearch Dashboards or Grafana.

Tech Stack

Every component, open-source

No proprietary services. The entire pipeline runs on Apache-licensed or equivalent open-source software. Every component can be self-hosted on your own hardware.

Message Buffer

Apache Kafka

The backbone. Decouples network collectors from all downstream consumers and provides replay capability. KRaft mode eliminates ZooKeeper dependency.

KRaft 8 partitions 10-min TTL 200 MB/sec

Aggregation

Kafka Streams

Embedded in a plain Java app — no separate cluster needed. Performs stateful 1-second tumbling window aggregation per dimension key, reducing 500K→5K records/sec.

Embedded JVM 1-sec windows Z-score inline Zero ops

Hot Store

Valkey

Redis-compatible, fully open-source. Stores pre-aggregated 1-second metric buckets with 301-second TTL. Keys expire automatically — no cleanup job needed.

TTL 301s Hash per bucket 9 MB data <5ms query

Analytics

OpenSearch

Receives aggregated data via built-in ingest-kafka plugin. Provides ILM auto-deletion, pipeline aggregations for pattern detection, and the RCF anomaly detection plugin.

ingest-kafka ILM 6-min Pipeline aggs Search pipeline

ML Detection

Random Cut Forest

OpenSearch anomaly detection plugin. Unsupervised ML — no labelled data needed. Trains a separate model per src_ip. Detects novel patterns no rule would catch.

Per-entity model Multi-variate Anomaly grade 0-1 Attribution

Collector

pmacct / Rust Collector

Receives raw NetFlow/IPFIX/sFlow datagrams at line rate. Decodes protocol, serialises to JSON/Avro, and produces to Kafka. Replace with any UDP collector.

NetFlow v9 IPFIX sFlow syslog

Clarification

What's actually inside OpenSearch?

A common question: is Apache Flink built into OpenSearch? No — and you don't need it. Here's exactly what OpenSearch provides natively versus what lives outside.

opensearch-capabilities.yaml

# Built-in to OpenSearch core (plugins/ and modules/)
native:
  ingestion:
    - ingest-kafka          # plugins/ingest-kafka
    - ingest-kinesis        # plugins/ingest-kinesis
    - ingest-common         # modules/ingest-common
  aggregations:
    - extended_stats        # server/ core
    - moving_avg            # Holt-Winters seasonal model
    - serial_diff           # sudden change detection
    - derivative            # rate-of-change
    - bucket_selector       # filter anomalous buckets
  lifecycle:
    - ILM                   # auto-delete at 6 minutes
    - search-pipeline       # enforce 5-min window server-side

# Separate plugins (opensearch-project org, not core)
plugins:
  - anomaly-detection     # Random Cut Forest (RCF)
  - alerting              # scheduled queries + notifications

# External — completely separate projects
external:
  flink:   NOT part of OpenSearch  # not needed — use Kafka Streams
  kafka:   NOT part of OpenSearch  # Apache project
  valkey:  NOT part of OpenSearch  # separate open-source project

Anomaly Detection

Three layers, three latency tiers

No single algorithm catches every class of network anomaly. Three complementary layers operate independently — the failure of one doesn't affect the others.

Layer 1 — Kafka Streams
Statistical Z-Score< 1 secondWelford online mean + variance per metric key
No historical data stored — pure streaming
Flags |z-score| > 3.5 standard deviations
Catches: DDoS, port scans, link failures
Adapts to baseline drift via EWMA
Layer 2 — OpenSearch Aggs
Pattern Aggregations1 – 2 minutesHolt-Winters seasonal decomposition
Detects time-of-day / day-of-week anomalies
extended_stats_bucket + bucket_selector
Catches: 3am traffic, correlated multi-host events
No ML plugin required — built into core
Layer 3 — RCF Plugin
Random Cut Forest2 – 5 minutesUnsupervised ML — zero labelled training data
Separate model per src_ip (category_field)
Multi-variate: bytes + flows + latency together
Catches: slow exfiltration, novel protocols, drift
relevant_attribution explains which feature fired

Hardware Sizing

Match your server to your event rate

The disk type matters more than you think. An HDD caps the entire pipeline at ~150K events/sec regardless of CPU or RAM.

Event Rate	CPU	RAM	Disk	NIC	Status	Bottleneck
Up to 80K/sec	8 cores	32 GB	1TB HDD	1GbE	✓ Fits	Disk I/O
Up to 250K/sec	8 cores	32 GB	1TB NVMe	10GbE	✓ Comfortable	CPU (8 cores)
Up to 400K/sec	16 cores	64 GB	2TB NVMe	10GbE	~ Tight	CPU at peak
500K/sec (target)	32 cores	96 GB	2×1TB NVMe	25GbE	✓ Recommended	None at 70%
500K/sec + HA	3 × 16 core	3 × 64 GB	3 × 2TB NVMe	10GbE each	✓ Production	Fully redundant

⚠ An 8-core / 32GB server cannot sustain 500K events/sec. Maximum comfortable throughput is ~250K/sec. The NIC is the first bottleneck — a 1GbE NIC caps at 312K events/sec at 400 bytes each.

Blog

Coming soon — DIY guides

Step-by-step tutorials for building every component of this pipeline yourself, from a single server PoC to a production three-node cluster.

📨

Guide 8 min read

Setting Up Kafka in KRaft Mode for Network Telemetry

Configure a single-broker Kafka cluster optimised for 200 MB/sec sustained write throughput — no ZooKeeper required.

kafka networking DIY

⚙️

Tutorial 12 min read

100× Aggregation: Kafka Streams for Network Flow Data

Write a Kafka Streams application that reduces 500K raw flow events per second to 5K aggregated metrics using 1-second tumbling windows.

kafka-streams aggregation

⚡

Deep Dive 15 min read

Three-Layer Anomaly Detection on Network Traffic

Z-score inline detection, Holt-Winters seasonal aggregations in OpenSearch, and Random Cut Forest ML — how they work together and what each one catches.

anomaly-detection ML

DIY Network Analysis
that scales

Why network analysis at scale is hard

Volume Problem

Memory Problem

Latency Problem

Detection Problem

The five-layer pipeline

🚨 Anomaly Detection Path

📊 Dashboard Query Path

Every component, open-source

Apache Kafka

Kafka Streams

Valkey

OpenSearch

Random Cut Forest

pmacct / Rust Collector

What's actually inside OpenSearch?

Three layers, three latency tiers

Statistical Z-Score

Pattern Aggregations

Random Cut Forest

Match your server to your event rate

Coming soon — DIY guides

Setting Up Kafka in KRaft Mode for Network Telemetry

100× Aggregation: Kafka Streams for Network Flow Data

Three-Layer Anomaly Detection on Network Traffic

Start building your pipeline today

DIY Network Analysis that scales

Why network analysis at scale is hard

Volume Problem

Memory Problem

Latency Problem

Detection Problem

The five-layer pipeline

🚨 Anomaly Detection Path

📊 Dashboard Query Path

Every component, open-source

Apache Kafka

Kafka Streams

Valkey

OpenSearch

Random Cut Forest

pmacct / Rust Collector

What's actually inside OpenSearch?

Three layers, three latency tiers

Statistical Z-Score

Pattern Aggregations

Random Cut Forest

Match your server to your event rate

Coming soon — DIY guides

Setting Up Kafka in KRaft Mode for Network Telemetry

100× Aggregation: Kafka Streams for Network Flow Data

Three-Layer Anomaly Detection on Network Traffic

Start building your pipeline today

DIY Network Analysis
that scales