Open-source methods for building production-grade network telemetry pipelines. No vendor lock-in. No black boxes. Just Kafka, OpenSearch, and Valkey.
Enterprise and service-provider networks generate hundreds of thousands of flow records per second. Naive approaches collapse under the weight of the data before a single query can run.
500,000 events/sec × 400 bytes = 200 MB/sec sustained ingest. A 5-minute window holds 150 million raw events.
Storing raw events in Redis/Valkey for the hot window requires 75 GB RAM and 500K write operations per second.
Network ops teams need answers in milliseconds. Scanning 150M documents in OpenSearch for every dashboard refresh is not an option.
Anomalies like slow exfiltration, port scans, and novel protocols require different algorithms — no single detector catches everything.
The key insight: pre-aggregate 100× in Kafka Streams before touching any persistent store. This reduces 500K events/sec to 5K records/sec — making everything downstream trivial.
Kafka Streams also runs Z-score detection inline (<1 sec). Alerts flow to a separate Kafka topic → OpenSearch alerts index → Alerting plugin → PagerDuty / Slack.
Last 5 minutes → query Valkey directly (sub-5ms). Historical analysis and ad-hoc queries → OpenSearch (<200ms). Both served from OpenSearch Dashboards or Grafana.
No proprietary services. The entire pipeline runs on Apache-licensed or equivalent open-source software. Every component can be self-hosted on your own hardware.
The backbone. Decouples network collectors from all downstream consumers and provides replay capability. KRaft mode eliminates ZooKeeper dependency.
Embedded in a plain Java app — no separate cluster needed. Performs stateful 1-second tumbling window aggregation per dimension key, reducing 500K→5K records/sec.
Redis-compatible, fully open-source. Stores pre-aggregated 1-second metric buckets with 301-second TTL. Keys expire automatically — no cleanup job needed.
Receives aggregated data via built-in ingest-kafka plugin. Provides ILM auto-deletion, pipeline aggregations for pattern detection, and the RCF anomaly detection plugin.
OpenSearch anomaly detection plugin. Unsupervised ML — no labelled data needed. Trains a separate model per src_ip. Detects novel patterns no rule would catch.
Receives raw NetFlow/IPFIX/sFlow datagrams at line rate. Decodes protocol, serialises to JSON/Avro, and produces to Kafka. Replace with any UDP collector.
A common question: is Apache Flink built into OpenSearch? No — and you don't need it. Here's exactly what OpenSearch provides natively versus what lives outside.
# Built-in to OpenSearch core (plugins/ and modules/) native: ingestion: - ingest-kafka # plugins/ingest-kafka - ingest-kinesis # plugins/ingest-kinesis - ingest-common # modules/ingest-common aggregations: - extended_stats # server/ core - moving_avg # Holt-Winters seasonal model - serial_diff # sudden change detection - derivative # rate-of-change - bucket_selector # filter anomalous buckets lifecycle: - ILM # auto-delete at 6 minutes - search-pipeline # enforce 5-min window server-side # Separate plugins (opensearch-project org, not core) plugins: - anomaly-detection # Random Cut Forest (RCF) - alerting # scheduled queries + notifications # External — completely separate projects external: flink: NOT part of OpenSearch # not needed — use Kafka Streams kafka: NOT part of OpenSearch # Apache project valkey: NOT part of OpenSearch # separate open-source project
No single algorithm catches every class of network anomaly. Three complementary layers operate independently — the failure of one doesn't affect the others.
The disk type matters more than you think. An HDD caps the entire pipeline at ~150K events/sec regardless of CPU or RAM.
| Event Rate | CPU | RAM | Disk | NIC | Status | Bottleneck |
|---|---|---|---|---|---|---|
| Up to 80K/sec | 8 cores | 32 GB | 1TB HDD | 1GbE | ✓ Fits | Disk I/O |
| Up to 250K/sec | 8 cores | 32 GB | 1TB NVMe | 10GbE | ✓ Comfortable | CPU (8 cores) |
| Up to 400K/sec | 16 cores | 64 GB | 2TB NVMe | 10GbE | ~ Tight | CPU at peak |
| 500K/sec (target) | 32 cores | 96 GB | 2×1TB NVMe | 25GbE | ✓ Recommended | None at 70% |
| 500K/sec + HA | 3 × 16 core | 3 × 64 GB | 3 × 2TB NVMe | 10GbE each | ✓ Production | Fully redundant |
⚠ An 8-core / 32GB server cannot sustain 500K events/sec. Maximum comfortable throughput is ~250K/sec. The NIC is the first bottleneck — a 1GbE NIC caps at 312K events/sec at 400 bytes each.
Step-by-step tutorials for building every component of this pipeline yourself, from a single server PoC to a production three-node cluster.
Configure a single-broker Kafka cluster optimised for 200 MB/sec sustained write throughput — no ZooKeeper required.
kafka networking DIYWrite a Kafka Streams application that reduces 500K raw flow events per second to 5K aggregated metrics using 1-second tumbling windows.
kafka-streams aggregationZ-score inline detection, Holt-Winters seasonal aggregations in OpenSearch, and Random Cut Forest ML — how they work together and what each one catches.
anomaly-detection ML