Data Engineering · 04 of 05

Scale that just runs. Streams that just stream.

Distributed compute and event-driven pipelines that survive the day — Black Friday, gameday, the surprise migration. We design for throughput, backpressure and recovery first; the dashboards take care of themselves.

140k/sPeak ingest · top client
p99 380msEnd-to-end on stream
−47%Compute cost · same SLA
What you get

Pipelines that don't flinch at 100k/sec.

Six muscles for high-throughput, low-latency data — engineered for the day the dashboard goes red.

Stream processing

Exactly-once Spark Streaming / Flink jobs with checkpointing, watermarks and stateful windows. Late data handled — not dropped.

exactly-once · watermark

Distributed batch

Petabyte-scale Spark jobs on Kubernetes or managed runtimes. Tuned for shuffle, spill and skew — not for the demo dataset.

PySpark · Scala · k8s

Event backbone

Kafka or Pulsar as the durable bus. Topics designed by domain, partitions designed for throughput, retention designed for replay.

Kafka · Pulsar · Schema Reg

CDC at scale

Production database changes streamed to the lakehouse via Debezium / Maxwell — with backfill that doesn't take down the source.

Debezium · log-based

Cost & performance

Spot-friendly autoscaling, partition pruning, broadcast hints, column pruning. Same SLA, smaller bill.

spot · skew · prune

Reliability & replay

Idempotent sinks, checkpointed state, replayable topics. When something breaks, you rewind — you don't reprocess by hand.

idempotent · replay · checkpoint
How it works

A streaming spine, with batch when batch is right.

Reference architecture we've hardened across telemetry, fraud, gaming and ad-tech workloads.

ProducersAppsIoT · edgeCDCWebhooksBus · Kafka / Pulsardomain topics · partitioned · schema-enforcedretention · replay · backpressureStream processingFlinkSpark StreamingStateful opsSinksLakehouseOLAPSearchAPIs · KVObservability · lag, throughput, exactly-once, cost

Stream wherever it pays back. Batch wherever it doesn't.

Not every workload deserves a stream. We'll stream the latency-critical 20% and batch the rest — the savings, and the simplicity, are both real.

  • 01
    Topics by domain

    Orders, users, telemetry — each its own topic, each its own contract. No "everything-events" topic that nobody owns.

  • 02
    Exactly-once where it matters

    Idempotent sinks plus transactional writes — reconciliations stop being a quarterly fire drill.

  • 03
    State as a first-class concept

    Windowed aggregations, joins, sessionization — with RocksDB-backed state, checkpointed and replayable.

  • 04
    Run cheap, scale fast

    Spot pools for steady-state, on-demand for spikes. Autoscaling lives in the orchestration layer, not in tickets.

Tech stack

The real distributed systems toolkit.

Tools that have survived a few peak seasons. Each one we've carried a pager for.

Compute

SparkFlinkBeamRayDataflow

Event backbone

KafkaConfluentPulsarRedpandaKinesis

OLAP / serving

ClickHousePinotDruidStarRocks

CDC

DebeziumMaxwellStriimDMS

Orchestration

AirflowDagsterArgoPrefect

Cloud runtimes

EMRDataprocHDInsightDatabricks Jobs

State & serving

RocksDBRedisCassandraDynamoDB

Observability

PrometheusGrafanaOpenTelemetryCribl
From vision to victory

From batch-only to live, deliberately.

A staged rollout that starts with the workload that pays back first. No rebuild of the world.

01
Week 1
Audit latency

Which workloads need seconds, which need hours, which are fine at daily. Honest grid.

02
Week 2
Backbone

Kafka / Pulsar with topic design, schema registry, retention — the durable interface for everything else.

03
Week 3–4
First stream job

The highest-payback latency-critical workload migrated to Flink / Spark Streaming. End-to-end SLA proven.

04
Week 5–6
Scale & harden

Autoscaling, exactly-once, observability, replay drill. On-call ready.

05
Ongoing
Operate & expand

Migrate more workloads as they earn it; keep batch where it shines.

Where it pays back

Three places streaming actually earns its keep.

Latency-critical, throughput-heavy, regulated. The use cases where “real-time” is a feature, not a tagline.

Pattern · Fintech · Fraud

Fraud scoring before authorization.

Sub-200ms fraud decisioning on card swipes. Flink job pulls features from feature store, scores in-line, writes back to ledger.

p95 180msScore latency
−34%Fraud losses
FlinkKafkaCassandra
Pattern · IoT · Telemetry

140k events / sec without breaking a sweat.

Manufacturing telemetry from 32 plants. Spark Structured Streaming into Hudi, with anomaly detection in-flight.

140k/sPeak ingest
32 plantsLive
Spark StreamingKafkaHudi
Pattern · Ad-tech · Cost

Same SLA, half the bill.

Rewrote a Kinesis + Lambda pipeline as a Kafka + Flink topology on spot instances. Latency stayed, cost halved.

−52%Compute spend
p99 380msEnd-to-end
KafkaFlinkEKS · spot
Why ETY

Engineers who've carried the pager.

140k/sPeak ingest sustained on a production pipeline we built.
99.97%Uptime across streaming platforms we operate end-to-end.
−47%Median compute-cost reduction after Spark / Flink tuning.
24×7On-call coverage for systems we operate on behalf of clients.

Stream the 20% that pays back.

Tell us the workload that's slowest, most fragile, or about to outgrow its batch window. We'll come back with a streaming target architecture and a credible migration sequence.