Stream processing
Exactly-once Spark Streaming / Flink jobs with checkpointing, watermarks and stateful windows. Late data handled — not dropped.
exactly-once · watermarkDistributed compute and event-driven pipelines that survive the day — Black Friday, gameday, the surprise migration. We design for throughput, backpressure and recovery first; the dashboards take care of themselves.
Six muscles for high-throughput, low-latency data — engineered for the day the dashboard goes red.
Exactly-once Spark Streaming / Flink jobs with checkpointing, watermarks and stateful windows. Late data handled — not dropped.
exactly-once · watermarkPetabyte-scale Spark jobs on Kubernetes or managed runtimes. Tuned for shuffle, spill and skew — not for the demo dataset.
PySpark · Scala · k8sKafka or Pulsar as the durable bus. Topics designed by domain, partitions designed for throughput, retention designed for replay.
Kafka · Pulsar · Schema RegProduction database changes streamed to the lakehouse via Debezium / Maxwell — with backfill that doesn't take down the source.
Debezium · log-basedSpot-friendly autoscaling, partition pruning, broadcast hints, column pruning. Same SLA, smaller bill.
spot · skew · pruneIdempotent sinks, checkpointed state, replayable topics. When something breaks, you rewind — you don't reprocess by hand.
idempotent · replay · checkpointReference architecture we've hardened across telemetry, fraud, gaming and ad-tech workloads.
Not every workload deserves a stream. We'll stream the latency-critical 20% and batch the rest — the savings, and the simplicity, are both real.
Orders, users, telemetry — each its own topic, each its own contract. No "everything-events" topic that nobody owns.
Idempotent sinks plus transactional writes — reconciliations stop being a quarterly fire drill.
Windowed aggregations, joins, sessionization — with RocksDB-backed state, checkpointed and replayable.
Spot pools for steady-state, on-demand for spikes. Autoscaling lives in the orchestration layer, not in tickets.
Tools that have survived a few peak seasons. Each one we've carried a pager for.
A staged rollout that starts with the workload that pays back first. No rebuild of the world.
Which workloads need seconds, which need hours, which are fine at daily. Honest grid.
Kafka / Pulsar with topic design, schema registry, retention — the durable interface for everything else.
The highest-payback latency-critical workload migrated to Flink / Spark Streaming. End-to-end SLA proven.
Autoscaling, exactly-once, observability, replay drill. On-call ready.
Migrate more workloads as they earn it; keep batch where it shines.
Latency-critical, throughput-heavy, regulated. The use cases where “real-time” is a feature, not a tagline.
Sub-200ms fraud decisioning on card swipes. Flink job pulls features from feature store, scores in-line, writes back to ledger.
Manufacturing telemetry from 32 plants. Spark Structured Streaming into Hudi, with anomaly detection in-flight.
Rewrote a Kinesis + Lambda pipeline as a Kafka + Flink topology on spot instances. Latency stayed, cost halved.
Tell us the workload that's slowest, most fragile, or about to outgrow its batch window. We'll come back with a streaming target architecture and a credible migration sequence.