Data Engineering · 03 of 05

Keep the raw. Model the useful.

Open-format lakehouses on Delta, Iceberg or Hudi — one storage layer that serves BI, ML and AI from the same source of truth. ELT means you don't lose data on the way in. The lakehouse means you don't get locked in on the way out.

Plan a lakehouse ↗See the architecture

OpenDelta · Iceberg · Hudi

1Copy of data · all engines

−58%Storage cost vs. warehouse

Medallion architecture · liveDelta · object store

Bronze

Raw · append-only

JSON · Avro · CDC312 sources4.8 TB / day

Silver

Cleansed · joined · conformed

schema-enforcedtype-safeSCD2,140 tables

Gold

Business-ready · curated

marts · featuresBI · ML · AI0 copies elsewhere

DatabricksSnowflakeTrinoDuckDBAthena

What you get

An open lakehouse — not a marketing slide.

The pieces that make a lakehouse work the way the diagrams promise.

Open table formats

Delta, Iceberg, Hudi. ACID transactions, time travel, schema evolution — on plain object storage. No proprietary file in sight.

Delta · Iceberg · Hudi

Streaming & batch in one

Same table, served by Spark Structured Streaming and batch readers. Late-arriving data merges cleanly — no separate Lambda pipeline.

Spark · Flink · Kafka

Time travel & rollback

Query the table as of yesterday, last week, that bad deploy. Reproduce experiments, recover from mistakes — without a backup tape.

time-travel · versioned

Multi-engine reads

Snowflake reads the same Iceberg table as Spark, Trino and DuckDB. Pick the engine for the workload, not the vendor.

multi-engine · zero-copy

Governance & lineage

Unity Catalog or Tabular for object-level RBAC, column masking, row filters and lineage. Audit-ready out of the box.

Unity · Tabular · column ACL

Feature & vector store

Same lakehouse, extra surface. Offline feature store for training, low-latency online store for serving — both consistent.

Feast · Tecton · vector

How it works

One storage layer. Many engines. No copies.

The whole point of a lakehouse is that data lives once. The complexity is in the curation layer — that's where we spend the time.

The trick is what gets promoted, not what gets stored.

Anyone can dump JSON into S3 and call it a lakehouse. The actual work — and the value — is in the silver and gold layers, where messy reality turns into reliable data products.

01
ELT into Bronze, never lose anything
Land raw, append-only, partition-friendly. The lakehouse is also your archive.
02
Schema-enforced Silver
Type-safe, deduplicated, conformed. The contract every downstream model can rely on.
03
Gold per-use-case
BI marts, ML features, RAG vectors. Same source, fit-for-purpose shapes.
04
Catalog as the gatekeeper
Permissions, masking, lineage at the catalog — not in five different engines.

Tech stack

Open formats, credible engines.

Everything here works without proprietary file formats. You can leave anytime — that's the point.

Table formats

Delta LakeApache IcebergApache HudiParquet

Compute

DatabricksSparkTrinoDuckDBAthena

Streaming

KafkaFlinkSpark StreamingDebezium

Object storage

S3ADLS Gen2GCSMinIO

Catalog & access

Unity CatalogTabularPolarisAWS GlueLake Formation

Transformation

dbt on SparkDLTSQLMeshPySpark

Feature & vector

FeastTectonpgvectorLance

Quality & obs

Great ExpectationsSodaMonte CarloOpenLineage

From vision to victory

From silos to shared substrate, in six weeks.

A pragmatic rollout designed around your existing warehouse — not a rip-and-replace.

Week 1

Format choice

Delta vs. Iceberg vs. Hudi — decided on your workload mix, not on a Twitter argument.

Week 2

Land bronze

CDC + streams + files into raw bronze. Object storage and catalog stood up.

Week 3–4

Build silver

Cleansing, dedup, schema enforcement, SCDs. The trustworthy interior of the lakehouse.

Week 5–6

Promote to gold

Marts for BI, features for ML, vectors for AI. All from the same silver source.

Ongoing

Operate & tune

Compaction, vacuum, retention policy, cost dashboard.

Where it shines

Three lakehouses that retired tool sprawl.

When BI, ML and AI start sharing one source of truth, the rest of the platform gets smaller.

Pattern · BI + ML · Retail

One lakehouse, two audiences.

Built a Delta lakehouse serving Tableau and PyTorch from the same gold tables. Killed three redundant warehouses on the way.

−58%Storage cost

3 → 1Warehouses

DatabricksDeltaUnity

Pattern · Iceberg · Multi-engine

Snowflake and Spark, reading the same tables.

Migrated a SaaS analytics platform to Iceberg on S3. Snowflake for BI workloads, Spark for ML, zero copy between them.

0 copiesBetween engines

p95 1.4sBI queries

IcebergTabularSnowflakeTrino

Pattern · Streaming · IoT

Telemetry at 80k events / sec.

Hudi lakehouse for a manufacturing telemetry platform. Sub-minute freshness, late-arriving data merges cleanly.

80k/sIngest peak

< 60sEnd-to-end

HudiFlinkKafka

Why ETY

Format-agnostic. Outcome-obsessed.

14Production lakehouses on Delta, Iceberg or Hudi.

1Source of truth for BI, ML and AI — that's the goal every time.

−58%Median storage cost reduction vs. a warehouse-only setup.

Multi-cloudS3, ADLS, GCS — we'll meet you on the substrate you have.

Continue exploring

Big Data

The streaming and distributed-compute layer that keeps a lakehouse live.

→

ETL & Data Warehousing

When a classical, modeled warehouse is the right tool — without a lakehouse layer underneath.

→

One storage layer. Every workload.

Send us your current warehouse + lake estate. We'll come back with a target lakehouse design, a migration order, and the workloads that consolidate first.

Book a discovery call ↗Back to Data Engineering

AI/ML

Data Engineering

Cloud and Devops

Development

Need help choosing the right service?

Cloud Platforms

Data Platforms

industry

Portfolio

Company