Data Engineering · 03 of 05

Keep the raw. Model the useful.

Open-format lakehouses on Delta, Iceberg or Hudi — one storage layer that serves BI, ML and AI from the same source of truth. ELT means you don't lose data on the way in. The lakehouse means you don't get locked in on the way out.

OpenDelta · Iceberg · Hudi
1Copy of data · all engines
−58%Storage cost vs. warehouse
Medallion architecture · liveDelta · object store
Bronze
Raw · append-only
JSON · Avro · CDC312 sources4.8 TB / day
Silver
Cleansed · joined · conformed
schema-enforcedtype-safeSCD2,140 tables
Gold
Business-ready · curated
marts · featuresBI · ML · AI0 copies elsewhere
DatabricksSnowflakeTrinoDuckDBAthena
What you get

An open lakehouse — not a marketing slide.

The pieces that make a lakehouse work the way the diagrams promise.

Open table formats

Delta, Iceberg, Hudi. ACID transactions, time travel, schema evolution — on plain object storage. No proprietary file in sight.

Delta · Iceberg · Hudi

Streaming & batch in one

Same table, served by Spark Structured Streaming and batch readers. Late-arriving data merges cleanly — no separate Lambda pipeline.

Spark · Flink · Kafka

Time travel & rollback

Query the table as of yesterday, last week, that bad deploy. Reproduce experiments, recover from mistakes — without a backup tape.

time-travel · versioned

Multi-engine reads

Snowflake reads the same Iceberg table as Spark, Trino and DuckDB. Pick the engine for the workload, not the vendor.

multi-engine · zero-copy

Governance & lineage

Unity Catalog or Tabular for object-level RBAC, column masking, row filters and lineage. Audit-ready out of the box.

Unity · Tabular · column ACL

Feature & vector store

Same lakehouse, extra surface. Offline feature store for training, low-latency online store for serving — both consistent.

Feast · Tecton · vector
How it works

One storage layer. Many engines. No copies.

The whole point of a lakehouse is that data lives once. The complexity is in the curation layer — that's where we spend the time.

Ingest (ELT)CDCStreamsFilesSaaS APIsLakehouse (object storage)Bronzeraw · append-only · all the dataSilvercleansed · joined · type-safeGoldmarts · features · vectorsCompute engines (read same tables)SparkTrinoDuckDBSnowflakeCatalog · Unity / Tabular / Polaris

The trick is what gets promoted, not what gets stored.

Anyone can dump JSON into S3 and call it a lakehouse. The actual work — and the value — is in the silver and gold layers, where messy reality turns into reliable data products.

  • 01
    ELT into Bronze, never lose anything

    Land raw, append-only, partition-friendly. The lakehouse is also your archive.

  • 02
    Schema-enforced Silver

    Type-safe, deduplicated, conformed. The contract every downstream model can rely on.

  • 03
    Gold per-use-case

    BI marts, ML features, RAG vectors. Same source, fit-for-purpose shapes.

  • 04
    Catalog as the gatekeeper

    Permissions, masking, lineage at the catalog — not in five different engines.

Tech stack

Open formats, credible engines.

Everything here works without proprietary file formats. You can leave anytime — that's the point.

Table formats

Delta LakeApache IcebergApache HudiParquet

Compute

DatabricksSparkTrinoDuckDBAthena

Streaming

KafkaFlinkSpark StreamingDebezium

Object storage

S3ADLS Gen2GCSMinIO

Catalog & access

Unity CatalogTabularPolarisAWS GlueLake Formation

Transformation

dbt on SparkDLTSQLMeshPySpark

Feature & vector

FeastTectonpgvectorLance

Quality & obs

Great ExpectationsSodaMonte CarloOpenLineage
From vision to victory

From silos to shared substrate, in six weeks.

A pragmatic rollout designed around your existing warehouse — not a rip-and-replace.

01
Week 1
Format choice

Delta vs. Iceberg vs. Hudi — decided on your workload mix, not on a Twitter argument.

02
Week 2
Land bronze

CDC + streams + files into raw bronze. Object storage and catalog stood up.

03
Week 3–4
Build silver

Cleansing, dedup, schema enforcement, SCDs. The trustworthy interior of the lakehouse.

04
Week 5–6
Promote to gold

Marts for BI, features for ML, vectors for AI. All from the same silver source.

05
Ongoing
Operate & tune

Compaction, vacuum, retention policy, cost dashboard.

Where it shines

Three lakehouses that retired tool sprawl.

When BI, ML and AI start sharing one source of truth, the rest of the platform gets smaller.

Pattern · BI + ML · Retail

One lakehouse, two audiences.

Built a Delta lakehouse serving Tableau and PyTorch from the same gold tables. Killed three redundant warehouses on the way.

−58%Storage cost
3 → 1Warehouses
DatabricksDeltaUnity
Pattern · Iceberg · Multi-engine

Snowflake and Spark, reading the same tables.

Migrated a SaaS analytics platform to Iceberg on S3. Snowflake for BI workloads, Spark for ML, zero copy between them.

0 copiesBetween engines
p95 1.4sBI queries
IcebergTabularSnowflakeTrino
Pattern · Streaming · IoT

Telemetry at 80k events / sec.

Hudi lakehouse for a manufacturing telemetry platform. Sub-minute freshness, late-arriving data merges cleanly.

80k/sIngest peak
< 60sEnd-to-end
HudiFlinkKafka
Why ETY

Format-agnostic. Outcome-obsessed.

14Production lakehouses on Delta, Iceberg or Hudi.
1Source of truth for BI, ML and AI — that's the goal every time.
−58%Median storage cost reduction vs. a warehouse-only setup.
Multi-cloudS3, ADLS, GCS — we'll meet you on the substrate you have.

One storage layer. Every workload.

Send us your current warehouse + lake estate. We'll come back with a target lakehouse design, a migration order, and the workloads that consolidate first.