AI / ML · 04 of 05

Frontier when it matters. Small when it doesn’t.

Most teams over-spend on the wrong model and under-spend on evaluation. We benchmark, fine-tune, distill and serve the right model per request — billion-parameter giants where they earn it, slim distilled SLMs where they don’t.

Right-size our model ↗See the selection grid

−74%Median cost / call

3.4×Tokens/sec on same GPU

27Models we’ve fine-tuned

What you get

Model intelligence as a discipline, not a vendor pick.

Six muscles we’ve built up across 27 model deployments. We use them together — they compound.

Benchmarking on your data

Public benchmarks lie. We build a domain-specific eval suite — your prompts, your rubric — then score 8–12 candidate models head to head.

eval-on-your-data

Fine-tuning (LoRA, QLoRA, full-SFT)

Style transfer, structured output, domain adaptation. We pick the method that earns its keep on your eval set.

LoRA · QLoRA · DPO

Quantization & distillation

FP16 → INT8/INT4 with measured quality loss. Distill a 70B teacher into a 7B student that fits on one GPU.

GGUF · AWQ · distill

Inference optimization

vLLM, TensorRT-LLM, speculative decoding, KV-cache reuse, batching. Same hardware, 3–4× the throughput.

vLLM · TensorRT · spec-decode

Model routing

Per-request routing — cheap SLM for easy, big LLM for hard, fallback for ambiguous. With a classifier you can read.

router · classifier · fallback

Edge & on-device

llama.cpp, MLX, ONNX Runtime for laptop, mobile, kiosk. Sub-second inference where the network can't reach.

llama.cpp · MLX · ONNX

How we pick

A model selection grid we’ll walk with you.

Same conversation, every project: how hard is the task, how strict the latency, how sensitive the data. The grid falls out.

Easy
task

Medium
task

Hard
task

Reasoning
heavy

Latency < 300ms

SLM 1B

SLM 7B

Llama 8B + LoRA

Llama 70B (warm)

Volume-driven

Distilled SLM

Mistral 7B

Mixtral 8x7B

Llama 70B

Quality-critical

Llama 8B

Llama 70B

GPT-4o · Claude 3.5

o1 · Claude 3.5

The model is a deployment decision, not a religion.

We start from the request mix, the latency budget and the data class — not from a model preference. Then we route, fine-tune and distill until the unit economics work.

01
Profile your traffic
Cluster real requests by difficulty. 70–85% are usually easier than you assume — that's SLM territory.
02
Run the bake-off on your evals
Side-by-side scoring on the rubric you care about. Quality numbers in the design doc, not in marketing copy.
03
Fine-tune to close the gap
A 7B LoRA often matches a 70B base on your specific task — for 10× less cost. We measure the trade.
04
Route in production
Cheap by default, escalate on uncertainty. The big model becomes a safety net, not the main floor.

Tech stack

Tooling for the full model lifecycle.

From a Friday-night benchmarking notebook to a Monday-morning production rollout.

Models we work with

Llama 3.1 / 3.3Mistral / MixtralQwen 2.5Phi-3DeepSeekGemma

Fine-tuning

AxolotlUnslothHF PEFTLoRA / QLoRADPO / ORPO

Compression

GGUFAWQGPTQSmoothQuantKnowledge distill

Serving

vLLMTGITensorRT-LLMTritonRay Serve

Eval

PromptfooBraintrustlm-eval-harnessMTEB

Edge runtime

llama.cppMLXONNX RuntimeCore MLTFLite

Compute

H100 / H200A100L40SModalLambda Labs

Routing

Custom classifierRouteLLMMartianPortkey

From vision to victory

From benchmark to production, in weeks.

A short, evidence-driven path from which model? to ship and route.

Week 1

Build the eval

Domain rubric, 60–150 representative prompts, scoring scaffold. The artifact that decides every model trade.

Week 2

Bake-off

8–12 candidate models scored on cost, latency and quality. Pareto front in a single chart.

Week 3

Fine-tune

LoRA / QLoRA on the best small candidate. Recover the quality gap, keep the cost win.

Week 4

Compress & serve

Quantize, deploy on vLLM, throughput-tune. Latency budget enforced in CI.

Ongoing

Route & iterate

Production traffic feeds the eval. Quarterly bake-offs as the model landscape shifts.

Where this pays back

Three ways to get from a big model to a good one.

Patterns where right-sizing turned an unsustainable AI bill into a healthy unit economic.

Pattern · SaaS · Routing

Most calls didn't need GPT-4.

Replaced a single-model GPT-4 setup with a router + fine-tuned Llama 8B for 78% of traffic. Quality flat, cost down 74%.

−74%$ / 1M req

0 reg.On golden set

Llama 8B + LoRAPortkeyvLLM

Pattern · Retail · Edge

Kiosks that don't need the cloud.

Distilled a 70B teacher into a 1.3B student running on store-floor kiosks. Sub-second responses, zero network dependency.

1.3BStudent model

620msEnd-to-end

llama.cppGGUF Q4NUC hardware

Pattern · Legal · Domain

A small model that beats the big one — on contracts.

QLoRA-tuned Mistral 7B on a curated contract corpus. Beats GPT-4o on the client's clause-extraction eval, runs on one A100.

+12 ptsF1 vs. GPT-4o

1× A100Inference

Mistral 7BQLoRAvLLM

Why ETY

Engineers who’ve been inside the model.

27Models we’ve fine-tuned and deployed across production.

3.4×Median throughput improvement on the same hardware.

−74%Median cost-per-call reduction after routing + fine-tune.

11Open-source contributions on PEFT, quantization and eval.

Continue exploring

Generative AI

The pipelines these models power — with the grounding and evaluation that makes output worth shipping.

→

Private Enterprise AI

How we deploy these models inside your perimeter, with audit and tenancy controls.

→

Pay for the right intelligence.

Send us a sample of your real prompts and your latency budget. We’ll come back with a model shortlist, a cost projection, and the eval we’d score them on.

Book a discovery call ↗Back to AI / ML

AI/ML

Data Engineering

Cloud and Devops

Development

Need help choosing the right service?

Cloud Platforms

Data Platforms

industry

Portfolio

Company