AI / ML · 04 of 05

Frontier when it matters. Small when it doesn’t.

Most teams over-spend on the wrong model and under-spend on evaluation. We benchmark, fine-tune, distill and serve the right model per request — billion-parameter giants where they earn it, slim distilled SLMs where they don’t.

−74%Median cost / call
3.4×Tokens/sec on same GPU
27Models we’ve fine-tuned
What you get

Model intelligence as a discipline, not a vendor pick.

Six muscles we’ve built up across 27 model deployments. We use them together — they compound.

Benchmarking on your data

Public benchmarks lie. We build a domain-specific eval suite — your prompts, your rubric — then score 8–12 candidate models head to head.

eval-on-your-data

Fine-tuning (LoRA, QLoRA, full-SFT)

Style transfer, structured output, domain adaptation. We pick the method that earns its keep on your eval set.

LoRA · QLoRA · DPO

Quantization & distillation

FP16 → INT8/INT4 with measured quality loss. Distill a 70B teacher into a 7B student that fits on one GPU.

GGUF · AWQ · distill

Inference optimization

vLLM, TensorRT-LLM, speculative decoding, KV-cache reuse, batching. Same hardware, 3–4× the throughput.

vLLM · TensorRT · spec-decode

Model routing

Per-request routing — cheap SLM for easy, big LLM for hard, fallback for ambiguous. With a classifier you can read.

router · classifier · fallback

Edge & on-device

llama.cpp, MLX, ONNX Runtime for laptop, mobile, kiosk. Sub-second inference where the network can't reach.

llama.cpp · MLX · ONNX
How we pick

A model selection grid we’ll walk with you.

Same conversation, every project: how hard is the task, how strict the latency, how sensitive the data. The grid falls out.

Easy
task
Medium
task
Hard
task
Reasoning
heavy
Latency < 300ms
SLM 1B
SLM 7B
Llama 8B + LoRA
Llama 70B (warm)
Volume-driven
Distilled SLM
Mistral 7B
Mixtral 8x7B
Llama 70B
Quality-critical
Llama 8B
Llama 70B
GPT-4o · Claude 3.5
o1 · Claude 3.5

The model is a deployment decision, not a religion.

We start from the request mix, the latency budget and the data class — not from a model preference. Then we route, fine-tune and distill until the unit economics work.

  • 01
    Profile your traffic

    Cluster real requests by difficulty. 70–85% are usually easier than you assume — that's SLM territory.

  • 02
    Run the bake-off on your evals

    Side-by-side scoring on the rubric you care about. Quality numbers in the design doc, not in marketing copy.

  • 03
    Fine-tune to close the gap

    A 7B LoRA often matches a 70B base on your specific task — for 10× less cost. We measure the trade.

  • 04
    Route in production

    Cheap by default, escalate on uncertainty. The big model becomes a safety net, not the main floor.

Tech stack

Tooling for the full model lifecycle.

From a Friday-night benchmarking notebook to a Monday-morning production rollout.

Models we work with

Llama 3.1 / 3.3Mistral / MixtralQwen 2.5Phi-3DeepSeekGemma

Fine-tuning

AxolotlUnslothHF PEFTLoRA / QLoRADPO / ORPO

Compression

GGUFAWQGPTQSmoothQuantKnowledge distill

Serving

vLLMTGITensorRT-LLMTritonRay Serve

Eval

PromptfooBraintrustlm-eval-harnessMTEB

Edge runtime

llama.cppMLXONNX RuntimeCore MLTFLite

Compute

H100 / H200A100L40SModalLambda Labs

Routing

Custom classifierRouteLLMMartianPortkey
From vision to victory

From benchmark to production, in weeks.

A short, evidence-driven path from which model? to ship and route.

01
Week 1
Build the eval

Domain rubric, 60–150 representative prompts, scoring scaffold. The artifact that decides every model trade.

02
Week 2
Bake-off

8–12 candidate models scored on cost, latency and quality. Pareto front in a single chart.

03
Week 3
Fine-tune

LoRA / QLoRA on the best small candidate. Recover the quality gap, keep the cost win.

04
Week 4
Compress & serve

Quantize, deploy on vLLM, throughput-tune. Latency budget enforced in CI.

05
Ongoing
Route & iterate

Production traffic feeds the eval. Quarterly bake-offs as the model landscape shifts.

Where this pays back

Three ways to get from a big model to a good one.

Patterns where right-sizing turned an unsustainable AI bill into a healthy unit economic.

Pattern · SaaS · Routing

Most calls didn't need GPT-4.

Replaced a single-model GPT-4 setup with a router + fine-tuned Llama 8B for 78% of traffic. Quality flat, cost down 74%.

−74%$ / 1M req
0 reg.On golden set
Llama 8B + LoRAPortkeyvLLM
Pattern · Retail · Edge

Kiosks that don't need the cloud.

Distilled a 70B teacher into a 1.3B student running on store-floor kiosks. Sub-second responses, zero network dependency.

1.3BStudent model
620msEnd-to-end
llama.cppGGUF Q4NUC hardware
Pattern · Legal · Domain

A small model that beats the big one — on contracts.

QLoRA-tuned Mistral 7B on a curated contract corpus. Beats GPT-4o on the client's clause-extraction eval, runs on one A100.

+12 ptsF1 vs. GPT-4o
1× A100Inference
Mistral 7BQLoRAvLLM
Why ETY

Engineers who’ve been inside the model.

27Models we’ve fine-tuned and deployed across production.
3.4×Median throughput improvement on the same hardware.
−74%Median cost-per-call reduction after routing + fine-tune.
11Open-source contributions on PEFT, quantization and eval.

Pay for the right intelligence.

Send us a sample of your real prompts and your latency budget. We’ll come back with a model shortlist, a cost projection, and the eval we’d score them on.