MLOps engineer building reliable applied AI systems

Making AI workflows measurable after the demo stage.

I work on the production side of ML and AI: data pipelines, model services, retrieval systems, evaluation loops, observability, release gates, and the glue that makes models easier to operate. My flagship work is an enterprise-style RAG reliability platform with citations, eval gates, traces, reports, and a dashboard.

6 featured repos checked locally
168 local tests across featured AI/MLOps repos
0 API keys required for core demos Proof ledger Project map Enterprise RAG Brief

Selected work

Projects with code, docs, and clear tradeoffs.

Applied AI Python · Next.js

Applied AI Eval Lab

Document intelligence workspace with retrieval, citations, evaluation metrics, answer-fact coverage, release-gate checks, experiment comparison, and a live static dashboard.

  • 21 backend tests, frontend audit/typecheck/static export, demo data checks, and tracked desktop/mobile demo QA.
  • Writes JSON/Markdown report artifacts with report detail endpoints.
  • Core verification is Docker-free; Docker smoke separately verifies indexing, grounded query, eval gate, reports, CORS, and dashboard readiness.
Retrieval eval Python · RAG

RAG Forge

Retrieval benchmark runner for comparing chunking, embedding, dense, BM25, hybrid retrieval, and reranking choices with Markdown/JSON reports and a regression gate for retrieval-quality changes.

  • 37 tests cover ranking, gates, reports, and E5 query embedding behavior.
  • Blocks quality drops beyond configured hit-rate, MRR, and latency thresholds.
  • Sample check reruns the 24-configuration benchmark and self-comparison gate.
Inference serving Python · FastAPI

StreamInfer

Local inference-serving project with adaptive batching, backpressure, model hot-swap, metrics, load-test reports, and LLM-style benchmark sweeps.

  • Benchmark sweep compares batch size and timeout tradeoffs with JSON/Markdown reports.
  • 40 tests cover serving, backpressure, benchmark gates, and recommendation stability.
  • Docker smoke verifies container health, prediction, hot-swap, and metrics paths.
Reliability Python · ML validation

MLGuard

Pre-deployment checks for drift, performance regression, and latency regression before shipping model changes.

  • 26 tests cover CLI behavior, report summaries, regression checks, and action metadata.
  • CLI help and action metadata both avoid advertising unsupported PyTorch artifacts.
  • Missing baselines now fail fast unless drift-only mode is explicit.
MLOps Python · Docker · Kubernetes

MLOps End-to-End Pipeline

Customer churn pipeline covering data ingestion, model training, FastAPI serving, monitoring, Docker, and Kubernetes-oriented deployment structure.

  • 15 tests cover API behavior, request validation, data cleaning, and feature prep.
  • Local verification includes strict lint/format checks, training import, and Prometheus parsing without requiring Docker.
  • Optional Docker/Compose checks verify container config, health, and prediction paths.

Supporting systems work

Smaller projects that show platform thinking.

Experimental Python · CLI · LLM routing

Prism CLI

Experimental model-routing CLI for exploring provider selection, local tool execution, cost tracking, project memory, and security-aware command boundaries.

  • Local verification passes Ruff, 5,817 tests, and CLI smoke checks.
  • Public status doc lists non-gating mypy, Bandit, format, and live-provider gaps.

Writing

Notes and case studies from building AI systems.

Flagship

Enterprise RAG Reliability Platform

A short map of the flagship RAG platform: ingestion, citations, eval gates, traces, reports, dashboard, live demo, and the honest portfolio boundary.

Read note
AI Reliability Platform

What I Learned Building Evals Before Adding an LLM

A short engineering note on why retrieval, evidence, refusal behavior, and regression checks should be visible before adding a stronger model provider.

Read note
Local verification

Verifying AI Systems Without API Keys

A practical note on deterministic baselines, report artifacts, browser QA, and keeping Docker smoke checks separate from core local verification.

Read note
Applied AI Eval Lab

Evaluation Workspace Case Study

A document intelligence walkthrough covering grounded answers, citations, answer-fact coverage, release gates, and static demo verification.

Read case study
RAG Forge

RAG Forge Case Study

A retrieval benchmark walkthrough covering the problem, benchmark grid, regression gate, sample results, and why the first version stays retrieval-only.

Read case study
StreamInfer

Inference Serving Case Study

A local serving walkthrough covering adaptive batching, backpressure, benchmark gates, and why the sample stays synthetic and keyless.

Read case study
MLGuard

Release Gate Case Study

A release-gate walkthrough for checking drift, performance regression, latency regression, report artifacts, and trusted model boundaries.

Read case study
MLOps Pipeline

Model Lifecycle Case Study

A churn-model workflow covering ingestion, training, serving, Prometheus metrics, Docker smoke checks, and honest portfolio-scale boundaries.

Read case study

How I work

Reliability habits I try to build into ML systems.

01

Start with a baseline

Use deterministic behavior first so failures are visible before model variance enters.

02

Carry evidence forward

Answers, metrics, and deployment decisions should point back to source data or eval results.

03

Make regressions cheap to catch

Small eval suites, release gates, and local workflows make iteration safer.

04

Document tradeoffs

Good systems explain why a choice was made, not only what code was written.

Contact

Interested in AI systems that are measurable, reliable, and useful.