MLOps engineer building reliable applied AI systems

Making AI workflows measurable after the demo stage.

I work on the production side of ML and AI: data pipelines, model services, retrieval systems, evaluation loops, observability, release gates, and the glue that makes models easier to operate. My flagship work is an enterprise-style RAG reliability platform with citations, eval gates, traces, reports, and a dashboard.

GitHub Recruiter brief LinkedIn Email

6 featured repos checked locally

168 local tests across featured AI/MLOps repos

0 API keys required for core demos Proof ledger Project map Enterprise RAG Brief

Selected work

Projects with code, docs, and clear tradeoffs.

Featured Python · FastAPI · Next.js

Enterprise RAG Reliability Platform

Local-first enterprise-style RAG reliability platform for MLOps runbooks and uploaded documents, with cited answers, provider comparison, eval diagnostics, traces, proof artifacts, and an operational dashboard.

Designed to run without API keys.
29 tests cover provider citation markers, secret-refusal behavior, PDF uploads, CORS configuration, and CLI smoke.
Tracked Playwright dashboard QA drives ingest, query, eval, console, and overflow checks on desktop/mobile.
Eval reports include latency, source coverage, cost, and pass/fail evidence.

Repo Enterprise RAG note Demo walkthrough Case study Verification

Applied AI Python · Next.js

Applied AI Eval Lab

Document intelligence workspace with retrieval, citations, evaluation metrics, answer-fact coverage, release-gate checks, experiment comparison, and a live static dashboard.

21 backend tests, frontend audit/typecheck/static export, demo data checks, and tracked desktop/mobile demo QA.
Writes JSON/Markdown report artifacts with report detail endpoints.
Core verification is Docker-free; Docker smoke separately verifies indexing, grounded query, eval gate, reports, CORS, and dashboard readiness.

Repo Live demo Screenshot Case study Verification

Retrieval eval Python · RAG

RAG Forge

Retrieval benchmark runner for comparing chunking, embedding, dense, BM25, hybrid retrieval, and reranking choices with Markdown/JSON reports and a regression gate for retrieval-quality changes.

37 tests cover ranking, gates, reports, and E5 query embedding behavior.
Blocks quality drops beyond configured hit-rate, MRR, and latency thresholds.
Sample check reruns the 24-configuration benchmark and self-comparison gate.

Repo Case study Sample benchmark Regression gate Verification

Inference serving Python · FastAPI

StreamInfer

Local inference-serving project with adaptive batching, backpressure, model hot-swap, metrics, load-test reports, and LLM-style benchmark sweeps.

Benchmark sweep compares batch size and timeout tradeoffs with JSON/Markdown reports.
40 tests cover serving, backpressure, benchmark gates, and recommendation stability.
Docker smoke verifies container health, prediction, hot-swap, and metrics paths.

Repo Case study Sample sweep Benchmark gate Verification

Reliability Python · ML validation

MLGuard

Pre-deployment checks for drift, performance regression, and latency regression before shipping model changes.

26 tests cover CLI behavior, report summaries, regression checks, and action metadata.
CLI help and action metadata both avoid advertising unsupported PyTorch artifacts.
Missing baselines now fail fast unless drift-only mode is explicit.

Repo Case study Sample report Verification

MLOps Python · Docker · Kubernetes

MLOps End-to-End Pipeline

Customer churn pipeline covering data ingestion, model training, FastAPI serving, monitoring, Docker, and Kubernetes-oriented deployment structure.

15 tests cover API behavior, request validation, data cleaning, and feature prep.
Local verification includes strict lint/format checks, training import, and Prometheus parsing without requiring Docker.
Optional Docker/Compose checks verify container config, health, and prediction paths.

Repo Case study Verification

Open source Ray · LightEval · BentoML

Open Source PRs

Open upstream PRs proposing focused fixes in AI infrastructure and evaluation tooling: RLlib documentation, LightEval typing, and BentoML server/model/testing docs.

Ray PR LightEval PR BentoML server PR BentoML model PR BentoML testing PR

Supporting systems work

Smaller projects that show platform thinking.

Experimental Python · CLI · LLM routing

Prism CLI

Experimental model-routing CLI for exploring provider selection, local tool execution, cost tracking, project memory, and security-aware command boundaries.

Local verification passes Ruff, 5,817 tests, and CLI smoke checks.
Public status doc lists non-gating mypy, Bandit, format, and live-provider gaps.

Repo Showcase status

Writing

Notes and case studies from building AI systems.

Flagship

Enterprise RAG Reliability Platform

A short map of the flagship RAG platform: ingestion, citations, eval gates, traces, reports, dashboard, live demo, and the honest portfolio boundary.

Read note

AI Reliability Platform

What I Learned Building Evals Before Adding an LLM

A short engineering note on why retrieval, evidence, refusal behavior, and regression checks should be visible before adding a stronger model provider.

Read note

Local verification

Verifying AI Systems Without API Keys

A practical note on deterministic baselines, report artifacts, browser QA, and keeping Docker smoke checks separate from core local verification.

Read note

Applied AI Eval Lab

Evaluation Workspace Case Study

A document intelligence walkthrough covering grounded answers, citations, answer-fact coverage, release gates, and static demo verification.

Read case study

RAG Forge

RAG Forge Case Study

A retrieval benchmark walkthrough covering the problem, benchmark grid, regression gate, sample results, and why the first version stays retrieval-only.

Read case study

StreamInfer

Inference Serving Case Study

A local serving walkthrough covering adaptive batching, backpressure, benchmark gates, and why the sample stays synthetic and keyless.

Read case study

MLGuard

Release Gate Case Study

A release-gate walkthrough for checking drift, performance regression, latency regression, report artifacts, and trusted model boundaries.

Read case study

MLOps Pipeline

Model Lifecycle Case Study

A churn-model workflow covering ingestion, training, serving, Prometheus metrics, Docker smoke checks, and honest portfolio-scale boundaries.

Read case study

How I work

Reliability habits I try to build into ML systems.

Start with a baseline

Use deterministic behavior first so failures are visible before model variance enters.

Carry evidence forward

Answers, metrics, and deployment decisions should point back to source data or eval results.

Make regressions cheap to catch

Small eval suites, release gates, and local workflows make iteration safer.

Document tradeoffs

Good systems explain why a choice was made, not only what code was written.

Contact

Interested in AI systems that are measurable, reliable, and useful.

ketgop2@gmail.com GitHub LinkedIn