sahildahiya.me / home

I'm Sahil Dahiya. I build AI evaluation systems and agent runtimes. Most recently at Workleap, previously five years at Microsoft.

Most of what I've shipped is on the eval side — frameworks, A/B tests on model choices, traces that survive a postmortem. I run two projects of my own to keep the work close to the runtime.

Background

At Workleap I worked on AI assistants and the systems that made them measurable: routing, RAG pipelines, eval frameworks, A/B testing on model choices, and conversation anonymization for product analysis.

Before that, five years at Microsoft on anomaly detection, forecasting, experimentation, and device health systems across Windows and data center ops.

Two projects of my own

Tapasya

A retrieval-augmented reading and writing environment for philosophy. Search across Nietzsche passages, inspect cited sources, continue at passage level, and move into essay mode without leaving the same thinking workflow.

Right now: search across philosophical texts, passage-level conversation, and an essay mode that keeps citations attached to every paragraph.

FastAPI / HTMX / Claude API / Voyage AI

just-another-coding-agent

A terminal coding agent built around a Python runtime, JSON-over-stdio RPC, and a first-party Go interface. The project is focused on keeping the backend contract explicit, the TUI thin, and the runtime strict enough to support real coding sessions without fallback-heavy behavior. It now also has a public Terminal-Bench 2 submission with a GLM-5 run that validated successfully at 47.4% accuracy.

Python runtime, Go TUI, JSON-over-stdio between them. First public Terminal-Bench 2 submission validated at 47.4% with GLM-5.

Python / PydanticAI / Go / Bubble Tea

Logs

Short notes I keep so model choices, eval runs, and deployment decisions stay findable later.

Elsewhere

Want to talk about AI evaluation, agent runtimes, or anything above? Email's on my LinkedIn.