kaffeeundkuchen‑claw

A terminal AI coding agent with a human in the loop. It reads, plans, and edits a codebase, then shows you a diff before anything touches disk.

Source on GitHub Architecture

What it is

You give it a goal in plain language. It uses tools to read and search your project, then stages the file changes it wants to make. Nothing is written until you review the diff and approve it. The same engine also runs as a Telegram bot, so you can drive it from your phone.

Built on Bun and TypeScript, with the Vercel AI SDK over an OpenRouter model.

Modes

Agent

Give it a goal. It edits files and runs commands, staging every change for your approval.

Ask

Ask a question about the codebase and get a grounded answer. It only reads, never writes.

Plan

Turn a goal into a short numbered plan, pick the steps you want, then run them.

Run the agent, ask, and plan flows from a chat, with approve and reject buttons.

How it works

The defining idea: the model never writes to disk directly. Changes are staged, shown as a diff, and applied only after you approve.

Goal. You describe what you want in plain language.
Tools. The agent reads, searches, and lists files to understand the project.
Stage. Each create, edit, delete, or command is kept in an overlay and recorded in an action log, marked pending.
Review. You see a unified diff and either approve or reject.
Apply. The one step that writes to disk, and only for what you approved.
Report. A one-line summary shows LLM calls, tokens, and the real billed cost for the run.

Engineering

Tested

A unit suite covers path safety, the staging and apply logic, globbing, and the benchmark scoring.

Continuous integration

Type checking, linting, and tests run on every push through GitHub Actions.

Containerized

A small image runs the whole thing with a single command.

Strict types

TypeScript in strict mode, with unused code flagged.

Sandboxed

Every path is resolved inside the workspace and rejected if it tries to escape.

Resilient

Model and network failures become short, clear messages instead of crashes.

Observability

Each run reports tokens used and the real billed cost from OpenRouter, correct for any model automatically.

Evaluated

A benchmark scores the agent across seeded tasks and reports a success rate.

Measured, not guessed

Most hobby agents have no idea how well they work. This one keeps score. The benchmark runs the agent against a set of seeded tasks in isolated workspaces, auto-applies the changes, and checks the result, reporting pass rate, tokens, and real billed cost per task.

bun start        run the agent
bun test         run the unit suite (50 tests)
bun run eval     run the benchmark

Built with

Bun TypeScript Vercel AI SDK OpenRouter Firecrawl Telegraf Biome Commander