kaffeeundkuchen‑claw
A terminal AI coding agent with a human in the loop. It reads, plans, and edits a codebase, then shows you a diff before anything touches disk.
What it is
You give it a goal in plain language. It uses tools to read and search your project, then stages the file changes it wants to make. Nothing is written until you review the diff and approve it. The same engine also runs as a Telegram bot, so you can drive it from your phone.
Built on Bun and TypeScript, with the Vercel AI SDK over an OpenRouter model.
Agent
Give it a goal. It edits files and runs commands, staging every change for your approval.
Ask
Ask a question about the codebase and get a grounded answer. It only reads, never writes.
Plan
Turn a goal into a short numbered plan, pick the steps you want, then run them.
Telegram
Run the agent, ask, and plan flows from a chat, with approve and reject buttons.
How it works
The defining idea: the model never writes to disk directly. Changes are staged, shown as a diff, and applied only after you approve.
- Goal. You describe what you want in plain language.
- Tools. The agent reads, searches, and lists files to understand the project.
- Stage. Each create, edit, delete, or command is kept in an overlay and recorded in an action log, marked pending.
- Review. You see a unified diff and either approve or reject.
- Apply. The one step that writes to disk, and only for what you approved.
- Report. A one-line summary shows LLM calls, tokens, and the real billed cost for the run.
Tested
A unit suite covers path safety, the staging and apply logic, globbing, and the benchmark scoring.
Continuous integration
Type checking, linting, and tests run on every push through GitHub Actions.
Containerized
A small image runs the whole thing with a single command.
Strict types
TypeScript in strict mode, with unused code flagged.
Sandboxed
Every path is resolved inside the workspace and rejected if it tries to escape.
Resilient
Model and network failures become short, clear messages instead of crashes.
Observability
Each run reports tokens used and the real billed cost from OpenRouter, correct for any model automatically.
Evaluated
A benchmark scores the agent across seeded tasks and reports a success rate.
Measured, not guessed
Most hobby agents have no idea how well they work. This one keeps score. The benchmark runs the agent against a set of seeded tasks in isolated workspaces, auto-applies the changes, and checks the result, reporting pass rate, tokens, and real billed cost per task.
bun start run the agent
bun test run the unit suite (50 tests)
bun run eval run the benchmark