The endpoint of watching a tutorial isn't notes. It's a working result.
WithVideo is an AI-powered learning and execution engine for engineers. It reads what the instructor says first, only looks at the screen when the transcript isn't enough, distills every concrete step into a previewable, approvable execution plan — and runs it directly inside your local shell, editor, or IDE.
Semantic-first, not frame-by-frame
Other video AI tools treat a video as a pile of frames. WithVideo reads what the instructor says first. Only when the transcript can't resolve something — a code screenshot, a UI state — does it analyze the screen. One LLM call classifies everything; 50–70% of segments skip visual analysis entirely.
Not notes — executable diffs
Most video summarizer tools end with a Markdown file. WithVideo ends with actual changes in git status. Every step is a previewable diff. Nothing runs until you explicitly accept — your environment is never silently modified.
Local-first, video stays on your machine
Designed from the start on the assumption that you don't want to send your tutorial videos to someone else's server. Default setup runs fully locally: mlx-whisper on Apple Silicon, a local ONNX OCR model, and LLM can be swapped to Ollama or any OpenAI-compatible local backend.
Five-stage pipeline
- 1
Source
YouTube / Bilibili / local mp4·mov·mkv
- 2
Transcript
Whisper speech recognition or platform captions
- 3
Semantic
One LLM call classifies all caption blocks (~5 seconds)
- 4
Vision
Runs only when transcript can't resolve — 50–70% of blocks skip this
- 5
Guide
guide.md + semantic.json + code/ — three executable artifacts
Measured, not estimated
Works great for
- ✓CLI tool walkthroughs
- ✓Framework bootstrapping (Next.js / FastAPI / Rails...)
- ✓Deployment pipelines (Docker / Vercel / K8s)
- ✓Reproducing vibe-coding projects
Not a great fit for
- –Pure theory lectures with no executable steps
- –Videos without captions and poor audio quality