About WithVideo

The endpoint of watching a tutorial isn't notes. It's a working result.

WithVideo is an AI-powered learning and execution engine for engineers. It reads what the instructor says first, only looks at the screen when the transcript isn't enough, distills every concrete step into a previewable, approvable execution plan — and runs it directly inside your local shell, editor, or IDE.

Why we built this

Semantic-first, not frame-by-frame

Other video AI tools treat a video as a pile of frames. WithVideo reads what the instructor says first. Only when the transcript can't resolve something — a code screenshot, a UI state — does it analyze the screen. One LLM call classifies everything; 50–70% of segments skip visual analysis entirely.

Not notes — executable diffs

Most video summarizer tools end with a Markdown file. WithVideo ends with actual changes in git status. Every step is a previewable diff. Nothing runs until you explicitly accept — your environment is never silently modified.

Local-first, video stays on your machine

Designed from the start on the assumption that you don't want to send your tutorial videos to someone else's server. Default setup runs fully locally: mlx-whisper on Apple Silicon, a local ONNX OCR model, and LLM can be swapped to Ollama or any OpenAI-compatible local backend.

How it works

Five-stage pipeline

1
Source
YouTube / Bilibili / local mp4·mov·mkv
2
Transcript
Whisper speech recognition or platform captions
3
Semantic
One LLM call classifies all caption blocks (~5 seconds)
4
Vision
Runs only when transcript can't resolve — 50–70% of blocks skip this
5
Guide
guide.md + semantic.json + code/ — three executable artifacts

Benchmarks

Measured, not estimated

~78x

OCR vision backend speedup

vs. previous vision pipeline

7m49s

10-min video end-to-end

Apple Silicon, measured

8m28s

20-min video semantic stage

after 4-worker parallel

~5s

Full caption block classification

single LLM call

Use cases

Works great for

✓CLI tool walkthroughs
✓Framework bootstrapping (Next.js / FastAPI / Rails...)
✓Deployment pipelines (Docker / Vercel / K8s)
✓Reproducing vibe-coding projects

Not a great fit for

–Pure theory lectures with no executable steps
–Videos without captions and poor audio quality

Get started

pip install withvideoClaude Code plugin MCP server

The endpoint of watching a tutorial isn't notes. It's a working result.

Semantic-first, not frame-by-frame

Not notes — executable diffs

Local-first, video stays on your machine

Five-stage pipeline

Source

Transcript

Semantic

Vision

Guide

Measured, not estimated

Works great for

Not a great fit for

Get started