WithVideo
About WithVideo

The endpoint of watching a tutorial isn't notes. It's a working result.

WithVideo is an AI-powered learning and execution engine for engineers. It reads what the instructor says first, only looks at the screen when the transcript isn't enough, distills every concrete step into a previewable, approvable execution plan — and runs it directly inside your local shell, editor, or IDE.

Why we built this

Semantic-first, not frame-by-frame

Other video AI tools treat a video as a pile of frames. WithVideo reads what the instructor says first. Only when the transcript can't resolve something — a code screenshot, a UI state — does it analyze the screen. One LLM call classifies everything; 50–70% of segments skip visual analysis entirely.

Not notes — executable diffs

Most video summarizer tools end with a Markdown file. WithVideo ends with actual changes in git status. Every step is a previewable diff. Nothing runs until you explicitly accept — your environment is never silently modified.

Local-first, video stays on your machine

Designed from the start on the assumption that you don't want to send your tutorial videos to someone else's server. Default setup runs fully locally: mlx-whisper on Apple Silicon, a local ONNX OCR model, and LLM can be swapped to Ollama or any OpenAI-compatible local backend.

How it works

Five-stage pipeline

  1. 1

    Source

    YouTube / Bilibili / local mp4·mov·mkv

  2. 2

    Transcript

    Whisper speech recognition or platform captions

  3. 3

    Semantic

    One LLM call classifies all caption blocks (~5 seconds)

  4. 4

    Vision

    Runs only when transcript can't resolve — 50–70% of blocks skip this

  5. 5

    Guide

    guide.md + semantic.json + code/ — three executable artifacts

Benchmarks

Measured, not estimated

~78x
OCR vision backend speedup
vs. previous vision pipeline
7m49s
10-min video end-to-end
Apple Silicon, measured
8m28s
20-min video semantic stage
after 4-worker parallel
~5s
Full caption block classification
single LLM call
Use cases

Works great for

  • CLI tool walkthroughs
  • Framework bootstrapping (Next.js / FastAPI / Rails...)
  • Deployment pipelines (Docker / Vercel / K8s)
  • Reproducing vibe-coding projects

Not a great fit for

  • Pure theory lectures with no executable steps
  • Videos without captions and poor audio quality