How I Built a YouTube Knowledge Base with AI
The architecture behind Skip: pgvector, semantic chunking, and RAG over video transcripts. A technical walkthrough of turning YouTube into a searchable knowledge base.
The Skip Team
Skip Team
Skip processes YouTube videos through a pipeline: transcript extraction → chunking → embedding with text-embedding-3-small → storage in pgvector. Users search by meaning, not keywords. Built with Next.js, FastAPI, Celery, Redis, and Supabase. This post walks through the architecture decisions and trade-offs.
I watch a lot of YouTube. Tutorials, conference talks, deep dives on system design. At some point I realized I had hundreds of hours of video knowledge trapped behind play buttons — impossible to search, impossible to reference.
So I built Skip: a system that turns YouTube videos into a searchable knowledge base you can query with natural language. This post covers the architecture, the trade-offs, and what I learned building it.
The Problem: Video Knowledge Is Locked
YouTube's search finds videos. It doesn't search inside videos. If you watched a 40-minute talk where the speaker explained connection pooling for 90 seconds, you'll never find that segment again unless you remember which video it was in and scrub through the timeline.
I wanted something simple: import a video, and be able to search it later by what was said, not by what the title or description happened to contain.
Architecture Overview
The system has four layers:
- Ingestion — extract metadata and transcripts from YouTube
- Processing — chunk transcripts, generate embeddings, store vectors
- Search — semantic similarity search over the vector store
- Chat — RAG pipeline that answers questions using retrieved context
The frontend is Next.js with TypeScript. The backend is FastAPI (Python). Async work runs on Celery with Redis as the broker. The database is Supabase — PostgreSQL with the pgvector extension for vector similarity search.
Layer 1: Ingestion
When a user imports a YouTube video, we kick off a Celery task that:
- Fetches video metadata (title, author, duration, thumbnail) via the YouTube Data API
- Extracts the transcript using
youtube-transcript-api, which pulls YouTube's auto-generated or manual captions - Rate-limits concurrent transcript fetches with a semaphore (max 5 at a time) to avoid getting IP-blocked
The transcript comes back as a list of timed segments — each with a start timestamp and text. We preserve these timestamps because they're gold: when a user finds a relevant passage later, we can link them directly to that moment in the video.
Layer 2: Chunking and Embedding
Raw transcripts are messy. A 30-minute video produces a wall of text with no structure. We need to break it into chunks that are small enough to embed meaningfully but large enough to carry context.
Each chunk gets embedded using OpenAI's text-embedding-3-small model (512 dimensions). We chose this model for the balance of quality vs. cost — at 512 dimensions, storage is reasonable and search is fast, while semantic quality is still strong for our use case.
The embeddings go into pgvector, PostgreSQL's vector extension. This was a deliberate choice over dedicated vector databases like Pinecone or Weaviate. Why?
- Operational simplicity — one database for relational data and vectors. No sync issues.
- Transactional consistency — when we delete a video, its vectors disappear in the same transaction.
- Good enough performance — pgvector with IVFFlat or HNSW indexes handles millions of vectors. We're not at billions-scale.
- Supabase gives us pgvector for free — no extra infrastructure to manage.
The trade-off: pgvector is slower than purpose-built vector DBs at very high scale. For a personal knowledge base with thousands of videos, it's not even close to being a bottleneck.
Layer 3: Semantic Search
This is where it gets interesting. When a user searches "how to handle database migrations in production," we:
- Embed the query using the same model (text-embedding-3-small)
- Run a cosine similarity search against all chunk embeddings in the user's library
- Return the top-K most similar chunks, each with its source video and timestamp
The key insight: semantic search finds by meaning, not keywords. A query about "database migrations" will match a chunk where the speaker said "schema changes in production" — because the embeddings capture conceptual similarity.
We also cache embeddings in Redis with a 24-hour TTL. Since embeddings are deterministic (same input → same output), this avoids recomputing for repeated or similar queries.
Layer 4: Chat (RAG)
Search returns chunks. Chat turns those chunks into answers.
The RAG (Retrieval-Augmented Generation) pipeline works like this:
- User asks a question in the chat interface
- We retrieve the most relevant chunks from their library (same vector search as above)
- We construct a prompt with the retrieved context and the user's question
- An LLM generates an answer grounded in the actual video content
- The response includes citations — which video and timestamp each claim comes from
The citations are critical. Without them, it's just another chatbot making things up. With them, every answer is traceable back to a specific moment in a specific video.
The MCP Integration
One of the more interesting features: Skip exposes an MCP (Model Context Protocol) server that lets you query your video knowledge base directly from tools like Cursor or Claude.
MCP is a protocol that lets AI assistants call external tools. Our MCP server exposes a search_knowledge_base tool that runs the same semantic search pipeline. So while you're coding in Cursor, you can ask "what did that video say about React Server Components?" and get an answer pulled from your library — without leaving your editor.
The implementation is a FastMCP server that authenticates against the same user session and queries the same pgvector store. No separate index, no data duplication.
What I'd Do Differently
A few things I've learned building this:
- Start with pgvector. I spent time evaluating Pinecone, Qdrant, and Weaviate before realizing that pgvector in Supabase was the right choice for our scale. Don't over-engineer your vector store.
- Timestamp preservation is non-negotiable. The ability to jump to the exact moment in a video transforms the UX. Don't throw away temporal metadata during chunking.
- Rate limiting matters more than you think. YouTube will block your IP if you hit the transcript API too aggressively. The semaphore-based approach works, but you also need exponential backoff and graceful degradation.
- Embedding caching pays for itself immediately. Users search for similar things repeatedly. Redis caching cut our embedding API costs significantly.
Try It
Skip is live at getskip.dev. The free tier gives you 50 videos and 100 messages per month — enough to build a meaningful knowledge base and see if the workflow clicks for you.
If you're a developer who learns from YouTube (and who doesn't), I'd love to hear what you think. The Chrome extension makes importing videos one click, and the MCP integration is worth trying if you use Cursor or Claude.
The best part of building this has been watching the search results surface things I forgot I watched. That conference talk from six months ago where someone explained exactly the pattern I need right now? It's there. I just have to ask.
Try this yourself
Import a YouTube video into Skip and search it by meaning — not just keywords. Free, no credit card required.
Frequently Asked Questions
What tech stack is Skip built with?
Skip uses Next.js (TypeScript) for the frontend, FastAPI (Python) for the backend, Celery with Redis for async task processing, and Supabase (PostgreSQL with pgvector) for the database and vector store. Embeddings are generated using OpenAI's text-embedding-3-small model.
Why use pgvector instead of a dedicated vector database?
pgvector keeps relational data and vectors in the same database, giving you transactional consistency (deleting a video removes its vectors atomically) and operational simplicity (one database to manage). For personal knowledge bases with thousands of videos, pgvector performance is more than sufficient.
How does semantic search work on video transcripts?
Video transcripts are chunked into segments, each embedded into a 512-dimensional vector using text-embedding-3-small. When you search, your query is embedded the same way and compared via cosine similarity against all chunks. This finds matches by meaning — 'database migrations' matches 'schema changes in production' — not just keywords.
What is RAG and how does Skip use it?
RAG (Retrieval-Augmented Generation) retrieves relevant transcript chunks via vector search, then passes them as context to an LLM that generates a grounded answer. Skip's RAG pipeline includes citations with video timestamps, so every answer is traceable back to its source.
Can I query my video knowledge base from my code editor?
Yes. Skip provides an MCP (Model Context Protocol) server that integrates with tools like Cursor and Claude. You can search your video library with natural language directly from your editor, using the same semantic search pipeline as the web interface.
Related Articles
What is Skip? The Video Knowledge Platform for Learners
Skip is a platform that turns YouTube, Loom, and Fathom videos into a searchable knowledge base. Instead of rewatching hours of content, you can search, chat, and extract insights instantly.
ProductHow to Use Skip with Claude AI (MCP Integration Guide)
Skip integrates with Claude via MCP, letting you search your video library and import content directly from your AI assistant. Here's how to set it up.
LearningHow to Build a Personal Knowledge Base from YouTube
Your YouTube watch history is full of valuable information—but it's impossible to search. Here's how to turn those videos into an organized, searchable knowledge base.
Ready to try Skip?
Turn your YouTube videos into a searchable knowledge base. Start free, no credit card required.
