EngineeringMarch 30, 202610 min read

How I Built a YouTube Knowledge Base with AI

The architecture behind Skip: pgvector, semantic chunking, and RAG over video transcripts. A technical walkthrough of turning YouTube into a searchable knowledge base.

The Skip Team

Skip Team

TL;DR

Skip processes YouTube videos through a pipeline: transcript extraction → chunking → embedding with text-embedding-3-small → storage in pgvector. Users search by meaning, not keywords. Built with Next.js, FastAPI, Celery, Redis, and Supabase. This post walks through the architecture decisions and trade-offs.

I watch a lot of YouTube. Tutorials, conference talks, deep dives on system design. At some point I realized I had hundreds of hours of video knowledge trapped behind play buttons — impossible to search, impossible to reference.

So I built Skip: a system that turns YouTube videos into a searchable knowledge base you can query with natural language. This post covers the architecture, the trade-offs, and what I learned building it.

The Problem: Video Knowledge Is Locked

YouTube's search finds videos. It doesn't search inside videos. If you watched a 40-minute talk where the speaker explained connection pooling for 90 seconds, you'll never find that segment again unless you remember which video it was in and scrub through the timeline.

I wanted something simple: import a video, and be able to search it later by what was said, not by what the title or description happened to contain.

Architecture Overview

The system has four layers:

Ingestion — extract metadata and transcripts from YouTube
Processing — chunk transcripts, generate embeddings, store vectors
Search — semantic similarity search over the vector store
Chat — RAG pipeline that answers questions using retrieved context

The frontend is Next.js with TypeScript. The backend is FastAPI (Python). Async work runs on Celery with Redis as the broker. The database is Supabase — PostgreSQL with the pgvector extension for vector similarity search.

Layer 1: Ingestion

When a user imports a YouTube video, we kick off a Celery task that:

Fetches video metadata (title, author, duration, thumbnail) via the YouTube Data API
Extracts the transcript using youtube-transcript-api, which pulls YouTube's auto-generated or manual captions
Rate-limits concurrent transcript fetches with a semaphore (max 5 at a time) to avoid getting IP-blocked

The transcript comes back as a list of timed segments — each with a start timestamp and text. We preserve these timestamps because they're gold: when a user finds a relevant passage later, we can link them directly to that moment in the video.

Layer 2: Chunking and Embedding

Raw transcripts are messy. A 30-minute video produces a wall of text with no structure. We need to break it into chunks that are small enough to embed meaningfully but large enough to carry context.

Each chunk gets embedded using OpenAI's text-embedding-3-small model (512 dimensions). We chose this model for the balance of quality vs. cost — at 512 dimensions, storage is reasonable and search is fast, while semantic quality is still strong for our use case.

The embeddings go into pgvector, PostgreSQL's vector extension. This was a deliberate choice over dedicated vector databases like Pinecone or Weaviate. Why?

Operational simplicity — one database for relational data and vectors. No sync issues.
Transactional consistency — when we delete a video, its vectors disappear in the same transaction.
Good enough performance — pgvector with IVFFlat or HNSW indexes handles millions of vectors. We're not at billions-scale.
Supabase gives us pgvector for free — no extra infrastructure to manage.

The trade-off: pgvector is slower than purpose-built vector DBs at very high scale. For a personal knowledge base with thousands of videos, it's not even close to being a bottleneck.

Layer 3: Semantic Search

This is where it gets interesting. When a user searches "how to handle database migrations in production," we:

Embed the query using the same model (text-embedding-3-small)
Run a cosine similarity search against all chunk embeddings in the user's library
Return the top-K most similar chunks, each with its source video and timestamp

The key insight: semantic search finds by meaning, not keywords. A query about "database migrations" will match a chunk where the speaker said "schema changes in production" — because the embeddings capture conceptual similarity.

We also cache embeddings in Redis with a 24-hour TTL. Since embeddings are deterministic (same input → same output), this avoids recomputing for repeated or similar queries.

Layer 4: Chat (RAG)

Search returns chunks. Chat turns those chunks into answers.

The RAG (Retrieval-Augmented Generation) pipeline works like this:

User asks a question in the chat interface
We retrieve the most relevant chunks from their library (same vector search as above)
We construct a prompt with the retrieved context and the user's question
An LLM generates an answer grounded in the actual video content
The response includes citations — which video and timestamp each claim comes from

The citations are critical. Without them, it's just another chatbot making things up. With them, every answer is traceable back to a specific moment in a specific video.

The MCP Integration

One of the more interesting features: Skip exposes an MCP (Model Context Protocol) server that lets you query your video knowledge base directly from tools like Cursor or Claude.

MCP is a protocol that lets AI assistants call external tools. Our MCP server exposes a search_knowledge_base tool that runs the same semantic search pipeline. So while you're coding in Cursor, you can ask "what did that video say about React Server Components?" and get an answer pulled from your library — without leaving your editor.

The implementation is a FastMCP server that authenticates against the same user session and queries the same pgvector store. No separate index, no data duplication.

What I'd Do Differently

A few things I've learned building this:

Start with pgvector. I spent time evaluating Pinecone, Qdrant, and Weaviate before realizing that pgvector in Supabase was the right choice for our scale. Don't over-engineer your vector store.
Timestamp preservation is non-negotiable. The ability to jump to the exact moment in a video transforms the UX. Don't throw away temporal metadata during chunking.
Rate limiting matters more than you think. YouTube will block your IP if you hit the transcript API too aggressively. The semaphore-based approach works, but you also need exponential backoff and graceful degradation.
Embedding caching pays for itself immediately. Users search for similar things repeatedly. Redis caching cut our embedding API costs significantly.

Try It

Skip is live at getskip.dev. The free tier gives you 50 videos and 100 messages per month — enough to build a meaningful knowledge base and see if the workflow clicks for you.

If you're a developer who learns from YouTube (and who doesn't), I'd love to hear what you think. The Chrome extension makes importing videos one click, and the MCP integration is worth trying if you use Cursor or Claude.

The best part of building this has been watching the search results surface things I forgot I watched. That conference talk from six months ago where someone explained exactly the pattern I need right now? It's there. I just have to ask.

Try this yourself

Import a YouTube video into Skip and search it by meaning — not just keywords. Free, no credit card required.

Try Skip Free See How It Works

Frequently Asked Questions

What tech stack is Skip built with?

Skip uses Next.js (TypeScript) for the frontend, FastAPI (Python) for the backend, Celery with Redis for async task processing, and Supabase (PostgreSQL with pgvector) for the database and vector store. Embeddings are generated using OpenAI's text-embedding-3-small model.

Why use pgvector instead of a dedicated vector database?

pgvector keeps relational data and vectors in the same database, giving you transactional consistency (deleting a video removes its vectors atomically) and operational simplicity (one database to manage). For personal knowledge bases with thousands of videos, pgvector performance is more than sufficient.

How does semantic search work on video transcripts?

Video transcripts are chunked into segments, each embedded into a 512-dimensional vector using text-embedding-3-small. When you search, your query is embedded the same way and compared via cosine similarity against all chunks. This finds matches by meaning — 'database migrations' matches 'schema changes in production' — not just keywords.

What is RAG and how does Skip use it?

RAG (Retrieval-Augmented Generation) retrieves relevant transcript chunks via vector search, then passes them as context to an LLM that generates a grounded answer. Skip's RAG pipeline includes citations with video timestamps, so every answer is traceable back to its source.

Can I query my video knowledge base from my code editor?

Yes. Skip provides an MCP (Model Context Protocol) server that integrates with tools like Cursor and Claude. You can search your video library with natural language directly from your editor, using the same semantic search pipeline as the web interface.

Product

What is Skip? The Video Knowledge Platform for Learners

Skip is a platform that turns YouTube, Loom, and Fathom videos into a searchable knowledge base. Instead of rewatching hours of content, you can search, chat, and extract insights instantly.

Product

How to Use Skip with Claude AI (MCP Integration Guide)

Skip integrates with Claude via MCP, letting you search your video library and import content directly from your AI assistant. Here's how to set it up.

Learning

How to Build a Personal Knowledge Base from YouTube

Your YouTube watch history is full of valuable information—but it's impossible to search. Here's how to turn those videos into an organized, searchable knowledge base.

Ready to try Skip?

Turn your YouTube videos into a searchable knowledge base. Start free, no credit card required.

Get Started Free See Pricing

EngineeringMarch 30, 202610 min read

How I Built a YouTube Knowledge Base with AI

The architecture behind Skip: pgvector, semantic chunking, and RAG over video transcripts. A technical walkthrough of turning YouTube into a searchable knowledge base.

The Skip Team

Skip Team

TL;DR

The Problem: Video Knowledge Is Locked

I wanted something simple: import a video, and be able to search it later by what was said, not by what the title or description happened to contain.

Architecture Overview

The system has four layers:

Ingestion — extract metadata and transcripts from YouTube
Processing — chunk transcripts, generate embeddings, store vectors
Search — semantic similarity search over the vector store
Chat — RAG pipeline that answers questions using retrieved context

Layer 1: Ingestion

When a user imports a YouTube video, we kick off a Celery task that:

Fetches video metadata (title, author, duration, thumbnail) via the YouTube Data API
Extracts the transcript using youtube-transcript-api, which pulls YouTube's auto-generated or manual captions
Rate-limits concurrent transcript fetches with a semaphore (max 5 at a time) to avoid getting IP-blocked

Layer 2: Chunking and Embedding

Raw transcripts are messy. A 30-minute video produces a wall of text with no structure. We need to break it into chunks that are small enough to embed meaningfully but large enough to carry context.

The embeddings go into pgvector, PostgreSQL's vector extension. This was a deliberate choice over dedicated vector databases like Pinecone or Weaviate. Why?

Operational simplicity — one database for relational data and vectors. No sync issues.
Transactional consistency — when we delete a video, its vectors disappear in the same transaction.
Good enough performance — pgvector with IVFFlat or HNSW indexes handles millions of vectors. We're not at billions-scale.
Supabase gives us pgvector for free — no extra infrastructure to manage.

The trade-off: pgvector is slower than purpose-built vector DBs at very high scale. For a personal knowledge base with thousands of videos, it's not even close to being a bottleneck.

Layer 3: Semantic Search

This is where it gets interesting. When a user searches "how to handle database migrations in production," we:

Embed the query using the same model (text-embedding-3-small)
Run a cosine similarity search against all chunk embeddings in the user's library
Return the top-K most similar chunks, each with its source video and timestamp

We also cache embeddings in Redis with a 24-hour TTL. Since embeddings are deterministic (same input → same output), this avoids recomputing for repeated or similar queries.

Layer 4: Chat (RAG)

Search returns chunks. Chat turns those chunks into answers.

The RAG (Retrieval-Augmented Generation) pipeline works like this:

User asks a question in the chat interface
We retrieve the most relevant chunks from their library (same vector search as above)
We construct a prompt with the retrieved context and the user's question
An LLM generates an answer grounded in the actual video content
The response includes citations — which video and timestamp each claim comes from

The citations are critical. Without them, it's just another chatbot making things up. With them, every answer is traceable back to a specific moment in a specific video.

The MCP Integration

One of the more interesting features: Skip exposes an MCP (Model Context Protocol) server that lets you query your video knowledge base directly from tools like Cursor or Claude.

The implementation is a FastMCP server that authenticates against the same user session and queries the same pgvector store. No separate index, no data duplication.

What I'd Do Differently

A few things I've learned building this:

Start with pgvector. I spent time evaluating Pinecone, Qdrant, and Weaviate before realizing that pgvector in Supabase was the right choice for our scale. Don't over-engineer your vector store.
Timestamp preservation is non-negotiable. The ability to jump to the exact moment in a video transforms the UX. Don't throw away temporal metadata during chunking.
Rate limiting matters more than you think. YouTube will block your IP if you hit the transcript API too aggressively. The semaphore-based approach works, but you also need exponential backoff and graceful degradation.
Embedding caching pays for itself immediately. Users search for similar things repeatedly. Redis caching cut our embedding API costs significantly.

Try It

Skip is live at getskip.dev. The free tier gives you 50 videos and 100 messages per month — enough to build a meaningful knowledge base and see if the workflow clicks for you.

Try this yourself

Import a YouTube video into Skip and search it by meaning — not just keywords. Free, no credit card required.

Try Skip Free See How It Works

Frequently Asked Questions

What tech stack is Skip built with?

Why use pgvector instead of a dedicated vector database?

How does semantic search work on video transcripts?

What is RAG and how does Skip use it?

Can I query my video knowledge base from my code editor?

Product

Ready to try Skip?

Turn your YouTube videos into a searchable knowledge base. Start free, no credit card required.

Get Started Free See Pricing

The Problem: Video Knowledge Is Locked

Architecture Overview

Layer 1: Ingestion

Layer 2: Chunking and Embedding

Layer 3: Semantic Search

Layer 4: Chat (RAG)

The MCP Integration

What I'd Do Differently

Try It

Frequently Asked Questions

What tech stack is Skip built with?

Why use pgvector instead of a dedicated vector database?

How does semantic search work on video transcripts?

What is RAG and how does Skip use it?

Can I query my video knowledge base from my code editor?

Related Articles

What is Skip? The Video Knowledge Platform for Learners

How to Use Skip with Claude AI (MCP Integration Guide)

How to Build a Personal Knowledge Base from YouTube

Ready to try Skip?

The Problem: Video Knowledge Is Locked

Architecture Overview

Layer 1: Ingestion

Layer 2: Chunking and Embedding

Layer 3: Semantic Search

Layer 4: Chat (RAG)

The MCP Integration

What I'd Do Differently

Try It

Frequently Asked Questions

What tech stack is Skip built with?

Why use pgvector instead of a dedicated vector database?

How does semantic search work on video transcripts?

What is RAG and how does Skip use it?

Can I query my video knowledge base from my code editor?

Related Articles

What is Skip? The Video Knowledge Platform for Learners

How to Use Skip with Claude AI (MCP Integration Guide)

How to Build a Personal Knowledge Base from YouTube

Ready to try Skip?