Building SceneExtractor: A Semantic Search Engine for Video
Overview
SceneExtractor is a distributed video search engine that uses semantic embeddings to find scenes by natural language queries. Built with Go, Python, and vector embeddings, it processes videos asynchronously and returns frame-accurate clips in seconds.
Searching text is easy. Searching video is hard.
If I asked you to find the exact moment in a 2-hour movie where a character says "I need a million dollars," you'd probably spend 10 minutes scrubbing through the timeline. I built SceneExtractor to solve this. It's a distributed system that lets you search video content using natural language and retrieve frame-perfect clips instantly.
Here's a deep dive into how I built it using Go, Python, and Vector Embeddings.
The Architecture
SceneExtractor isn't a monolith; it's a distributed pipeline designed to handle heavy media processing without blocking user interactions.
The Stack
- Orchestrator (Go/Gin): The high-conformance API gateway.
- Worker (Python): The heavy lifter for AI and FFmpeg operations.
- Message Broker (Redis): Handles asynchronous job queues.
- State (PostgreSQL): Tracks job status and metadata.
- Storage (MinIO): S3-compatible object storage for video files and clip results.
- Frontend (Next.js): The interface for uploading and searching.
How It Works
1. Ingestion & Extraction
Everything starts when a user uploads a video. The Go orchestrator streams this to MinIO and pushes a job to the queue:transcription Redis list.
The Python worker picks this up. Instead of trying to run heavy ML models on the fly during a search, we pre-process everything.
- Audio Extraction: We use
FFmpegto rip the audio track. - Transcription: (Currently utilizing subtitle tracks, with Whisper integration planned) We parse the dialogue into time-stamped segments.
2. The "Semantic" Magic
Traditional search matches keywords ("money" == "money"). Semantic search matches meaning. I used Sentence-Transformers (all-MiniLM-L6-v2) to convert every line of dialogue into a 384-dimensional vector. These vectors are stored in a local index (backed by NumPy for speed).
When you search for "I need to buy a house", the system:
- Converts your query into a vector.
- Calculates the Cosine Similarity between your query vector and all the dialogue vectors.
- Finds the closest match—even if the exact words are different (e.g., "I'm looking to purchase a home").
3. Frame-Perfect Clipping
Once we find the best match (e.g., timestamp 00:45:12 to 00:45:15), the worker spins up an FFmpeg process to slice that exact segment from the source video file in MinIO and uploads the result.
Challenges & Learnings
Handling Large Files
Video files are huge. Buffering an entire 2GB file in RAM just to upload it is a recipe for crashing your server. I implemented streaming uploads in the Go orchestrator, piping the multipart/form-data stream directly to MinIO.
Inter-Service Communication
Decoupling the Orchestrator and Worker was crucial. Initially, I considered HTTP calls between them, but that couples availability. If the Worker is down, the Orchestrator shouldn't fail to accept uploads. Moving to a Redis-based job queue pattern allowed the system to be resilient. The Go service can queue up 100 jobs instantly, and the Python workers can churn through them at their own pace.
The "Context" Problem
Sometimes a dialogue line is too short to have semantic meaning ("Yes", "No", "Okay"). I improved search relevance by implementing a sliding window approach, encoding the target sentence plus its neighbors to give the model more context.
What's Next?
- Visual Search: Using CLIP models to search by visual content (e.g., "Find a scene with a red car").
- Whisper Integration: moving away from SRT files to full audio-to-text generation.
- Vector DB: Migrating from in-memory NumPy arrays to a proper vector database like Chroma or Milvus for scaling to millions of videos.
Check out the code on GitHub.