SceneSeeker: Distributed Video Retrieval Engine
SceneSeeker is a distributed system that indexes video content by dialogue and enables precise scene retrieval. Users can input a text query (e.g., "Seven! Seven! Seven!"), and the engine returns a frame-perfect video clip of that specific scene.
🚀 Key Features
- Distributed Architecture: Decouples the API (Go) from the heavy processing (Python) using a message broker.
- SRT-Based Indexing: Utilizes existing embedded subtitle streams for ultra-fast indexing (0.01x realtime).
- Full-Text Search: Uses PostgreSQL
tsvectorfor high-performance dialogue matching. - Smart Clipping: Performs frame-accurate video cutting and transcoding using FFmpeg.
- S3-Compatible Storage: Uses MinIO for scalable object storage.
🏗 System Architecture
The system follows a Controller-Worker pattern. The Go backend handles user interactions and state, while stateless Python workers handle media processing.
The Stack
| Component | Technology | Responsibility |
|---|---|---|
| Orchestrator | Go (Golang) | HTTP API, Job Scheduling, State Management. |
| Worker | Python | FFmpeg operations, Subtitle parsing (pysrt). |
| Broker | Redis | Asynchronous job queues (queue:ingest, queue:clip). |
| Database | PostgreSQL | Relational data & Full Text Search (tsvector). |
| Storage | MinIO | S3-compatible object storage for raw videos and clips. |
Visual Workflow
sequenceDiagram
participant User
participant Go as Go Orchestrator
participant Redis
participant Py as Python Worker
participant S3 as MinIO
participant DB as Postgres
Note over Go, Py: Flow 1: Ingestion
User->>Go: Upload Video (Stream)
Go->>S3: Stream to Bucket
Go->>Redis: Push Job (Ingest)
Redis->>Py: Pop Job
Py->>S3: Read Header/Stream
Py->>Py: Extract SRT (FFmpeg)
Py->>DB: Index Subtitles
Note over Go, Py: Flow 2: Retrieval
User->>Go: Search "Seven!"
Go->>DB: FTS Query
DB-->>Go: Hit (VideoID, StartTime)
Go->>Redis: Push Job (Clip)
Redis->>Py: Pop Job
Py->>S3: Seek & Transcode (FFmpeg)
Py->>S3: Upload Clip
Py-->>Go: Job Complete
Go->>User: Return Clip URL💾 Database Schema
We use PostgreSQL for its robust text search capabilities.
videos
Metadata for the raw files.
CREATE TABLE videos (
id SERIAL PRIMARY KEY,
filename TEXT NOT NULL,
s3_path TEXT NOT NULL,
duration FLOAT,
status VARCHAR(20) DEFAULT 'PENDING' -- PENDING, INDEXED, FAILED
);segments
The searchable dialogue chunks.
CREATE TABLE segments (
id SERIAL PRIMARY KEY,
video_id INT REFERENCES videos(id),
start_time FLOAT NOT NULL,
end_time FLOAT NOT NULL,
content TEXT NOT NULL,
-- The Magic: Pre-computed search vector
content_vector TSVECTOR GENERATED ALWAYS AS (to_tsvector('english', content)) STORED
);
-- Index for lightning-fast text search
CREATE INDEX idx_segments_content ON segments USING GIN(content_vector);🛠 Project Structure
.
├── docker-compose.yml # Orchestrates Go, Python, Redis, Postgres, MinIO
├── Makefile # Build scripts and Proto generation
├── api/ # The Go Orchestrator
│ ├── cmd/main.go # Entry point
│ ├── internal/
│ │ ├── handlers/ # HTTP Controllers
│ │ ├── queue/ # Redis Producer logic
│ │ └── db/ # Postgres repositories
├── worker/ # The Python Processor
│ ├── src/
│ │ ├── main.py # Worker Loop (Redis Consumer)
│ │ ├── extractor.py # Subtitle extraction logic
│ │ └── clipper.py # FFmpeg wrapping logic
│ ├── Dockerfile
│ └── requirements.txt
└── protobuf/ # Shared Protocol Buffers (if using gRPC for status)⚡️ Ingestion Logic (Python)
Instead of using heavy AI models (Whisper), we extract embedded tracks. This makes ingestion extremely lightweight.
- Probe: Check video for
codec_type='subtitle'. - Extract: Run
ffmpeg -i video.mkv -map 0:s:0 subs.srt. - Parse: Use
pysrtto parse timestamps. - Index: Bulk insert into Postgres.
✂️ Clipping Logic (FFmpeg)
To ensure clips are playable and frame-accurate, we re-encode the specific segment rather than stream-copying (which requires I-frames).
# The logic inside worker/src/clipper.py
ffmpeg.input(s3_url, ss=start_time)
.output("clip.mp4", t=duration, vcodec="libx264", acodec="aac")
.run()🚀 Getting Started
Start Infrastructure:
bashdocker-compose up -d postgres redis minioRun Migrations:
bash# (Assuming migrate tool is installed) migrate -path ./migrations -database "postgres://user:pass@localhost:5432/sceneseeker" upStart Services:
bashdocker-compose up --build