How to Build Semantic Video Search with Qwen3-VL and Local Embeddings

What You Can Build

Semantic video search lets you find moments in video footage using natural language queries like "find where someone enters the frame" or "show me the car with a bicycle carrier." Instead of transcribing audio or captioning every frame, you embed video chunks directly into a vector space and search by semantic similarity. The key breakthrough: this now works entirely locally on consumer hardware using Qwen3-VL-Embedding models.

The Setup

SentrySearch is a CLI tool (github.com/ssrajadh/sentrysearch) that handles the pipeline: it splits videos into overlapping 30-second chunks, embeds each chunk using Qwen3-VL, stores embeddings in ChromaDB, and returns matching clips when you query in natural language.

Two Qwen3-VL embedding models are available on Hugging Face:

Qwen3-VL-Embedding-8B — ~18GB VRAM, runs on consumer GPUs like RTX 4090
Qwen3-VL-Embedding-2B — ~6GB VRAM, works on laptops with integrated GPU

Both run on Apple Silicon (via MPS backend) and NVIDIA CUDA. No API calls needed.

Step-by-Step Installation

First, clone the repo and install dependencies:

git clone https://github.com/ssrajadh/sentrysearch.git
cd sentrysearch
pip install -r requirements.txt

You'll need PyTorch with transformers, ChromaDB, and ffmpeg for video processing.

Indexing Your Videos

To index videos for search, run:

sentrysearch index --videos /path/to/your/videos --model qwen3-vl

This splits each video into overlapping 30-second segments (default 5-second overlap), generates embeddings using the local Qwen3-VL model, and stores them in a local ChromaDB vector database. The embedding model treats video chunks like images — it sees the frames directly without any text intermediate.

If you prefer using Gemini's API instead, swap the flag to --model gemini.

Searching Your Footage

Once indexed, search using natural language:

sentrysearch search "a person walking toward the camera"

The tool returns matching video clips with timestamps. Under the hood, it computes cosine similarity between your text query embedding and the stored video chunk embeddings, then uses ffmpeg to trim the matching segments.

For better precision on large datasets, you can add Qwen3-VL-Reranker as a second stage — it refines the top-100 initial matches with more accurate scoring.

Performance Notes

Indexing runs at roughly 1 FPS for embedding generation. Search itself is near-instant since you're just doing vector similarity against stored embeddings. A 1-hour video indexes in under an hour on a single RTX 4090.

The 8B model produces noticeably better results than the 2B variant, particularly for complex queries involving multiple objects or actions. If your GPU has less than 16GB VRAM, stick with the 2B model or consider running across an API in the cloud.

Why This Matters

Traditional video search requires transcription, object detection, or manual tagging. This approach embeds visual content directly — no labels, no transcription, no text intermediate. You can search surveillance footage, find moments in recorded meetings, or locate specific actions in hours of b-roll without any preprocessing.

The local-only aspect is the key draw: no API costs, no data leaving your machine, and no rate limits.