AWS showed semantic video search on Amazon Bedrock with Nova Multimodal Embeddings
AWS showed how to build semantic video search on Amazon Bedrock and released a reference implementation. Videos are split on scene changes, separate…
AI-processed from AWS Machine Learning Blog; edited by Hamidun News
AWS demonstrated semantic video search on Amazon Bedrock using Amazon Nova Multimodal Embeddings and published a reference architecture that can be deployed on custom content. Instead of the traditional approach where everything reduces to transcripts, the system simultaneously accounts for image, audio, speech, and structural metadata.
Why Text Alone Isn't Enough
Standard video search is typically built around text: speech transcripts, manual tags, or auto-generated captions. AWS takes a different approach and explicitly states that converting all video content to text loses important signals. If a user searches for "intense car chase with sirens," the query mixes visual events and audio scenes. If a specific athlete is needed, they may be visible in the frame but their name never mentioned. For such cases, a transcript alone is indeed insufficient.
This is why the solution is based on dividing video into meaningful segments rather than arbitrary timer-based chunks. Nova Multimodal Embeddings supports up to 30 seconds per embedding, but AWS in the example targets roughly 10-second fragments and shifts boundaries toward actual scene changes using FFmpeg. The algorithm maintains a range of 5 to 15 seconds: if there's a natural frame transition nearby, the segment is cut there; if not, a hard boundary is set. This preserves context and doesn't break a scene in the middle of an action or phrase.
How the System Is Built
The architecture is divided into two workflows: ingestion and search. After video is uploaded to Amazon S3, orchestration moves to Lambda and Step Functions, then segments are processed in parallel through multiple branches. For each fragment, the system builds separate representations for visual signals, audio, and speech, then combines them with metadata in an index. On the search side, the query doesn't go into a single unified vector: it's decomposed into multiple channels and then re-ranked with consideration for user intent.
- Video lands in S3 and triggers the pipeline through Lambda and Step Functions
- Fargate with FFmpeg finds scene changes and cuts the video into semantic segments
- Nova Multimodal Embeddings creates vectors for image and audio, and Amazon Transcribe provides the basis for speech embeddings
- Amazon Nova 2 Lite and Rekognition add captions to segments, genre, and recognition of known people in the frame
- OpenSearch and S3 Vectors store the index to combine vector and exact text search
AWS emphasizes that visual, audio, and speech embeddings shouldn't be collapsed into a single vector if controlled precision is needed. In this scheme, image handles objects, actions, and frame composition, audio handles music, noise, and acoustic atmosphere, and transcript handles semantic meaning. On top of this, a lexical channel is added via metadata: names, dates, genres, entities, and other data that semantic search may capture less effectively.
How Accuracy Improves
The key element of the entire construction is the query intent router. AWS uses Claude Haiku in Amazon Bedrock to return JSON with weights for four channels on each query: visual, audio, transcription, and metadata. The sum of weights must equal 1.0, and channels with a share below 5% aren't triggered at all to avoid unnecessary calls and increased latency. After this, results from different sources are normalized to a 0–1 scale and combined using weighted average rather than equal merging of all signals.
In tests, the approach significantly outperforms the baseline AUDIO_VIDEO_COMBINED scheme. AWS ran a benchmark on 10 internal long videos ranging from 5 to 20 minutes with 20 queries of different types. The hybrid scheme achieved Recall@5 of 90% versus 51%, Recall@10 of 95% versus 64%, MRR of 90% versus 48%, and NDCG@10 of 88% versus 54%. The company also highlights storage economics: Amazon S3 Vectors, according to their data, can reduce vector storage and query costs by up to 90% compared to specialized alternatives.
What This Means
AWS here isn't just describing an embeddings model, but showing a practical template for product teams working with media libraries, broadcast archives, sports, educational content, or user-generated video. The core idea is straightforward: the less you try to forcibly reduce video to a single text or vector, the higher your chances of finding the right moment accurately and quickly.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.