Amazon demonstrated natural-language search across large video archives with Nova
Amazon demonstrated a working architecture for searching large video archives without manual labeling or rigid keywords. Nova splits videos into 15-second…
AI-processed from AWS Machine Learning Blog; edited by Hamidun News
Amazon demonstrated how to organize search across large video archives without manual tagging and rigid keyword binding. Instead of scripted tagging, the system builds multimodal embeddings for audio and images, then searches videos by meaning through OpenSearch.
How the Search Works
The solution is built on the combination of Amazon Nova Multimodal Embeddings and Amazon OpenSearch Service. Videos are uploaded to S3, after which the asynchronous Nova API automatically cuts them into 15-second segments and builds 1024-dimensional vectors in AUDIO_VIDEO_COMBINED mode. This is important: the model considers not just the picture, but also the sound, so the search understands not individual words in the caption, but the context of the scene itself — who is speaking, what is happening in the frame, and what the atmosphere of the fragment is.
Separately, AWS suggests processing videos through Nova Pro or Nova 2 Lite to generate 10–15 descriptive tags according to a given taxonomy. As a result, the system stores two indexes: a vector one for semantic search and a text one for keyword search. This scheme allows not choosing between "smart" search and metadata filtering, but combining both approaches in a single interface.
Essentially, the same archive can be browsed with queries like "a person walking on a beach at sunset" as well as through strict text filters.
- Text search across video: a natural language query is converted to an embedding and compared with video segments.
- Similar video search: the system takes the vector of an already known video and finds fragments similar in content.
- Hybrid search: k-NN and BM25 results are combined, by default with weights of 70% on semantics and 30% on text.
Scale and Economics
AWS tested the scheme not on a demo set of a couple of files, but on an array of approximately 792 thousand videos from the Multimedia Commons and MEVA datasets. This amounts to about 8,480 hours of content, or 30.5 million seconds.
Complete processing took 41 hours on four c7i.48xlarge instances with 600 parallel workers. However, Bedrock has a limit of 30 simultaneous async tasks per account, so the example uses a job queue with status polling and reloading new videos as slots become available.
The financial picture is also quite transparent. AWS estimates the first year of such a system at approximately $23.6–27.
3 thousand depending on the chosen OpenSearch payment model. Of this amount, about $18.1 thousand goes to one-time upload and embedding generation, while the rest goes to the annual operation of the search layer.
The main expense item is not EC2 computations, but the embeddings themselves, because Nova is charged by video duration.
- approximately $17,096 — generation of multimodal embeddings in Amazon Bedrock
- approximately $571 — auto-tagging via Nova Pro
- approximately $421 — EC2 computations for batch processing
- from $5,544 to $9,240 per year — storage and search in OpenSearch, depending on the payment model
AWS also explains why 1024-dimensional vectors were chosen instead of 3072-dimensional ones: the generation cost does not change, but storage becomes approximately three times cheaper with minimal loss of accuracy. On the search side, metrics already look production-grade: semantic k-NN accounts for approximately 76 ms, BM25 — for 30 ms, hybrid mode — for 106 ms. Across the entire corpus, indexes occupy about 29.8 GB, so even a large video archive does not require exotic infrastructure.
Practical Nuances
This material is important not as an announcement of another model, but as a ready-made engineering template. AWS essentially shows how to transition from manual video tagging to an AI data lake, where search is built around embeddings rather than around human descriptions. For teams in media and entertainment companies, this can address several tasks at once: finding duplicates, navigating the archive, quick selection of b-roll, and creating internal tools for editors, producers, and archivists.
But there are limitations too. To launch it, you need Bedrock in the us-east-1 region, OpenSearch 2.11 or newer, S3, and configured IAM permissions.
Speed and price directly depend on the length of the videos: in the test, a 45-second video was processed in about 70 seconds. If your metadata is good, AWS recommends increasing the share of text search more — up to 50/50. And if your library continues to grow, the processing logic can be moved to AWS Batch and scaled in parts.
What This Means
Amazon shows that multimodal video search can already be considered not a research toy, but an understandable infrastructure pattern. For media teams, this is a chance to stop living with manual tags and finally search the archive the way people really formulate queries — in plain language.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.