Video Analytics in Cities: Why Classic Video Processing Is Ineffective
Classical video analysis is unsuitable for cities due to occlusion, variable lighting, and sparse objects. Traditional algorithms fail to handle the complexity
AI-processed from Habr AI; edited by Hamidun News
Classical video processing in urban environments is wishful thinking. Traditional algorithms for motion detection and object tracking collapse against the reality of busy streets, variable lighting, and occlusion. Smart video analytics developers found a way out: transitioning to neural network models and architecture that adapts to each specific scenario.
Why Classical Approach Doesn't Work
In cities, video analytics encounters a series of critical problems:
- Occlusion—people and cars block each other, objects appear and disappear from the frame
- Lighting variability—from dawn sun to street lights at night
- Sparse objects—need to track a cyclist among traffic flow
- Reflections and shadows—storefront windows, puddles on pavement confuse algorithms
- Camera drift—vibrations, wind, seasonal mounting shifts
Classical methods (threshold pixel change detection, optical flow) produce dozens of false positives per hour and miss suspicious events.
Neural Network Models as Salvation
AI changes the rules. Modern YOLO families and Vision Transformers see objects, not pixels. They recognize people in any pose and clothing, vehicles regardless of viewing angle, faces and license plates, actions in real-time (falls, fights) and anomalies (parked suitcase, person in wrong place). This requires GPU. Urban surveillance systems use NVIDIA Jetson for edge computing—directly on the camera or in a cabinet on a pole. Typical stack: RTX 4090 or A100 at the center, Jetson Orin at the perimeter.
What Stack Developers Choose
Modular architecture allows assembling systems from components. For object detection—YOLOv10, Faster R-CNN, or ViT-detection with target FPS of 25-30 even on 4K streams. Tracking is built on Deep SORT (complements the detector with appearance embeddings) or ByteTrack (works without features). Behavior classification requires separate models for attributes (gender, age, clothing type) and actions (walking, standing, running, fell). Video is stored in H.265 (compresses 2x better than H.264), metadata in SQL or time-series databases like ClickHouse. Orchestration—Docker + Kubernetes at the network edge, Redis for hot data caching (current tracks), Kafka or NATS for event streams between modules.
Adaptation Through Modularity
Each city, each intersection is unique. Modular architecture allows retraining detection models for local conditions in hours, changing feature weights via config, adding new detectors without pipeline reconstruction, and disabling non-working modules when resources are scarce. Some systems even use federated learning—models train simultaneously across all city cameras, but data remains local. This is critical for GDPR and privacy.
What This Means
Video analytics in cities is no longer a black box. AI plus modular architecture plus distributed computing allow cities to build scalable smart surveillance systems that adapt to local conditions and don't require an army of developers at every corner.
Want to stop reading about AI and start using it?
AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.