Video Analytics in Cities: Why Classic Video Processing Is Ineffective

Q: What is the source?

Originally published on Habr AI. Hamidun News processes and adapts the material with AI.

Q: When was it published?

2026-05-25. Reading time: 3 min.

Classical video analysis is unsuitable for cities due to occlusion, variable lighting, and sparse objects. Traditional algorithms fail to handle the complexity

Hamidun News Editorial

AI monitoring · Habr AI

2026-05-25· 3 min

AI-processed from Habr AI; edited by Hamidun News

Video Analytics in Cities: Why Classic Video Processing Is Ineffective — Source: Habr AI. Collage: Hamidun News.

◐ Listen to article

Classical video processing in urban environments is wishful thinking. Traditional algorithms for motion detection and object tracking collapse against the reality of busy streets, variable lighting, and occlusion. Smart video analytics developers found a way out: transitioning to neural network models and architecture that adapts to each specific scenario.

Why Classical Approach Doesn't Work

In cities, video analytics encounters a series of critical problems:

Occlusion—people and cars block each other, objects appear and disappear from the frame
Lighting variability—from dawn sun to street lights at night
Sparse objects—need to track a cyclist among traffic flow
Reflections and shadows—storefront windows, puddles on pavement confuse algorithms
Camera drift—vibrations, wind, seasonal mounting shifts

Classical methods (threshold pixel change detection, optical flow) produce dozens of false positives per hour and miss suspicious events.

Neural Network Models as Salvation

AI changes the rules. Modern YOLO families and Vision Transformers see objects, not pixels. They recognize people in any pose and clothing, vehicles regardless of viewing angle, faces and license plates, actions in real-time (falls, fights) and anomalies (parked suitcase, person in wrong place). This requires GPU. Urban surveillance systems use NVIDIA Jetson for edge computing—directly on the camera or in a cabinet on a pole. Typical stack: RTX 4090 or A100 at the center, Jetson Orin at the perimeter.

What Stack Developers Choose

Modular architecture allows assembling systems from components. For object detection—YOLOv10, Faster R-CNN, or ViT-detection with target FPS of 25-30 even on 4K streams. Tracking is built on Deep SORT (complements the detector with appearance embeddings) or ByteTrack (works without features). Behavior classification requires separate models for attributes (gender, age, clothing type) and actions (walking, standing, running, fell). Video is stored in H.265 (compresses 2x better than H.264), metadata in SQL or time-series databases like ClickHouse. Orchestration—Docker + Kubernetes at the network edge, Redis for hot data caching (current tracks), Kafka or NATS for event streams between modules.

Adaptation Through Modularity

Each city, each intersection is unique. Modular architecture allows retraining detection models for local conditions in hours, changing feature weights via config, adding new detectors without pipeline reconstruction, and disabling non-working modules when resources are scarce. Some systems even use federated learning—models train simultaneously across all city cameras, but data remains local. This is critical for GDPR and privacy.

What This Means

Video analytics in cities is no longer a black box. AI plus modular architecture plus distributed computing allow cities to build scalable smart surveillance systems that adapt to local conditions and don't require an army of developers at every corner.

Hamidun News

AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Telegram channel RSS hamidun.com

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

🎓 Academy — 7 days free Free consultation