MarkTechPost→ original

Netflix Opens Void — Model for Removing Objects from Video with Scene Physics Consideration

Netflix released Void as open source — a model for video editing without 'floating' artifacts. The system removes not just the object itself, but its…

AI-processed from MarkTechPost; edited by Hamidun News
Netflix Opens Void — Model for Removing Objects from Video with Scene Physics Consideration
Source: MarkTechPost. Collage: Hamidun News.
◐ Listen to article

Netflix opened the code for Void — a model for video editing that removes not only an object from the frame, but also the consequences of its presence. If you remove a person holding a guitar, a standard editor often leaves the instrument "floating." Void attempts to reconstruct the scene as if the object had never been there from the start: the guitar falls, the pillow flattens, the collision no longer happens. For post-production, this is an important step from simple pixel inpainting to causality-aware editing. The development was presented by Netflix researchers and INSAIT at Sofia University, and a preprint of the work appeared on arXiv on April 2, 2026.

This is precisely the main challenge in video inpainting. Most current systems can fill a hole in a frame and fix surface artifacts like shadows or reflections, but fail where the object being removed physically interacts with the scene. In the Void article and demo, they show typical cases: a person holding an object, weight pressing on a pillow, one object colliding with another.

After standard removal, absurd traces of the original scene logic remain. Void targets precisely these scenarios and, according to the authors, better preserves consistent scene dynamics than ProPainter, DiffuEraser, Runway, MiniMax-Remover, ROSE, and Gen-Omnimatte. In other words, the model doesn't simply retouch the background, but attempts to answer a more complex question: what should happen in the frame next if the key object suddenly disappears.

Technically, Void is built on top of CogVideoX-Fun-V1.5-5b-InP from Alibaba PAI and fine-tuned for the video inpainting task. The base model is a 3D Transformer with 5 billion parameters. The key idea is not a binary mask of "delete/keep," but a quadmask with four values: the object itself, the intersection zone, the area of affected interactions, and the unchanged background. Because of this, the model receives not just a clipped region, but a more structural description of what in the scene should change after removal.

Text descriptions of the desired background state are also input, and the standard inference resolution in the repository is 384 by 672 pixels with a clip length of up to 197 frames. Before generation, the system needs to understand not only the boundaries of the object being removed, but also which parts of the scene depend on it. In the repository, a separate pipeline is provided for this: SAM2 segments the object, and Gemini helps reason about interaction zones, after which the mask can be manually corrected if necessary through the built-in editor.

The authors also added two-pass inference. The first pass does the main removal and scene reconstruction. The second pass is needed not for "aesthetics," but to fix a specific problem with video diffusion — gradual deformation of objects between frames. For this, optical flow and warped noise from the results of the first pass are used to stabilize shape and trajectories over long segments.

The dataset is also particularly interesting: real paired videos of the format "with object / without object but with correct physics" hardly exist, so the team synthetically assembled such data from HUMOTO and Kubric, where after removing a person or object, the scene physics are recalculated from scratch. In HUMOTO, they used motion-capture scenes and repeated simulation in Blender, and Kubric covered collision and object interaction scenarios.

The code and weights are open, the repository is distributed under Apache 2.0, and there is a demo on Hugging Face. But for quick deployment in Colab, the developers immediately warn of the requirement for a GPU with 40 GB of VRAM or higher, and training ran on eight A100s with 80 GB each.

The practical significance of Void extends beyond impressive demonstrations. For studios and creators, it is a potential reduction of weeks of manual work on complex shots where you need not just to remove an object, but to rewrite the behavior of the scene after its disappearance. For researchers, it is another signal that video models are beginning to move from generating plausible frames to modeling causality.

But there is a downside: the more seamlessly such tools edit real video, the higher the requirements for verifying material authenticity. So Void is simultaneously a powerful VFX tool and a reminder that the line between editing and rewriting events is becoming thinner.

ZK
Hamidun News
AI news without noise. Daily editorial selection from 400+ sources. A product by Zhemal Khamidun, Head of AI at Alpina Digital.

Want to stop reading about AI and start using it?

AI News is a curated feed of AI/tech news. Hamidun Academy teaches you to use AI systematically in your work.

What do you think?
Loading comments…