Naver Revolutionizes AI Video with Seoul's Realistic Model

⚡

Key Takeaways

1Naver has launched the Seoul World Model, using 1.2 million Street View images to create realistic videos.

2The model outperforms six others in visual quality and adapts to untrained cities like Busan.

3Technical challenges include managing transient objects and interpolating videos from still images.

💡Why it matters — This advancement could transform sectors like urban planning and autonomous driving through realistic simulations.

Naver Innovates with the Seoul World Model

The South Korean company Naver recently unveiled the Seoul World Model (SWM), an innovative video model that generates videos based on real locations. This model relies on the geometry of cities, utilizing a vast dataset of 1.2 million images from its own Street View service. This approach allows for the creation of videos that accurately reflect the appearance of places, rather than relying on fictional environments.

The SWM stands out for its ability to differentiate between permanent structures, such as buildings, and transient objects like cars, through the analysis of recordings captured at different times. To address missing camera angles, Naver uses simulated videos and Street View images as visual anchors, ensuring consistency over long distances.

Model Performance and Generalization

In comparative tests, the SWM outperformed six contemporary video models in terms of visual quality and temporal coherence. It also demonstrated its ability to adapt to unknown cities such as Busan and Ann Arbor, without requiring additional training. This flexibility is made possible by anchoring the model in the real geometry of cities, allowing it to generalize effectively to other urban environments.

Traditional video models tend to create visually convincing but entirely fictional environments, where anything beyond the initial image is invented. In contrast, Naver's SWM anchors video generation in urban reality, following actual routes through Seoul. Users can even modify the generated videos with text prompts, adding elements like burning cars or fantastical creatures.

Overcoming Challenges of Real Data

Using real images presents unique challenges, notably that Street View images are snapshots frozen in time. The cars and pedestrians captured often have little to do with the dynamic scene the model needs to generate. To address this issue, researchers developed a cross-temporal coupling mechanism. This mechanism combines reference images and target sequences from different recording moments, allowing the model to distinguish between permanent structures and transient objects.

Street View cameras mounted on vehicles capture images only every 5 to 20 meters, meaning there are no continuous videos or pedestrian or aerial perspectives. To fill this gap, Naver generated 12,700 synthetic videos using the CARLA simulator, covering various perspectives. A pipeline was also developed to interpolate coherent training videos from spatially dispersed images.

Technical Improvements and Dynamic Anchoring

Minor errors can accumulate over long distances, as the model generates video section by section. Previous methods used the first image as a fixed anchor, but this became ineffective over long distances. The SWM introduces a "virtual viewpoint," retrieving a Street View image slightly further along the route for each new section, thus providing an error-free reference that moves with the camera.

Collaboration Between Depth Maps and Original Images

Street View images are integrated into the generation process through two complementary pathways. On one hand, the model projects a spatially close reference image into the target perspective using its depth information, thus establishing the spatial layout of the scene. On the other hand, reference images are encoded into latent representations to capture additional details of the environment.

The SWM is based on the Cosmos-Predict2.5-2B model from Nvidia, a diffusion transformer with two billion parameters. Training was conducted on 24 Nvidia H100 GPUs, using 440,000 images from Street View in Seoul, synthetic data from CARLA, and driving data from Waymo.

Adapting to New Cities

The SWM has been tested not only in Seoul but also in Busan and Ann Arbor, two cities not included in the initial training. According to researchers, the model surpasses six current video models, including Aether, DeepVerse, and HY-World1.5, in terms of visual quality and temporal coherence. Existing models tend to drift over long distances, producing blurry videos, while the SWM maintains stable output over hundreds of meters.

Limitations and Future Perspectives

The lack of continuous videos of entire cities is a limitation, as training relies on interpolated sequences of individual images. Errors in metadata can lead to inconsistencies in the generated videos. Nevertheless, all Street View data has been processed in accordance with privacy regulations, with anonymization of faces and license plates.

Researchers see potential applications in urban planning, autonomous driving, and location-based exploration. Global models are an expanding research area, with initiatives like GWM-1 from Runway and studies from Microsoft Research demonstrating the potential of large language models to predict environmental conditions with over 99% accuracy.