NVIDIA vid2vid

Key Capabilities and How to Get Started

What Is the NVIDIA vid2vid Library?

NVIDIA vid2vid is an open source library designed for video-to-video synthesis. Leveraging Generative Adversarial Networks (GANs), it enables the transformation of one video type into another, maintaining temporal consistency. This opens up new possibilities in video editing and generation, offering high levels of realism.

The library is written in Python and enables use cases like label-to-street-view, edges-to-face, and post-to-body. Its deep learning approach is based on the paper Video-to-Video Synthesis (Wang et al., 2018), which uses a GAN with a carefully-designed generator and discriminator, together with a spatial-temporal adversarial objective, to create high-resolution, photorealistic video results with strong temporal synchronization to the source video. The model is capable of generating 2K resolution videos up to 30 seconds long.

Source for this and the following images: GitHub

This is part of a series of articles about AI open source projects

In this article:

What Is Video-to-Video Synthesis?

Video-to-video synthesis involves the process of transforming a source video into a target video, where the content or style of the target video differs from that of the source while retaining structural or semantic consistency across frames. It's a complex task that requires understanding and processing spatial and temporal information from videos.

The primary goal of video-to-video synthesis is to achieve high fidelity and temporal coherence in the output videos. This means that not only should each frame be convincingly realistic on its own, but the transition from one frame to the next should also be smooth and plausible, preserving the dynamics of the original video.

How Does Few-Shot Video-to-Video Synthesis Work?

Few-shot video-to-video synthesis is an approach that aims to generate realistic videos from a small set of example images or videos. The technique leverages deep learning models trained to understand and replicate the style and content of target videos from minimal examples. This is particularly challenging due to the limited data available for learning the complex mappings between input and output domains.

The process primarily involves training a model on a dataset to learn general video synthesis tasks, then fine-tuning this model with a few-shot learning approach. The fine-tuning enables the model to adapt to new tasks or styles with minimal examples, facilitating the generation of new video content that matches the target domain closely with less training data.

Use Cases of the vid2vid Library

The vid2vid library can be used for numerous applications. Let’s have a look at some of these.

Label-to-Street View

One application of the vid2vid library is the conversion of semantic label maps into photorealistic street views. This involves using labeled frames denoting different urban elements, like buildings, roads, and trees, as input, which are then translated into realistic street view videos. It demonstrates vid2vid’s potential in creating realistic environments for simulation and entertainment purposes.

This has significant implications for urban planning and virtual reality applications, allowing for accurate and varied simulations of urban landscapes from simple labels. The technology can capture complex urban scenes from basic representations and recreate them with lifelike detail and variation.


The edge-to-face use case transforms facial images into edge maps and then back into realistic faces. Initially, a facial image is converted into an abstract edge map, highlighting the outlines and major features of the face. This edge map serves as a simplified representation, stripping the image down to its basic contours.

Subsequently, the vid2vid library uses this edge map to generate a video with a similar facial image, aiming to recreate the original image's appearance but with the potential for changes in the identity of the person in the video, style, or facial features.


The pose-to-body functionality of the vid2vid library begins with the analysis of a person's movements within a video, from which the library generates a dynamic pose representation, capturing the essence of the human motion in a series of skeletal diagrams. These diagrams abstract the person's movements, isolating the dynamics from the visual details of the person and the background.

The next step involves using these pose diagrams as a basis to synthesize a new video, potentially altering the appearance of the person or the environment while maintaining the original movements. This feature has significant implications for motion capture, animation, and the creation of synthetic training data for various applications in sports and entertainment.

Frame Prediction

Frame prediction involves predicting future frames in a video sequence based on preceding frames. The vid2vid library does this by analyzing past footage and generating future frames that follow logically in terms of movement and appearance. This is valuable in video editing and post-production, allowing for the creation of smoother, more cohesive video sequences.

This technology can understand and replicate the complex dynamics of video content. It ensures seamless transitions and can be used to extend or enhance video clips with low effort.

Getting Started with the NVIDIA vid2vid Library

Before you can start using the vid2vid library, you’ll need to make sure you have the following:

  • A macOS or Linux environment
  • Python 3
  • PyTorch 0.4

Install the Relevant Libraries.

Start by installing the dominate  and requests Python libraries:

pip install dominate requests

For a training project that uses face datasets, you should install dlib:

pip install dlib

For a project trained on pose datasets, you should install libraries like OpenPose and DensePose.

Next, clone the following repository:

git clone https://github.com/NVIDIA/vid2vid
cd vid2vid

If you have trouble building this repository, you can find a Docker image in the docker folder.

Test the Model

Start by downloading the example dataset:

python scripts/download_datasets.py

Compile a snapshot of the FlowNet2 library by running the following script:

python scripts/download_flownet2.py


If your project requires a face dataset, download the following pre-trained model:

python scripts/face/download_models.py

You can test this model using the following bash script:

python test.py --name edge2face_512 --dataroot datasets/face/ --dataset_mode face --input_nc 15 --loadSize 512 --use_single_G

The test’s results should be saved in ./results/edge2face_512/test_latest/.


For a cityscapes project, download the relevant pre-trained model:

python scripts/street/download_models.py

Use the following bash script to test this model:

python test.py --name label2city_2048 --label_nc 35 --loadSize 2048 --n_scales_spatial 3 --use_instance --fg --use_single_G

The test’s results should be saved to ./results/label2city_2048/test_latest/.

Alternatively, you could use a smaller model, which has been trained on a single GPU and offers slightly reduced performance at a resolution of 1024 x 512. Download this model by running:

python scripts/street/download_models_g1.py

You can test the lower-resolution model with:

python test.py --name label2city_1024_g1 --label_nc 35 --loadSize 1024 --n_scales_spatial 3 --use_instance --fg --n_downsample_G 2 --use_single_G

Additional example scripts are available from the scripts/street/ directory.

Train Your Model with a Cityscapes Dataset

In this example, we’ll use the cityscapes dataset to train the model. You’ll need to register and download it from the official website.

This example involves applying a pre-trained segmentation algorithm to retrieve instance maps (train_inst) and semantic maps (train_A). Next, add the retrieved images to the datasets folder.

Now, you can download the FlowNet2 checkpoint file:

python scripts/download_models_flownet2.py

To train with eight GPUs, you should use a coarse-to-fine approach. This increases the resolution in order from 512 x 256 to 1024 x 512 and then to 2048 x 1024. You need to start with the lower resolution before you can train with a higher resolution. When training your model at the lowest resolution, using this script:

python train.py --name label2city_512 --label_nc 35 --gpu_ids 0,1,2,3,4,5,6,7 --n_gpus_gen 6 --n_frames_total 6 --use_instance --fg

For the next resolution, use:

python train.py --name label2city_1024 --label_nc 35 --loadSize 1024 --n_scales_spatial 2 --num_D 3 --gpu_ids 0,1,2,3,4,5,6,7 --n_gpus_gen 4 --use_instance --fg --niter_step 2 --niter_fix_global 10 --load_pretrain checkpoints/label2city_512

In this example, let’s assume that TensorFlow is installed on your system. You should be able to view TensorBoard logs in the ./checkpoints/label2city_1024/logs file by including --tf_log in the training scripts.

To train your model with one GPU, use a coarse-to-fine approach until you reach the maximum resolution (1024 x 512). The single GPU training may result in lower performance.

To train the model on the lowest resolution video (256 x 128) using a single GPU:

python train.py --name label2city_256_g1 --label_nc 35 --loadSize 256 --use_instance --fg --n_downsample_G 2 --num_D 1 --max_frames_per_gpu 6 --n_frames_total 6

To train at the highest possible resolution (2048 x 1024), you will need at least eight GPUs with 24G memory each:

bash ./scripts/street/train_2048.sh

If you only have access to GPUs with 12G or 16G memory, you should use the following script. It crops the images for training and doesn’t guarantee the same level of performance:


Managing AI Infrastructure with Run:ai

As an AI developer, you will need to manage large-scale computing architecture to train and deploy AI models. Run:ai automates resource management and orchestration for AI infrastructure. With Run:ai, you can automatically run as many compute intensive experiments as needed.

Here are some of the capabilities you gain when using Run:ai:

  • Advanced visibility—create an efficient pipeline of resource sharing by pooling GPU compute resources.
  • No more bottlenecks—you can set up guaranteed quotas of GPU resources, to avoid bottlenecks and optimize billing.
  • A higher level of control—Run:ai enables you to dynamically change resource allocation, ensuring each job gets the resources it needs at any given time.

Run:ai simplifies machine learning infrastructure pipelines, helping data scientists accelerate their productivity and the quality of their models.

Learn more about the Run:ai GPU virtualization platform.