Next frame prediction (NFP) is a process to generate videos by training an image translation network to predict the next frame of a video based on the previous frame. What sparked me to try this out were some of the videos that Mario Klingemann posted1 using the technique. These videos have an eerie similarity to their training data’s motion, but often lack longer range consitency. After a while of letting the network’s output feed back into itself, the images start to drift away from the original, fairly realistic first seconds. This often leads to long strung out wavefront patterns that whip and writhe back and forth across the screen.

While you can use fully-generative models to generate videos, one huge benefit is that image translation networks need a lot less training time than e.g. StyleGAN. Usually a single night of training is enough for short input videos and the results are, in my opinion, just as interesting. Often the movement is more varied than purely generative GAN interpolations at the cost of less diverse textures. This stems from the fact that the NFP approach explicitly tries to model motion over time.


The choice of training material, as always, is crucial. It takes some getting used to scoping out good videos. What the network learns are movement characteristics rather than any structure in the image. Hence, the best videos are those with distinctive movements which aren’t too reliant on long-range structural coherence. Nice textures are also important, though, as the network does not separate out movement and appearance characteristics (although an approach that does2 would certainly be interesting)

One thing I quickly realized is that the texture and movement in videos are pretty easy to decouple. Style transfer3 can be used to transfer the textural information from one image to a video. This way I could use any texture that I liked to stylize videos that had movement that I liked.

For the experiments in this post, the input I used was this looping style transfer video:

The style for the video is the cover I made for Galva's track Orochi.

Next Frame Prediction

The network I trained to predict my frames was Pix2PixHD4, the state of the art image translation architecture at the time. While you can write dataloading code that will take in a video and train a next frame prediction model from it, the easiest way to get up and running is just to dump the video’s frames to one folder and then decrement the filenames in the other folder. This way you can just use the default dataloader.

The first results were less than stellar…

Input: a frame of the original video
Input: Rhodox training image 2864

The network quickly devolves to a bland texture without any of the smoother stroke patterns in the input. There is some interesting movement, especially in the right video with the little rocket offshoots or the sudden shift from the left at ~20 seconds. However, most NFP models culminate to similar lightly undulating, smoky textures regardless of the input, so I wanted to find a way to get just a little bit more out of the network.

Improving Structure

The main thing that disappointed me about the outputs was the lack of even short-range coherence in the textures. The results were big patches of gray smudges and repeating stripes. I wanted more varying detail and cool shapes to look at.

One obvious way to improve this would be to run these outputs back through a style transfer. This would shape up these blander textures right back to the style I wanted. However, the iterative optimization style transfer algorithm would probably need about a minute or two per frame, which is was a little too slow for my taste. Alternatively, I could train a fast style transfer5 model, but in my experience the results of those are a little underwhelming. (In hindsight, it could probably work pretty well, in this case, seeing as the input is uniform and already similar in style. I’d say give it a try as it’s probably faster to train and run than the approach I settled on).

The solution I tried first probably also seems obvious and it worked like a charm. I just trained another Pix2Pix network to translate predicted frames back towards the original frames. To keep the second network from being able to stall all movement by just immediately undoing the work of the first network, I trained it to predict multiple different outputs based on each frame. This results in a network that outputs some non-linearly average, plausible frame.

I ran each of the original video frames through the first model 1-10 times. Then I selected one of the 1-10 frames after the starting frame to be the target for the new network. The motivation here is that I’m not necessarily interested in predicting the exact previous frame, but just any of the possible frames that might have preceded (the same trick can be applied to the training set for the first network, this is especially helpful for videos with slower changes or higher frame-rates). Also, even after multiple “forward” steps that corrupt the image further from its original state, the second network should be able to restore some plausible structure. This simulates the situation later in the feedback loop where the structure is almost completely different from the actual video frames.

This time the feedback loops were a lot more like I imagined them, capturing the style of the input video much better.

Input: Rhodox training image 4
Input: Rhodox training image 2864

Now it was only a question of trying different inputs and finding the combination of checkpoints that gave the most interesting result.

Cropping to 16:9 resolution and upscaling the frames individually with Waifu2x, I finally had something I was really happy with.


Closing Thoughts

Next frame prediction is a great way to get a wide variety of different results with a lot less computational investment than fully-generative networks. Even when training the weird stochastic, time-domain CycleGANish, next frame predicting monstrosity I’ve outlined above, you’ll be sinking much less time into waiting for training to converge.

While these feedback loop videos are inherently unpredictable and serendipitous, you can use a couple tricks to create looping videos from them. Also, the multi-network approach allows you to mix and match networks in the feedback loop which can create some cool hybrids with different movement or textural characteristics. You can read about both of these techniques in my follow up posts: Neural Lava Lamps (coming soon…) and It’s Raining (coming soon…).


1 Some very cool experiments with fireworks [twitter] [twitter]

2 Two-Stream Convolutional Networks for Dynamic Texture Synthesis [project page]

3 I’ve got a video style transfer implementation that automatically does multi-resolution, multi-pass, flow weighted transfer which can be found here: [github]

4 You can find a great implementation in NVIDIA’s official repository [github]

5 The neural style OG, Justin Johnson’s fast-neural-style repository is a good candidate [github]