I've started doing some preliminary work with a OpenPose to Pix2PixHD pipeline. Pix2Pix is a conditional adversarial network designed to mappings from an input image to and output image and synthesize like images off of this mapping. I became interested in working with Pix2Pix and other GAN networks after I was asked to lead a series of workshops on Design and Artificial Intelligence at Duke University. As Pix2Pix and other GANs have such expressive outputs, I naturally gravitated towards this area.

These experiments combine OpenPose and Pix2pix to allow the user to transfer their movement and gestures to the output of a different character. There are already methods to transfer movement data to 3D character models using rigged models and OSC. The novelty of Pix2Pix is the ability to transfer to realistic images of real people.

Pix2Pix is a conditional deep convolution generative adversarial network. Two networks are developed, the generator (g) and discriminator (d). For more information on deep convolutional generative adversarial networks, see my post 'Working with Generative Adversarial Networks'. Conditional Adversarial Networks differ from traditional GANs by conditioning the (z) noise that the generator samples from with an input image. This allows the network to learn the conditional input as well as the training sample. The network is fed pairs of training images and conditioning input images (y), with each training image corresponding to an conditioning input image.  

In this work, I constructed the pipeline by taking the output of OpenPose, along with the original images from a video and using that as a mapping for any OpenPose output. Because the network generalizes on OpenPose mappings, we can take a new video and inference on each frame to generate output on the new movements, as demonstrated below.

I trained this model using Pix2PixHD over 54 epochs. The model took about 2.5 days to train on ~1300 image input pairs. As this was my first attempt, the model overfit. If you watch the video closely, the character fades in an out of sharpness, as it fits better on poses it has seen vs. new poses. The output of this model, while a  little rough, is pretty accurate in relation toward the input images however. Output of the model alone can be found below.

On my second attempt, trained the model with more images, and reduced the network size to combat overfitting. I also chose a cleaner video (white background, no watermarks) to clean up the output of the model further. For this attempt, I trained the model for ~40 epoch.