My recent work with OpenPose and conditional adversarial networks lead me to start thinking about the generation of fake news media. I decided to develop a model based off of a video someone who could pass as a news anchor with a green screen background. The google search took a few minutes but I was able to come up with a relatively good stock photo video of a woman talking in front of a green screen.
I applied OpenPose to generate the data as I had in my previous work. This time however, instead of using Tensorflow openpose, I used the original CMU version, which allow for easy access of head and hand data and more accurate skeleton tracking data.
From that data I was able to develop a pretty solid model with Pix2PixHD. I generated about 2000 frame of OpenPose data from the initial video. I then built a model off of the original frames and open pose frames. From there I took a video of myself talking, and inferenced the OpenPose data from the video of myself onto the model of the speaker. Below is the output after only 7 epochs.
I couldn't create "fake news" without putting the output into a fake newsroom. I knew nothing about green screens, Adobe Premier Pro, or video editing, but after a couple of quick tutorials I was up and running. The video below was developed after training the Pix2pix model for 30 or so epoch on the 2000 input frame pairs. It ran overnight and took about 12hrs. It's worrisome how easy it is to produce videos like these. For about $12 in GPU time on an AWS EC2 instance, a lone developer like myself can create an (not quite) passable model, decide what movements it will make, and potentially sub in some nefarious speech of their own.