One of my favorite deep learning papers is Learning to Generate Chairs, Tables, and Cars with Convolutional Networks. It’s a very simple concept – you give the network the parameters of the thing you want to draw and it does it – but it yields an incredibly interesting result. The network seems like it is able to learn concepts about 3D space and the structure of the objects it’s drawing, and because it’s generating images rather than numbers it gives us a better sense about how the network “thinks” as well.

I happened to stumble upon the Radboud Faces Database some time ago, and wondered if something like this could be used to generate and interpolate between faces as well.

The results are actually pretty exciting!

Network Architecture

To implement this, I adapted a version of the “1s-S-deep” model from the chairs paper. In it, they feed the network with one-hot encodings of the chair’s “style” and parameters for the orientation and camera position. They then pass them through several fully-connected layers to get a representation of what to draw before passing it to a deconvolution network to draw the image and predict its segmentation mask. To generate faces, we do a similar thing, except we drop the segmentation network entirely (because we don’t have ground-truth for those) and input the person’s identity, emotion, and orientation.

A diagram of the model used to generate chairs (from Dosovitskiy et al.)
A diagram of the model used to generate chairs (from
        Dosovitskiy et al.)

The deconvolution network here is similar to how you see them in other models for semantic segmentation and generative networks (such as here, here, or here). Essentially it is the inversion of the operations typically used in classification networks. Normally you would have a few layers of convolution, followed by a pooling layer to reduce the dimensionality of the input volume. For deconvolution networks, we do these operations backwards – first we upsample the inputs (a.k.a: unpooling), and then we apply our convolutions.

An illustration of 'deconvolution' as upsampling (unpooling) followed by a convolution operation (from Dosovitskiy et al.)
An illustration of 'deconvolution' as upsampling (unpooling)
        followed by a convolution operation (from Dosovitskiy et al.)

Essentially, when we unpool we dot our canvas with dabs of paint in a grid, and then use our convolution kernels as paintbrushes to spread and mix them around.

While I was able to mostly apply this architecture directly from the paper, I had to play with the number of kernels per layer to get higher-resolution images without exhausting GPU memory and use batch normalization to make sure that the Leaky ReLU activations would behave. The model is implemented in Keras, a high-level deep learning framework built on top of Theano and TensorFlow.

Interpolating Faces

As you already saw, the network is able to interpolate between identities and emotions pretty smoothly.

It actually interpolates between modes somewhat realistically, and doesn’t resort to any contrived methods like fading between each example. What actually surprises me is that it seemed to learn about facial features in a way – mouths open and close, you can see cheekbones move, eyebrows move up and down, and so on. You could potentially use this to animate complex expressions and transitions.

We can also break this down and look at how it interpolates between either just identities or just emotions.

So far I’ve forgone interpolating between orientations. Unfortunately, the network wasn’t able to learn how to interpolate between orientations nearly as well. This is likely because the available orientations in the dataset weren’t granular enough for the network to develop a sense of 3D space (the dataset only includes orientations at 45˚ intervals), and it took the easier path to learn how to draw faces at different orientations.

What the network did here is really interesting visually, however, especially when the network tries to draw something it has no knowledge of whatsoever like the back of a person’s head. Perhaps there are other ways you could exploit the network to create new, ultimately incorrect, but still visually interesting images…

Using “Illegal” Inputs

So far, all of the parameter’s we’ve given to the network have been “legal” more or less. The identity and emotion vectors have so far always been unit length (i.e: representing an even mixture of identities/emotions) and the orientations have always represented a valid angle. But what happens when we break those rules, and instead feed the network random values?

It’s pretty horrific:

From some combination of not knowing how to interpret invalid orientations and having “too much” identity or emotion to process, the network begins to stretch and contort faces in really unsettling, uncanny ways.

We can also create an animation out of this by shifting our inputs randomly little by little each frame:

Wild.

Generating Images with Partially Trained Networks

Lastly, we can also explore creating images from partially-trained networks to create more interesting images. For example, here are some random images from a partially-trained network using AdaGrad:

One thing that I found incredibly interesting is how differently different optimizers would draw the images, especially early in training. For example, generations from a partially-trained network using plain stochastic gradient descent resemble more of an abstract paining than a human face.

Something I want to explore more is experimenting with these networks and seeing just how they’re learning to generate images and the effects of different optimizers (and why they perform the way they do), but I’ll leave that for a future post.

Naturally, the code can be found here.

Edit 10/2/2016: Changed the description of the deconvolution operation to accurately reflect what this model is actually doing. Here we just upsample to unpool, rather than learn unpooling “switches.”

Edit 9/30/2016: Added extra animations with partially trained networks.