This is the Tensorflow implementation of following paper:
Perceptual Losses for Real-Time Style Transfer and Super-Resolution
Justin Johnson, Alexandre Alahi, Li Fei-Fei
The Neural Image Stylizer stylizes the source image using another image. Here is one example:
In what follows, I’ll try to explain what makes this model work. Intuitively, we can think of this model as doing the job of painting an art work that tries to recreate the content from the content images, but also attempts to do so in a way that mimic the artistic style of another image. Mathematically, we say that the model can find a set of parameters that simultaneously minimize a certain type of loss function that incorporates the differences in content and styles.
where the represents the importance we want to assign to each of the component. I will leave the discussion of later, for now, let’s focus on the most important part.
For the clarification, I’ll use content images to refer to the input images that represent the content we want to keep, style images the ones that we want the final output to be similar to, and output images to be the art work our neural network creates.
Try to look again at the content image, and output images above. What exactly makes you believe they represent the same thing? If you examine carefully, you might notice the output image are somewhat tilted with some mosaic structures that are not observed in the original image. You might suggest that it is the structure of the images that help you recognize it, like the corridor. Abstractly, it is the high level features of both images are similar, despite their low-level pixel values are changed.
The tricky question remains: how do we define high level features? It turns out that there is a very interesting phenomenon:
The deep learning model, if well-trained on other tasks, can somehow behave like a real human in that it can form its own understanding of images. Like the real neurons, the artificial neurons in the model will activate when they see a certain pattern from the input images. Even when we apply the model to images from different domain, this feature still works. This is called Transfer Learning, but we will skip the details.
Long, Jonathan, Evan Shelhamer, and Trevor Darrell. “Fully Convolutional Networks for Semantic Segmentation.” 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
Zeiler, Matthew D., and Rob Fergus. “Visualizing and Understanding Convolutional Networks.” Computer Vision â ECCV 2014 Lecture Notes in Computer Science (2014): 818-33.
Specifically, images are just huge 3D matrices with dimension = where each value represents the intensity of that pixel in that channel. To obtain the high level features of images, we input that large matrix into a pre-trained model called VGG network, and then we extract the activations of these artificial neurons from high layers of the network. We do the same thing for both content and output images, and then we compute the pixel-wise Euclidean distances between these two feature representations. We call the sum of distances feature loss.
Denote the activation of content image at layer , filter , filter , position of the VGG network and denote that of output image , the loss function for feature representations is:
While we can easily answer what makes two images similar in content. The subtle differences in styles are much harder to understand. How do we quantify these complex combination of colors, textures that we might not even interpret? In this original paper of image stylization, the author proposes a very smart way. He defines styles as the high-level relationship between pixels; specifically, styles are encoded in the complex correlation between one pixel values with the others. And he came up with the term Gram Matrix, which is just outer products between two matrices. Similar to the way content losses are computed, he extracts multiple layers of artificial neurons and compute pair-wise Gram Matrices for both style images and output images.
and where are layer-wise weights
Total Variation Loss
If we think of the pixel values in the matrix as a distribution of values, it comes no surprise that the distribution natural images should be very smooth: one pixel will have almost the same value as its neighbors. To make our generated images more natural, we want an equation that discourage great variations among nearby pixels. Total variation loss, with its root in signal processing, are just the sum of Euclidean distances of between intensities of neighbor pixels.
The following graph from the original paper summarizes what we have just described. A model is created to generate images that minimizes the sum of all three loss values at the same time. One interesting fact in this Neural Network architecture is that while the network is a deep learning model, its objective function itself is also a deep learning model. Finally, the model is trained to behave like what you just saw.
Johnson, Justin, Alexandre Alahi, and Li Fei-Fei. “Perceptual Losses for Real-Time Style Transfer and Super-Resolution.” Computer Vision â ECCV 2016 Lecture Notes in Computer Science (2016): 694-711
Finding the Right Model
While we have specified the objective functions we want to minimize, how to find the right model that accomplishes this challenging task shows the real power of modern deep learning.
Lecun, Yann, Yoshua Bengio, and Geoffrey Hinton. “Deep Learning.” Nature 521.7553 (2015): 436-44
The model used in here is a 11-layer neural network with 3 convolutional layers followed by 5 residual layers and 3 deconvolutional layers. As shown in the image above, the convolutional layers map the image matrix to a hierarchy of feature maps by applying filters to summarize local information as it passes to higher and higher layers. While the equation for convolutional is very notation-heavy, for now I will use to denote the convolution function.
Denote the image matrix, each convolutional layer computes:
where b is bias. We then regularize the hidden state by Instance Normalization
where are mean and standard deviation across that image, and are used to prevent numerical outflow. The justification of instance normalization is better explained by this paper, but we can think of it as improving the quality of the final image by whitening the image matrix. In the last step, we apply RELU function to add non-linearity (otherwise, the model will be just a series of linear transformation, and so is the final output) to the model as standard practice. The output is passed on to next layer.
Followed by 3 convolutional layers, 5 residual layers are attached. These layers compute the function where is itself a convolutional function, and refers to copying the lower layer output to next layer. We can think of these residual block as providing highway connection to the next layer, though the detail justification are somewhat complex. Intuitively speaking, a deep neural network is very hard to train because of the phenomenon called exploding / vanishing gradient. The residual connections introduced here help alleviate that issue by enabling lower level information to pass to higher layers.
He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep Residual Learning for Image Recognition.” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
The final three layers of the neural network are de-convolutional layers. These layers convert the condensed high level feature representations back to the dimension of the output image we desired.
It is important to note that while the image pixel values are expected to be within the range 0-255, it is not the case for the output from convolutional layers. In the last step, the model applies to squash the unbounded values to certain ranges.
If we think of the model as a graph of artificial neurons, the model involves tens of thousands of connections, and lots of parameters that need to be tuned. To find the proper set of parameters that minimize the loss function we designed, we use standard backpropagation. Specifically, each time the model is trained, we modify the model parameters by:
where the are the learning rate and loss function respectively. The partial derivatives represents how to change the parameter so that we can decrease the loss value, and the learning rate specifies how large each update step we want to take.
The network was trained on roughly 80,000 images for two epochs. The training time was approximately 13 hours on a single Tesla K80 GPU in Amazon Web Services EC2 instance. After training, the model transform images within a few seconds on CPU, and a fraction of a second on GPU.