Stylizing Images using Deep Learning

Neural Style


This is the Tensorflow implementation of following paper:
Perceptual Losses for Real-Time Style Transfer and Super-Resolution
Justin Johnson, Alexandre Alahi, Li Fei-Fei


The Neural Image Stylizer stylizes the source image using another image. Here is one example:

In what follows, I’ll try to explain what makes this model work. Intuitively, we can think of this model as doing the job of painting an art work that tries to recreate the content from the content images, but also attempts to do so in a way that mimic the artistic style of another image. Mathematically, we say that the model can find a set of parameters \theta that simultaneously minimize a certain type of loss function that incorporates the differences in content and styles.

\text{model} = \arg\min_{\theta} \lambda_{feature}\ell_{feature} +  \lambda_{style}\ell_{style} +  \lambda_{tv}\ell_{tv}

where the \lambda represents the importance we want to assign to each of the component. I will leave the discussion of \ell_{tv} later, for now, let’s focus on the most important part.

For the clarification, I’ll use content images to refer to the input images that represent the content we want to keep, style images the ones that we want the final output to be similar to, and output images to be the art work our neural network creates.

Feature Loss

Try to look again at the content image, and output images above. What exactly makes you believe they represent the same thing?  If you examine carefully, you might notice the output image are somewhat tilted with some mosaic structures that are not observed in the original image. You might suggest that it is the structure of the images that help you recognize it, like the corridor. Abstractly, it is the high level features of both images are similar, despite their low-level pixel values are changed.

The tricky question remains: how do we define high level featuresIt turns out that there is a very interesting phenomenon:

The deep learning model, if well-trained on other tasks, can somehow behave like a real human in that it can form its own understanding of images. Like the real neurons, the artificial neurons in the model will activate when they see a certain pattern from the input images. Even when we apply the model to images from different domain, this feature still works. This is called Transfer Learning, but we will skip the details.

Screen Shot 2017-01-22 at 22.38.14.png

Long, Jonathan, Evan Shelhamer, and Trevor Darrell. “Fully Convolutional Networks for Semantic Segmentation.” 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)

Visualizing the activations of artificial neurons at different layers:  lower layer usually detects edges, and higher layers will form their own understanding of the images that are not invariant to specific details.

Zeiler, Matthew D., and Rob Fergus. “Visualizing and Understanding Convolutional Networks.” Computer Vision – ECCV 2014 Lecture Notes in Computer Science (2014): 818-33.

Specifically, images are just huge 3D matrices with dimension = width \cdot height \cdot channels(RGB) where each value represents the intensity of that pixel in that channel. To obtain the high level features of images, we input that large matrix into a pre-trained model called VGG network, and then we extract the activations of these artificial neurons from high layers of the network. We do the same thing for both content and output images, and then we compute the pixel-wise Euclidean distances between these two feature representations. We call the sum of distances feature loss.

Denote _{C}F_{f,p}^{l} \in \mathbb{R} the activation of content image C at layer l, filter l, filter l, position p of the VGG network and denote  _{X}F_{f,p}^{l} \in \mathbb{R} that of output image X, the loss function for feature representations is:

\ell_{feature} (C, X, l) = \frac{1}{2}\sum _{f,p} (_{C}F_{f,p}^{l}-_{X}F_{f,p}^{l})^2

Style Loss

While we can easily answer what makes two images similar in content. The subtle differences in styles are much harder to understand. How do we quantify these complex combination of colors, textures that we might not even interpret? In this original paper of  image stylization, the author proposes a very smart way. He defines styles as the high-level relationship between pixels; specifically, styles are encoded in the complex correlation between one pixel values with the others. And he came up with the term Gram Matrix, which is just outer products between two matrices. Similar to the way content losses are computed, he extracts multiple layers of artificial neurons and compute pair-wise Gram Matrices for both style images and output images.

_{S}G_{f_1,f_2}^{l} = \sum _{p}{_{S}F_{f_1,p}}^{l}\cdot_{S}F_{f_2,p}^{l}
_{X}G_{f_1,f_2}^{l} = \sum _{p}{_{X}F_{f_1,p}}^{l}\cdot_{X}F_{f_2,p}^{l}

 What follows are almost the same as of the way content losses are computed, except that the Gram Matrices are summed according to certain weights, which are just hyper-parameters one can find-tune.

\ell_{feature} (C, X, l) = \frac{1}{2}\sum _{f,p}(_{C}F_{f,p}^{l} - _{X}F_{f,p}^{l})^2

\ell_{style}^{l} = \frac{1}{size}\sum _{f_1,f_2} (_{S}G_{f_1,f_2}^{l} - _{X}G_{f_1,f_2}^{l})^2

and \ell_{style} = \sum _{l}w_l\cdot \ell_{style}^{l} where w_l are layer-wise weights

Total Variation Loss

If we think of the pixel values in the matrix as a distribution of values, it comes no surprise that the distribution natural images should be very smooth: one pixel will have almost the same value as its neighbors. To make our generated images more natural, we want an equation that discourage great variations among nearby pixels. Total variation loss, with its root in signal processing, are just the sum of Euclidean distances of between intensities of neighbor pixels.

The following graph from the original paper summarizes what we have just described. A model is created to generate images that minimizes the sum of all three loss values at the same time. One interesting fact in this Neural Network architecture is that while the network is a deep learning model, its objective function itself is also a deep learning model. Finally, the model is trained to behave like what you just saw. model

Johnson, Justin, Alexandre Alahi, and Li Fei-Fei. “Perceptual Losses for Real-Time Style Transfer and Super-Resolution.” Computer Vision – ECCV 2016 Lecture Notes in Computer Science (2016): 694-711

Finding the Right Model

While we have specified the objective functions we want to minimize, how to find the right model that accomplishes this challenging task shows the real power of modern deep learning.

Screen Shot 2017-01-23 at 13.15.54.png
Inside a Convolutional Network: the output of each horizontal layer of a typical convolutional network architecture applied to the image of a Samoyed dog. Each rectangular image is a feature map corresponding to the output for one of the learned features, detected at each of the image positions. Information flows bottom up.

Lecun, Yann, Yoshua Bengio, and Geoffrey Hinton. “Deep Learning.” Nature 521.7553 (2015): 436-44

The model used in here is a 11-layer neural network with 3 convolutional layers followed by 5 residual layers and 3 deconvolutional layers. As shown in the image above, the convolutional layers map the image matrix to a hierarchy of feature maps by applying filters to summarize local information as it passes to higher and higher layers. While the equation for convolutional is very notation-heavy, for now I will use f_{conv}(\cdot) to denote the convolution function.

Denote X \in \mathbb{R}^{width \times height \times channel} the image matrix, each convolutional layer computes:

h = f_{conv}(X) + b

where b is bias. We then regularize the hidden state h by Instance Normalization

_{normalized}h_{t,i,j,k} = \frac{h_{t,i,j,k} - \mu_{t,i} }{\sqrt{\sigma_{t,i}^{2}+\epsilon}}

where \mu,\sigma are mean and standard deviation across that image, and \epsilon are used to prevent numerical outflow. The justification of instance normalization is better explained by this paper, but we can think of it as improving the quality of the final image by whitening the image matrix. In the last step, we apply  RELU function max(h,0) to add non-linearity (otherwise, the model will be just a series of linear transformation, and so is the final output) to the model as standard practice. The output is passed on to next layer.

Followed by 3 convolutional layers, 5 residual layers are attached. These layers compute the function \mathcal{F} (x)+ x where \mathcal{F} is itself a convolutional function, and x refers to copying the lower layer output to next layer. We can think of these residual block as providing highway connection to the next layer, though the detail justification are somewhat complex. Intuitively speaking, a deep neural network is very hard to train because of the phenomenon called exploding / vanishing gradient. The residual connections introduced here help alleviate that issue by enabling lower level information to pass to higher layers.


He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep Residual Learning for Image Recognition.” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)

The final three layers of the neural network are de-convolutional layers. These layers convert the condensed high level feature representations back to the dimension of the output image we desired.

It is important to note that while the image pixel values are expected to be within the range 0-255, it is not the case for the output from convolutional layers. In the last step, the model applies tanh(\cdot) + C to squash the unbounded values to certain ranges.

If we think of the model as a graph of artificial neurons, the model involves tens of thousands of connections, and lots of parameters that need to be tuned. To find the proper set of parameters that minimize the loss function we designed, we use standard backpropagation. Specifically, each time the model is trained, we modify the model parameters \theta by:

\theta = \theta - \varepsilon \cdot \frac{\partial \L }{\partial \theta}

where the  \varepsilon, \L are the learning rate and loss function respectively. The partial derivatives represents how to change the parameter \theta so that we can decrease the loss value, and the learning rate specifies how large each update step we want to take.

The network was trained on roughly 80,000 images for two epochs. The training time was approximately 13 hours on a single Tesla K80 GPU in Amazon Web Services EC2 instance. After training, the model transform images within a few seconds on CPU, and a fraction of a second on GPU.

More Examples


A Neural Algorithm of Artistic Style

Very Deep Convolutional Networks For Large-Scale Image

Deep Residual Learning for Image Recognition

Instance Normalization: The Missing Ingredient for Fast Stylization

Fully Convolutional Networks for Semantic Segmentation

Visualizing and Understanding Neural Networks

‘s Github Page

‘s Github Page


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s