Neural Style Transfer: Briefly explained and implemented in PyTorch.

9 min readMay 26, 2024

Brief History and Motivation Behind NST

Neural Style Transfer (NST) is a technique that emerged from the intersection of deep learning and computer vision. It gained significant attention in 2015 when researchers Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge introduced a groundbreaking paper titled “A Neural Algorithm of Artistic Style.” This paper showcased how convolutional neural networks (CNNs), which are typically used for image recognition tasks, could be repurposed to create stunning artistic images by blending the content of one image with the style of another.

You can read the published paper here.

What is Neural Style Transfer

Neural Style Transfer is a deep learning technique that manipulates an image by merging the content of one image with the artistic style of another. Imagine taking a photograph of a cityscape and transforming it to look as if it were painted by Vincent van Gogh or Pablo Picasso. This is achieved by leveraging the power of CNNs to separately extract and recombine the content and style features of images.

The process works by first extracting the content features from the target image and the style features from the reference style image. Through an optimization process, NST adjusts a third image, called the generated image, to minimize the differences in content and style between the target and reference images respectively. The result is an image that maintains the structural integrity of the content image while adopting the stylistic elements of the reference image.

Understanding Content and Style in Images

Definition of Content and Style in the Context of Images

In the context of image processing, “content” and “style” refer to two distinct aspects of an image.

Content: This pertains to the objects, structures, and the overall layout within an image. For example, in a photograph of a cityscape, the buildings, streets, and skyline represent the content. It captures the essence and the primary subjects of the image.
Style: Style, on the other hand, encompasses the artistic elements of an image, such as colors, textures, brushstrokes, and patterns. In a painting, the style is what defines the artist’s unique approach, be it the swirling strokes of Van Gogh or the cubist forms of Picasso.

How Humans Perceive Content and Style Differently

We humans have an innate ability to distinguish between content and style when viewing an image. This ability allows us to recognize the same scene (content) even when it is rendered in different artistic styles. For instance, a picture of a tree drawn in a cartoonish manner or painted realistically is still identifiable as a tree. Our brains can separate the structural components (content) from the aesthetic elements (style) seamlessly.

The Role of Convolutional Neural Networks (CNNs) in Distinguishing Content from Style

Convolutional Neural Networks (CNNs) play a pivotal role in the process of Neural Style Transfer by mimicking this human ability to separate content and style. CNNs are a class of deep learning models particularly adept at processing visual data. They are composed of multiple layers that extract features from images in a hierarchical manner.

Content Extraction: In the context of NST, deeper layers of a CNN are used to capture the content of an image. These layers respond to high-level features such as shapes and structures. By using these deeper layers, NST can effectively represent the primary subjects of an image without focusing on fine details.
Style Extraction: The style of an image is captured using the responses from shallower layers of the CNN. These layers detect lower-level features like edges, textures, and colors. By examining the patterns of activation across these layers, NST can encode the stylistic elements of an image.

The NST process involves optimizing a new image to have similar content representations (from the deeper layers) to the content image and similar style representations (from the shallower layers) to the style image. This optimization is done by minimizing a loss function that quantifies the differences in content and style between the generated image and the target images.

Implementing Neural Style Transfer in PyT🔥rch

Step 1: Install and import required packages

!pip install torch torchvision torch_snippets

# import the required packages
import torch
import torch.nn.functional as F
import torchvision.transforms as T
from torch_snippets import *

# set the device
device = "cuda:0" if torch.cuda.is_available() else "cpu"
print(device)

Step 2: Load and preprocess the images

# Url for the `style image`
!wget https://res.cloudinary.com/ezmonydevcloud/image/upload/v1710325675/turning-pattern.png

# Url for the `content image`
!wget https://res.cloudinary.com/ezmonydevcloud/image/upload/v1710322208/thomas-shelby.jpg

preprocess = T.Compose([
    T.ToTensor(),
    # we'll use mean and std used for VGG models
    T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]), 
    T.Lambda(lambda x: x.mul_(255))
])

postprocess = T.Compose([
    T.Lambda(lambda x: x.mul(1./255)),
    T.Normalize(mean=[-0.485/0.229, -0.456/0.224, -0.406/0.225], std=[1/0.229, 1/0.224, 1/0.225]),
])

Explanation

Preprocessing: Prepares the image data to match the distribution of the data used to train the VGG model. This involves normalizing the image tensor to have a specific mean and standard deviation, ensuring consistency in the input data.
Postprocessing: Reverts the transformations applied during preprocessing, bringing the image data back to its original form or a form suitable for further use, such as visualization.

from PIL import Image
import matplotlib.pyplot as plt

imgs = [Image.open(path).resize((512, 512)).convert('RGB') for path in ['turning-pattern.png', 'thomas-shelby.jpg']]
style_image, content_image = [preprocess(img).to(device)[None] for img in imgs]

# Modify content image with `require_grad=True`
opt_img = content_image.data.clone()
opt_img.requires_grad = True

# Display the images
fig, axes = plt.subplots(nrows=1, ncols=len(imgs), figsize=(len(imgs) * 5, 5))

for idx in range(len(imgs)):
    axes[idx].imshow(imgs[idx])
    axes[idx].axis('off') # do this, if you want to hide the axis
    
plt.show()

Step 3: Define the model class, vgg19_modified

from torchvision.models import vgg19, VGG19_Weights

class vgg19_modified(nn.Module):
    def __init__(self):
        super().__init__()
        features = list(vgg19(weights=VGG19_Weights.IMAGENET1K_V1).features)
        self.features = nn.ModuleList(features).eval()
        
    # Define the forward method, which takes the list of layers and returns the features corresponding to each layer
    def forward(self, x, layers=[]):
        order = np.argsort(layers)
        _results, results = [], []
        for ix, model in enumerate(self.features):
            x = model(x)
            if ix in layers: _results.append(x)
        for o in order: results.append(_results[o])
        return results if layers is not [] else x
    
# Define the model object
vgg = vgg19_modified().to(device)

This code defines a modified VGG19 model that allows extraction of features from specific layers. By creating a custom vgg19_modified class, it loads the pre-trained VGG19, converts its feature layers to a ModuleList, and enables the forward method to return features from specified layers.

Step 4: Implementing loss functions

Content Loss Function

The content loss function is used to ensure that the generated image retains the primary subjects and structure of the content image. It measures how much the high-level features of the generated image differ from those of the content image.

In practice, this is achieved by passing both the content image and the generated image through a pre-trained Convolutional Neural Network (CNN) and extracting the feature maps from a specific layer. These feature maps represent the high-level content of the images.

The content loss Lcontent is calculated as the mean squared error (MSE) between the feature representations of the content image P and the generated image G:

content loss function visual representation

where:

Fl is the feature map of the generated image from layer l.
Pl is the feature map of the content image from layer l.
i and j index the elements in the feature maps.

This loss function ensures that the generated image G has similar high-level features as the content image P, preserving the overall structure and objects in the scene.

Style Loss Function

The style loss function is designed to capture the stylistic elements of the style image and transfer them to the generated image. It measures how well the textures, colors, and patterns of the style image are replicated in the generated image.

To achieve this, the style of an image is represented using the Gram matrix, which captures the correlations between different feature maps in a layer of the CNN. The Gram matrix Gl for a layer l is defined as:

The style loss Lstyle is calculated as the MSE between the Gram matrices of the style image S and the generated image P:

style loss function visual representation

This loss function ensures that the generated image P replicates the stylistic features of the style image S, including textures and patterns.

Combining Content and Style Losses

In Neural Style Transfer, the total loss function is a weighted sum of the content loss and the style loss:

Ltotal= αLcontent+βLstyle

where:

α and β are weights that control the relative importance of content and style.

By minimizing this total loss function, the NST algorithm generates an image that balances the structure of the content image with the artistic style of the style image.

Define the gram matrix module

class GramMatrix(nn.Module):
    def forward(self, input):
        b,c,h,w = input.size()
        feat = input.view(b, c, h*w)
        G = feat@feat.transpose(1,2)
        G.div_(h*w)
        return G
    
# Define the gram matrix's corresponding MSE Loss, GramMSELoss
class GramMSELoss(nn.Module):
    def forward(self, input, target):
        out = F.mse_loss(GramMatrix()(input), target)
        return(out)

Specify the layer that define content loss and style loss

style_layers = [0, 5, 10, 19, 28]
content_layers = [21]
loss_layers = style_layers + content_layers

# Define the loss function for content and style loss values
loss_fns = [GramMSELoss()] * len(style_layers) + [nn.MSELoss()] * len(content_layers)
loss_fns = [loss_fn.to(device) for loss_fn in loss_fns]

# Define the weights associated with content and style loss
style_weights = [1000/n**2 for n in [64,128,256,512,512]]
content_weights = [1]
weights = style_weights + content_weights

We need to manipulate our image such that the style of the target image resembles style_image as much as possible. Hence we compute the style_targets values of style_image by computing GramMatrix of features obtained from a few chosen layers of VGG. Since the overall content should be preserved, we choose the content_layer variable at which we compute the raw features from VGG.

style_targets = [GramMatrix()(A).detach() for A in vgg(style_image, style_layers)]
content_targets = [A.detach() for A in vgg(content_image, content_layers)]
targets = style_targets + content_targets

Step 5: Define optimizer and number of iterations

Even though we could have used Adam or any other optimizer, LBFGS is an optimizer that has been observed to work best in deterministic scenarios. Additionally, since we are dealing with exactly one image, there is nothing random. Many experiments have revealed that LBFGS converges faster and to lower losses in neural transfer settings, so we will use this optimizer

max_iters = 1000
optimizer = optim.LBFGS([opt_img])
log = Report(max_iters)

Step 6: Train the model

Perform the optimization. In deterministic scenarios where we are iterating on the same tensor again and again, we can wrap the optimizer step as a function with zero arguments and repeatedly call it, as shown here.

iters = 0
while iters < max_iters:
    def closure():
        global iters
        iters += 1
        optimizer.zero_grad()
        out = vgg(opt_img, loss_layers)
        layer_losses = [weights[a]*loss_fns[a](A,targets[a]) \
        for a,A in enumerate(out)]
        loss = sum(layer_losses)
        loss.backward()
        log.record(pos=iters, loss=loss, end='\r')
        return loss
    optimizer.step(closure)

# Plot the variation in the loss
log.plot(log=True)

# Plot the image with combination of content and style images
out_img = postprocess(opt_img[0]).permute(1,2,0)
show(out_img)

As you can see our model after the training is successfully able to blend the style with the content of original image. You could try this out on your own images to test it out how it’s performing on those. You could also try out tweaking some parameters or training it for longer iterations, if you are not satisfied with the final result.

You could get the code on my kaggle notebook.

You could also try checking out my other blog posts:

See you in future in a different dimension (maybe …)