[Review] Image2StyleGAN — Embedding An Image Into StyleGAN.

10 min readMar 26, 2022

Several months ago, I was working in Image2StyleGAN: How to Embed Images Into the StyleGAN Latent Space? during an internship at King Abdullah University of Science and Technology (KAUST). I remember that I struggled at the beginning because I didn’t understand some terms and concepts really well.

What I commonly do in those situations is to explore the bigger picture of a problem by seeing some high-level explanations in videos, blogs, or reviews. However, I couldn’t find good introductory material during that time. I would have liked to have easy-to-understand material. For those reasons, I did this post, in which I intend to explain Image2StyleGAN (I2S) concisely and provide some additional references for a further understanding.

In short, Image2StyleGAN is an optimization algorithm that aims to map an image into the latent space of a pre-trained StyleGAN. These mappings known as latent codes are helpful to perform posterior image processing applications.

This post intends to explain the main concepts about Image2StyleGAN. However, you also can find my code implementation [here] and a poster article [here].

Who I am? Hi there! 👋

I’m Oscar, a young computer scientist from Ecuador. During my bachelor’s program, I received an unconventional formation in Artificial Intelligence and Machine Learning (AI/ML). Since that point, I have been studying AI/ML on my own (recently with a more efficient plan).

My journey so far has been very fruitful and has allowed me to accomplish several milestones. So far, I have been able to publish 4+ scientific articles. I have worked in the industry on projects for computer vision. I have done an internship at KAUST, Saudi Arabia, working on StyleGAN projects. I was a member of Scientific Computing Group back in my college. Last but not least, I recently co-founded DeepARC, a non-profit research group that I am most proud of. It was created by the cooperation of alumni, professors, and students in order to encourage AI/ML research among undergraduate students.

1. StyleGAN Principles 📚

Before digging into Image2StyleGAN, it is essential to understand how StyleGAN architecture works.

Here, I do not intend to do a profound explanation about StyleGAN 1 or 2; instead I will highlight only the main aspects that you need to understand I2S. You can find a completed review [here] by Jonathan Hui. It’s also helpful to understand the fundamental algorithm of a vanilla GAN beforehand.

In my perspective, StyleGAN aims to generate high-quality images with a better sense and control about the style/type of the generated image. StyleGAN differs from a vanilla GAN primarily in its Generator architecture, where the authors replace/add components to regulate the image generation.

Commonly in a vanilla generator, you map a latent code z into an image I, where z is sampled from a uniform or normal distribution. As we can note, the latent code plays an essential role in generating an image; indeed, we can say that z contains the meta-information to generate an image. In such a way, we can wonder how to leverage this latent code to refer to meta-information about the style/type of a generated image. However, in principle, it is really complex; at least directly from the latent code z is not too efficient.

GAN — StyleGAN & StyleGAN2 by Jonathan Hui

StyleGAN solves the issue mentioned above by introducing a Mapping Network f, mapping the latent space Z into a latent space W. The network f consists of eight fully-connected layers. The initial latent code z goes through the mapping network to get a new latent code w, which is more representative and contains more independent latent factors (unentangled factors). This new representation can be roughly understood as the style meta-information of an image (more efficient and more untangled than Z space). The following image shows the mapping architecture.

A Style-Based Generator Architecture for Generative Adversarial Networks by Karras T et al.

We start with a 512-dimensional latent code z ∈ Z and get a 512-dimensional latent code w ∈ W. Then, w can be used to feed the synthesis network g, as shown below.

The synthesis network generates an image following a progressive growth approach. It starts with a constant low-resolution 4x4x512 tensor and increases the resolution per computational block at a standard rate from 4x4 to 1024x1024 (9 resolutions in total).

In contrast with a vanilla GAN, the w latent code is fed into different synthesis network layers and resolutions through an affine transformation (A) and an adaptive instance normalization (AdaIN). In total, the network applies this procedure into 18 different layers, corresponding to two layers per resolution from 4x4 to 1024x1024. This transformation per layer allows us to apply style information to the spatial data, unlike a conventional latent code that applies the latent code only at the first layer, diminishing the style role over the generated image.

The main insights that you can get from this architecture are the following:

The architecture is composed of two latent spaces, Z and W, with their respective latent codes z and w (both 512-dimensional sizes).
The latent code w is more representative and corresponds to more independent representations.
The affine transformation (A) and the AdaIN blocks feed the given latent code w into different layers at different resolutions (18 layers in total).

2. Image2Style Objective 💡

Given a pre-trained StyleGAN, Image2StyleGAN intends to map a given image I in the latent space of the StyleGAN.

The usefulness of this process resides in the posterior applications, where you can manipulate the resulting latent codes to generate a modified version of the given image. Additionally, this algorithm differs from others because it allows you to modify a given image instead of a randomly generated image by the StyleGAN.

As we see before, in the StyleGAN architecture, there are different latent spaces, and we have to decide on which latent space to embed our given image. Two obvious options are the latent space Z and W. As we see before, W is more representative than Z. However, the authors in this article proposed an extended version W+, which corresponds to an 18x512 dimensional latent code. This change allows us to have better control of the generated image and the posterior applications.

Recall that StyleGAN feeds w 18 times into the synthesis network (2 layers per resolution) to apply style transformations. So the idea here is you have a 512-dimensional vector per layer connection in the original architecture, allowing you to control the image generation features per resolution by manipulating these vectors. Bellow, you can check how the Synthesis Network G will work with our new latent code.

3. Optimization Framework 📈

Once we know our objective, we search for an optimal procedure to find our w+ latent code. The authors’ proposed an optimization framework based on gradient descent. The steps are the following:

Start with an initial latent code w* (initial guess).
Generate an image with the latent code w*
Compare the generated image I* with the reference image I, using a loss function.
Based on the loss function (error), update the latent code w* by Gradient Descent.
Repeat this process by a given number of iterations.

The algorithm will optimize the latent code w* in such a way the error between the generated image and reference (original) image is reduced; in other words, searching that both images are similar.

If we check an example of a generated image per iteration, we can see the progression until we find a latent code that generated an image closer to the real one. Also, we can note how the loss reduces.

Image Generation per Iteration (Optimization Step) initialized with w mean.

At this point, the authors pointed out two concerns: latent code initialization and loss function.

Latent code initialization

The initial guess about the latent code could be different. Therefore, the authors explore two initializations: uniform randomly U(-1,1) and the mean latent code corresponding to the average face.

The mean latent code seems to work really well to find faces because it is like adjusting the mean face until we find our given face. It helps to initially have some face features beforehand instead of starting from a random position.

Loss Function

The loss function comprises of two terms a VGG-16 perceptual loss function and means square error (MSE).

The MSE loss is a standard pixel-wise error between the generated and reference images.

The perceptual loss is based on a pre-trained neural network (VGG-16 trained with ImageNet), which helps to calculate a similarity between two images in terms of hidden features. The idea to calculate this similarity is to send both images through VGG-16 and capture its output feature maps (hidden features) from different hidden layers.

The authors realized that MSE loss alone could not find high-quality embeddings, so they added the perceptual loss. The perceptual loss acts as some sort of regularizer to guide the optimization into the right region of the latent space.

Note: A perceptual metric is a broader concept, which it’s used in many other applications.

In practice, the VGG-16 layers conv1_1, conv1_2, conv3_2, and conv4_2 were used for the feature maps F. Then, we get the norm difference between hidden features at the same position and calculate a weighted average. In this way, it is possible to obtain a similarity measurement at high and low-level characteristics.

I was really excited the first time I understood the perceptual loss because it is an example of reusing already trained networks to learn new machine learning models (roughly speaking, doing an AI with another AI).

4. Semantic Editing Applications 💻

Finally, the authors studied three semantic image editing applications: morphing, expression transfer, and style transfer. Each test can be done by simple latent code manipulation of vectors w to check the usefulness of our resulting latent codes.

Morphing

Morphing is an image processing technique used for the metamorphosis from one image to another. Using two embedded images with their respective latent codes w1 and w2, morphing is computed by

Consequently, the new code w is used to generate the image using the pre-trained network. Additionally, you can adjust the morphing parameter to explore the morphing procedure from image 1 to image 2. Below you can see the effect of this parameter.

Image2StyleGAN: How to Embed Images Into the StyleGAN Latent Space? by Rameen Abdal et al.

Expression Transfer

Expression Transfer is an image processing technique used for transferring facial expression from one image (or more) into another. Using three input latent codes w1, w2, w3, expression transfer is computed as

The idea behind the equation is to apply sequential facial expression steps to a target image (with embedding w1). First, the target image goes through a neutral expression (embedding w2); consequently, it goes through a more distinctive expression (embedding w3). In the example below, we can check these steps where w2 corresponds to an expressionless face, and w3 corresponds to a smiling face of the same person.

Style Transfer

Style Transfer is an image processing technique used for transferring style (color, texture, shadows, content, among others) from one image to another. Using two latent codes w1 and w2, style transfer is computed by a crossover operation (similar to how the paper StyleGAN does).

Given the reference image w1, the crossover operation retains the latent codes of the first 9 layers (corresponding to resolution from 4x4 to 64x64). Consequently, it overrides the next latent codes with the ones of the style image (with embedding w2) for the last 9 layers. Bellow, we can see a representation of the crossover operation.

And some style-transfer examples are the following:

Final Thoughts 🤔

Image2StyleGAN is an exciting algorithm with a variety of uses beyond those described here. You can check several resources in the same line if you are curious and want to learn more about them. Bellow, I show some resources that you find valuable.

StyleGAN 1: A Style-Based Generator Architecture for Generative Adversarial Networks
StyleGAN 2: Analyzing and Improving the Image Quality of StyleGAN
StyleGAN 2 — Reduced Data: Training Generative Adversarial Networks with Limited Data
LPIPS (another state-of-the-art Perceptual Metric): The Unreasonable Effectiveness of Deep Features as a Perceptual Metric
Image2StyleGAN++: How to Edit the Embedded Images?
Improved Image2Style — Improved StyleGAN Embedding: Where are the Good Latents?

I hope you enjoyed this post as much as I did when I wrote it. Furthermore, I hope my explanation was clear and not so vague. Ultimately, I wish to generate interest in AI and Machine Learning for prospective learners.

Finally, any recommendations would be appreciated. I am constantly looking to improve the quality of my posts, so your comments will be worthwhile to me. Have a nice day! :D