Geometry-Aware Style Transfer: Implementation and Analysis

Perform texture-based (standard) and geometry-aware (novel) neural style transfer on human face images.

Azmarie Wang
Towards Data Science

--

Images from left to right are: input content image (Audrey Hepburn in 1956, Public Domain), input style image (Girl with A Black Cat, by Henri Matisse 1910, Public Domain), a standard style transfer output, proposed method output (generated by the Author)

In this blog, I will talk about how to perform texture-based (standard) and geometry-aware (novel) neural style transfer on human face images based on the implementation of DST [1] and the Face of Art [2].

DST is a one-shot, optimization-based method for geometry-aware style transfer. Combining DST with the Face of Art’s style images and facial landmarks, we will unfold how this method can generate a better representation of the artist style compared to standard style transfer methods.

🌟 Want to play with it yourself? Check out my github repo for the demo and more visual results.

Introduction

Most of the existing successful works of Style Transfer focus on the colour and texture, which is only part of the styling. The geometric shapes and/or distortions demonstrated in an artwork are also prime styling properties. In fact, some modern art schools focus heavily on the deconstruction of shapes and forms as a key characteristic of the styling, such as analytic cubism popularized by Picasso. Incorporating geometric information in Neural Style Transfer is not only a compelling and interesting topic but is a crucial step in helping the neural networks to be more artist-like and creative.

The goal of this project is to utilize and analyze a one-shot method for style transfer and face caricature generation that jointly move the content image towards texture-based and geometry-based resemblance to the style image. I have employed DST [1] for such purpose. Additionally, DST uses an optimization-based method leveraging a standard VGG feature extractor pre-trained on ImageNet, eliminating the need for large-scale paired datasets.

Problem Statement

Goal: Perform texture-based and geometry-aware neural style transformation on human face images.

Input: A content image with a human face and a style caricature artwork image

Output: A new image with the human face from the content image and the geometric and texture styling of the artwork image

To incorporate geometric wrapping into neural style transfer, it is important to define the alignment of the spatial points from the input to the desired output. With matching points from the content image to the style image, we can introduce a loss to encourage a warped output image such that the key points become spatially aligned. This way, we move the content image towards the colour and texture of the styling image, as well as the geometric shape and form at the same time.

We will use a stage-wise approach to achieve the goal of this project.

Stage 1: Obtain the facial landmarks of input content image and style image.

Stage 2: Conduct optimization with a style transfer loss that incorporates the content, the style, and the spatial alignment of two sets of facial landmarks

Stage 1: Obtain Facial Landmarks

In the first stage, we want to obtain an accurate set of facial marks from both input content image and input style image. Facial landmarks are used to localize the face in the image and detect the key facial structures on the face. A brute force way is to manually select correspondences, which is time-consuming and prone to human errors. In this project, I propose to utilize dlib to detect facial landmarks. To put this visually, as Figure 3 shows, the facial landmark detection algorithm gives us the estimated location of 68 (x, y) pair coordinates that represent salient regions of the face, including eye, eyebrows, nose, mouth, and jawline.

In the second stage, the style transfer network will take two input images (content and style) and their corresponding sets of 68 facial landmarks. The goal is to jointly optimize the textural and geometric styling of the stylized output by designing a loss function that encourages the content preservation and spatially aligned styling transfer. The loss function is described in the following paragraphs

Stage 2: Style Transfer Optimization

Style and Content Loss: The proposed method focus on the geometry-aware aspect of style transfer, thus I will use style and content loss from a standard style transfer method. DST uses the style and content loss from STROTSS [3], which is inspired by the concept of self-similarity. STROTSS allows user-specified point-to-point or region-to-region control over visual similarity between the style image and the output.

The content loss is based on the observation that robust pattern recognition can be built using local self-similarity descriptors. The style loss is derived from the Earth Movers Distance (EMD). Such a construction is convenient for this project as it focus on the structural forms of the style image to the output image. Style loss also takes account of a moment matching term, which matches the mean and co-variance of the feature vectors in the two images, and the color matching term encouraging the output image and the style image to have a similar palette.

Image Warping Loss Even though geometry-aware style transfer hasn’t been systematically explored, the research around geometric transformation has been studied intensively for both 2D and 3D domains. In working with 2D images, automatic image warping is particularly relevant to this project. Cole et al. [4] proposed to use spline interpolation in neural networks by employing a sparse set of landmark points, which solved the end-to-end trainable requirement such that this warping operation is differentiable. The key idea is to construct a dense flow field from the correspondences using spline interpolation, and then apply the flow field from the points to obtain the warped version.

Following Cole et al’s work [4], in WarpGAN [6], the authors made use of the differentiable spline interpolation operation and turned it into an image warping module. The inverse mapping f(q) of the pixel q in the original image can then be computed via thin-plate spline interpolation as the following.

p and p` are the source and target sets of correspondence (in our case, the content and style sets of facial landmarks) and φ(r) = r^2 log(r) as the kernel function. Thus, a closed-form procedure can be solved to find parameters w, v, b that minimize P|f(p_j `) − p_j |. See DST[1] for more details.

Analysis and Evaluation

Finding 1: Results are better when the facial landmarks of two input images are roughly aligned.

From the experiment (shown in the More Results figure), I find the results are generally better when the facial landmarks of two input images are roughly aligned, for example a frontal content face image with a frontal style face image, or a side-portrait content face image with a side-portrait style face image. However, using a frontal content image with a side-portrait style image could result in distortions in the output.

Images from left to right are: input content image (By New York Sunday News — This file was derived from: Marilyn Monroe in 1952, Public Domain), input style image (Vincent van Gogh, Self-Portrait, Public Domain), proposed method output (generated by the Author)

Finding 2: This method is limited to work well with mild to moderate shape deconstructed style; heavy shape deconstruction leads to distortion, or could be ignored.

As a hypothesis, I think incorporating geometry information in style transfer would impact the quality of the results when the shape is deconstructed and not so much where there is no shape deconstruction in the style image.

This hypothesis is studied with the following experiments. I organize the input style image into a spectrum of data examples consisting of three categories of style (artwork) images — with no shape decomposition (such as photo-realistic painting), with moderate shape deconstruction (post-impression painting), with heavy shape deconstruction (cubism or heavily animated caricatures). I realize the outputs from the first two categories show considerable resemblance of the style comparing to a standard style transfer method, however, with a heavy shape deconstruction style image, the geometric styling may not transfer very well in the stylized output.

Finding 3: Geometry-aware style transfer gives a better representation of the artist style compared to standard style transfer methods.

Overall, the proposed methods work well in the case where the input content and style images are roughly aligned and the style image has mild to moderate shape deconstructed styling.

Discussion, Limitation and Future work

Computational Creativity

One of my motivations for this project is to understand if incorporating geometry information helps the neural networks to be more artist-like and creative. A common view is that creativity is innately linked with unpredictability or the elements of surprise [5]. Intuitively, in this project, the elements of surprise could be justified with the presence of shape decomposition of the output image and the added geometry-aware capabilities of this system. However, this is a subjective judgement by human perception, and it is hard to be quantified.

Is neural networks capable of human-level creativity? This is perhaps too large of a topic to discuss in this project. Nonetheless, this is an interesting topic I would like to continue exploring.

Limitation

As mentioned above in the analysis section, the proposed method works well when the content image and style image pair is carefully chosen, for example the face posture is aligned, the shape deconstruction in the style image is mild to moderate, and etc. Mis-aligned postures in content and style images could lead to distortion. Moreover, a heavily deconstructed shape style image could simply be ignored in the stylized output, missing the point of geometry-aware style transfer.

On the other hand, the inference speed is around 1.5 minutes for the two stages of the method (image size as 256). This is far from real-time tolerable. A factor could be that I followed DST and STROTSS’ [3] implementation settings and optimized the output image at multiple scales in a Laplacian pyramid.

Lastly, this work is limited in the scope of evaluations. Initially, I have prepared to perform a survey to ask participants to rate the results on different categories. I have shown the results to 10 fellow graduate students who are familiar with the problem of style transfer and have prior knowledge about neural style transfer from Gatys paper. With their comments and input, I have finalize the analysis section and the findings as well as the failure cases. However, while analyzing the results with the quantitative scores from the survey, I realize it is hardly valuable because the hand-picked sample images could reflect my bias in the project, as I have already concluded some key findings and hypothesis around the proposed method.

Moreover, I think it is difficult to evaluate this without the proper comparison with other geometric style transfer, for example WarpGAN [6]. If time allows, I could reproduce WarpGAN and test out the sample images with the method. For the scope of this project, I think the results can show the potential of geometric information in style transfer, thus perhaps it’s less important to give quantitative evaluations.

Future Work

Addressing the limitation above, some extension can be done to improve the stylized output by alleviating distortion, and improve the inference speed by making the optimization more efficient.

1. Improve inference speed with better feature extractor

After streamlining the process this one-shot proposed method, the inference speed including the two stages of this system is 1.5 minutes for each pair of content/style inputs. A future work could employ some faster and lighter feature extractors for this purpose.

2. Finding the “mean” face as a guide for image warping loss

This idea is inspired by face morphing (here’s an article going through this step-by-step). To morph between the images, we can define one triangulation to use for every step of the morph. In order to get the best triangulation throughout the morph, we can compute the Delaunay triangulation on the mid-way face: the mean of the two correspondence point sets. This will prevent potential triangle deformations, since the triangles will be evenly shaped in the middle of the warp.

I think the idea of using a “mean” face could be used here as well. We could construct a “mean” face of the face from the content image and the face from the style image as a guide, and penalize by how different is the middle frame from the “mean” face in facial spatial alignment. This could potentially be realized with an additional loss in second stage, however, it may be tricky to introduce the “mean” face technique in the end-to-end trainable optimization.

Reference

[1] Kim, S. S., Kolkin, N., Salavon, J., & Shakhnarovich, G. (2020). Deformable Style Transfer. arXiv preprint arXiv:2003.11038.

[2] Jordan Yaniv, Yael Newman, and Ariel Shamir. The face of art: landmark detection and geometric style in portraits. ACM Transactions on Graphics (TOG), 38(4):1–15, 2019.

[3] Nicholas Kolkin, Jason Salavon, and Gregory Shakhnarovich. Style transfer by relaxed optimal transport and self-similarity

[4] Forrester Cole, David Belanger, Dilip Krishnan, Aaron Sarna, Inbar Mosseri, and William T Freeman. Synthesizing normalized faces from facial identity features. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3703–3712, 2017.

[5] Daniel Cohen-Or and Hao Zhang. From inspired modeling to creative modeling. The Visual Computer, 32(1):7–14, 2016

[6] Yichun Shi, Debayan Deb, and Anil K Jain. Warpgan: Automatic caricature generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 10762–10771, 2019.

--

--

I’m a Software Engineer, ☕️ and 🐱 lover. Previously, I was a graduate researcher in Computer Vision. Find me @ https://azmarie.github.io/