The world’s leading publication for data science, AI, and ML professionals.

Improving UI Layout Understanding with Hierarchical Positional Encodings

How can we modify transformers for UI-centric tasks?

Thoughts and Theory

Written by Jason Lee, Connor Johnson, and Varun Nair

User interfaces contain rich hierarchies of elements that can be encoded with tree-based representations. (Image by Authors)
User interfaces contain rich hierarchies of elements that can be encoded with tree-based representations. (Image by Authors)

Layout Understanding is a sub-field of AI that enables machines to better process the semantics and information within layouts such as user interfaces (UIs), text documents, forms, presentation slides, graphic design compositions, etc. Companies already invest extensive resources into their web/mobile applications’ UI and user experience (UX), with Fast Company reporting that every $1 spent on UX can return between 2–100x on its investment. As such, AI and deep-learning powered tools have much potential to aid and speed up the iterative design process.

In this blog post, we will share our findings and lessons harnessing deep-learning based layout understanding models for UI-centric tasks. This research was conducted in partnership with the machine intelligence team at Uizard.

Specifically, we focus on investigating how the positional encodings of transformer models (Vaswani, 2017) can be altered to encode better representations of layouts. We emphasize that:

  • The hierarchical information of user interfaces are a rich source of information that can be injected into transformer models using novel positional embeddings (Shiv & Quirk, 2020).
  • Handling UI-domain specific characteristics well will be key to successfully training models.

This is a relatively unexplored space and there is still much work to be done – we hope to use this short piece to guide future research into layout understanding and unlock more of deep learning’s potential for the Design world.

We’ll first walk through some tasks relevant to UI layout understanding, then share some background on papers that we found to be useful, and finally discuss the points listed above in further detail.


Relevant Tasks

When seeking to build a good representation of UI layouts within transformer models, it’s important to consider the datasets and many types of tasks available to help build those representations. We use the RICO dataset, a dataset of 9.3K Android apps with the visual, textual, structural, and interactive design properties of more than 66k unique UI screens and 3M+ UI elements. We also present four tasks for building representations of layouts that have been studied independently so far:

Examples of layout understanding pre-training tasks useful for later downstream tasks (Top Left and Top Right by Javier Fuentes Alonso, Bottom Left from Gupta et al., 2020, Bottom Right from Li et al., 2020).
Examples of layout understanding pre-training tasks useful for later downstream tasks (Top Left and Top Right by Javier Fuentes Alonso, Bottom Left from Gupta et al., 2020, Bottom Right from Li et al., 2020).

Token Classification

In token classification, we are presented with the class names for all UI elements except for one target element and asked to predict that target element’s class. We are also given the bounding box locations of all elements, including the target element.

Semantic Grouping

In this task, we are presented with class names and bounding box locations for UI elements and asked to predict a grouping for sets of elements in the input sequence. For example, given an image, paragraph, and icon, we may want to classify this group as a card element.

Layout Generation

Layout generation follows work that we studied in LayoutTransformer (Gupta et al., 2020) by generating a layout similar to that of the examples in the training dataset. We could presumably use the RICO dataset to train a layout generation model to generate realistic UI elements.

Hierarchical Tree Generation

This task follows work studied in Li et al., 2020, in which a transformer-based tree decoder model is used to take in a UI element as input and output a hierarchy of elements. This task can be thought of as a more complex version of semantic grouping, in which groups of other groupings can be found and assigned to a tree.

While we did not attempt all of these tasks in our work, we hope this gives a general idea of the types of tasks that are useful for layout understanding. Adopting a multi-task learning framework, similarly to the Text-To-Text Transfer Transformer (T5) model, these tasks could even be combined and solved by a single model. Our findings below will discuss the token classification (i.e. sequence labeling) task.


Related Work

In addition to the tasks above, here are a few papers that informed our work in layout understanding. We provide summaries of the most influential few works and links to others below.

LayoutLM

The architecture of the LayoutLM model is heavily inspired by BERT while incorporating image embeddings from a Faster R-CNN model. LayoutLM input embeddings are generated as a combination of text and position embeddings and are then combined with image embeddings. Masked Visual-Language Models (inspired by original MLM) and Multi-Label Document Classification models (how well model clusters similar documents) are primarily used as pre-training tasks for LayoutLM. The LayoutLM model is useful and dynamic enough for the purpose of layout understanding, with form understanding, receipt understanding and document image classification included as downstream tasks in the paper. In our case the main downstream task for fine-tuning was Masked (Element) Language Modeling, masking the tokens representing each element of the UI’s.

Architecture of LayoutLM Model - (Image from Xu et al., 2020)
Architecture of LayoutLM Model – (Image from Xu et al., 2020)

The original LayoutLM model was pre trained on the IIT-CDIP Test Collection 1.0, which contains more than 6 million documents, with more than 11 million scanned document images. As introduced in the related tasks section, we fine-tuned the model on the RICO dataset for our purposes.

We considered improving the LayoutLM paper due to the way it treats off-the-shelf OCR as ground truth – which is not very practical for practitioners. Having a greater degree of control over the OCR could provide better downstream task-dependent results.

CanvasEMB

Architecture of CanvasEmb Model - (Image from Xie et al., 2020).
Architecture of CanvasEmb Model – (Image from Xie et al., 2020).

CanvasEMB is a large-scale, self-supervised, pre-trained model for learning contextual layout information like LayoutLM which breaks layouts down into type, geometry, color and content-related properties. The model can be applied to downstream tasks such as role labelling and image captioning (with SOTA performance) and can also be leveraged in layout auto-completion and layout retrieval.

In the model, each visual element X_i from 0 to N consists of property elements 0 to M, which are projected and concatenated into element embeddings. For categorical properties (type, color) an embedding matrix is used, and for numerical properties, the sinusoidal positional encoding is adopted.

The model is trained on masked-language modeling like BERT or LayoutLM, and other fine-tuning tasks are added with task-specific layers like:

  • Element-Level – predict specific features/properties of elements
  • Element-to-Element – predicts relations between a pair of elements

The model is pre-trained on labeled data and it is fine-tuned on the presentation slides dataset, meaning there is a different type of semantic information incorporated unlike the documents other models are pre-trained on.

Novel Positional Encodings To Enable Tree-Based Transformers

This paper (Shiv et al., 2020) proposed a technique for generating novel positional encodings to encode hierarchical information. In datasets like RICO where there is a rich hierarchical information for user interfaces, sequential positional embeddings found in vanilla transformers are not sufficient. The paper proposes this technique for regular trees, where a stack-like data-structure is formed and a vector is added to the stack for each step down in the tree, with a single one in the n-long vector (for n nodes for each node) representing which branch the current node is down from the previous level.

Since we are not working with regular trees (where each node has the same amount of children), changes had to be made to the technique in the paper in order for it to be viable. A diagram of the novel positional encodings and these changes will be discussed in further detail below.


Findings

#1 – The hierarchical information of user interfaces are a rich source of information that can be injected into transformer models using novel positional embeddings.

User interfaces often contain a rich hierarchy within the sets of elements that make it up. For example, a list may contain several list items, each list item may contain a card object, and this card object may contain an image, a paragraph, and an icon. We don’t need a set ground truth to obtain this information either – if bounding box information is present in addition to element classnames, then we can infer which elements are children elements by looking at which bounding boxes overlap one another.

The first task that we attempted to solve was the token classification (i.e. sequence labeling) task from above. We modified LayoutLM, the model we initially studied, to be able to input both bounding box and class-label information. LayoutLM, as most traditional transformer models, takes in fixed-size inputs encoded by their position in the sequence. However, since we know the hierarchical information of user interfaces to be a rich source of information, we can utilize that information by merging the LayoutLM approach with tree-based novel positional embeddings.

The novel positional encoding for tree-based inputs into transformers from [Shiv & Quirk, 2020](https://papers.nips.cc/paper/2019/file/6e0917469214d8fbd8c517dcdc6b8dcf-Paper.pdf) (Image from Shiv & Quirk, 2020).
The novel positional encoding for tree-based inputs into transformers from [Shiv & Quirk, 2020](https://papers.nips.cc/paper/2019/file/6e0917469214d8fbd8c517dcdc6b8dcf-Paper.pdf) (Image from Shiv & Quirk, 2020).

Intuitively, by injecting additional information about how the input tokens (which are UI elements) relate to one another, we should be able to perform better on downstream tasks. This information could otherwise have been learned by the model after multiple training steps or after seeing many examples, but by explicitly providing it in the input we allow the model’s parameters to encode other (and hopefully more useful) representations.

The relationships between elements are provided by the RICO dataset in JSON files, which tell us the ancestors and children of each UI element. We found the maximum degree of the tree (n) and maximum depth of the tree (k) and formed n x k size vectors for each element in each example, adding padding for each value unused for each individual node. This size was consistent so the decoder could make sense of these embeddings.

While we believe there is good reason for performance to improve with these positional embeddings (as explained above), in practice they were quite difficult to implement. Some of our challenges included:

  • Choosing appropriate values for the dimensions (values of n and k) of the tree-based positional encoding.
  • Projecting the tree-based positional encoding to the dimension of the token representation (e.g. 512) for concatenation.

When implementing these hierarchical positional embeddings, it is important to select appropriate values for the dimensions because of the sparsity of the positional encoding vectors when simply using the maximum depth and degree across all examples. We discuss these challenges further in the section below.

2 – Handling UI-domain specific characteristics well will be the key to successfully training models.

For our exploration of layout understanding models, we used the aforementioned RICO dataset, consisting of the visual, textual, structural, and interactive design properties of more than 66k unique UI screens. An initial realization when exploring a dataset like this is that UI screens don’t always appear as we expect – for example, they can have a very high information density. The example below shows the UI with maximum number of nodes on any given level across the RICO dataset: 421.

The example below shows the UI with maximum number of nodes on any given level across the RICO dataset - 421 in total. (Image by Authors).
The example below shows the UI with maximum number of nodes on any given level across the RICO dataset – 421 in total. (Image by Authors).

One consideration when adapting the tree-based positional embeddings to non-regular trees (where each node can have a different amount of children) is memory usage. The padding added for outlier examples, if excessive, presents a large strain on the compute when fine-tuning the transformer. Recall that the dimensionality and padding for our tree-based position embedding vectors is the product of the overall maximum depth and overall maximum of branches off any single node (including root) in the dataset.

In the box-plot below, we see that the vast majority of UI screens have a maximum of 25 branches or less in their element hierarchies. If we retained the outliers, the dimensionality of position embedding vectors would be orders of magnitude larger. With an 80/20 training test split and no filtering, a text file containing the training examples’ position embedding vectors is 3.09 gigabytes which is projected to 631.40 gigabytes in a projection layer. By filtering for examples with at most 80 branches, we are able to reduce the dimensions of each vector from 2947 (4217) to 560 (807), making the vectors less sparse due to padding and cutting the memory to store the embeddings down by more than 80%. Though filtering reduces the memory footprint, using a token embedding dimension of 560 still leads vectors to be projected to 79.11 gigabytes, far beyond the 16 gigabytes of memory available on (our) typical GPUs.

Box plot of the maximum number of branches off of a node in an example's tree hierarchy from the RICO dataset (Image by Authors).
Box plot of the maximum number of branches off of a node in an example’s tree hierarchy from the RICO dataset (Image by Authors).

Given this information, a possible solution for improving the way the model is fine-tuned on the RICO dataset is to filter the maximum number of nodes among the examples used even further (from 421 nodes to say, 25 nodes) so that it better reflects the distribution of the dataset itself. Another means of doing this is to possibly remove what are presumed to be ‘decorative elements’ from the dataset. Analysis of the dataset shows that a little over 39% of all leaves (nodes without children) exist on the first level of the trees in the dataset.

We can also filter elements from the RICO dataset by restricting the depth of the hierarchy. While the maximum depth is 7, over 99% of all leaves are concentrated in the first three levels of each example’s tree, meaning very few branches reach the 7th level or even the 4th (with 0.6% of leaves residing on that level).

In summary, filtering which examples are used in the dataset and which nodes’ information are included from each example can lead to a significant loss of valuable semantic information helpful in identifying complex UI element groupings. However, retaining all information leads to significant memory issues while training and we believe that filtering is a necessary first step for fine-tuning a model like LayoutLM on the RICO dataset to perform the relevant downstream tasks.

Future Work

Layout understanding is a sub-field that is ripe for much more progress in the coming years and we hope this work illustrates some of that potential. The recent addition of LayoutLM to the HuggingFace transformers library should also allow the research community to make faster iterations. To summarize:

  • The hierarchical information of user interfaces are a rich source of information that can be injected into transformer models using novel positional embeddings.
  • Handling UI-domain specific characteristics well will be key to successfully training models.

As mentioned in the related tasks section, layout understanding tasks can be combined in multi-task training for even better representation learning (akin to the T5 model, Raffel et al., 2020). Other future work includes experimenting with other forms of positional embeddings and extending this work off of the next generation version of LayoutLM (LayoutLMv2 – Xu et al., 2020).

Acknowledgements

Thank you to Javier Fuentes Alonso, Arturo Arranz, and Tony Beltramelli at Uizard for supporting this work for the past few months. To find out more about how Uizard is applying Machine Learning to design, check out Uizard’s research page here.

The authors of the article are also members of Duke Applied Machine Learning (DAML), a student group at Duke University working on applied ML projects and research – learn more about DAML here.


Related Articles