GAN Loci

Can neural nets reveal the tacit properties that make a “space” a “place”?

Published in

Towards Data Science

14 min readJul 6, 2019

This project applies generative adversarial networks (or GANs) to produce synthetic images that capture the predominant visual properties of urban places. Imaging cities in this way represents the first computational attempt to documenting the Genius Loci of a city: those forms, spaces, and qualities of light that exemplify a particular location and that set it apart from similar places.

Presented here are methods for the collection of urban image data, for the necessary processing and formatting of this data, and for the training of two known computational statistical models (StyleGAN and Pix2Pix) that identify visual patterns distinct to a given site and that reproduce these patterns to generate new images. These methods have been applied to image nine distinct urban contexts across six cities in the US and Europe, the results of which will be presented in our next post.

When a town pleases us because of its distinct character, it is usually because a majority of its buildings are related to the earth and the sky in the same way; they seem to express a common form of life, a common way of being on the earth. Thus they constitute a genius loci which allows for human identification.
- Christian Norberg-Schulz, Genius Loci: Towards a Phenomenology of Architecture, p 63

Motivation

While for many architects and urban designers design intent is a necessarily tacit concept, most CAD tools are made to support only the explicit articulation of the intention of their author. In contrast, this project represents a small step toward a new approach to CAD tools based on machine learning techniques that are made to better support tacit or difficult-to-articulate design intent. To clearly demonstrate the needs of design intent such as this, we consider here a scenario that offers an especially difficult to explicitly articulate quality of the built environment: the phenomenon of “place”.

In his seminal work that defines a phenomenological approach, “Genius Loci: Towards a Phenomenology of Architecture”, Christian Norberg-Schulz argues that the design of cities and buildings must center on the construction of “place”, which he defines as a “space with a unique character” (Norberg-Schulz 94). But how is this “unique character” defined, and how can it be captured using digital tools in manner that affords the advantages of a computational medium? In anticipation of a future tool that better supports the development of tacit design intent, we seek to leverage methods in machine learning to attempt to capture “place”, as described by Norberg-Schulz.

It is the ambition of this project to use a GAN as a tool for tacit design, and to apply the capacity of this technology for capturing the implicit yet salient visual properties of a set of images. We speculate that this capacity will prove useful in uncovering and encoding a phenomenological understanding of place. Presented in overview here are the steps required to train a generative adversarial network (GAN) to produce images that capture the predominant visual properties of an urban context. The work proceeds in three stages: data preparation, model training, and latent space exploration.

Data Preparation

In the data preparation stage, we first collect, clean, and curate a large number of images related to a selection of urban contexts that are significantly different sorts of places, and compile these into distinct sets. Then, each set of these images is processed to serve as training data for one of two ML models.

To this end, a Python library has been developed that supports the collecting, curating, and processing panoramic images using Google’s StreetView API. This section details this process, which includes: tasks related to the identification of a desired geographic location, the collection of a large and diverse set of images from this location, the curation of this set to define a relevant sub-set of valid images, and finally, the processing of these images such that they are appropriately formatted for training.

Coordinate locations of 560 requests for panoramic images in Cambridge, MA (left) and the actual geo-locations from which 470 panoramic images were taken (right)

The collection of data begins with the identification of a geographic location of interest. Since the images at which panoramas are taken are likely not coincident with the given geo-locations of interest, a number of failure scenarios must be accommodated. For example: not all locations of interest are related to a panorama, and not all panoramas depict the external urban environment (there are panoramas of interiors as well). For each of the nine urban contexts listed below, approximately 500 panoramas are sampled.

Beyond the basic validation mentioned above, due to the structure of the data returned, even given a successful call to the API, a number of auxiliary processing steps are required. Most notable among these auxiliary processing steps is the collection of depth information related to each StreetView panorama. Although it is not made clear from the Google StreetView interface, many two-dimensional panoramas provided by this service also hold some rudimentary three-dimensional data that describes objects in the urban scene (typically building forms).

A greyscale depthmap image (left) and sceneographic image (right) of a scene in downtown Berkeley, CA.

In summary, the collection of data begins with the defining of a single geographic point of interest, and results in the compilation of several hundred samples that may be further processed for training.

Each sample includes:

An equalrectangular panoramic image.
A set of three-dimensional planes that describe occluding objects in the described scene encoded as a base-64 string.
A set of related metadata.

With a set of images related to a given urban place collected and curated, the task of the data processing step is to prepare this set of images for their role in training a GAN model. In summary, this training data is best described as pairs of related square-cropped raster images: one RGB image that represents a crop of a larger panorama image scene, and another greyscale image that represents the “depthmap” of this scene, with the value of each pixel representing the minimum distance from the camera to any occluding objects.

Pairs of corresponding depthmap (top) and sceneographic (bottom) images.

The production of the sceneographic images is largely straightforward, with just one process worthy of mention: The equalrectangular projection of the panoramic images must be transformed to arrive the cubic environment map that better approximates what we expect for the synthetic images we aim to produce. This is accomplished following previous work describing the relevant conversion (Bourke, 2006). It is noteworthy that the same panoramic image may be arbitrarily rotated along the z-axis (a transformation equivalent to a horizontal rotation of the cube) to produce slight variations of the same scene that may still be seamlessly tiled together. Expanding the breadth of training sets through slight transformations in this manner, a practice known as “data augmentation”, is a common practice in ML.

Data augmentation by rotation by 0 degrees (left) 30 degrees (middle) 60 degrees (right)

In summary, the data preparation step for any given urban place begins with a curated collection of panoramic equalrectangular images and related information (including a description of occluding planes), and results in two sets of cubemap projection images: one set of RGB images that describe an urban scene, and one set of greyscale images that describe the effective depth of objects in that scene.

Model Training

In the model training stage, we use the collected image sets to train GAN models capable of generating new images related to each selected urban context. To this end, two distinct GAN architectures are employed: StyleGAN and Pix2Pix, the particular implementations of which are discussed below. Once trained, each of these models prove valuable in their own way, as each offers a distinct interface for the production of synthetic urban images.

Pix2Pix

Pix2Pix (Isola et al., 2016) is an architecture for a particular kind of GAN: a conditional adversarial network that learns a mapping from a given input image to a desired output image. From the perspective of a user of a trained Pix2Pix model, we offer an input image that conforms to some mapping convention (such as a color-coded diagram of a facade, or an edge drawing of a cat) and receive in return an image that results from the transformation of this input into some desired output (such as a photographic representation of a facade, or of a cat).

The particulars that guide the training of a Pix2Pix model strongly depend upon the specifics of the implementation employed. This project relies upon a “high-definition” version of this architecture implemented in Pytorch (Wang, 2019). Some modifications of this implementation were required: in particular, to correct problems with unwanted artifacts forming in cases of low-contrast source images (as seen in the figure below). Following suggestions offered by the community of Pix2Pix users, zero paddings were replaced with reflection paddings, and the learning rate was temporarily adjusted to 0.0008.

Synthetic image artifacts encountered while training.

Once trained, each model operates as implied by the nature of a conditional GAN and by the structure of the training data: given a greyscale depthmap image that describes a desired three-dimensional urban scene, a synthetic RGB sceneographic image is returned. Since these models are trained on subsets of data segregated by site, each model produces synthetic images specific to just one urban place: the Rotterdam model produces images that “feel” like Rotterdam, while the San Francisco model generates ones that appear more like San Francisco. This feature allows for direct comparisons to be drawn.

Results derived from Pix2Pix model: synthetic images of San Francisco, CA (left) and Rotterdam, NL (right).

StyleGAN

In contrast with a traditional GAN architecture, StyleGAN (Karras et al., 2018) draws from “style transfer” techniques to offer an alternative design for the generator portion of the GAN that separates coarse image features (such as head pose when trained on human faces) from fine or textural features (such as hair and freckles). Here, in comparison to the Pix2Pix model, the user experience is quite different: rather than operating by mapping an input image to a desired output, users select a pair of images from within the latent space of a trained model, and hybridize them. Rather than a simple interpolation between points in latent space, however, these hybrids correspond to the coarse and fine features of the given pair.

Fake images drawn from all nine sites studied.

As above, the particulars that guide the training of a StyleGAN model strongly depend upon the specifics of the implementation. This project relies on the official TensorFlow implementation of StyleGAN , which was employed without modification to train a single model on a combination of RGB sceneographic data drawn from all nine urban places. Once trained, the model may be queried either by sampling locations in latent space, or by providing coarse-fine pairs of locations in latent space to more precisely control different aspects of the synthetic image.

Building upon the former technique of taking samples in latent space, linear sequences of samples may be combined to produce animations such as the ones discussed below.

Image Generation

In the image generation stage, we develop methods for interfacing with the trained models in useful ways. This task is non-trivial, since each GAN model, once trained, is capable of producing a vast and overwhelming volume of synthetic images, which is described in terms of a high-dimensional latent space. The StyleGAN model offers a unique form of guiding the generation of images as combinations of features drawn from other images selected from latent space. The Pix2Pix model offers quite a different interface, with new synthetic images generated as transformations of arbitrary given source images: in our case, these are depth-maps of urban spaces. We present here a brief overview of these methods, and leave a more complete unpacking and visual analysis of the resulting images to a future post.

Pix2Pix Image Generation

Here, greyscale depthmap images are produced by sampling a scene described in a 3d CAD model. These depthmaps of constructed scenes are then used as the source image by which the Pix2Pix models for each urban place produces a synthetic photographic scene. By providing precisely the same input to models trained on different urban places, direct comparisons between the salient features picked up by the transformation models may be made.

For example, while each of the synthetic images below were produced by sampling the same depthmap, we can clearly see those imagistic properties that characterize each of the urban places sampled. A large massing that appears in the depthmap is interpreted by the Rotterdam model as a large brick housing block, as is typical in the Dutch city, while the Pickwick Park model renders this massing in a manner typical of the Northern Florida flora, suggesting the mass of a mossy Live Oak. A long and receding urban wall is broken up by the Alamo Square model into a series of small scale forms, an interpretation that expresses the massing of a line of Edwardian townhouses that dominate this San Francisco neighborhood; this same urban form is understood as something resembling a red-brick industrial warehouse building by the model trained on images from the Bushwick area of Brooklyn.

A depthmap (left) and cooresponding images for Jacksonville, FL (middle) and Rotterdam, NL (right)

StyleGAN Image Generation

The other two image generation methods developed here rely on the StyleGAN models.

Synthetic urban places generated by StyleGAN. Vertical columns define course features, such as camera direction and orientation, while horizontal rows define fine features, such as textures, colors, and lighting effects of each urban place.

Like the method discussed above, the first of these two offers opportunities for comparisons to be drawn between the urban places sampled. Using the StyleGAN interface as it was intended by its authors, it is possible to separately assert control over the fine and coarse aspects of generated images. The interface that results may be seen as the design of urban scenes “by example”: the user need only offer examples of images that contain desired features of a place, without explicitly stating what these are, where they came from, or how to construct them. In the context of this study, as above, this allows for comparisons between urban places to be conducted. For example, the nearby figure demonstrates how coarse features and fine features may be combined to form new scenes that hybridize aspects of existing scenes.

Finally, a method for generating animations by sampling linear sequences of the latent space of images implied by the StyleGAN model is developed. While not directly supportive of a controlled comparative study of urban places, these animations do offer insight into the structure of the latent space of the StyleGAN model, including which features and scenes are similarly parameterized, and which are far from one another.

Reflection

As described above, a GAN instrumentalizes the competition between two related neural networks. Since the effective result of this competition is the encoding of the tacit properties held in common by the given set of images, this project proposes that an interrogation of the synthetic images generated by the GAN will reveal certain properties useful in uncovering the nature of urban places. Relying on Norberg-Schulz’s characterization of a “place”, which is understood to mean a “space with a unique character”, and to include those forms, textures, colors, and qualities of light that exemplify a particular urban location, we can see that the immediate aim has been met, as the initial results documented above exhibit imagistic features unique to each of the sites studied. Much work remains, however, to realize the larger aim of this project to develop tools that support tacit intent in architectural design. Future work in this area includes the extension of the “design by example” paradigm from images of urban places, as demonstrated here, to more directly architectural representations, such as three-dimensional forms and spaces.

Bibliography

Aish, Robert, Ruairi Glynn, and Bob Sheil. “Foreword.” In Fabricate 2011: Making Digital Architecture, DGO-Digital original., 10–11. UCL Press, 2017.

Bechtel, William, and Adele Abrahamsen. Connectionism and the Mind: Parallel Processing, Dynamics, and Evolution in Networks. Wiley-Blackwell, 2002.

Cheng, Chili, and June-Hao Hou. “Biomimetic Robotic Construction Process: An Approach for Adapting Mass Irregular-Shaped Natural Materials.” In Proceedings of the 34th ECAADe Conference. Oulu, Finland, 2016.

Chung, Chia Chun, and Taysheng Jeng. “Information Extraction Methodology by Web Scraping for Smart Cities: Using Machine Learning to Train Air Quality Monitor for Smart Cities.” In CAADRIA 2018–23rd International Conference on Computer-Aided Architectural Design Research in Asia, edited by Suleiman Alhadidi, Tomohiro Fukuda, Weixin Huang, Patrick Janssen, and Kristof Crolla, 2:515–524. The Association for Computer-Aided Architectural Design Research in Asia (CAADRIA), 2018.

Goodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. “Generative Adversarial Nets.” In Advances in Neural Information Processing Systems 27, edited by Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, 2672–2680. Curran Associates, Inc., 2014. http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf.

Isola, Phillip, Jun-Yan Zhu, Tinghui Zhou, and Alexei A. Efros. “Image-to-Image Translation with Conditional Adversarial Networks.” CoRR abs/1611.07004 (2016). http://arxiv.org/abs/1611.07004.

Karras, Tero. StyleGAN. NVIDIA, 2019. https://github.com/NVlabs/stylegan.

Karras, Tero, Samuli Laine, and Timo Aila. “A Style-Based Generator Architecture for Generative Adversarial Networks.” CoRR abs/1812.04948 (2018). http://arxiv.org/abs/1812.04948.

Kilian, Axel. “Design Exploration and Steering of Design.” In Inside Smartgeometry, 122–29. John Wiley & Sons, Ltd, 2014. https://doi.org/10.1002/9781118653074.ch10.

Ledig, Christian, Lucas Theis, Ferenc Huszar, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, et al. “Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network,” 105–14, 2017. https://doi.org/10.1109/CVPR.2017.19.

Ng, Andrew Y., and Michael I. Jordan. “On Discriminative vs. Generative Classifiers: A Comparison of Logistic Regression and Naive Bayes.” In Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic, 841–848. NIPS’01. Cambridge, MA, USA: MIT Press, 2001. http://dl.acm.org/citation.cfm?id=2980539.2980648.

Norberg-Schulz, Christian. Genius Loci: Towards a Phenomenology of Architecture. Academy Editions, 1980.

Peng, Wenzhe, Fan Zhang, and Takehiko Nagakura. “Machines’ Perception of Space.” In Proceedings of the 37th Annual Conference of the Association for Computer Aided Design in Architecture (ACADIA). Cambridge, MA: Association for Computer Aided Design in Architecture, 2017.

Sculley, D., Jasper Snoek, Alex Wiltschko, and Ali Rahimi. “Winner’s Curse? On Pace, Progress, and Empirical Rigor.” Vancouver, CA, 2018. https://openreview.net/forum?id=rJWF0Fywf.

Steinfeld, Kyle. “Dreams May Come.” In Proceedings of the 37th Annual Conference of the Association for Computer Aided Design in Architecture (ACADIA). Cambridge, MA: Association for Computer Aided Design in Architecture, 2017.

Wagener, Paul. Streetview Explorer, 2013. https://github.com/PaulWagener/Streetview-Explorer.

Wang, Jason, and Luis Perez. “The Effectiveness of Data Augmentation in Image Classification Using Deep Learning.” Convolutional Neural Networks Vis. Recognit, 2017.

Wang, Ting-Chun. Pix2PixHD: Synthesizing and Manipulating 2048x1024 Images with Conditional GANs. NVIDIA, 2017. https://github.com/NVIDIA/pix2pixHD.