Image by Thomas Staub from Pixabay. Mona Lisa speaks! See #14 below.

The New Photography — What is it?

Published in

Towards Data Science

8 min readJul 18, 2019

This three-part series about photography and imaging examines many of the recent technical and social developments. In part 1, we look at the 190-year history since photography’s invention, noting the rapid pace of change in the medium and sudden transition from film to digital and rise of The Smartphone.

In this installment, part 2, we’ll survey a number of recent technical developments in an effort to build a larger context for understanding new capabilities and consider what’s next.

The terms Computer Vision and Computational Photography are often used interchangeably. Computer vision, however, is a broader discipline covering a range of digital capture and processing techniques. It refers to the ability of computers to understand imagery, often in a similar way as people. Commonly this is achieved by characterizing the content, color, tonal values, shape or edge data. But it can also apply to other metadata such as embedded location or time data. When very large sets of images are analyzed, patterns and insights are gained and subsequently applied to organizing, verifying or even modifying imagery.

Computational photography is more specifically a discipline involving calculations, analysis, manipulation of imagery using algorithms rather than optical methods. We won’t fret over the distinction between the two disciplines here but rather consider the larger genre of computer vision.

Eyes Robot

This isn’t necessarily a new area. Initial examples of computer vision have been with us for awhile in the likes of:

Optical Character Recognition (OCR) enabling bar- and QR-code scanning and conversion of text-based printouts to machine-readable documents.
High-Dynamic-Range (HDR) imaging where multiple images are combined to depict a high-contrast setting, normally one that exceeds the range of a camera sensor or even the human eye. Recent updates have dramatically improved the quality, helping HDR shed the moody and surreal “Harry Potter” look of its early days.
Panoramic imagery where multiple images are aligned and stitched together with the seams between images automatically blended.
Contextual image replacement, coined Content-Aware Fill in Adobe’s software, where portions of an image are replaced using surrounding data. A common usage would be to remove power lines from a photo. Adobe released this in 2010 but more recently, the feature has seen dramatic improvements in quality and capability with the application of more sophisticated algorithms based on their AI platform.

We’ve seen a raft of new developments in computer vision covering a range of use in photography and video. Recent examples include:

Putting the ever-improved hardware and powerful software in the hands of serious hobbyists has enabled Andrew McCarthy to combine 50,000 images for a high-definition images of the moon using various techniques including stitching tiles together to reduce sensor noise.
Similarly, Alan Friedman created incredible HD images of the Sun by combining thousands of images to average out distortions introduced by Earth’s atmosphere.
Light’s L16 camera delivers impressive quality, filesize and creative latitude in a form factor equivalent to a smartphone. Using 16 sensors, the camera instantly evaluates a scene, producing 10+ images that are combined into a stunning final image equal or better to that from many SLRs at a lower price.
Artists Pep Ventosa and Corinne Vionnett each have harvested online imagery and combined them into new representations of familiar landmarks. Vionnet notes how these landmarks are re-photographed by visitors from similar vantage points, reinforcing visual patterns and adding subtle variations to the existing collective index. Accessibility of beautiful imagery from around the world through social networks has had rather dramatic unintended impacts as overcrowding and environmental issues at well-known tourist destinations.
Remove.bg, a plugin designed for Photoshop, removes the background from an image in a single click. Certainly this has been performed manually for years but when the speed and quality of this plugin are combined with the relighting technology described next, it’ll be fast and easy to put people and objects into entirely new image contexts (as long as the perspective is similar).
Lighting plays a key role in defining an image. A team at Google has discovered a method to relight a portrait using a single image to understand the facial terrain plus a separate source image from which to emulate the lighting style.
Google has made an impressive update to ARCore to enable Environmental HDR within their Light Estimation API. By applying machine learning to a single image, the shadows, highlights and reflections may be understood so the scene can be relit, a useful technique for the development of AR.
Another Google technology labelled Global localization, will enable more accurate positioning and directions. It makes use of a combination of the panoramas in Google Street View, its Visual Positioning Service which in real time, compares imagery from your active smartphone camera to images cataloged to known locations. Machine learning is used to filter out impermanent structures for a better understanding of your more precise location.
Airbnb has used machine learning to recognize and adjust the order of imagery to present their places to stay better so hallway or bathroom imagery don’t appear first.
Recognizing the challenge of producing multiple versions of videos to fit different displays (laptops, TVs, range of smartphones), Adobe will release AI-driven video cropping to optimize the video for each aspect ratio so the important elements remain in frame.
Simplifying a technique used in video production with static imagery that is very labor intensive, Adobe is applying their AI platform to enable static imagery to have a changing dimensional quality in which items slide past each other as if the viewer’s eye changes position, similar to video footage.
Similarly, Adobe’s AI platform soon will enable video masking, a speedy update to what has been time consuming with a single image and very labor intensive for video at 24 frames per second.
Riffing further on that theme will be content-aware fill for video in Adobe After Effects enabling the removal of selected items from the frame. Once an object is selected, its replacement will be tracked through successive video frames.
Machine learning researchers at Samsung AI Center developed a technique to add lifelike motion to a person’s face by establishing facial landmarks. This wasn’t necessarily new but this approach achieves remarkable results by using a single image and improves with additional reference imagery.
In a capability that is very easy for humans but hard for machines, Google is making good progress with computer-generated depth perception of moving people which will be a powerful enabling technology.
Researchers at the University of Washington’s Allen School’s Graphics & Imaging Laboratory developed a technology to re-shape the lip-sync of a video to match a specified audio track. They analyzed the mouth shapes of President Obama and applied it to another video of him.
Similarly, a UC Berkeley team has enabled the transfer of dance moves from an expert to a novice using pose detection, bringing hope for us all!
Chinese company Kandao, Google and NVIDIA are all exploring AI for tweening in video, the process of interpolating between frames to smooth animation. This will improve the rendering of slow motion footage when not shot for that purpose.
Google’s PlaNet has referenced 90M geotagged images in order to power early stages of image recognition service to determine where an image was shot.
Taking advantage of faster processors, Google’s Pixel camera features Night Sight which enables better nighttime photography by taking the better result of 1. A steady handheld photo with longer exposures to gather light and reduce noise or 2. if the camera isn’t steady, bursting and combining several photos for an improved result.

Mechanical Brains

The news has been rife with stories of fake imagery and videos, often involving some of the techniques mentioned above — and it’s only beginning to mature. A basic understanding of the methodology is helpful to understand its capabilities and where it may lead. While it can be used to generate new, synthesized or fake imagery, it can also be used to recognize, categorize, track or creatively modify imagery as in the case of the popular Prisma app which uses a technique known as style transfer achieved through use of a Convolutional Neural Network (CNN). Additionally, these approaches, which are highly adaptive, are a big focus in the effort to create self-driving vehicles.

Generally, achieving good results makes use of neural networks patterned after biological systems where stimuli roll up to higher levels creating more meaningful impulses. At a very elemental level neural networks are optimization methods that train a computer model by finding associations between data. Strong associations are given more importance, weak associations have less value. It’s a bit of a brute force method but computers being fast and tireless can crunch enormous amounts of data to reach surprisingly good results.

One approach pits two neural networks against each other in an optimization scheme known as a Generative Adversarial Network (GAN). One network generates an image based on learnings from a dataset, the other assesses the image to determine if it’s realistic. Rejected images are refined until the discriminator can no longer determine if the image is fake.
Convolutional Neural Networks (CNNs) are commonly used to categorize images or otherwise find patterns. As data is analyzed, convolutional layers transform the data and pass the info to the next layer for further analysis. A number of filters are specified for each layer such as edges, shapes and corners, representing more complex information or objects with each layer. As the data moves further into the network, the more sophisticated layers are able to identify more complex objects like eyes, faces, or cars as data from prior layers is combined.
Perceptual Loss Functions are also used for their speed in training a CNN. This method recognizes that 2 images can look the same to humans but be mathematically different to a computer — such as shifting the same image by a pixel or more. The more data analyzed, the better the results.

These explanations represent the very tip of the iceberg with these technologies. Implementations are still rough around the edges but they are improving rapidly. Yet even with this limited understanding, it’s not hard to see how neural networks can be used to generate impressive, animated models of real people, especially celebrities as we’ve heard many times in the news. For example, high definition video with 24 frames per second can be pulled from YouTube to train a network on how a specific person speaks and moves. These learnings can then be used to generate new or altered imagery such as this example where Jon Snow apologizes for GoT season 8.

These methods are computationally very intensive. Faster processors and the availability of huge amounts of digital imagery that can be used for patterning have allowed the more sophisticated and open-sourced algorithms to proliferate at this time. Interestingly, despite the complexity of image data, ML/AI methodologies have progressed much further than they have with text, due largely to the objective nature of imagery. Words and text, on the other hand, can have varying interpretations based on context, personality, culture and other factors like irony which pose bigger challenges for machines to understand.

The examples we covered above are far from comprehensive. Software and hardware companies continue their aggressive progress while many universities have added the subject to their curriculum and formed computer vision departments. It’s clear we’ll continue to see an increase in the volume and quality of manipulated imagery. Further characterization of large image datasets will naturally bring insights and learnings along with some abuses.

In the final installment of this series, we’ll consider some of the social and ethical challenges with these technologies along with some thoughts on mitigation. We’ll also look at what’s on the horizon.

The New Photography — What is it?

Eyes Robot

Mechanical Brains

Written by Ted Tuescher