Computer Vision: Image formation and representation

Getting started with computer vision

Mayank Mishra

Published in

Towards Data Science

10 min readJan 25, 2021

1. Introduction

As humans, we are able to perceive the three-dimensional world around us with such ease. Imagine looking at a flower vase. We can effortlessly perceive each petal's shape and translucency and can separate the flowers from the background.

Computer vision aims at giving computers the ability to understand the environment as we do. It focuses on looking at the world through multiple images or videos and reconstructing properties like the shape of objects, intensity, color distributions, etc.

Recent advancements in the field of deep learning are enabling computer vision methods to understand and automate tasks that the human visual system can do. This article discusses the introductory topics of Computer Vision, namely image formation and representation. Image formation will briefly cover how an image is formed and the factors on which it depends. It will also cover a pipeline of image sensing in a digital camera. The second half of the article will cover image representation which will explain the various ways to represent the images and will focus on certain operations that can be done on images.

2. Image Formation

In modeling any image formation process, geometric primitives and transformations are crucial to project 3-D geometric features into 2-D features. However, apart from geometric features, image formation also depends on discrete color and intensity values. It needs to know the lighting of the environment, camera optics, sensor properties, etc. Therefore, while talking about image formation in Computer Vision, the article will be focussing on photometric image formation.

2.1 Photometric Image Formation

Fig. 1 gives a simple explanation of image formation. The light from a source is reflected on a particular surface. A part of that reflected light goes through an image plane that reaches a sensor plane via optics.

Fig: 1. Photometric Image Formation (Image credit: Szeliski, Computer Vision: Algorithms and Applications 2010)

Some factors that affect image formation are:

The strength and direction of the light emitted from the source.
The material and surface geometry along with other nearby surfaces.
Sensor Capture properties

2.1.1 Reflection and Scattering

Images cannot exist without light. Light sources can be a point or an area light source. When the light hits a surface, three major reactions might occur-

Some light is absorbed. That depends on the factor called ρ (albedo). Low ρ of the surface means more light will get absorbed.
Some light gets reflected diffusively, which is independent of viewing direction. It follows Lambert’s cosine law that the amount of reflected light is proportional to cos(θ). E.g., cloth, brick.
Some light is reflected specularly, which depends on the viewing direction. E.g., mirror.

Fig. 2: Models of reflection (Image credits: Derek Hoiem, University of Illinois)

Apart from the above models of reflection, the most common model of light scattering is the Bidirectional Reflectance Distribution Function (BRDF). It gives the measure of light scattered by a medium from one direction into another. The scattering of the light can determine the topography of the surface — smooth surfaces reflect almost entirely in the specular direction, while with increasing roughness the light tends to diffract into all possible directions. Eventually, an object will appear equally bright throughout the outgoing hemisphere if its surface is perfectly diffuse (i.e., Lambertian). Owing to this, BRDF can give valuable information about the nature of the target sample.

There are multiple other shading models and ray tracing approaches that are used in unison to properly understand the environment by evaluating the appearance of the scene.

2.1.2 Color

From a viewpoint of color, we know visible light is only a small portion of a large electromagnetic spectrum.

Two factors are noticed when a colored light arrives at a sensor:

Colour of the light
Colour of the surface

Bayer Grid/Filter is an important development to capture the color of the light. In a camera, not every sensor captures all the three components (RGB) of light. Inspired by human visual preceptors, Bayers proposed a grid in which there are 50% green, 25 % red, and 25% blue sensors.

Demosaicing algorithm is then used to obtain a full-color image where the surrounding pixels are used to estimate the values for a particular pixel.

There are many such color filters that have been developed to sense colors apart from Bayer Filter.

Fig. 3: (a) Bayer arrangement of filters on an image sensor. (b) The cross-section of the sensor. (Image credits: https://en.wikipedia.org/wiki/Bayer_filter)

2.2 Image sensing Pipeline (The digital camera)

The light originates from multiple light sources, gets reflected on multiple surfaces, and finally enters the camera where the photons are converted into the (R, G, B) values that we see while looking at a digital image.

An image sensing pipeline in the camera follows the flowchart that is given in fig. 4.

In a camera, the light first falls on the lens (optics). Following that is the aperture and shutter which can be specified or adjusted. Then the light falls on sensors which can be CCD or CMOS (discussed below), then the image is obtained in an analog or digital form and we get the raw image.

Typically cameras do not stop here. They use demosaic algorithms mentioned in the above topic. Image is sharpened if required or any other important processing algorithms are applied. Post this, white balancing and other digital signal processing tasks are done and the image is finally compressed to a suitable format and stored.

Fig. 4: Image sensing pipeline in a camera (Image credit: *Szeliski, Computer Vision: Algorithms and Applications 2010)*

2.2.1 CCD vs CMOS

The camera sensor can be CCD or CMOS. In charged coupled device (CCD). A charge is generated at each sensing element and this photogenerated charge is moved from pixel to pixel and is converted into a voltage at the output node. Then an analog to digital converter (ADC) converts the value of each pixel to a digital value.

The complementary metal-oxide-semiconductor (CMOS) sensors work by converting charge to voltage inside each element as opposed to CCD which accumulates the charge. CMOS signal is digital and therefore does not need ADC. CMOS is widely used in cameras in the current times.

Fig. 5: CCD vs CMOS (Image credit: D. Litwiller, CMOS vs. CCD: Maturing technologies, maturing markets)

2.2.2 Properties of Digital Image Sensor

Let us look at some properties that you may see while clicking a picture on a camera.

Shutter Speed: It controls the amount of light reaching the sensor

Sampling Pitch: It defines the physical space between adjacent sensor cells on the imaging chip.

Fill Factor: It is the ratio of active sensing area size with respect to the theoretically available sensing area (product of horizontal and vertical sampling pitches)

Chip Size: Entire size of the chip

Sensor Noise: Noise from various sources in the sensing process

Resolution: It tells you how many bits are specified for each pixel.

Post-processing: Digital image enhancement methods used before compression and storage.

3.0 Image Representation

After getting an image, it is important to devise ways to represent the image. There are various ways by which an image can be represented. Let’s look at the most common ways to represent an image.

3.1 Image as a matrix

The simplest way to represent the image is in the form of a matrix.

Fig. 6: Representing a part of the image as a matrix (Image credit: IIT, Madras, NPTEL Deep Learning for Computer Vision)

In fig. 6, we can see that a part of the image, i.e., the clock, has been represented as a matrix. A similar matrix will represent the rest of the image too.

It is commonly seen that people use up to a byte to represent every pixel of the image. This means that values between 0 to 255 represent the intensity for each pixel in the image where 0 is black and 255 is white. For every color channel in the image, one such matrix is generated. In practice, it is also common to normalize the values between 0 and 1 (as done in the example in the figure above).

3.2 Image as a function

An image can also be represented as a function. An image (grayscale) can be thought of as a function that takes in a pixel coordinate and gives the intensity at that pixel.

It can be written as function f: ℝ² → ℝ that outputs the intensity at any input point (x,y). The value of intensity can be between 0 to 255 or 0 to 1 if values are normalized.

Fig. 7: An image represented as a function (Image credits: Noah Snavely, Cornell University)

3.2.1 Image Transformation

Images can be transformed when they are looked upon as functions. A change in the function can result in changes in the pixel values of the image. Given below are a few examples of the same.

Fig. 8 Image transformation — Lightening the image (Source used: Noah Snavely, Cornell University)

In fig. 8, we wish to make the image lighter. Therefore, for every pixel in the image, we increase the corresponding intensity value. Here we are assuming that the values lie between 0 and 255.

Fig. 9: Image Transformation — Flipping the image (Source used: Noah Snavely, Cornell University)

Similarly, fig. 9 shows the change in the function to flip the image around the vertical axis.

In the above examples, transformation takes place at a pixel level. There are other ways too by which we can perform image transformation.

3.2.2 Image Processing Operations

Essentially, there are three main operations that can be performed on an image.

Point Operations
Local Operations
Global Operations

Given below is an explanation of each of these operations.

3.2.2.1 Point Operation

The examples of image transformations shown above are point operations. In this, the output value depends only on the input value at that particular coordinate.

Fig. 10: Point Operation. Output coordinate in ‘b’ only depends on the corresponding input coordinate in ‘a’ (Image credit: Author)

A very famous point operation example that one uses a lot while editing images is reversing the contrast. In the most simple terms, it flips the dark pixels into light pixels and vice versa.

Fig. 11: Reversing the contrast (Image credit: IIT, Madras, NPTEL Deep Learning for Computer Vision)

Fig. 11 displays the application of reversing the contrast. The point operation that helps us to achieve this is stated below.

Fig.12: Point operation to reverse contrast (Image credit: Author)

Here, I(x,y) stands for the intensity value at coordinate (x,y) of an image I. Iₘₐₓ and Iₘᵢₙ refer to the maximum and minimum intensity value of image I. For example, say that an image I has an intensity between 0 and 255. Therefore, Iₘₐₓ and Iₘᵢₙ become 255 and 0 respectively. You wish to flip the intensity value at a coordinate say (x,y) where the current intensity value is 5. By using the above operation, you get the output as : (255) — 5 + 0 = 250 which will be the new value of intensity at coordinate (x,y).

Let’s say you clicked a still scene using a camera. But there can be noise in the image due to many reasons like dust particles on the lens, damage in a sensor, and many more. Noise reduction using point operations can be very tedious. One way is to take multiple still scenes and average the value at every pixel and hope that the noise gets removed. But at times, it is not possible to get multiple images of a scene and the stillness of a scene can not be guaranteed every time. To do this, we need to move from point operation to local operation.

3.2.2.2 Local Operation

In local operation, as shown in fig. 13, the output value is dependent on the input value and its neighbors.

Fig. 13: Local Operation (Image credit: Author)

A simple example to understand local operation is the moving average. Suppose an image I as shown in fig. 14. It is clear by looking at the image that it is a white box placed in the dark background. However, we see noise in the picture as a couple of pixels seem to be misplaced (circled in the figure).

Fig. 14: Intensity values of corresponding pixels of an image (Source used: Steve Seitz, University of Washington)

How do you remove this noise from the image? Assume a 3 X 3 window in the image (any size of the window can be chosen). Move the window across the image and take the average of all the pixels falling within the window. A demonstration of this can be seen below. The final output of the operation can be seen in fig. 15.

Moving Average 2D (Source used to make the visual description: Steve Seitz, Unversity of Washington)

Fig. 15: Final output of the moving average (Image credit: Steve Seitz, University of Washington)

The above operation is a local operation as the output is dependent on the input pixel and its neighbors. Due to the operation, noise pixels in the image are smoothened out in the output.

3.2.2.3 Global Operation

As the name suggests, in global operation, the value at the output pixel is dependent on the entire input image. An example of the global operation is the Fourier transformation which is shown in fig.17.

Fig. 16: Global Operation (Image credit: Author)

Fig. 17: An example of Fourier transformation (Image credit: Mathworks MATLAB Toolbox)

Thank you for reading. This article covered the basics of image formation and its representation. Upcoming articles will cover more topics in Computer Vision in depth.