
AN INTRODUCTORY GUIDE
:
Contents
- Affine Transformations
- Homography
— Homogeneous Coordinates
— Pinhole Camera Model
— Equation of Homography
- References
Transformations make up an important part of Computer Vision and understanding how they work lays the basis for much more advanced techniques.
Here, I will primarily cover the affine transform and homography.
Affine Transformations:
Affine transformations are the simplest form of transformation. These transformations are also linear in the sense that they satisfy the following properties:
- Lines map to lines
- Points map to points
- Parallel lines stay parallel
Some familiar examples of affine transforms are translations, dilations, rotations, shearing, and reflections. Furthermore, any composition of these transformations (like a rotation after a dilation) is another affine transform.




Rather than splitting up affine transforms into separate cases, it is much more elegant to have one unified definition. Thus, we turn towards the use of matrices as linear transformations to define affine transforms. If you are unfamiliar with interpreting matrices as linear transformations of space, 3Blue1Brown has an excellent video on the topic:
In essence, we can think of affine transformations as first applying some linear transformation with a matrix, which is then composed with some translation.
In the 2D case, the equation for an affine transform is given by:

Here, the matrix represents some linear transform on the vector with entries (_x_1 and _x_2), such as a reflection, shear, rotation, dilation, or a combination of all four. It is important to note that, since the transformation is linear, it must also be invertible, so the determinant of the matrix is non-zero. The final step of the transform would be a translation by the vector [_t_1, _t_2], completing the transformation onto the vector [_y_1, _y_2].

Affine transformations can also be generalized to n dimensions with the following equation:

This transformation maps the vector x onto the vector y by applying the linear transform A (where A is a n×n, invertible matrix) and then applying a translation with the vector b (b has dimension n×1).
In conclusion, affine transformations can be represented as linear transformations composed with some translation, and they are extremely effective at modifying images for computer vision. In fact, image pre-processing relies heavily on affine transforms for scaling, rotating, shifting, etc.
Homography:
A homography is a type of projective transformation in that we take advantage of projections to relate two images. Homographies were originally introduced to study shifts in perspective, and they have enabled people to better understand how images change when we look at them from a different perspective.

Homographies are studied by examining the fact that cameras pick up different images depending on their position and orientation. In essence, a homography is a transformation between two images of the same scene, but from a different perspective. There are two only cases for which homography applies (both cases assume that the world view can be modeled by plane):
- Images are captured by the same camera but at a different angle (the world is now essentially a plane)
- Two cameras are viewing the same plane from a different location


However, in order to use a homography, we must first devise tools to describe how an image would turn out on a camera. More specifically, if we are given the location/orientation of a camera and an image plane (which is where the image appears and is a property of the camera), we must find where a point in the world (called the world point) appears on the image plane.

Because light travels in a straight line, we know that a world point with Cartesian coordinates (X, Y, Z) appears on the image plane at the intersection of the image plane with the line going through the camera center and world point (this line is the path that light takes to the camera).
In order to properly define formulas for this, however, we must take detour to a concept in projective geometry called homogeneous coordinates.
Homogeneous Coordinates:
Homogeneous coordinates (or projective coordinates) are another coordinate system with the advantage that formulas with homogeneous coordinates are often much simpler than in Cartesian coordinates (points on the x-y plane). In contrast, homogeneous coordinates use 3 coordinates (x‘, y‘, z‘) to represent points in space (in general, they use one more coordinate than their Cartesian counterparts). As a bonus, converting homogeneous coordinates to Cartesian coordinates is relatively simple:

Notice that when z’=1, the homogeneous coordinates map nicely to Cartesian coordinates. Furthermore, when z’=0, we interpret the corresponding Cartesian coordinate as a point at infinity. With this in mind, let us examine some applications of homogenous coordinates.
First of all, lines in homogeneous coordinates can be represented with only 3 points. In particular, you can represent a line l as (_l_1, _l_2, _l_3) in homogeneous coordinates. Then, the line is simply the set of all points p=(x, y, z) such that the dot product of l with p is 0.

The second property of note is that, given two points _p_1=(a, b, c) and _p_2 = (d, e, f), the line l going through _p_1 and _p_2 is given by the cross-product (an operation on two vectors defined in elementary linear algebra) of the two points.

The final property of note is that the intersection of two lines _l_1 and _l_2 (in homogeneous coordinates) is the point p given by the cross-product of the two lines.

With these properties, we are now well-equipped to understand the formulas behind homographies.
Pinhole camera model:
We now return to our problem of finding where a world point (X, Y, Z) – this is in Cartesian coordinates – lies the image plane of a camera given the location and orientation of said camera.
In particular, we find that the world point (X, Y, Z, 1) lies on the point (x’, y’, z’) of the image plane, where both points are in homogeneous coordinates and are related by the following equation (f is a constant describing the focal length of the camera):

From here, we can convert (x‘, y‘, z‘) into Cartesian coordinates (x, y) on the image plane as described earlier.
It turns out that the matrix that transforms the world point into a point on the image plane can be factored nicely, giving us an intuitive way to understand what the matrix is doing:

However, since the coordinate system on images is different from Cartesian coordinates (the top left corner is (0,0) in pixel coordinates), we must perform one final transform to convert from homogeneous image plane coordinates (x’, y’, z’) into homogeneous pixel coordinates (u’, v’, w’).

Thus, we first scale our image plane coordinates (in order to convert them into pixels) by dividing by the size of the pixel, which is ρu in the u direction and ρv in the v direction. You can think of this as converting from a unit like meters into pixels. Next, we must translate the coordinates by some _u_0 and _v_0 such that the origin of the pixel coordinates is in the proper spot. Putting these two transformations, we get the following formula:

Finally, putting everything together, we can complete our camera model by converting world point (X, Y, Z, 1) into pixel coordinates (u’, v’, w’), where both are given in homogeneous coordinates:

The extrinsic parameters can be thought of as a term that gives us information about the orientation (stored in the rotation matrix R) and the position (stored in the vector t) of the camera.
In practice, we don’t often know all the parameters, so we can approximate the camera matrix instead, which is just some 3×4 matrix (by matrix multiplication properties); this is accomplished by calibrating the camera.
Furthermore, one important property of the camera matrix is that scaling the entries of C gives a new camera matrix that describes the same camera. This is because if we multiply the C by some constant λ, then all the entries of the homogeneous pixel coordinate will also be scaled by λ. However, when we convert from homogeneous to Cartesian coordinates (which was shown earlier), the λ‘s cancel out, leaving the same Cartesian coordinate as before the scaling. Thus, it has become convention to let the lower right entry of the camera matrix be 1, since scale factors are arbitrary.

With these equations defined, we are very close to attaining a formula for the homography.
Equation of Homography:
Once again, let us examine the pinhole camera setup. In this setup, our camera will be taking a picture of points P=(X, Y, Z) that lie on a plane, where a point P appears as (u’, v’ ,w’) in homogeneous pixel coordinates on the image plane. Now, we can apply a clever trick with our choice of coordinate system. Namely, we let the X and Y axes lie in the plane such that the Z axis points out of the plane. Therefore, the Z coordinate for all points on the plane is 0, meaning that P is of the form P=(X, Y, 0).

Now, substituting this into our camera equation and using with Z=0, we get the following simplification:

This new camera matrix is known as the homography matrix H and has 8 unknown entries (since the 1 at the lower right is fixed). Conventionally, the entries of the planar homography matrix are denoted with H rather than C as shown below:

Our final task is now to estimate the entries of H. Since there are 8 unknowns in the homography matrix, and each world point on the plane has 2 coordinates of note, we will need to calibrate the camera on 4 world points in order to estimate all the entries of H.
Recapping, homographies are a transformation that we can use to transform one image into another when both images are pictures of the same plane, but from a different perspective. Mathematically, this transformation is carried out by the homography matrix, which is 3×3 matrix that has 8 unknowns and can be estimated by calibrating the images with 4 corresponding points (using more points gets a better approximation, but 4 is the minimum required).

All in all, homgraphies are a powerful tool with applications in areas like augmented reality (to project certain images onto the environment) and image stitching (combining multiple images to create a larger panorama).
References
https://en.wikipedia.org/wiki/Homogeneous_coordinates