Have you ever wondered what it would be like to live in Flatland? I don’t know about you, but I feel no inclination to downgrade the universe to two dimensions. I am satisfied with the possibilities offered by a 3D space, such as tying knots in my shoelaces and other topological gratifications.
When coding Automated Inspection, it is common to have to venture into Flatland. For example, you might consider an inspection system where parts are moving on a conveyor belt, and a camera is grabbing images from the top. In this scenario, the interesting objects are constrained to the conveyor belt plane, so the scene is essentially two-dimensional. Another example would be a case where the object to be inspected has planar surfaces. It might be desirable to locate these surfaces in the image to check for the presence of defects or validate that some features are present.
In the general case of a 3D scene grabbed by a single camera, there is not a one-to-one relationship between the 3D coordinates of points in the scene and the 2D coordinates in the image. Multiple 3D points will be projected to a given image point. You can visualize this by imagining that you’re taking a picture through a series of multiple transparent windows. If there is a fly on each window and if every fly is aligned with your camera’s focal point, you will only see a single fly. The closest one will occlude the others because all the 3D points occupied by the flies are projected to the same pixel. In other words, there is a many-to-one relationship between the 3D coordinates in the scene and the 2D coordinates in the image.

In the context of having the interesting features in a single plane¹, the situation is different. You can write down a linear relationship between the plane of interest (that lies in the 3D world) and the image plane (where coordinates are in pixels). In this special case, there is a one-to-one correspondence between the 2D coordinates of the points in the plane of interest – dropping the Z coordinate, that we’ll arbitrarily set to 0 – and the 2D coordinates in the image: it is a homography. Let’s consider this mapping between a camera image plane and a plane of interest in the 3D scene.

A homography? That sounds great. Now, how to compute it?
The key is to have correspondences between points with known coordinates in the plane, typically in millimeters, and their coordinates in the image, in pixels. We know that, with a pinhole camera model, these corresponding points are linked with a linear relationship².


T is the 3×3 transformation matrix from pixels to millimeters. (X, Y) are the 2D coordinates of a point constrained to a plane of interest – with the reference axes in the plane – and (x, y) are its corresponding coordinates in pixels.
Why would we add this third dummy coordinate that is always set to one?
The transformation is a composition of scaling, shearing, rotation, and translation. These operations can be represented by 2×2 matrices and an additive 2×1 vector for translation. By introducing the dummy third coordinate (i.e. a homogeneous representation), we are allowed to encapsulate the compound transformation by a single 3×3 matrix.
It is a matter of compactness.
Our goal will be to compute the transformation matrix T, which will allow us to obtain the coordinates of a point in the scene from its corresponding coordinates in the image. Let’s write the transformation equation explicitly, for a given correspondence (_Xᵢ, Yᵢ, xᵢ, y_ᵢ):

After reorganizing the equation in the form Az = b, with z containing our unknowns {t₀₀, … t₁₂}:

We see that each correspondence supplies us with two linear equations in 6 unknowns. To solve this system of linear equations, we need at least three correspondences, to get a minimum of 6 linearly independent equations. In practice, we’ll try to get more than three correspondences and solve an overdetermined system with a least-squares method.
The following code and the image can be found in this github repository.
The image checkerboard.png shows a planar surface. The X and Y axes have been annotated with blue lines.

Measurements on the physical object allow us to map 2D coordinates in millimeters with their corresponding coordinates, in pixels.

With four such correspondences (as far away as possible from each other), an overdetermined system of equations could be solved with the numpy function numpy.linalg.lstsq()
With the {t₀₀, … t₁₂} matrix entries (denoted by the vector z in the code snippet above), we have our transformation matrix T. With it, we can get the millimeter coordinates of a point for which we have the pixel coordinates. For example, pre-multiplying pixel coordinates (399, 500) with T:

The prediction (72.3 mm, 74.8 mm) is within 4 mm of the measured coordinates (71.4 mm, 71. 4 mm). Conversely, by inverting T, we can get the pixel coordinates of a point for which we have the millimeter coordinates. For example, the point (107.1 mm, 107.1 mm) is three squares to the right, three squares up of the origin.

The inverse transformation predicts that its pixel coordinates after rounding should be (496, 370):

As we can see, the predictions are close, but not exact. This is probably because we assumed a simple pinhole camera model. To go beyond the pinhole camera model, calibration of our imaging system would be required to take into account any non-linear distortions.
We’ve covered the steps to build the transformation matrix giving the linear relationship between coordinates in pixels and coordinates in millimeters when we are dealing with a planar scene. It is a basic Computer Vision operation that needs to be performed often in automated inspection. In practice, the part to be inspected may have a few fiducial markers or distinct features that are easy to locate. The known 2D location of the features on the plane (in millimeters) and their detected image location (in pixels) will constitute the set of correspondences that are used to compute the transformation matrix T. Then, any object on the plane whose real-world coordinates are known can be located in the image, even if this object is difficult to detect, or even absent.
Please feel free to experiment with the code, and let me know about use cases where you think it could be applicable.
¹Assuming the plane of interest doesn’t go through the camera’s focal point
²A pinhole camera model ignores lens distortion and out-of-focus blur. It is a first-order approximation to a real camera.