3-D Reconstruction with Vision

Published in

Towards Data Science

19 min readMay 9, 2020

Source :Continuous Ratio Optimization via Convex Relaxation with Applications to Multiview 3D Reconstruction, Paper by Kalin Kolev and Daniel Cremers.

Exactly a year back, before I started writing this article, I watched Andrej Karapathy, the director of AI at Tesla delivering a talk where he showed the world a glimpse of how a Tesla car perceives depth using the cameras hooked to the car in-order to reconstruct its surroundings in 3D and take decisions in real-time, everything(except the front radar for safety) was being computed just with vision. And that presentation blew my mind!

Of course, I knew 3-D reconstruction of an environment is possible through cameras, but I was in a mindset that why would anyone risk using a normal camera when we’ve got such highly accurate sensors like LiDAR, Radar, etc. that could give us an accurate presentation of the environment in 3-D with far less computation? And I started studying(trying to understand) papers related to this topic of Depth Perception and 3-D Reconstruction from Vision and came to the conclusion that we humans have never had rays coming out of our heads to perceive depth and environment around us, we are intelligent and aware of our surroundings just with the two eyes we’ve got, from driving our car or bike from office to work, or driving a formula 1 at 230 mph in the world’s most dangerous tracks, we never required lasers to make decisions in microseconds. The world around us was constructed by us for us, beings with vision and so as Elon said, ‘these costly sensors would become pointless once we solve vision’.

There’s huge research going on in this field of depth perception with vision, especially with the advancements in Machine Learning and Deep Learning we are now able to compute depth just from vision at high accuracy. So before we start learning the concepts and implementing these techniques, let us look at what stage this technology is currently in and what are the applications of it.

Robot Vision:

Creating HD Maps for autonomous driving:

SfM(Structure from Motion) and SLAM(Simultaneous Localisation and Mapping) are one of the major techniques that make use of the concepts that I am going to introduce you to in this tutorial.

Now that we’ve got enough inspiration to learn, I’ll start the tutorial. So first I’m going to teach you the basic concepts required to understand what’s happening behind the hood, and then apply them using the OpenCV library in C++. The question you might ask is why am I implementing these concepts in C++ while doing it in python would be far easier, and there’s reason behind it. The first reason is that python is not fast enough for these concepts to implement in real-time, and the second reason is that unlike python, using C++ would mandate our understanding of the concept without which one can’t implement.

In this tutorial we are going to write two programs, one is to get a depth map of a scene and another is to obtain a point cloud of the scene, both using stereo vision.

Before we dive right into the coding part, it is important for us to understand the concepts of camera geometry, which I am going to teach you now.

The Camera Model

The process used to produce images has not changed since the beginning of photography. The light coming from an observed scene is captured by a camera through a frontal aperture(a lens) that shoots the light onto an image plane located at the back of the camera lens. The process is illustrated in the figure below:

From the above figure, do is the distance from the lens to the observed object, di is the distance between the lens and image plane. And f will hence become the focal length of the lens. These described quantities have a relation between them from the so-called “Thin Lens Equation” shown below:

Now let us look into the process of how an object from the real-world that is 3-Dimensional, is projected onto a 2-Dimensional plane(a photograph). The best way for us to understand this is by taking a look into how a camera works.

A camera can be seen as the function that maps 3-D world to a 2-D image. Let us take the simplest model of a camera, that is the Pinhole Camera Model, the older photography mechanisms in human history. Below is a working diagram of a pinhole camera :

From this diagram we can derive :

Here it’s natural that the size hi of the image formed from the object will be inversely proportional to the distance do of the object from camera. And also that a 3-D scene point located at position (X, Y, Z) will be projected onto the image plane at (x,y) where (x,y) = (fX/Z, fY/Z). Where the Z coordinate refers to the depth of the point, which was done in the previous image. This entire camera configuration and notation can be described with a simple matrix using the homogeneous coordinate system.

When cameras generate a projected image of the world, projective geometry is used as an algebraic representation of the geometry of objects, rotations and transformations in the real world.

Homogeneous coordinates are a system of coordinates used in projective geometry. Even though we can represent the positions of objects(or any point in 3-D space) in real-world in Euclidean Space, any transformation or rotation that has to be performed must be performed in homogeneous coordinate space and then brought back. Let us look at the advantages of using Homogeneous coordinates:

Formulas involving Homogeneous Coordinates are often simpler than in the Cartesian world.
Points at infinity can be represented using finite coordinates.
A single matrix can represent all the possible protective transformations that can occur between a camera and the world.

In homogeneous coordinate space, 2-D points are represented by 3 vectors, and 3-D points are represented by 4 vectors.

In the above equations, the first matrix with the f notation is called the intrinsic parameter matrix(or commonly known as the intrinsic matrix). Here the intrinsic matrix contains just the focal length(f) right now, we’ll look into more parameters of this matrix ahead of this tutorial.

The second matrix with the r and t notations is called the extrinsic parameter matrix(or commonly known as the Extrinsic Matrix). The elements within this matrix represent the rotation and translation parameters of the camera(that is where and how the camera is placed in real world).

Thus these intrinsic and extrinsic matrices together can give us a relation between the (x,y) point in image and (X, Y, Z) point in real world. This is how a 3-D scene point is projected onto a 2-D plane depending on the given camera’s intrinsic and extrinsic parameters.

Now that we have acquired enough knowledge about projective geometry and camera model, it’s time to introduce you to one of the most important element in computer vision geometry, the Fundamental Matrix.

The Fundamental Matrix

Now that we know how a point in the 3-D world is projected onto an image plane of a camera. We will look into the projective relationship that exists between two images displaying the same scene. These two cameras when separated by a rigid baseline, we use the term stereo-vision. Consider two pinhole cameras observing a given scene point sharing the same baseline as shown in the figure below :

From the above figure, the world point X has its image at position x on the image plane, now this x can be located anywhere on this line in 3-D space. Which implies that if we want to find the same point x in another image, we need to search along the projection of this line on the second image.

This imaginary line drawn from x is known as the epipolar line of x. This epipolar line brings a fundamental constraint with it, that is, the match of a given point must lie on this line in another view. It means that if you want to find x from the first image in the second image, you have to look for it along the epipolar line of x on the second image. These epipolar lines can characterise the geometry between the two views. An important thing to note here is that all the epipolar lines always pass through one single point. This point corresponds to the projection of one camera’s centre onto the other camera’s centre and this point is called the epipole.

We can consider fundamental matrix F as the one that maps a 2-D image point in one view to an epipolar line in the other image view. The fundamental matrix between an image pair can be estimated by solving a set of equations that involve a certain number of known matched points between the two images. The minimum number of such matches is seven and an optimal number is eight. Then for a point in one image, the fundamental matrix gives the equation of the line on which its corresponding point in another view should be found.

If the corresponding point of a point of a point (x, y) is (x’, y’), and the fundamental matrix between the two images planes is F, then we must have the following equation in homogeneous coordinates.

This equation expresses the relationship between two corresponding points and known as the epipolar constraint.

Matching Good Image Points using RANSAC

When two cameras observe the same scene, they see the same objects but under different viewpoints. There are libraries like OpenCV in both C++ and Python that provide us with feature detectors that find us certain points with descriptors in images that they think are unique to image and can be found if given another image of the same scene. However, practically it is not possible to guarantee that a matching set obtained between two images by comparing the descriptors of the detected feature points(like SIFT, ORB, etc.) will be exact and true. This is why a fundamental matrix estimation method based on RANSAC (Random Sampling Consensus) strategy has been introduced.

The idea behind RANSAC is to randomly select some data points from a given set of data points and perform an estimation with only those. The number of selected points should be the minimum number of points required to estimate the mathematical entity, which in our case of the fundamental matrix is eight matches. Once the fundamental matrix is estimated from these eight random matches, all the other matches in the match set are tested against the epipolar constraint we discussed. These matches form the support set of the computed fundamental matrix.

The larger the support set, the higher the probability that the computed matrix is the right one. And if one of the randomly selected matches is an incorrect match, then the computed fundamental matrix will also be incorrect, and its support set will be expected to be small. This process is repeated a number of times and in the end, the matrix with the largest support set will be retained as the most probable one.

Computing Depth Map from Stereo Images

The reason humans have evolved to be species with two eyes is so that we can perceive depth. And when we organise cameras in a similar way in a machine, it’s called Stereo-Vision. A stereo-vision system is generally made of two side-by-side cameras looking at the same scene, the following figure shows the setup of a stereo rig with an ideal configuration, aligned perfectly.

Under such an ideal configuration of cameras as shown in the above figure, cameras are only separated by an horizontal translation and therefore all the epipolar lines are horizontal. It means that the corresponding points have the same y coordinates and search is reduced to 1 Dimensional line. When cameras are separated by such a pure horizontal translation, the projective equation of the second camera would become:

The equation would make more sense by looking at the diagram below, which is the general case with digital cameras :

where the point (uo, vo) is the pixel position where the line passing through the principle point of the lens pierces the image plane. Here we obtain a relation :

Here, the term (x-x’) is called the disparity and Z is, of course, the depth. In order to compute a depth map from a stereo pair, the disparity of each pixel must be computed.

But in a practical world, obtaining such an ideal configuration is very difficult. Even if we place the cameras accurately, they would unavoidably include some extra transitional and rotational components.

Fortunately, it is possible to rectify these images to produce horizontal epilines required by using a robust matching algorithm which makes use of a fundamental matrix to perform the rectification.

Now let us start by obtaining a fundamental matrix for the below stereo images :

You can download the above images from my GitHub repository by clicking here. Before you start writing the code in this tutorial, make sure you have opencv and opencv-contrib libraries built on your computer. If they are not built, I would suggest you visit this link to install them if you are on Ubuntu.

Let’s Code!

#include <opencv2/opencv.hpp>
#include "opencv2/xfeatures2d.hpp"using namespace std;
using namespace cv;int main(){
cv::Mat img1, img2;img1 = cv::imread("imR.png",cv::IMREAD_GRAYSCALE);
img2 = cv::imread("imL.png",cv::IMREAD_GRAYSCALE);

The first thing we did is included the required libraries from opencv and opencv-contrib which I asked you to build before starting this section. In main() function we have initialised two variables of the cv:Mat datatype which is a member function of the opencv library, the Mat datatype can hold vectors of any size by dynamically allocating memory, especially images. Then using the cv::imread() we have imported both the images into img1 and img2 of the mat datatype. The cv::IMREAD_GRAYSCALE parameter imports images as gray-scale.

// Define keypoints vector
std::vector<cv::KeyPoint> keypoints1, keypoints2;// Define feature detector
cv::Ptr<cv::Feature2D> ptrFeature2D = cv::xfeatures2d::SIFT::create(74);// Keypoint detection
ptrFeature2D->detect(img1,keypoints1);
ptrFeature2D->detect(img2,keypoints2);// Extract the descriptor
cv::Mat descriptors1;
cv::Mat descriptors2;ptrFeature2D->compute(img1,keypoints1,descriptors1);
ptrFeature2D->compute(img2,keypoints2,descriptors2);

Here we are making use of opencv’s SIFT feature detector in order to extract the required feature points from the images. If you want to understand more about how these feature detectors work visit this link. The descriptors we obtained above describe each point extracted, this description is used in order to find it in another image.

// Construction of the matcher
cv::BFMatcher matcher(cv::NORM_L2);// Match the two image descriptors
std::vector<cv::DMatch> outputMatches;
matcher.match(descriptors1,descriptors2, outputMatches);

BFMatcher takes the descriptor of one feature in the first set and is matched with all the other features in the second set using some threshold distance calculation, and the closest one is returned. We are storing all the matches returned by BFMatches in output matches variable of type vector<cv::DMatch>.

// Convert keypoints into Point2f
std::vector<cv::Point2f> points1, points2;
for (std::vector<cv::DMatch>::const_iterator it=   outputMatches.begin(); it!= outputMatches.end(); ++it) {    // Get the position of left keypoints
    points1.push_back(keypoints1[it->queryIdx].pt);
    // Get the position of right keypoints
    points2.push_back(keypoints2[it->trainIdx].pt);
     }

The acquired keypoints first need to be converted into cv::Point2f type in-order to be used with cv::findFundamentalMat, a function we will make use of for computing the fundamental matrix using these feature points we abstracted. The two resulting vectors Points1 and Points2 contain the corresponding point coordinates in two images.

std::vector<uchar> inliers(points1.size(),0);
cv::Mat fundamental= cv::findFundamentalMat(
  points1,points2, // matching points
     inliers,         // match status (inlier or outlier)  
     cv::FM_RANSAC,   // RANSAC method
     1.0,        // distance to epipolar line
     0.98);     // confidence probabilitycout<<fundamental; //include this for seeing fundamental matrix

And finally, we have called the cv::findFundamentalMat.

// Compute homographic rectification
cv::Mat h1, h2;
cv::stereoRectifyUncalibrated(points1, points2, fundamental,
 img1.size(), h1, h2);
// Rectify the images through warping
cv::Mat rectified1;
cv::warpPerspective(img1, rectified1, h1, img1.size());
cv::Mat rectified2;
cv::warpPerspective(img2, rectified2, h2, img1.size());

As i explained you previously in the tutorial that obtaining an ideal configuration of cameras without any error is very difficult in a practical world, hence opencv offers a rectifying function that applies homographic transformation to project the image plane of each camera onto a perfectly aligned virtual plane. This transformation is computed from a set of matched points and fundamental matrix.

// Compute disparity
cv::Mat disparity;
cv::Ptr<cv::StereoMatcher> pStereo = cv::StereoSGBM::create(0,               32,5);
pStereo->compute(rectified1, rectified2, disparity);cv::imwrite("disparity.jpg", disparity);

And finally, we’ve computed the disparity map. From the image below, the darker pixels are representing objects nearer to the camera, and the lighter pixels are representing the objects far from the camera. The white pixel noise you are seeing in the output disparity map can be removed with some filters which I won’t be covering in this tutorial.

Now that we have successfully obtained a depth map from a given stereo pair. Let us now try to re-project the obtained 2-D image points onto 3-D space by making use of a tool called 3D-Viz from opencv that will help us render a 3-D point cloud.

But this time, rather than estimating a fundamental matrix from the given image points, we will project the points using an essential matrix.

Essential Matrix

The essential matrix can be seen as a fundamental matrix, but for calibrated cameras. We can also call it a specialisation over fundamental matrix where the matrix is computed using calibrated cameras, means that we must first acquire knowledge about our camera in the world.

Hence in-order for us to estimate the essential matrix, we first need the intrinsic matrix of the camera( a matrix that represents the optical centre and focal length of the given camera). Let us take a look at the equation below :

Here, from the first matrix, the fx and fy represent the focal length of the camera, (uo, vo) is the principle point. This is the intrinsic matrix and our goal is to estimate it.

This process of finding different camera parameters is known as camera calibration. We can obviously use specifications provided by the camera manufacturer, but for the tasks like 3-D reconstruction, which we will be doing, these specifications are not accurate enough. Hence we are going to perform our own camera calibration.

The idea is to show a set of scene points to the camera, the points for which we know their actual 3-D positions in the real world, and then observing where these points are projected on the obtained image plane. With a sufficient number of 3-D points and associated 2-D image points, we can abstract the exact camera parameters from a projective equation.

One way to do this is to take several images of a set of 3-D points of the world with their known 3-D positions from different viewpoints. We are going to make use of the opencv’s calibration methods, one of which takes images of chessboards as input and returns us all the corners present. We can freely assume that board is located at Z=0, with X and Y axes well aligned with the grid. We are going to look at how these calibration functions of OpenCV work, in the section below.

3-D Scene Reconstruction

Let us first create three functions, that we will use in the main function. These three functions will be

addChessBoardPoints() //returns corners from given chessboard images
calibrate() // returns the intrinsic matrix from the extracted points
triangulate() //returns the 3-D coordinate of re-constructed point

#include "CameraCalibrator.h"#include <opencv2/opencv.hpp>
#include "opencv2/xfeatures2d.hpp"using namespace std;
using namespace cv;std::vector<cv::Mat> rvecs, tvecs;// Open chessboard images and extract corner points
int CameraCalibrator::addChessboardPoints(
         const std::vector<std::string>& filelist, 
         cv::Size & boardSize) {// the points on the chessboard
std::vector<cv::Point2f> imageCorners;
std::vector<cv::Point3f> objectCorners;// 3D Scene Points:
// Initialize the chessboard corners 
// in the chessboard reference frame
// The corners are at 3D location (X,Y,Z)= (i,j,0)
for (int i=0; i<boardSize.height; i++) {
  for (int j=0; j<boardSize.width; j++) {objectCorners.push_back(cv::Point3f(i, j, 0.0f));
   }
 }// 2D Image points:
cv::Mat image; // to contain chessboard image
int successes = 0;
// for all viewpoints
for (int i=0; i<filelist.size(); i++) {// Open the image
        image = cv::imread(filelist[i],0);// Get the chessboard corners
        bool found = cv::findChessboardCorners(
                        image, boardSize, imageCorners);// Get subpixel accuracy on the corners
        cv::cornerSubPix(image, imageCorners, 
                  cv::Size(5,5), 
                  cv::Size(-1,-1), 
      cv::TermCriteria(cv::TermCriteria::MAX_ITER +
                          cv::TermCriteria::EPS, 
             30,    // max number of iterations 
             0.1));     // min accuracy// If we have a good board, add it to our data
      if (imageCorners.size() == boardSize.area()) {// Add image and scene points from one view
            addPoints(imageCorners, objectCorners);
            successes++;
          }//Draw the corners
        cv::drawChessboardCorners(image, boardSize, imageCorners, found);
        cv::imshow("Corners on Chessboard", image);
        cv::waitKey(100);
    }return successes;
}

In the above code, you can observe that we have included a head file “CameraCalibrator.h”, which is going to contain all the function declarations and variable initialisation for this file. You can download the header along with all the other files in this tutorial at my Github by visiting this link.

Our function makes use of the findChessBoardCorners() function of opencv which takes image locations array(the array must contain locations of each chessboard image) and the board size(you should enter the number of corners present in your board horizontally and vertically) as input parameters and returns us a vector containing the corner locations.

double CameraCalibrator::calibrate(cv::Size &imageSize)
{
  // undistorter must be reinitialized
  mustInitUndistort= true;// start calibration
  return 
     calibrateCamera(objectPoints, // the 3D points
          imagePoints,  // the image points
          imageSize,    // image size
          cameraMatrix, // output camera matrix
          distCoeffs,   // output distortion matrix
          rvecs, tvecs, // Rs, Ts 
          flag);        // set options}

In this function, we used calibrateCamera() function that takes the 3-D points and image points we’ve obtained above and returns to us the intrinsic matrix, rotation vector(which describes rotation of the camera relative to scene points) and translation matrix(describes the position of camera relative to scene points).

cv::Vec3d CameraCalibrator::triangulate(const cv::Mat &p1, const cv::Mat &p2, const cv::Vec2d &u1, const cv::Vec2d &u2) {// system of equations assuming image=[u,v] and X=[x,y,z,1]
  // from u(p3.X)= p1.X and v(p3.X)=p2.X
  cv::Matx43d A(u1(0)*p1.at<double>(2, 0) - p1.at<double>(0, 0),
  u1(0)*p1.at<double>(2, 1) - p1.at<double>(0, 1),
  u1(0)*p1.at<double>(2, 2) - p1.at<double>(0, 2),
  u1(1)*p1.at<double>(2, 0) - p1.at<double>(1, 0),
  u1(1)*p1.at<double>(2, 1) - p1.at<double>(1, 1),
  u1(1)*p1.at<double>(2, 2) - p1.at<double>(1, 2),
  u2(0)*p2.at<double>(2, 0) - p2.at<double>(0, 0),
  u2(0)*p2.at<double>(2, 1) - p2.at<double>(0, 1),
  u2(0)*p2.at<double>(2, 2) - p2.at<double>(0, 2),
  u2(1)*p2.at<double>(2, 0) - p2.at<double>(1, 0),
  u2(1)*p2.at<double>(2, 1) - p2.at<double>(1, 1),
  u2(1)*p2.at<double>(2, 2) - p2.at<double>(1, 2));cv::Matx41d B(p1.at<double>(0, 3) - u1(0)*p1.at<double>(2,3),
                p1.at<double>(1, 3) - u1(1)*p1.at<double>(2,3),
                p2.at<double>(0, 3) - u2(0)*p2.at<double>(2,3),
                p2.at<double>(1, 3) - u2(1)*p2.at<double>(2,3));// X contains the 3D coordinate of the reconstructed point
  cv::Vec3d X;
  // solve AX=B
  cv::solve(A, B, X, cv::DECOMP_SVD);
  return X;
}

This above function takes the projection matrices and normalised image points that can be obtained using the intrinsic matrix from previous function and returns the 3-D coordinate of above points.

Below is the complete code for 3-D reconstruction from stereo pair. This code requires a minimum of 25 to 30 chessboard images taken from the same camera you have shot your stereo pair images with. In-order to first run this code in your PC, clone my GitHub repo, replace the stereo pair with the stereo pair of your own, and replace the chessboard images locations array with as array of your own, then build and compile. I am uploading an example chessboard image to my GitHub for your reference, you have to shoot around 30 such images and mention in the code.

int main(){cout<<"compiled"<<endl;const std::vector<std::string> files = {"boards/1.jpg"......};
  cv::Size board_size(7,7);CameraCalibrator cal;
  cal.addChessboardPoints(files, board_size);cv::Mat img = cv::imread("boards/1.jpg");cv::Size img_size = img.size();
  cal.calibrate(img_size);
  cout<<cameraMatrix<<endl;cv::Mat image1 = cv::imread("imR.png");
  cv::Mat image2 = cv::imread("imL.png");// vector of keypoints and descriptors
  std::vector<cv::KeyPoint> keypoints1;
  std::vector<cv::KeyPoint> keypoints2;
  cv::Mat descriptors1, descriptors2;// Construction of the SIFT feature detector
  cv::Ptr<cv::Feature2D> ptrFeature2D = cv::xfeatures2d::SIFT::create(10000);// Detection of the SIFT features and associated descriptors
  ptrFeature2D->detectAndCompute(image1, cv::noArray(), keypoints1, descriptors1);
  ptrFeature2D->detectAndCompute(image2, cv::noArray(), keypoints2, descriptors2);// Match the two image descriptors
  // Construction of the matcher with crosscheck
  cv::BFMatcher matcher(cv::NORM_L2, true);
  std::vector<cv::DMatch> matches;
  matcher.match(descriptors1, descriptors2, matches);cv::Mat matchImage;cv::namedWindow("img1");
  cv::drawMatches(image1, keypoints1, image2, keypoints2, matches, matchImage, Scalar::all(-1), Scalar::all(-1), vector<char>(), DrawMatchesFlags::NOT_DRAW_SINGLE_POINTS);
  cv::imwrite("matches.jpg", matchImage);// Convert keypoints into Point2f
  std::vector<cv::Point2f> points1, points2;for (std::vector<cv::DMatch>::const_iterator it = matches.begin(); it != matches.end(); ++it) {
    // Get the position of left keypoints
    float x = keypoints1[it->queryIdx].pt.x;
    float y = keypoints1[it->queryIdx].pt.y;
    points1.push_back(cv::Point2f(x, y));
    // Get the position of right keypoints
    x = keypoints2[it->trainIdx].pt.x;
    y = keypoints2[it->trainIdx].pt.y;
    points2.push_back(cv::Point2f(x, y));
  }// Find the essential between image 1 and image 2
  cv::Mat inliers;
  cv::Mat essential = cv::findEssentialMat(points1, points2, cameraMatrix, cv::RANSAC, 0.9, 1.0, inliers);cout<<essential<<endl;// recover relative camera pose from essential matrix
  cv::Mat rotation, translation;
  cv::recoverPose(essential, points1, points2, cameraMatrix, rotation, translation, inliers);
  cout<<rotation<<endl;
  cout<<translation<<endl;// compose projection matrix from R,T
  cv::Mat projection2(3, 4, CV_64F); // the 3x4 projection matrix
  rotation.copyTo(projection2(cv::Rect(0, 0, 3, 3)));
  translation.copyTo(projection2.colRange(3, 4));
  // compose generic projection matrix
  cv::Mat projection1(3, 4, CV_64F, 0.); // the 3x4 projection matrix
  cv::Mat diag(cv::Mat::eye(3, 3, CV_64F));
  diag.copyTo(projection1(cv::Rect(0, 0, 3, 3)));
  // to contain the inliers
  std::vector<cv::Vec2d> inlierPts1;
  std::vector<cv::Vec2d> inlierPts2;
  // create inliers input point vector for triangulation
  int j(0);
  for (int i = 0; i < inliers.rows; i++) {
    if (inliers.at<uchar>(i)) {
      inlierPts1.push_back(cv::Vec2d(points1[i].x, points1[i].y));
      inlierPts2.push_back(cv::Vec2d(points2[i].x, points2[i].y));
    }
  }
  // undistort and normalize the image points
  std::vector<cv::Vec2d> points1u;
  cv::undistortPoints(inlierPts1, points1u, cameraMatrix, distCoeffs);
  std::vector<cv::Vec2d> points2u;
  cv::undistortPoints(inlierPts2, points2u, cameraMatrix, distCoeffs);// Triangulation
  std::vector<cv::Vec3d> points3D;
  cal.triangulate(projection1, projection2, points1u, points2u, points3D);cout<<"3D points :"<<points3D.size()<<endl;viz::Viz3d window; //creating a Viz window//Displaying the Coordinate Origin (0,0,0)
  window.showWidget("coordinate", viz::WCoordinateSystem());window.setBackgroundColor(cv::viz::Color::black());//Displaying the 3D points in green
  window.showWidget("points", viz::WCloud(points3D, viz::Color::green()));
  window.spin();
}

I know that medium’s code display is a mess especially for C++, and hence even code is looking like a mess, so I would suggest you go to my GitHub and understand the above code.

The Output for a given pair for me would look like below, can be improved by tuning the feature detector and its types.

Anyone out there who is interested in learning these concepts in-depth, I would suggest this book below, which I think is the bible for Computer Vision Geometry. Which is also the reference book for this tutorial.

Multiple View Geometry in Computer Vision 2nd Edition - by Richard Hartley and Andrew Zisserman.

If you have any questions, please let me know in the comments section.

Thank You.