Getting Started, MACHINE LEARNING. GEOSPATIAL ANALYSIS. PYTHON.

"Life is like a landscape. You live in the midst of it but can describe it only from the vantage point of distance" – Charles Lindbergh
Distance metrics are essential in understanding a lot of Machine Learning algorithms and therefore resolution of real-world problems. There are numerous distance metrics out there and data scientists should be able to understand most of them to make models more meaningful.
For a geospatial data scientist, there is an added benefit to this exercise: the feature creation from longitude and latitude.
Longitude and Latitude, while represented as floats, are more similar to categorical or nominal data. Increasing or decreasing them in magnitude may not give you or your model something meaningful.
Location data, therefore, needs some additional feature engineering to produce valuable insights. These newly-engineered features will then be what we will be used as inputs in machine learning.
DISTANCE METRICS
The most important feature to derive from a set of geocodes (longitude and latitude) is distance. Many supervised and unsupervised machine learning models use distance metrics as inputs. Distance metrics measure the similarity between two or more objects.
Distance metrics play a crucial role in the development and resolution of real-world problems. For example, distance metrics are used for many computer vision tasks, sentiment analysis, and, even clustering algorithms.
It goes without saying that any geospatial analyst should understand the different types of distance metrics and what type of problem they solve.
Choosing therefore the correct distance metric can therefore be the difference maker between a successful and failed model implementation.
In this article, let us discuss some of the most used distance metrics apply some codes to implement them in Python. There will be some mathematical discussions but one can skip and read the pros and cons instead. For each metric, we will discuss the pros and cons, some mathematical intuition, where the metrics are most appropriate, and the actual codes.
Euclidean Distance
Although there are other possible choices, most instance-based learners use Euclidean distance. – p 135, Data Mining Practical Machine Learning Tools and Techniques (4th edition, 2016).
Euclidean distance is the easiest and most obvious way of representing the distance between two points.

Because it is a formalization of the "Pythagorean" theorem, this is likewise called the Pythagorean distance.

Pros: Euclidean distance is relatively easy to implement and is already being used by most clustering algorithms. Likewise, it is easier to explain and visualize. Finally, for small distances, it can be argued that the distance between two points is the same regardless if it lies on a flat or spherical surface.
Cons: It rarely seldom approximates the true distance between two objects in the real world. For one reason, distance on a lower-dimensional space, say 2D Euclidean space, becomes less applicable on higher dimensional space where the difference between the nearest and farthest data becomes more uniformly distant from one another.
While most distant metrics suffer from this problem, this is more pronounced for the Euclidean distance. Besides, since it is the smallest distance between the two points, it disregards structures in the 3D plane that lowers the accuracy of this distance measurement for geospatial problems.
When is it most applicable: Because it ignores structures in the real-world, Euclidean distance is best for emergency cases where helicopters can fly in a straight-line to places such as hospitals. Another documented case is for trip planning where you simply need to determine which landmarks are close to one another.
Code: As longitude and latitude are not really cartesian coordinates, we need to convert them, taking into account the spherical nature of the earth.
import numpy as np
import math
#Origin latitude, longitude
origin = [14.5545901,120.9981703] #Makati Coordinates
destination = [14.1172947,120.9339132] #Tagaytay Coordinates
def euclidean_distance(origin, destination):
#euclidean distance
distance = np.sqrt((origin[0]-destination[0])**2 +(origin[1]-destination[1])**2)
#multiply by 6371 KM (earth's radius) * pi/180
return 6371*(math.pi/180)*distance


The Great Circle Distance
Unlike the Euclidean distance, the great circle distance considers the fact that two points lie on the surface of a sphere.



To get the Great Circle Distance, we apply the Haversine Formula above.
Pros: The majority of geospatial analysts agree that this is the appropriate distance to use for Earth distances and is argued to be more accurate over longer distances compared to Euclidean distance. In addition to that, coding is straightforward despite the complexity of the formula’s appearance. Performance is faster in computing compared to other great circle distance formulas such as "Vincenty Formula".
Cons: Slower compared to the "Spherical Law of Cosines Formula". In addition to that, this may not necessarily produce the driving or walking "distance", which may be the variable of interest. Read more about this.
When is it most applicable: Most geospatial analyst would argue that this should be the default and norm for calculating distances between two geographic points.
Code: Haversine distance is the basic formula I used for my distance calculations. While there are packages that readily calculate it, let us try coding it from scratch:
def great_circle_distance(origin_lat, origin_lon, destination_lat, dstination_lon):
r = 6371 #earth radius in KM
phi1 = np.radians(origin_lat)
phi2 = np.radians(destination_lat)
delta_phi = np.radians(destination_lat - origin_lat)
delta_lambda = np.radians(destination_lon - origin_lon)
a = np.sin(delta_phi / 2)**2 + np.cos(phi1) * np.cos(phi2) * np.sin(delta_lambda / 2)**2
res = r * (2 * np.arctan2(np.sqrt(a), np.sqrt(1 - a)))
return np.round(res, 2)
Manhattan Distance (Taxicab Distance)
The Manhattan Distance is a measure of the distance between two points that take into account the perpendicular layout of the map. It is called Manhattan distance because Manhattan is known for its grid or block layout where streets intersect at right angles.

The formula for a Manhattan Distance is as follows:


Pros: Because it takes into account the grid layout of locations, this is what most GPS use to calculate distances. This taxicab geometry is what we use in LASSO regression as well.
Cons: The application of the formula for geospatial analysis is not as straightforward using the formula. Because the earth is tilted, a correction factor is applied to produce more accurate results (28.9 degrees according to experts applying said formula)
When is it most applicable: If the variable of interest is that of driving distance, this is more appropriate than both Euclidean and Great Circle distances. This is why another name for this is the taxicab distance, as it is the distance more applicable to a taxicab driving around in a grid-layout location.
Code: The code for Manhattan distance requires us to rotate the grid we use in our base calculation. After doing this, we then proceed to apply the great circle distance (haversine formula) we coded earlier:
def manhattan_distance(origin_lat, origin_lon, destination_lat, destination_lon):
# Origin coordinates
p = np.stack(np.array([origin_lat, origin_lon]).reshape(-1,1), axis = 1)
# Destination coordinates
d = np.stack(np.array([origin_lat, origin_lon]).reshape(-1,1), axis = 1)
theta1 = np.radians(-28.904)
theta2 = np.radians(28.904)
## Rotation matrix
R1 = np.array([[np.cos(theta1), np.sin(theta1)],
[-np.sin(theta1), np.cos(theta1)]])
R2 = np.array([[np.cos(theta2), np.sin(theta2)],
[-np.sin(theta2), np.cos(theta2)]])
# Rotate Origin and Destination coordinates by -29 degrees
pT = R1 @ p.T
dT = R1 @ d.T
# Coordinates of hinge point in the rotated world
vT = np.stack((pT[0,:], dT[1,:]))
# Coordinates of Hinge point in the real world
v = R2 @ vT
return (great_circle_distance(p.T[0], p.T[1], v[0], v[1]) +
great_circle_distance(v[0],v[1], d.T[0],d.T[1] ))
If we try to compute the distance we have using the original points and destination we have:

As noted, this is a much closer approximation to the driving distance as presented by Google Maps:

COSINE SIMILARITY
While not normally used for geospatial problems, some distance metrics are worth discussing as they can be pretty useful with complementary problems.
As a beginner, I often confuse this with the "Spherical Law of Cosines", which is just another formula for the Great Circle Distance. Note, however, that these are two different and the cosine similarity is best known for applications in text analysis.

The cosine similarity returns -1 for least dissimilar documents and +1 for most similar documents.
Suppose you want to determine which documents are similar. Using other distance metrics would probably rank similar documents according to the size of the repeated words. This tends to classify longer documents as more similar, which may not be the case.
Pros: As it only measures the angle of similarity and not the size, it may produce accurate results especially when analyzing which subcomponent (lower size) is part of a larger object.
Cons: There may be cases where the researchers may actually define "similarity" as being synonymous with magnitude or size and in these cases, the cosine similarity may not be as useful.
When is it most applicable: Applications on text and image processing problems.
Code: For this, let’s use the one available in scikit. Let us have an example though using real-life examples.
The following examples are published documents by the CFA Institute. The first document is a small excerpt from a publication of the CFA Institute for Fixed Income. The third one is likewise from a different publication but on the same topic of Fixed Income. The second document, however, is from a publication but for Alternative Investments.
If we will be using euclidean distance, we may see that the second and third documents to be more similar as they have more in terms of the number of repeated words.
#Define the documents
asset_classes = "Globally, fixed-income markets represent the largest asset class in financial markets, and most investors' portfolios include fixed-income investments."
alternative_investments = '''
Assets under management in vehicles classified as alternative investments have grown rapidly since the mid-1990s.
This growth has largely occurred because of interest in these investments by institutions, such as endowment and pension funds, as well as by high-net-worth individuals seeking diversification and return opportunities. Alternative investments are perceived to behave differently from traditional investments. Investors may seek either absolute return or relative return.
Some investors hope alternative investments will provide positive returns throughout the economic cycle; this goal is an absolute return objective.
Alternative investments are not free of risk, however, and their returns may be negative and/or correlated with other investments, including traditional investments, especially in periods of financial crisis. Some investors in alternative investments have a relative return objective. A relative return objective, which is often the objective of portfolios of traditional investment, seeks to achieve a return relative to an equity or a fixed-income benchmark.
'''
fixed_income = '''
Globally, the fixed-income market is a key source of financing for businesses and governments.
In fact, the total market value outstanding of corporate and government bonds is significantly larger than that of equity securities. Similarly, the fixed-income market, which is also called the debt market or bond market, represents a significant investing opportunity for institutions as well as individuals.
Pension funds, mutual funds, insurance companies, and sovereign wealth funds, among others, are major fixed-income investors. Retirees who desire a relatively stable income stream often hold fixed-income securities.
Clearly, understanding how to value fixed-income securities is important to investors, issuers, and financial analysts. This reading focuses on the valuation of traditional (option-free) fixed-rate bonds, although other debt securities, such as floating-rate notes and money market instruments, are also covered.
'''
documents = [asset_classes,alternative_investments, fixed_income ]
Importing Scikit for text analysis:
#SciKit Learn
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
# Create the Document Term Matrix
cv = CountVectorizer(stop_words='english')
cv = CountVectorizer()
sparse_matrix = count_vectorizer.fit_transform(documents)
# Convert to dataframe so we can view them
doc_term_matrix = sparse_matrix.todense()
df = pd.DataFrame(doc_term_matrix,
columns=count_vectorizer.get_feature_names(),
index=['asset_classes', 'alternative_investments', 'fixed_income'])
Turnin this into a dataframe:

After this, let us calculate the cosine similarity between the three documents:
# Compute Cosine Similarity
from sklearn.metrics.pairwise import cosine_similarity
print(cosine_similarity(df, df))

CONCLUSION
There are many more distance metrics but for now, let us focus on these four. Let me know what you think about them.
As we all saw, there are a lot of different distance metrics, each having different strengths and appropriateness for different problem types. For geospatial data scientists, it may be advantageous to try as many as possible and simply assess which ones are more relevant in the feature selection portion of the study.
Checkout the codes on my Github page.
References:
Four Types of Distance Metrics in Machine Learning
Why Manhattan Distance Formula Doesn’t Apply to Manhattan
Cosine Similarity – Understanding the math and how it works (with python codes)