Distance Metrics: Choosing the Right One for Your Machine Learning Model

Many Machine Learning (ML) approaches utilise distance metrics heavily in order to achieve visual recognition, data extraction, and other applications. Technologies such as facial recognition and recommendation engines are widely utilising ML in order to provide users with the data they need. However, when working with ML algorithms, it is important to keep some distance from the material in order to allow the algorithm to accurately understand the similarities between different materials. This separation is referred to as a “distance metric”.

By accurately calculating the distance between data points, the perfect distance metric helps to identify the pattern of the input data, thus enabling more reliable conclusions. This enhances the effectiveness of both supervised and unsupervised machine learning algorithms when used to accomplish tasks such as classification, clustering, and information retrieval.

In this article, we will explore the various distance metrics available, examine the mathematics involved in their computation, and analyse their application to machine learning algorithms. This will enable readers to make informed decisions when selecting the best distance metric for their machine learning model.

The concept of Euclidean distance

The Euclidean distance measure is one of the most widely employed in the field of mathematics. This type of distance calculation relies on the Pythagorean theorem to determine the shortest route between two points in a given coordinate system. By utilising this theorem, the Euclidean measure of distance can provide an accurate representation of the shortest path between two points.

Many Machine Learning algorithms make use of Euclidean Distance as their default distance metric to measure the proximity of two observed data points. This metric is particularly useful for finding the separations between two columns of numerical data, such as weight, height, pay, etc., that are composed of continuous, numeric variables. In conclusion, Euclidean Distance is an effective metric for evaluating the closeness of two data points when the observations contain numeric values.

If you were tasked with writing equivalent code in Python, it would read as follows:

import numpy as np
#Function to calculate the Euclidean Distance between two points
def euclidean(p,q)->float:
distance = 0
for index, feature in enumerate(p):
d =(feature - q[index])**2
distance = distance + d

return np.sqrt(distance)

The Euclidean distance metric, which determines the distance between two real-valued vectors, is shown beautifully by Google Maps.

The equivalent of a trip to Manhattan

Distances between points on a grid may be measured using either the Manhattan distance or the cityblock distance.

When the dataset contains discrete or binary qualities, this measure excels because the route created from the attribute values is intuitive.

To calculate this measure, use the following code snippet:

import numpy as np
#Function to calculate the Manhattan Distance between two points
def manhattan(a,b)->int:
distance = 0
for index, feature in enumerate(a):
d = np.abs(feature - b[index])
distance = distance + d

return distance

It is important to bear in mind that the Manhattan distance is not a reliable measure of distance when the data set contains floating properties. Although it can be utilised as an alternative to the more effective Euclidean distance, it is not advisable for datasets with high dimensional data.

The term “taxicab distance” is a colloquialism used to describe the travel times between two locations in Manhattan. This phrase is derived from the notion of a taxi driver taking the most direct route possible between two points, thus minimising the time taken to traverse the distance.

The Minkowski Distancing

The Minkowski distance metric is an improved version of both the Euclidean and Manhattan distances, allowing for the comparison of spatial separations between two or more vectors. This metric is frequently used in machine learning to measure the closeness of two objects, providing an accurate assessment of their proximity to one another. The Minkowski equation is used to determine this distance metric, which can then be used to compare the relative position of two objects.

An alternative term for the p-norm vector is the p-norm, which is an expression of the order of the norm. This vector enables the calculation of different types of distances by allowing the insertion of a parameter p.

where

For a trip to Manhattan, p==1.
For a Euclidean distance, p = 2.
If the Chebyshev distance p is infinite, then the Chebyshev distance is also infinite.

Chebyshev distance is the largest possible separation along any given space’s axes.

The Python function to compute the Minkowski distance is as follows:

def minkowski_distance(a, b, p):
return sum(abs(e1-e2)**p for e1, e2 in zip(a,b))**(1/p)

#OR

from scipy.spatial import minkowski_distance
dist = minkowski_distance(row1, row2)
print(dist)

Distance between hammers

The use of Hamming distance for comparing strings of the same length is an exceptionally easy way of determining the similarity between the two locations in the data. This technique, known as the distance metre, provides a clear representation of the number of values that exist between the two points.

As an illustration:

A = [1, 2, 5, 8, 9, 0]
B = [1, 3, 5, 7, 9, 0]

Since there are two values differing between the provided binary strings, the Hamming distance is 2 in the aforementioned example.

The formula for determining the Hamming distance is as follows:

d = min We have d(x,y) = xC, yC, xy.

The Hamming distance can be an effective tool for detecting transmission errors between computers. To ensure accurate results, it is important to ensure that only comparable data sets are being compared. Furthermore, when the size of the feature is significant, the Hamming distance may not be the most appropriate distance measure to use.

Python’s Hamming distance function is as follows:

def hamming_distance(a, b):
return sum(abs(e1 - e2) for e1, e2 in zip(a, b)) / len(a)

#OR

from scipy.spatial.distance import hamming
dist = hamming(row1, row2)
print(dist)

When building a machine learning model, it is essential to understand the different distance measurements available and how to choose an appropriate metric for your dataset. Familiarising oneself with the characteristics of each distance metric, as well as the techniques for writing them in Python, provides the necessary insight required to select the most suitable option for the given model.

Join the Top 1% of Remote Developers and Designers

Works connects the top 1% of remote developers and designers with the leading brands and startups around the world. We focus on sophisticated, challenging tier-one projects which require highly skilled talent and problem solvers.
seasoned project manager reviewing remote software engineer's progress on software development project, hired from Works blog.join_marketplace.your_wayexperienced remote UI / UX designer working remotely at home while working on UI / UX & product design projects on Works blog.join_marketplace.freelance_jobs