Euclidean distance
drawback:
- does not fit for demonsion higher than 3d
- different features have different units, cannot be standarized.
from scipy.spatial import distance
distance.euclidean(vector_1, vector_2)
Manhattan distance
calculated based on the fact that a person can only move at rectangular angles, drawback:
- not as intuitive as Euclidean distance
- may be not shortest distance
from scipy.spatial import distance
distance.cityblock(vector_1, vector_2)
Chebyshev distance
the maximum distance in any dimension between two real-valued vectors
from scipy.spatial import distance
distance.chebyshev(vector_1, vector_2)
Minkowski distance
a generalized form of the all the above distance metrics
from scipy.spatial import distance
distance.minkowski(vector_1, vector_2, p)
Cosine similarity
drawback:
- calculated only angle difference not length differnce
from scipy.spatial import distance
distance.cosine(vector_1, vector_2)
Haversine distance
calculated the shortest distance between two points on a sphere
from sklearn.metrics.pairwise import haversine_distances
haversine_distances([vector_1, vector_2])
Hamming distance
Compare vectors element-wise and average the number of differences. If the two vectors are the same, the resulting distance is between 0, and if the two vectors are completely different, the resulting distance is 1. drawback:
- only compare vectors of the same length
- it cannot give the magnitude of the difference
from scipy.spatial import distance distance.hamming(vector_1, vector_2)
Jaccard Index
Jaccard index is used to determine the similarity between two sample sets. It reflects how many one-to-one matches exist compared to the entire dataset. The Jaccard index is often used to compare the predictions of deep learning models on binary data such as image recognition with labeled data, or to compare text patterns in documents based on word overlap. drawback:
- depend on the dataset size
from scipy.spatial import distance distance.jaccard(vector_1, vector_2)
Sorensen-Dice
calculate overlap percentage to measure dataset similarity, used for image segmentation and text similarity analysis. drawback: depend on dataset size
from scipy.spatial import distance
distance.dice(vector_1, vector_2)
Dynamic Time Warping
measuring the distance between two time series of different lengths. Can be used for all time series data use cases such as speech recognition or anomaly detection. using many-to-one or one-to-many mappings to minimize the total distance between two time series. This results in a more intuitive measure of similarity when searching for the best alignment. drawback:
- high computation cost
from scipy.spatial.distance import euclidean
from fastdtw import fastdtw
distance, path = fastdtw(timeseries_1, timeseries_2, dist=euclidean)