Many machine learning algorithms rely on distances between data points as their input, sometimes the only input, especially so for clustering and ranking algorithms. The celebrated k-nearest neighbors (KNN) algorithm is our example chief, but distances are also frequently used as an input in the natural language processing domain; “You shall know a word by the company it keeps” (Firth, J. R. 1957:11); e.g. the word “jaguar” refers to the animal if words like “zoo” or “safari” are also in the neighborhood. But would refer to a mark of a car if words such as “parking” or “highway” are nearby. But (and a big one), ‘in the neighborhood’ means one thing in a low-dimension settings, and another thing in high-dimensional settings. This post emphasizes this important difference- another example of the curse of dimensionality; measuring distance in high dimension.
You simply cannot interpret algorithms which take distances as input in a high- or even moderate dimension as you would in a low-dimensional data. In the bottom of the post Curse of dimensionality part 1 we saw that the data rapidly becomes sparse with increased dimension.
So it means that even if a data point ranks “closest” to another data point, it can still be very, very far. In fact, in the paper When is Nearest Neighbors Meaningful? the authors argue that for many data distribution and distance functions, the ratio of distances between nearest and farthest neighbors is almost 1 (so more or less the same). Fascinatingly counter-intuitive. Illustration next.
- Simulate a random matrix of dimension 1000 rows by 500 columns, from a Gaussian distribution.
- Compute pairwise Euclidean distance between each data points and all other data points
- Compute the ratio of the maximum over all those distances to the minimum over all those distances, per data point.
- Compute the average of those ratios across data points
TT <- 1000
R <- 500
# set your seed
x <- do.call(cbind, lapply(rep(TT, R), rnorm))
dist_mat <- dist(x, method = "euclidean", diag=T, upper= T) %>% as.matrix
max_dist <- dist_mat %>% apply(2, max)
min_dist <- (dist_mat %>% apply(2, sort)) # get the minimum which is not zero
(max_dist/min_dist) %>% mean
So on average the farthest data point from each other point is ~19% farther than the closest data point. This means that the other 997 data points are somewhere within that 19% range.
So it means that there is very little difference between what is far and what is not far. For comparison, when you change the number of columns to 5, i.e. low-dimension data, you get a result of around ~10 (seed-dependent), which indeed means that the farthest data point is 10-times farther away than the closest data point. Seeing is believing. I have created a tiny app for checking how the ratio changes as a function of the dimension and the number of data point. See below or be directed*.
Even for 50 columns, fairly moderate dimension in today’s standards, it is very hard to reach a meaningful distance diversification. This exercise bodes ill for large scale applications which use the go-to Euclidean distance measure.
Can we do better? Somewhat better, yes.
Briefly: lower the exponent for the distance function. Manhattan distance (exponent= 1) is better than Euclidean (exponent= 2), and in that paper the authors propose to go lower still- call it fractional distance function. They show that fractional distance function (in their exercises [0.3,0.4]) makes better sense in high dimension, both from a theoretical and empirical perspective.
Powerful paper on a topic which deserves much more attention than currently given.
* Bear in mind that there are only 25 active hours in my subscription, so the app may or may not display depending on credit status.