Many machine learning algorithms rely on distances between data points as their input, sometimes the only input, especially so for clustering and ranking algorithms. The celebrated k-nearest neighbors (KNN) algorithm is our example chief, but distances are also frequently used as an input in the natural language processing domain; “You shall know a word by the company it keeps” (Firth, J. R. 1957:11); e.g. the word “jaguar” refers to the animal if words like “zoo” or “safari” are also in the neighborhood. But would refer to a mark of a car if words such as “parking” or “highway” are nearby. But (and a big one), ‘in the neighborhood’ means one thing in a low-dimension settings, and another thing in high-dimensional settings. This post emphasizes this important difference- another example of the curse of dimensionality; measuring distance in high dimension.

Contents

### Introduction

You simply cannot interpret algorithms which take distances as input in a high- or even moderate dimension as you would in a low-dimensional data. In the bottom of the post Curse of dimensionality part 1 we saw that the data rapidly becomes sparse with increased dimension.

So what?

So it means that even if a data point ranks “closest” to another data point, it can still be very, *very far*. In fact, in the paper When is Nearest Neighbors Meaningful? the authors argue that for many data distribution and distance functions, the ratio of distances between nearest and farthest neighbors is almost 1 (so more or less the same). Fascinatingly counter-intuitive. Illustration next.

### Illustration

- Simulate a random matrix of dimension 1000 rows by 500 columns, from a Gaussian distribution.
- Compute pairwise Euclidean distance between each data points and all other data points
- Compute the ratio of the maximum over all those distances to the minimum over all those distances, per data point.
- Compute the average of those ratios across data points

#### Code

1 2 3 4 5 6 7 8 9 10 11 12 |
library(magrittr) TT <- 1000 R <- 500 # set your seed x <- do.call(cbind, lapply(rep(TT, R), rnorm)) dist_mat <- dist(x, method = "euclidean", diag=T, upper= T) %>% as.matrix max_dist <- dist_mat %>% apply(2, max) min_dist <- (dist_mat %>% apply(2, sort))[2] # get the minimum which is not zero (max_dist/min_dist) %>% mean # 1.194053 |

So on average the farthest data point from each other point is ~19% farther than the closest data point. This means that the other 997 data points are somewhere within that 19% range.

So What?

**So it means that there is very little difference between what is far and what is not far**. For comparison, when you change the number of columns to 5, i.e. low-dimension data, you get a result of around ~10 (seed-dependent), which indeed means that the farthest data point is 10-times farther away than the closest data point. Seeing is believing. I have created a tiny app for checking how the ratio changes as a function of the dimension and the number of data point. See below or be directed*.

Even for 50 columns, fairly moderate dimension in today’s standards, it is very hard to reach a meaningful distance diversification. This exercise bodes ill for large scale applications which use the go-to Euclidean distance measure.

Can we do better? Somewhat better, yes.

### Recommendation

Directly from the paper On the Surprising Behavior of Distance Metrics in High Dimensional Space. Lecture Notes in Computer Science, vol 1973.

Briefly: lower the exponent for the distance function. Manhattan distance (exponent= 1) is better than Euclidean (exponent= 2), and in that paper the authors propose to go lower still- call it fractional distance function. They show that fractional distance function (in their exercises [0.3,0.4]) makes better sense in high dimension, both from a theoretical and empirical perspective.

Powerful paper on a topic which deserves much more attention than currently given.

* Bear in mind that there are only 25 active hours in my subscription, so the app may or may not display depending on credit status.

## 2 comments on “Curse of Dimensionality part 4: Distance Metrics”