A metric distance function corresponds to the shortest path through a data space. With simple metrics, the inherent structure of the data is simply neglected. For example, Euclidean distance always yields a straight line.
We now will consider the structure that results from the data points themselves and not simply ignore them. In order to do this we use a theorem from information theory. The searched distance coincides locally with the Euclidean distance. Conversely, the more distorted the Euclidean distance is, the further two data points are from each other. This phenomenon is known as Bourgain’s theorem and corresponds in algebra to the approximation of a function by a straight line. In the vicinity of the applied tangent the estimation is still good, further away the linearization causes large deviations.
In this context it follows that a data space with any metric can always be embed into a Euclidean space if only the local properties are to be preserved. Theoretically, these local properties can be added continuously to determine a global distance, which then approximately corresponds to the searched distance, as if a spline is composed of small, straight sections or the radian measure is determined.
Since we are looking for the shortest distance, we can determine it by first constructing a KNN graph with Euclidean metrics on the data points and then using the Dijkstra algorithm to find the shortest path through the graph. The local distances are then added up to the global distance using this shortest route.
However, we do not want to do this directly, since the Dijkstra algorithm is very limited in the number of possible data points due to its high complexity. Therefore we first fit a Kohonen net or a manifold into the data space. In this way, data sets of virtually any size can be considered. The Dijkstra Shortest Path is now calculated using the virtual Kohonen net instead of the real data points.
For a global embedding, it should still be defined how the data structure is to be affected exactly.
In the sample plot shown below, the shortest path leads through the region of lowest density, as if it were moving particularly fast here, or the other way around, as if it were somehow jammed at high density. This can also quickly lead to the fact that a space with emptiness is crossed, which is however actually subject to a restriction and therefore should not be penetrated. In a borderline case, the euclidean distance is thus achieved again.
To make it more robust, the data room can be artificially densified before applying the shortest path algorithm.
Another possibility is a physically or quantum mechanically motivated weighting. By analogy, we consider the path of a photon or electron or its probability amplitude from one location to another through a resistance or probability network.
The particle now takes quasi arbitrary paths from which all are at first equally far away. The probabilities of the transit time of similar paths, however, stabilize each other to a path that represents the most effective and thus shortest path. Detours or wide paths, on the other hand, cancel each other out statistically in the probability amplitude due to their symmetrical counterpart.
In infinite simulations of a random walk from A to B, a path through the lowest (optical) density occurs less frequently than a path through the highest density. A high density leads to a more probable path and a low density to a less probable path.
This phenomenon can be simulated (without the many random walks) by thinning the KNN graph or the virtual Kohonen mesh with a special algorithm, so that the distances through the mesh remain the same from all directions. For each distance that is omitted, a different distance is added, so that the probability is either split into two equal connections of equal parts or passed through as one part in a connection twice as long. (2*0.1 -> 1*0.2)
What do the results mean in practice?
If there is no assumption for a metric, the Kohonen distance can be used directly in one of these versions. The calculation of a Kohonen distance model is straightforward and the calculation of distances with these models is also straightforward.
The procedure can also be used to check numerically how distorted the assumption of a Euclidean metric is for a data set. If this is not large, the Euclidean metric can be used just as well. The distortion of the examples shown above is 100%, 104%, 113% and 117%.
The following figures show examples of the distortion for the NRAIA data sets with different numbers of dimensions.
BOD2: Distortion Estimate Result 2.3%
Chloride: Distortion Estimate Result 15%
Leaves: Distortion Estimate Result 11%
Lipo: Distortion Estimate Result 6%
Rumford: Distortion Estimate Result 3%
Sacch2: Distortion Estimate Result 8%
PCB: Distortion Estimate Result 22%
Ethyl: Distortion Estimate Result 15%
Lubricant: Distortion Estimate Result 16%
Nitrite: Distortion Estimate Result 15%
Saccharin: Distortion Estimate Result 3%
Isom: Distortion Estimate Result 45%
O.xylene: Distortion Estimate Result 18%
Oilshale: Distortion Estimate Result 41%
Pinene: Distortion Estimate Result 8%
Coal: Distortion Estimate Result 31%
The calculation distortion itself is quantitative and helps when viewing large amounts of data, because it is a measure of the non-linearity of the data set. In the example data set, 2D points have been selected for illustration. However, the method works for numerical data sets of any complexity, e.g. also images. As well as the estimation of intrinsic dimensionality (https://panda.technology/en/entropy), the knowledge of nonlinearity helps to select a suitable feature encoder to reduce dimensionality. Both are scalar properties of a metric space. These scalar values can be used to compare data sets using the concept of metric spaces or to detect changes in a space.
Such information-theoretically motivated aggregations correspond roughly to the concept of entropy or kinetic energy (or temperature) in physics, which can be used to describe complex structures simply as states.
And this is also the most obvious practical case, condition monitoring on arbitrarily complex and unknown data sets.
So if an image recording or a mechanical oscillation shows an increase in dimensionality or non-linearity over time, this is an important warning signal.
Therefore, with this concept an Auto-Condition Monitoring can be realized.