First the bad news. There is neither a definition nor a recognized calculation method for identifying an anomaly in a data set. Neither information theory nor statistics has a scientific theory for this. In metrology or experimental physics, there is the measurement error that occurs when an event from outside influences the observation, although the observation should actually be systematically isolated from the outside world. Here, one assumes that the reproducibility was not guaranteed by an unknown and not isolated influence from outside. This could be, for example, a fault in the electronics for data recording itself. If there is any suspicion, the test setup is checked, and the series of measurements is repeated, and the anomaly disappears in the process.
An anomaly is an indication that there is an unknown and misunderstood influence on an observation series or that an observation has occurred that was not expected in the context of the previous observations. The anomaly is outside the possibility of the internal variation of the observations. Therefore, the term outlier is also used as a synonym for an anomaly, as if it had simply broken away from the underlying structure and was free.
Even if there is no scientific definition, an anomaly is usually understood to be a rarely occurring, significantly different observation in relation to the rest of the statistics - as if for a moment the principle of statistics could be overridden by removing the supposed data set from the statistics, making it unique and comparing it with the statistics from which it actually originated.
The following three figures show examples of different measurement series with anomalies colored red and clearly visible.
An anomaly is a warning signal in human sensory perception. Humans have an intuitive idea whether an observation fits a pattern or not. Each observation is matched with something familiar. If this is not possible, all senses are sharpened. There is reason to believe that danger is imminent. For example, a person involuntarily perceives the sudden temperature jump in figure 2, as well as the small constant fluctuations in the ECG in figure 3, as an abnormal observation.
In other words, when you develop an algorithm to detect an anomaly, do you need to replicate the intuitive human understanding as closely as possible?
First of all, humans can’t do that for individual values because they don’t capture the world in one-dimensional data. Humans can do this for complex data like visual, acoustic or haptic stimuli.
And this is also an astounding insight. Anomalies are easier to detect in complex data than in highly simplified data.
People intuitively differentiate between extreme observations and outliers. In simple data, however, the anomalies lie on the same axis as extreme observations. In the multidimensional, an extreme event can be seen as an observation on the extension of a path, while an anomaly is on no known path or between known paths.
Accordingly, a procedure is needed that calculates whether an observation lies on a path, or, more generally speaking, on a manifold or not. This is where the feature Encoding comes into play, which first calculates the paths in multidimensional data. Paths can also be understood as a line of freedom of variation in the data, which are subject to mutual restrictions.
In the METRIC Framework we provide two algorithms out-of-the-box that first determine the paths or manifolds behind the data and then check which observations are anomalous and not just extremely distant from them.
(1) Inverse Diffusion Mapping:
In a pseudo-Euclidean space, one can model a diffusion process in which values in the different dimensions model each other and this process results in a path fanning out to a noisy path with each iteration step. If you mathematically invert this diffusion process, you force the noisy data space to collapse into individual paths, which - if you exaggerate - finally collapse into individual points. No matter how far you push it, you can check which observations are furthest away from the calculated paths and points. In addition, a diffusion process shows a specific statistical distribution of the diffusion paths and also of the inverted paths, so that this distribution can be used for further interpretation by measuring the distance of the original noisy data points to the denoised path and marking anything outside the expected statistics as an anomaly. The following three figures illustrate the process (the red dots represent anomalies)
(2) Kohonen Outlier Clustering:
Another alternative is the approximation by a local Euclidean graph, e.g. a 2D city block grid to a multidimensional data set. In this case, the optimized nodes of the coherence network are clustered, and the data is checked to which cluster it belongs. If a data set is now outside the grid, the statistics of the distances within the respective cluster or the density are determined and potential outliers are identified by triangulation, which statistically do not fit into a defined density.
Kohonen Outlier Clustering then proceeds as follows:
Figure 4 shows anomaly detection on daily curves on NYC taxidermal data with researched events.
In one method, the data is optimized for a manifold, in the other a manifold is optimized for the data. The results are very similar. For both methods (Reverse Diffusion Mapping and Kohonen Outlier Clustering) a ranking of the furthest data points from the next best path can be calculated.
The following figures show the application of the algorithms to image data. The anomalies were determined automatically without the need for a human to teach the algorithm what an anomaly is.
The answer to the initial question of whether an anomaly detection needs special training to match the human perception of anomalies is no. The methods are objective and purely information-theoretically motivated. The human being simply has an intuitively amazingly good feeling for anomalies.