Interpret Evaluation Metrics for Time Series Anomaly Detectors
An important factor in assessing the performance of an anomaly detector are the evaluation
metrics that the app and the command-line function timeSeriesAnomalyMetrics show or return. These information these metrics can
provide depend on whether ground truth (true state of anomaly) is available or not when the
detector is tested. You can train and test an anomaly detector with no ground truth
knowledge—that is, no anomaly labels. This is called unsupervised
evaluation. But if you have labeled data, you can increase the granularity and
fidelity of your knowledge and interpretation.
Supervised Evaluation Metrics
In supervised evaluation where the number of normal and anomalous points is roughly balanced, Accuracy is a useful metric because it gives an overall measure of correctness by equally weighting all entries in the confusion matrix.
However, in most real‑world anomaly‑detection tasks where anomalies are rare and the data are highly imbalanced, metrics such as Precision, Recall, and F1 Score are more informative. F1 Score is especially important when you need a single metric reflecting both the detector’s ability to find anomalies (Recall) and avoid false alarms (Precision).
If the application is highly sensitive to false positives—for example, when each alert triggers a costly action—False Positive Rate and Precision should be prioritized.
Conversely, in safety‑critical applications where missing an anomaly is unacceptable, Recall becomes more important. Per‑class accuracy and the Confusion Matrix provide more detailed, class‑level diagnostic information, allowing users to understand exactly where the detector succeeds or fails across normal and anomalous categories.
Unsupervised Evaluation Metrics
In unsupervised evaluation, ground‑truth anomaly labels are not available, so you must assess the detector using only its anomaly scores. The metrics provided in this mode describe how well the score distribution separates normal and anomalous regions, but you must interpret them with care. Metrics such as KL Divergence, Average Anomaly Separation, and Normal Scores Range are model‑dependent because different detectors often produce anomaly scores on very different scales or with different statistical profiles. This means these metrics are not intended for comparing different models against one another.
Instead, these metrics are most useful in the following two situations.
Hyperparameter or threshold tuning for a single model — Users can adjust parameters such as smoothing, threshold settings, or window sizes, and use these metrics to monitor whether the anomaly-normal separation improves.
Condition monitoring of a deployed model—For a fixed trained detector, these metrics help monitor stability of the score distribution. Sudden changes in separation or score ranges can indicate model drift, data drift, or degradation in detector performance.
The Fraction of Anomalies metric provides a simple check on whether the model is predicting anomaly points at a reasonable rate, helping users identify whether the detector is overly conservative or generating excessive false alarms