Interpret Evaluation Metrics for Time Series Anomaly Detectors

This example uses:

Time Series Anomaly Detection for MATLAB Time Series Anomaly Detection for MATLAB

An important factor in assessing the performance of an anomaly detector are the evaluation metrics that the app and the command-line function timeSeriesAnomalyMetrics show or return. These information these metrics can provide depend on whether ground truth (true state of anomaly) is available or not when the detector is tested. You can train and test an anomaly detector with no ground truth knowledge—that is, no anomaly labels. This is called unsupervised evaluation. But if you have labeled data, you can increase the granularity and fidelity of your knowledge and interpretation.

Supervised Evaluation Metrics

In supervised evaluation where the number of normal and anomalous points is roughly balanced, Accuracy is a useful metric because it gives an overall measure of correctness by equally weighting all entries in the confusion matrix.

However, in most real‑world anomaly‑detection tasks where anomalies are rare and the data are highly imbalanced, metrics such as Precision, Recall, and F1 Score are more informative. F1 Score is especially important when you need a single metric reflecting both the detector’s ability to find anomalies (Recall) and avoid false alarms (Precision).

If the application is highly sensitive to false positives—for example, when each alert triggers a costly action—False Positive Rate and Precision should be prioritized.

Conversely, in safety‑critical applications where missing an anomaly is unacceptable, Recall becomes more important. Per‑class accuracy and the Confusion Matrix provide more detailed, class‑level diagnostic information, allowing users to understand exactly where the detector succeeds or fails across normal and anomalous categories.

Unsupervised Evaluation Metrics

In unsupervised evaluation, ground‑truth anomaly labels are not available, so you must assess the detector using only its anomaly scores. The metrics provided in this mode describe how well the score distribution separates normal and anomalous regions, but you must interpret them with care. Metrics such as KL Divergence, Average Anomaly Separation, and Normal Scores Range are model‑dependent because different detectors often produce anomaly scores on very different scales or with different statistical profiles. This means these metrics are not intended for comparing different models against one another.

Instead, these metrics are most useful in the following two situations.

Hyperparameter or threshold tuning for a single model — Users can adjust parameters such as smoothing, threshold settings, or window sizes, and use these metrics to monitor whether the anomaly-normal separation improves.
Condition monitoring of a deployed model—For a fixed trained detector, these metrics help monitor stability of the score distribution. Sudden changes in separation or score ranges can indicate model drift, data drift, or degradation in detector performance.

The Fraction of Anomalies metric provides a simple check on whether the model is predicting anomaly points at a reasonable rate, helping users identify whether the detector is overly conservative or generating excessive false alarms

Interpret Evaluation Metrics for Time Series Anomaly Detectors

Supervised Evaluation Metrics

Unsupervised Evaluation Metrics

See Also