Main Content

Loss Function and Model Quality Metrics

What is a Loss Function?

The System Identification Toolbox™ software estimates model parameters by minimizing the error between the model output and the measured response. This error, called loss function or cost function, is a positive function of prediction errors e(t). In general, this function is a weighted sum of squares of the errors. For a model with ny-outputs, the loss function V(θ) has the following general form:

V(θ)= 1Nt=1NeT(t,θ) W(θ) e(t,θ)

where:

  • N is the number of data samples.

  • e(t,θ) is ny-by-1 error vector at a given time t, parameterized by the parameter vector θ.

  • W(θ) is the weighting matrix, specified as a positive semidefinite matrix. If W is a diagonal matrix, you can think of it as a way to control the relative importance of outputs during multi-output estimations. When W is a fixed or known weight, it does not depend on θ.

The software determines the parameter values by minimizing V(θ) with respect to θ.

For notational convenience, V(θ) is expressed in its matrix form:

V(θ)=1Ntrace(ET(θ) E(θ) W(θ))

E(θ) is the error matrix of size N-by-ny. The i:th row of E(θ) represents the error value at time t = i.

The exact form of V(θ) depends on the following factors:

  • Model structure. For example, whether the model that you want to estimate is an ARX or a state-space model.

  • Estimator and estimation options. For example, whether you are using n4sid or ssest estimator and specifying options such as Focus and OutputWeight.

Options to Configure the Loss Function

You can configure the loss function for your application needs. The following estimation options, when available for the estimator, configure the loss function:

Estimation OptionDescriptionNotes

Focus

Focus option affects how e(t) in the loss function is computed:

  • When Focus is 'prediction', e(t) represents 1-step ahead prediction error:

    ep(t)=ymeasured(t)ypredicted(t)

  • When Focus is 'simulation', e(t) represents the simulation error:

    es(t)=ymeasured(t)ysimulated(t)

Note

For models whose noise component is trivial, (H(q) = 1), ep(t), and es(t) are equivalent.

The Focus option can also be interpreted as a weighting filter in the loss function. For more information, see Effect of Focus and WeightingFilter Options on the Loss Function.

  • Specify the Focus option in the estimation option sets.

  • The estimation option sets for oe and tfest do not have a Focus option because the noise-component for the estimated models is trivial, and so ep(t) and es(t) are equivalent.

WeightingFilter

When you specify a weighting filter, prefiltered prediction or simulation error is minimized:

ef(t)=(e(t))

where (.) is a linear filter. The WeightingFilter option can be interpreted as a custom weighting filter that is applied to the loss function. For more information, see Effect of Focus and WeightingFilter Options on the Loss Function.

  • Specify the WeightingFilter option in the estimation option sets. Not all options for WeightingFilter are available for all estimation commands.

EnforceStability

When EnforceStability is true, the minimization objective also contains a constraint that the estimated model must be stable.

  • Specify the EnforceStability option in the estimation option sets.

  • The estimation option sets for procest and ssregest commands do not have an EnforceStability option. These estimation commands always yield a stable model.

  • The estimation commands tfest and oe always yield a stable model when used with time-domain estimation data.

  • Identifying unstable plants requires data collection under a closed loop with a stabilizing feedback controller. A reliable estimation of the plant dynamics requires a sufficiently rich noise component in the model structure to separate out the plant dynamics from feedback effects. As a result, models that use a trivial noise component (H(q) = 1), such as models estimated by tfest and oe commands, do not estimate good results for unstable plants.

OutputWeight

OutputWeight option configures the weighting matrix W(θ) in the loss function and lets you control the relative importance of output channels during multi-output estimations.

  • When OutputWeight is 'noise', W(θ) equals the inverse of the estimated variance of error e(t):

    W(θ)=(1NET(θ) E(θ))1

    Because W depends on θ, the weighting is determined as a part of the estimation. Minimization of the loss function with this weight simplifies the loss function to:

    V(θ)=det(1NET(θ) E(θ))

    Using the inverse of the noise variance is the optimal weighting in the maximum likelihood sense.

  • When OutputWeight is an ny-by-ny positive semidefinite matrix, a constant weighting is used. This loss function then becomes a weighted sum of squared errors.

  • Specify the OutputWeight option in the estimation option sets. Not all options for OutputWeight are available for all estimation commands.

  • OutputWeight is not available for polynomial model estimation because such models are always estimated one output at a time.

  • OutputWeight cannot be 'noise' when SearchMethod is 'lsqnonlin'.

ErrorThreshold

ErrorThreshold option specifies the threshold for when to adjust the weight of large errors from quadratic to linear. Errors larger than ErrorThreshold times the estimated standard deviation have a linear weight in the loss function.

V(θ)= 1N(tIeT(t,θ) W(θ) e(t,θ)+tJvT(t,θ) W(θ) v(t,θ))

where:

  • I represents those time instants for which  |e(t)| <ρ*σ  , where ρ is the error threshold.

  • J represents the complement of I, that is, the time instants for which |e(t)| >=ρ*σ.

  • σ is the estimated standard deviation of the error.

The error v(t,θ) is defined as:

v(t,θ)=e(t,θ)*σρ|e(t,θ)|

  • Specify the ErrorThreshold option in the estimation option sets.

  • A typical value for the error threshold ρ = 1.6 minimizes the effect of data outliers on the estimation results.

Regularization

Regularization option modifies the loss function to add a penalty on the variance of the estimated parameters.

The loss function is set up with the goal of minimizing the prediction errors. It does not include specific constraints on the variance (a measure of reliability) of estimated parameters. This can sometimes lead to models with large uncertainty in estimated model parameters, especially when the model has many parameters.

Regularization introduces an additional term in the loss function that penalizes the model flexibility:

V(θ)= 1Nt=1NeT(t,θ) W(θ) e(t,θ) + 1Nλ (θθ*)TR (θθ*)

The second term is a weighted (R) and scaled (λ) variance of the estimated parameter set θ about its nominal value θ*.

  • Specify the Regularization option in the estimation option sets.

  • For linear-in-parameter models (FIR models) and ARX models, you can compute optimal values of the regularization variables R and λ using the arxRegul command.

Effect of Focus and WeightingFilter Options on the Loss Function

The Focus option can be interpreted as a weighting filter in the loss function. The WeightingFilter option is an additional custom weighting filter that is applied to the loss function.

To understand the effect of Focus and WeightingFilter, consider a linear single-input single-output model:

y(t)=G(q,θ) u(t)+H(q,θ) e(t)

Where G(q,θ) is the measured transfer function, H(q,θ) is the noise model, and e(t) represents the additive disturbances modeled as white Gaussian noise. q is the time-shift operator.

In frequency domain, the linear model can be represented as:

Y(ω)=G(ω,θ)U(ω)+H(ω,θ)E(ω)

where Y(ω), U(ω), and E(ω) are the Fourier transforms of the output, input, and output error, respectively. G(ω,θ) and H(ω,θ) represent the frequency response of the input-output and noise transfer functions, respectively.

The loss function to be minimized for the SISO model is given by:

V(θ)= 1Nt=1NeT(t,θ) e(t,θ)

Using Parseval’s Identity, the loss function in frequency-domain is:

V(θ,ω)= 1NE(ω)2

Substituting for E(ω) gives:

V(θ,ω)=1NY(ω)U(ω)G(θ,ω))2U(ω)2H(θ,ω)2

Thus, you can interpret minimizing the loss function V as fitting G(θ,ω) to the empirical transfer function Y(ω)/U(ω), using U(ω)2H(θ,ω)2 as a weighting filter. This corresponds to specifying Focus as 'prediction'. The estimation emphasizes frequencies where input has more power (U(ω)2 is greater) and de-emphasizes frequencies where noise is significant (H(θ,ω)2 is large).

When Focus is specified as 'simulation', the inverse weighting with H(θ,ω)2 is not used. That is, only the input spectrum is used to weigh the relative importance of the estimation fit in a specific frequency range.

When you specify a linear filter as WeightingFilter, it is used as an additional custom weighting in the loss function.

V(θ)=1N2Y(ω)U(ω)G(θ))2U(ω)2H(θ)2(ω)2

Here (ω) is the frequency response of the filter. Use (ω) to enhance the fit of the model response to observed data in certain frequencies, such as to emphasize the fit close to system resonant frequencies.

The estimated value of input-output transfer function G is the same as what you get if you instead first prefilter the estimation data with (.) using idfilt, and then estimate the model without specifying WeightingFilter. However, the effect of (.) on the estimated noise model H depends on the choice of Focus:

  • Focus is 'prediction' — The software minimizes the weighted prediction error ef(t)=(ep(t)), and the estimated model has the form:

    y(t)=G(q)u(t)+H1(q)e(t)

    Where H1(q)=H(q)/(q). Thus, the estimation with prediction focus creates a biased estimate of H. This is the same estimated noise model you get if you instead first prefilter the estimation data with (.) using idfilt, and then estimate the model.

    When H is parameterized independent of G, you can treat the filter (.) as a way of affecting the estimation bias distribution. That is, you can shape the trade-off between fitting G to the system frequency response and fitting H/ to the disturbance spectrum when minimizing the loss function. For more details see, section 14.4 in System Identification: Theory for the User, Second Edition, by Lennart Ljung, Prentice Hall PTR, 1999.

  • Focus is 'simulation' — The software first estimates G by minimizing the weighted simulation error ef(t)=(es(t)), where es(t)=ymeasured(t)G(q)umeasured(t). Once G is estimated, the software fixes it and computes H by minimizing pure prediction errors e(t) using unfiltered data. The estimated model has the form:

    y(t)=G(q)u(t)+He(t)

    If you prefilter the data first, and then estimate the model, you get the same estimate for G but get a biased noise model H/.

Thus, the WeightingFilter has the same effect as prefiltering the estimation data for estimation of G. For estimation of H, the effect of WeightingFilter depends upon the choice of Focus. A prediction focus estimates a biased version of the noise model H/, while a simulation focus estimates H. Prefiltering the estimation data, and then estimating the model always gives H/ as the noise model.

Model Quality Metrics

After you estimate a model, use model quality metrics to assess the quality of identified models, compare different models, and pick the best one. The Report.Fit property of an identified model stores various metrics such as FitPercent, LossFcn, FPE, MSE, AIC, nAIC, AICc, and BIC values.

  • FitPercent, LossFcn, and MSE are measures of the actual quantity that is minimized during the estimation. For example, if Focus is 'simulation', these quantities are computed for the simulation error es (t). Similarly, if you specify the WeightingFilter option, then LossFcn, FPE, and MSE are computed using filtered residuals ef (t).

  • FPE, AIC, nAIC, AICc, and BIC measures are computed as properties of the output disturbance according to the relationship:

    y(t)=G(q)u(t)+H(q)e(t)

    G(q) and H(q) represent the measured and noise components of the estimated model.

    Regardless of how the loss function is configured, the error vector e(t) is computed as 1-step ahead prediction error using a given model and a given dataset. This implies that even when the model is obtained by minimizing the simulation error es (t), the FPE and various AIC values are still computed using the prediction error ep (t). The actual value of ep (t) is determined using the pe command with prediction horizon of 1 and using the initial conditions specified for the estimation.

    These metrics contain two terms — one for describing the model accuracy and another to describe its complexity. For example, in FPE, det(1NET E) describes the model accuracy and 1+npN1npN describes the model complexity.

    By comparing models using these criteria, you can pick a model that gives the best (smallest criterion value) trade-off between accuracy and complexity.

Quality MetricDescription

FitPercent

Normalized Root Mean Squared Error (NRMSE) expressed as a percentage, defined as:

FitPercent= 100(1ymeasuredymodelymeasuredymeasured¯)

where:

  • ymeasured is the measured output data.

  • ymeasured¯ is its (channel-wise) mean.

  • ymodel is the simulated or predicted response of the model, governed by the Focus.

  • ||.|| indicates the 2-norm of a vector.

For input or output data, FitPercent is an ny-by-nexp matrix, where ny is the number of outputs and nexp is the number of experiments. For FRD data, FitPercent is an ny-by-nu matrix, where nu is the number of inputs (for FRD data, nexp is always one).

FitPercent varies between -Inf (bad fit) to 100 (perfect fit). If the value is equal to zero, then the model is no better at fitting the measured data than a straight line equal to the mean of the data.

LossFcn

Value of the loss function when the estimation completes. It contains effects of error thresholds, output weight, and regularization used for estimation.

MSE

Mean Squared Error measure, defined as:

MSE= 1Nt=1NeT(t) e(t)

where:

  • e(t) is the signal whose norm is minimized for estimation.

  • N is the number of data samples in the estimation dataset.

FPE

Akaike’s Final Prediction Error (FPE), defined as:

FPE=det(1NET E)(1+npN1npN)

where:

  • np is the number of free parameters in the model. np includes the number of estimated initial states.

  • N is the number of samples in the estimation dataset.

  • E is the N-by-ny matrix of prediction errors, where ny is the number of output channels.

AIC

A raw measure of Akaike's Information Criterion, defined as:

AIC=Nlog(det(1NET E))+2np+N(nylog(2π)+1)

AICc

Small sample-size corrected Akaike's Information Criterion, defined as:

AICc=AIC+2np(np+1)(Nnp1)

This metric is often more reliable for picking a model of optimal complexity from a list of candidate models when the data size N is small.

nAIC

Normalized measure of Akaike's Information Criterion, defined as:

nAIC=log(det(1NET E))+2npN

BIC

Bayesian Information Criterion, defined as:

BIC=Nlog(det(1NET E))+N(nylog(2π)+1)+nplog(N)

See Also

| | | | | |

Related Topics