Main Content

*Regularization* is the technique for specifying constraints on the
flexibility of a model, thereby reducing uncertainty in the estimated parameter values.

Model parameters are obtained by fitting measured data to the predicted model response,
such as a transfer function with three poles or a second-order state-space model. The model
order is a measure of its flexibility — higher the order, the greater the flexibility.
For example, a model with three poles is more flexible than one with two poles. Increasing the
order causes the model to fit the observed data with increasing accuracy. However, the increased
flexibility comes with the price of higher uncertainty in the estimates, measured by a higher
value of random or *variance* error. On the other hand, choosing a model
with too low an order leads to larger systematic errors. Such errors cannot be attributed to
measurement noise and are also known as *bias* error.

Ideally, the parameters of a good model should minimize the *mean square error
(MSE)*, given by a sum of systematic error (bias) and random error
(variance):

MSE = |Bias|^{2} + Variance

The minimization is thus a tradeoff in constraining the model. A flexible (high order) model gives small bias and large variance, whereas a simpler (low order) model results in larger bias and smaller variance errors. Typically, you can investigate this tradeoff between bias and variance errors by cross-validation tests on a set of models of increasing flexibility. However, such tests do not always give full control in managing the parameter estimation behavior. For example:

You cannot use the known (

*a priori*) information about the model to influence the quality of the fits.In grey-box and other structured models, the order is fixed by the underlying ODEs and cannot be changed. If the data is not rich enough to capture the full range of dynamic behavior, this typically leads to high uncertainty in the estimated values.

Varying the model order does not let you explicitly shape the variance of the underlying parameters.

Regularization gives you a better control over the bias versus variance tradeoff by
introducing an additional term in the minimization criterion that penalizes the model
flexibility. Without regularization, for a model with one output and no weighting, the parameter
estimates are obtained by minimizing a weighted quadratic norm of the prediction errors
*ε*(*t*,*θ*):

$$\begin{array}{l}{V}_{N}\left(\theta \right)=\frac{1}{N}{\displaystyle \sum _{t=1}^{N}{\epsilon}^{2}(t,\theta )}\\ \end{array}$$

where *t* is the time variable, *N* is the number of data
samples, and *ε*(*t,θ*) is the predicted error computed as the
difference between the observed output and the predicted output of the model.

Regularization modifies the cost function by adding a term proportional to the square of
the norm of the parameter vector *θ*, so that the parameters
*θ* are obtained by minimizing:

${\widehat{V}}_{N}\left(\theta \right)=\frac{1}{N}{\displaystyle \sum _{t=1}^{N}{\epsilon}^{2}\left(t,\theta \right)+\frac{1}{N}\lambda {\Vert \theta \Vert}^{2}}$

where *λ* is a positive constant that has the effect of trading variance
error in V_{N}(θ) for bias error — the larger the value of λ, the
higher the bias and lower the variance of *θ*. The added term penalizes the
parameter values with the effect of keeping their values small during estimation. In statistics,
this type of regularization is called *ridge regression*. For more
information, see Ridge Regression (Statistics and Machine Learning Toolbox).

**Note**

Another choice for the norm of *θ* vector is the
L_{1}-norm, known as *lasso regularization*. However,
System Identification Toolbox™ supports only the 2-norm based penalty, known as L_{2}
regularization, as shown in the previous equation.

The penalty term is made more effective by using a positive definite matrix
*R*, which allows weighting and/or rotation of the parameter vector:

$${\widehat{V}}_{N}\left(\theta \right)=\frac{1}{N}{\displaystyle \sum _{t=1}^{N}{\epsilon}^{2}\left(t,\theta \right)+\text{}}\frac{1}{N}\lambda {\theta}^{T}R\theta $$

The square matrix *R* gives additional freedom for:

Shaping the penalty term to meet the required constraints, such as keeping the model stable

Adding known information about the model parameters, such as reliability of the individual parameters in the

*θ*vector

For structured models such as grey-box models, you may want to keep the estimated parameters close to their guess values to maintain the physical validity of the estimated model. This can be achieved by generalizing the penalty term to $$\lambda {\left(\theta -{\theta}^{*}\right)}^{T}R\left(\theta -{\theta}^{*}\right)$$, such that the cost function becomes:

$${\widehat{V}}_{N}\left(\theta \right)=\frac{1}{N}{\displaystyle \sum _{t=1}^{N}{\epsilon}^{2}\left(t,\theta \right)+\frac{1}{N}\lambda {\left(\theta -{\theta}^{*}\right)}^{T}R\left(\theta -{\theta}^{*}\right)}$$

Minimizing this cost function has the effect of estimating *θ* such that
their values remain close to initial guesses *θ**.

In regularization:

*θ**represents prior knowledge about the unknown parameters.*λ***R*represents the confidence in the prior knowledge of the unknown parameters. This implies that the larger the value, the higher the confidence.

A formal interpretation in a Bayesian setting is that θ has a prior distribution that is
Gaussian with mean θ* and covariance matrix $${\sigma}^{2}/\lambda {R}^{-1}$$, where σ^{2} is the variance of ε(t). The use of
regularization can therefore be linked to some prior information about the system. This could be
quite soft, such as the system is stable.

You can use the regularization variables *λ* and *R* as
tools to find a good model that balances complexity and provides the best tradeoff between bias
and variance. You can obtain regularized estimates of parameters for transfer function,
state-space, polynomial, grey-box, process, and nonlinear black-box models. The three terms
defining the penalty term, *λ*, *R* and *θ**,
are represented by regularization options `Lambda`

, `R`

, and
`Nominal`

, respectively in the toolbox. You can specify their values in the
estimation option sets for both linear and nonlinear models. In the System Identification app,
click **Regularization** in the linear model estimation dialog box or
**Estimation Options** in the Nonlinear Models dialog box.

Use regularization for:

Identifying overparameterized models.

Imposing

*a priori*knowledge of model parameters in structured models.Incorporating knowledge of system behavior in ARX and FIR models.

Over-parameterized models are rich in parameters. Their estimation typically yields
parameter values with a high level of uncertainty. Over-parameterization is common for
nonlinear ARX (`idnlarx`

) models and can also be for linear
state-space models using free parameterization.

In such cases, regularization improves the numerical conditioning of the estimation. You
can explore the bias-vs.-variance tradeoff using various values of the regularization constant
`Lambda`

. Typically, the `Nominal`

option is its default
value of `0`

, and `R`

is an identity matrix such that the
following cost function is minimized:

${\widehat{V}}_{N}\left(\theta \right)=\frac{1}{N}{\displaystyle \sum _{t=1}^{N}{\epsilon}^{2}\left(t,\theta \right)+\frac{1}{N}\lambda {\Vert \theta \Vert}^{2}}$

In the following example, a nonlinear ARX model estimation using a large number of neurons leads to an ill-conditioned estimation problem.

% Load estimation data. load regularizationExampleData.mat nldata % Estimate model without regularization. Orders = [1 2 1]; NL = sigmoidnet('NumberOfUnits',30); sys = nlarx(nldata,Orders,NL); compare(nldata,sys)

Applying even a small regularizing penalty produces a good fit for the model to the data.

```
% Estimate model using regularization constant λ = 1e-8.
opt = nlarxOptions;
opt.Regularization.Lambda = 1e-8;
sysr = nlarx(nldata,Orders,NL,opt);
compare(nldata,sysr)
```

In models derived from differential equations, the parameters have physical significance. You may have a good guess for typical values of those parameters even if the reliability of the guess may be different for each parameter. Because the model structure is fixed in such cases, you cannot simplify the structure to reduce variance errors.

Using the regularization constant `Nominal`

, you can keep the estimated
values close to their initial guesses. You can also design `R`

to reflect the
confidence in the initial guesses of the parameters. For example, if θ is a 2-element vector
and you can guess the value of the first element with more confidence than the second one, set
`R`

to be a diagonal matrix of size 2-by-2 such that R(1,1) >>
R(2,2).

In the following example, a model of a DC motor is parameterized by static gain
`G`

and time constant τ. From prior knowledge, suppose you know that
`G`

is about 4 and τ is about 1. Also, assume that you have more confidence
in the value of τ than `G`

and would like to guide the estimation to remain
close to the initial guess.

% Load estimation data. load regularizationExampleData.mat motorData % Create idgrey model for DC motor dynamics. mi = idgrey(@DCMotorODE,{'G',4;'Tau',1},'cd',{}, 0); mi = setpar(mi,'label','default'); % Configure Regularization options. opt = greyestOptions; opt.Regularization.Lambda = 100; % Specify that the second parameter better known than the first. opt.Regularization.R = [1, 1000]; % Specify initial guess as Nominal. opt.Regularization.Nominal = 'model'; % Estimate model. sys = greyest(motorData,mi,opt) getpar(sys)

In many situations, you may know the shape of the system impulse response from impact
tests. For example, it is quite common for stable systems to have an impulse response that is
smooth and exponentially decaying. You can use such prior knowledge of system behavior to
derive good values of regularization constants for linear-in-parameter models such as ARX and
FIR structure models using the `arxRegul`

command.

For black-box models of arbitrary structure, it is often difficult to determine the
optimal values of `Lambda`

and `R`

that yield the best
bias-vs.-variance tradeoff. Therefore, it is recommended that you start by obtaining the
regularized estimate of an ARX or FIR structure model. Then, convert the model to a
state-space, transfer function or polynomial model using the `idtf`

, `idss`

, or `idpoly`

commands, followed by order reduction if required.

In the following example, direct estimation of a 15th order continuous-time transfer function model fails due to numerical ill-conditioning.

% Load estimation data. load dryer2 Dryer = iddata(y2,u2,0.08); Dryerd = detrend(Dryer,0); Dryerde = Dryerd(1:500); xe = Dryerd(1:500); ze = Dryerd(1:500); zv = Dryerd(501:end); % Estimate model without regularization. sys1 = tfest(ze,15);

Therefore, use regularized ARX estimation and then convert the model to transfer function structure.

% Specify regularization constants. [L, R] = arxRegul(ze,[15 15 1]); optARX = arxOptions; optARX.Regularization.Lambda = L; optARX.Regularization.R = R; % Estimate ARX model. sysARX = arx(ze,[15 15 1],optARX); % Convert model to continuous time. sysc = d2c(sysARX); % Convert model to transfer function. sys2 = idtf(sysc); % Validate the models sys1 and sys2. compare(zv,sys1,sys2)

A guideline for selecting the regularization constants λ and `R`

is in the
Bayesian interpretation. The added penalty term is an assumption that the parameter vector θ is
a Gaussian random vector with mean θ* and covariance matrix $${\sigma}^{2}/\lambda {R}^{-1}$$.

You can relate naturally to such an assumption for a grey-box model, where the parameters
are of known physical interpretation. In other cases, this may be more difficult. Then, you have
to use ridge regression (`R`

= 1; θ* = 0) and tune λ by trial and error.

Use the following techniques for determining λ and `R`

values:

Incorporate prior information using tunable kernels.

Perform cross-validation tests.

Tuning the regularization constants for ARX models in `arxRegul`

is based on simple assumptions about the properties of the true impulse
responses.

In the case of an FIR model, the parameter vector contains the impulse response
coefficients b_{k} for the system. From prior knowledge of the system, it
is often known that the impulse response is smooth and exponentially decaying:

$$E{\left[{b}_{k}\right]}^{2}=C{\mu}^{k},\text{}corr\left\{{b}_{k}{b}_{k-1}\right\}=\rho $$

where *corr* means correlation. The equation is a parameterization of
the regularization constants in terms of coefficients C, μ, and ρ and the chosen shape
(decaying polynomial) is called a *kernel*. The kernel thus contains
information about parameterization of the prior covariance of the impulse response
coefficients.

You can estimate the parameters of the kernel by adjusting them to the measured data
using the `RegularizationKernel`

input of the `arxRegul`

command. For example, the `DC`

kernel estimates all
three parameters while the `TC`

kernel links $$\rho =\sqrt{\mu}$$. This technique of tuning kernels applies to all linear-in-parameter models
such as ARX and FIR models.

A general way to test and evaluate any regularization parameters is to estimate a model
based on certain parameters on an estimation data set, and evaluate the model fit for another
validation data set. This is known as *cross-validation*.

Cross-validation is entirely analogous to the method for selecting model order:

Generate a list of candidate λ and

`R`

values to be tested.Estimate a model for each candidate regularization constant set.

Compare the model fit to the validation data.

Use the constants that give the best fit to the validation data.

For example:

% Create estimation and validation data sets. ze = z(1:N/2); zv = z(N/2:end); % Specify regularization options and estimate models. opt = ssestOptions; for tests = 1:M opt.Regularization.Lambda = Lvalue(test); opt.Regularization.R = Rvalue(test); m{test} = ssest(ze,order,opt); end % Compare models with validation data for model fit. [~,fit] = compare(zv,m{:))

[1] L. Ljung. “Some Classical and Some New Ideas for
Identification of Linear Systems.” *Journal of Control, Automation and
Electrical Systems.* April 2013, Volume 24, Issue 1-2, pp 3-10.

[2] L. Ljung, and T. Chen. “What can regularization offer for estimation of
dynamical systems?” *In Proceedings of IFAC International Workshop on
Adaptation and Learning in Control and Signal Processing*, ALCOSP13, Caen, France,
July 2013.

[3] L. Ljung, and T. Chen. “Convexity issues in system identification.”
*In Proceedings of the 10th IEEE International Conference on Control &
Automation*, ICCA 2013, Hangzhou, China, June 2013.

- Estimate Regularized ARX Model Using System Identification App
- Regularized Identification of Dynamic Systems