How to do discontinuous piecewise linear model fitting?

Question

Richard 2019 年 8 月 15 日

0
リンク

この質問への直接リンク

https://jp.mathworks.com/matlabcentral/answers/476239-how-to-do-discontinuous-piecewise-linear-model-fitting

回答済み: Mathieu NOE 2021 年 2 月 1 日

Hi all,

I am interested in identifying the linear or linear piecewise model that best fits my data (cell activity versus spatial location, see figure). I have tried linear and continuous piecewise models with John D'Errico's SLM. However, it appears that a discontinuous fit (with 2 line segments, hand-drawn red lines) may be more appropriate than a continuous fit (with 3 line segments, black lines below). I did not find an option to implement discontinuous piecewise linear fits with SLM.

My questions are:

1) Is there a simple way I can do discontinuous linear piecewise fits? and

2) Compare model fit of a) linear vs b) continuous linear piecewise vs c) discontinuous linear piecewise models (e.g. with AIC)? Thanks!

Data and code:

x=[-4.5,-4.0,-3.5,-3.0,-2.5,-2.0,-1.5,-1.0,-0.5,0,0.5,1.0,1.5,2.0,2.5,3.0,3.5,4.0,4.5,5.0,5.5,7.5,8.0,8.5,9.0,9.5,16.0,16.5];
y=[0.66,0.65,1.39,1.27,1.05,0.90,0.48,1.11,1.42,1.69,1.95,2.00,2.27,3.02,2.37,1.10,0.97,0.95,1.06,1.10,0.74,0.68,0.64,0.53,0.71,1.45,0.77,0.81];
slm = slmengine(x,y,'degree','linear',...
'interiorknots','free','knots',[-8 1.345 4.84 18],'plot','on','minval',0,'maxval',100);
ylim([0 5]);ylabel('Cell activity');
xlim([-7 18]);ylabel('Location');

4 件のコメント
2 件の古いコメントを表示2 件の古いコメントを非表示

Bruno Luong 2019 年 8 月 15 日

編集済み: Bruno Luong 2019 年 8 月 15 日

IMO the "problem" is not due the fact that you impose the model to be continuous, in fact with free knot spline, when the 2 adjadcent knots get closer, and no point is falling stricly in the interval, it just makes a dummy transition, but the left and right model will fit the data independently as if you do not impose the continuity.

The reason that you don't like the fit is that the score (internally compute by SLM) is based on l^2 distance and as you data obviously do not follow a straight line you prefer the red line (but without telling us why), and the black line looks kind of not doing the job.

You should ask John if he can change SLM to other metric than 2-norm to fit your data (it is not clear to be what would be the score function to get the red line).

But again to my eye your data cannot be fitted by 2 lines.

Richard 2019 年 8 月 16 日

編集済み: Richard 2019 年 8 月 16 日

Hi Bruno,

"The reason that you don't like the fit is that ... you prefer the red line (but without telling us why), and the black line looks kind of not doing the job."

This is true. The data is actually cell activity across a section of brain tissue, and the dotted lines demarcate the known anatomical boundaries of brain regions (3 regions in total). We have an intuition that these regions are functionally distinct, and therefore we are looking to test whether the activity follows a linear gradient or a piecewise linear model. We do this by testing for interior knots at these boundaries, but we also allow the interior knots to fluctuate freely (as specified in SLM).

In this case, it appears to be a discontinuous linear model, and thus my question above. Do you have any additional insights or comments on this approach on fitting models?

サインインしてコメントする。

サインインしてこの質問に回答する。

Answer 1

John D'Errico 2019 年 8 月 15 日

2
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/476239-how-to-do-discontinuous-piecewise-linear-model-fitting#answer_387707

MATLAB Online で開く

SLM is not designed to solve the discontinuous linear case, nor would I ever decide to modify the code to allow that. (Sorry, but not gonna happen.) Nor, does it allow for some other noise model, though someone one time did modify the code to use a robust regression, but their modification was not sufficiently general for me to release. (In an older precursor code to SLM, I had allowed the user to indicate a 1-norm fit, but the option never seemed to be used, so I did not put that capability into SLM.)

If the break location is known, then the question is trivial. Just split the data between breaks. Fit a straight line within each break, ignoring any data that lies outside of the current interval. Polyfit will suffice for that.

If the break location is not known, then the problem is not simply amenable to use of an optimization tool to choose the break locations, because that problem will become a piecewise constant optimization. What do I mean by that? Here is your sample data:

x=[-4.5,-4.0,-3.5,-3.0,-2.5,-2.0,-1.5,-1.0,-0.5,0,0.5,1.0,1.5,2.0,2.5,3.0,3.5,4.0,4.5,5.0,5.5,7.5,8.0,8.5,9.0,9.5,16.0,16.5];
y=[0.66,0.65,1.39,1.27,1.05,0.90,0.48,1.11,1.42,1.69,1.95,2.00,2.27,3.02,2.37,1.10,0.97,0.95,1.06,1.10,0.74,0.68,0.64,0.53,0.71,1.45,0.77,0.81];
plot(x,y,'o')
grid on

If I look at your data,

the main break appears to lie somewhere between x=2.5 and x=3. There are no data points between those two points however. Suppose I chose a break at 2.6? 2.75? 2.9? If I did, see that the resulting fits would be IDENTICAL, since we just use only the data that lies blow or above that break to fit a line. The point is, when you do piecewise discontinuous fits, the objective to choose the breaks is logically the norm of the residuals (or some other simple measure of the residuals.) But the value of that objective does not change for any choice here of the break between x==2.5 and x==3. So the problem now becomes optimization of a piecewise constant function.

What can you do? The answer is simple enough, at least in 1-d. Just walk along the x axis. Put a break after the first two points, now fit the data as a pair of discontinuous line segments. Again, polyfit is sufficient, but you could do it easily enough using a simple matrix solve using one call to backslash. Then move the break to lie after the third data point, then after the 4th data point, etc. Choose the best fit as the one that has minimum overall norm of the residuals. You could even use robustfit here, if you prefer.

You ask if there is a simple way to compare the models? OF COURSE! Compute the residuals, then compute the norm of them. The one with the lowest norm is the "best" by some measure. Feel free to choose whatever norm you wish. A 1-norm will tend to decrease the impact of large residual points on the fit. (Sorry, but no, I won't write a complete code that compares all possible models, certainly not since anything I did write would immediately be subject to change, because your goals here are very loose and subjective.

I've give you a simple example though of how you might build a simple fit for a given break location.

x = x(:); % easier as column vectors
y = y(:);
xb1 = 2.75;
minx = min(x);
maxx = max(x);
edges = [minx,xb1,maxx];
bins = discretize(x,edges);
M = [x.*(bins == 1),bins == 1,x.*(bins == 2),bins == 2];

Now, we can choose a variety of tools for the fit here.

[rcoeffs,rstats] = robustfit(M,y,[],[],'off')
rcoeffs =
         0.255468461914307
          1.73482758648713
       -0.0176408629185562
          1.00411602902401
          
rstats = 
  struct with fields:
        ols_s: 0.337737083788132
     robust_s: 0.299596005270023
        mad_s: 0.260947668710901
            s: 0.314001985471304
        resid: [28×1 double]
        rstud: [28×1 double]
           se: [4×1 double]
         covb: [4×4 double]
    coeffcorr: [4×4 double]
            t: [4×1 double]
            p: [4×1 double]
            w: [28×1 double]
           Qy: [4×1 double]
            R: [4×4 double]
          dfe: 24
            h: [28×1 double]
         Rtol: 1.97073061912377e-13
          
coeffs = M\y
coeffs =
         0.276642857142857
          1.75864285714286
       -0.0156761776126963
          1.00657352692892

In both cases, the coefficients are in the form [slope1,intercept1,slope2,intercept2], for the two lines estimated.

You can trivially plot the models and the data. And computing the norm of the residuals is as simple as:

preds = M*rcoeffs;
norm(preds - y)
ans =
           1.6655356120512

Now you can simply compare the result of any fit to any other. If you want, vary the break point to lie between each pair of data points, as long as there are at least two data points in each interval.

plot(x,y,'o',x,M*rcoeffs,'r-',x,M*coeffs,'b-')

Not a lot of difference here between the OLS regression and the robustfit case. The connecting line across the break is not real, just an artifact of my being too lazy to do a better job in the plot.

1 件のコメント
-1 件の古いコメントを表示-1 件の古いコメントを非表示

Richard 2019 年 8 月 16 日

編集済み: Richard 2019 年 8 月 16 日

MATLAB Online で開く

Hi John, thank you very much for your detailed solution. I have managed to implement this, with a random walk down the x-axis.

I have three clarifications on model comparisons that I hope you can shed light on.

1) Is a 2-knot slm linear fit identical to the fitlm() OLS regression model? i.e.

slm = slmengine(x,y,'degree','linear','knots',2,...
    'minval',0,'maxval',100,'plot','off');
mdl = fitlm(x,y);

2) If so, can the norm residuals of fitlm and slmstats be directly compared? I note that there is a scaling factor in slmstats.

function slmstats = modelstatistics(slmstats,y,coef,YScale,Mdes)
% generate model statistics, stuffing them into slmstats
 
% residuals, as yhat - y
resids = (Mdes*coef - y)./YScale;

3) Finally, is the norm residuals suitable for comparing models with different numbers of model coefficients, or is a penalty required for adding coefficients? I eventually need to compare linear piecewise models with 1-3 segments (red, black), versus the above discontinuous model (blue). I'm currently using the AIC formula below, which penalizes for the number of model coefficients by increasing AIC value. Across models, the model having the smallest AIC is better.

Example plot

AIC formula

Thanks!

サインインしてコメントする。

Answer 2

Bruno Luong 2019 年 8 月 16 日

1
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/476239-how-to-do-discontinuous-piecewise-linear-model-fitting#answer_387809

編集済み: Bruno Luong 2019 年 8 月 16 日

MATLAB Online で開く

"Do you have any additional insights or comments on this approach on fitting models?"

Well, I actually have problem to fit your data with just 2 lines. To me there are at least 4 regions with different linear models.

I can't suggest you an automatic way, but at least you can do semi-automatic by chosing the subintervals after running a free-knot splines.

I use here BSFK (very similar to SLM but I know it better)

This graph is obtained from this code

close all
x=[-4.5,-4.0,-3.5,-3.0,-2.5,-2.0,-1.5,-1.0,-0.5,0,0.5,1.0,1.5,2.0,2.5,3.0,3.5,4.0,4.5,5.0,5.5,7.5,8.0,8.5,9.0,9.5,16.0,16.5];
y=[0.66,0.65,1.39,1.27,1.05,0.90,0.48,1.11,1.42,1.69,1.95,2.00,2.27,3.02,2.37,1.10,0.97,0.95,1.06,1.10,0.74,0.68,0.64,0.53,0.71,1.45,0.77,0.81];
% https://fr.mathworks.com/matlabcentral/fileexchange/25872-free-knot-spline-approximation
pp = BSFK(x,y,4,20,[],struct('annimation',1,'sigma',0.001));
subintervals = pp.breaks([2 3;
                         5 6;
                         9 13;
                         13 15]); % selct manually the break points
subintervals = subintervals + [0.1 -0.1]; % reduce the intervals by some margin
hold on
for k=1:size(subintervals,1)
    % Fit by a line for each interval
    b = x >= subintervals(k,1) & x <= subintervals(k,2);
    P = polyfit(x(b),y(b),1);
    xlines = subintervals(k,:);
    ylines = polyval(P,xlines);
    plot(xlines,ylines,'k', 'Linewidth', 2);
end

3 件のコメント
1 件の古いコメントを表示1 件の古いコメントを非表示

Bruno Luong 2019 年 8 月 16 日

編集済み: Bruno Luong 2019 年 8 月 16 日

1) How can I deduce the breakpoints from the plot above? I do see some trends from the red line plot, but I'm not sure which suggested breakpoints (in blue dotted lines) are to be selected and which to be ignored.

I use my eye as criteria. Sorry.

2) If my final discontinuous model is linear (4 black lines), is there any benefit to use a 3rd-order 20-knot spline to identify breakpoints (as you did), or should I stick with a lower order spline with less breakpoints?

The discontinuity break points are detected as those with large derivative, meaning the free knots are close to each other. As you data are clearly not a straight lines, I think it is better to use hight order to reduce unwanted artifact due to fitting non-linear data with linear one.

3) Furthermore, would it be right to specify knot positions only for a prior-driven model (tissue boundaries), and to use free knots for identifying the true best fit (ignoring known tissue boundaries)?

Yes, In my BSFK function you can impose fix knots and in the same time free knots in between.

Richard 2019 年 8 月 18 日

Thanks for your explanations and an alternative method of identifying breakpoints. I will likely use a combination of both answers.

サインインしてコメントする。

Answer 3

Mathieu NOE 2021 年 2 月 1 日

0
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/476239-how-to-do-discontinuous-piecewise-linear-model-fitting#answer_611423

MATLAB Online で開く

hello

I combined : Piecewise linear least square fit - File Exchange - MATLAB Central (mathworks.com)

and fminsearch to create that code - seems to work finding the optimal values for 2 inner break points

% load x and y data
x=[-4.5,-4.0,-3.5,-3.0,-2.5,-2.0,-1.5,-1.0,-0.5,0,0.5,1.0,1.5,2.0,2.5,3.0,3.5,4.0,4.5,5.0,5.5,7.5,8.0,8.5,9.0,9.5,16.0,16.5];
y=[0.66,0.65,1.39,1.27,1.05,0.90,0.48,1.11,1.42,1.69,1.95,2.00,2.27,3.02,2.37,1.10,0.97,0.95,1.06,1.10,0.74,0.68,0.64,0.53,0.71,1.45,0.77,0.81];
xdata = x(:);
ydata = y(:);
% Init of fminsearch 
a = (max(xdata)-min(xdata))/8;
b = a*2;
global xdata ydata
x0 = [a,b];
x = fminsearch(@obj_fun, x0);
a_sol = x(1);
b_sol = x(2);
XI = [min(xdata),a_sol,b_sol,max(xdata)]; % vector of 1-D look-up table "x" points
YI = lsq_lut_piecewise( xdata, ydata, XI ); % obtain vector of 1-D look-up table "y" points
% plot fit
plot(xdata,ydata,'.',XI,YI,'+-')
legend('experimental data (x,y(x))','LUT points (XI,YI)')
title('Piecewise 1-D look-up table least square estimation')
function err = obj_fun(x)
global xdata ydata
    XI = [min(xdata),x(1),x(2),max(xdata)];  % vector of 1-D look-up table "x" points
    YI = lsq_lut_piecewise( xdata, ydata, XI ); % obtain vector of 1-D look-up table "y" points
    Yint = interp1(XI,YI,xdata);
    err = sum((Yint-ydata).^2);
end

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

サインインしてコメントする。

How to do discontinuous piecewise linear model fitting?

4 件のコメント
2 件の古いコメントを表示2 件の古いコメントを非表示

採用された回答

1 件のコメント
-1 件の古いコメントを表示-1 件の古いコメントを非表示

その他の回答 (2 件)

3 件のコメント
1 件の古いコメントを表示1 件の古いコメントを非表示

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

参考

カテゴリ

タグ

製品

リリース

Community Treasure Hunt

How to do discontinuous piecewise linear model fitting?

4 件のコメント 2 件の古いコメントを表示2 件の古いコメントを非表示

採用された回答

1 件のコメント -1 件の古いコメントを表示-1 件の古いコメントを非表示

その他の回答 (2 件)

3 件のコメント 1 件の古いコメントを表示1 件の古いコメントを非表示

0 件のコメント -2 件の古いコメントを表示-2 件の古いコメントを非表示

参考

カテゴリ

タグ

製品

リリース

Community Treasure Hunt

4 件のコメント
2 件の古いコメントを表示2 件の古いコメントを非表示

1 件のコメント
-1 件の古いコメントを表示-1 件の古いコメントを非表示

3 件のコメント
1 件の古いコメントを表示1 件の古いコメントを非表示

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示