How to do discontinuous piecewise linear model fitting?

47 ビュー (過去 30 日間)
Richard
Richard 2019 年 8 月 15 日
回答済み: Mathieu NOE 2021 年 2 月 1 日
Hi all,
I am interested in identifying the linear or linear piecewise model that best fits my data (cell activity versus spatial location, see figure). I have tried linear and continuous piecewise models with John D'Errico's SLM. However, it appears that a discontinuous fit (with 2 line segments, hand-drawn red lines) may be more appropriate than a continuous fit (with 3 line segments, black lines below). I did not find an option to implement discontinuous piecewise linear fits with SLM.
figure.png
My questions are:
1) Is there a simple way I can do discontinuous linear piecewise fits? and
2) Compare model fit of a) linear vs b) continuous linear piecewise vs c) discontinuous linear piecewise models (e.g. with AIC)? Thanks!
Data and code:
x=[-4.5,-4.0,-3.5,-3.0,-2.5,-2.0,-1.5,-1.0,-0.5,0,0.5,1.0,1.5,2.0,2.5,3.0,3.5,4.0,4.5,5.0,5.5,7.5,8.0,8.5,9.0,9.5,16.0,16.5];
y=[0.66,0.65,1.39,1.27,1.05,0.90,0.48,1.11,1.42,1.69,1.95,2.00,2.27,3.02,2.37,1.10,0.97,0.95,1.06,1.10,0.74,0.68,0.64,0.53,0.71,1.45,0.77,0.81];
slm = slmengine(x,y,'degree','linear',...
'interiorknots','free','knots',[-8 1.345 4.84 18],'plot','on','minval',0,'maxval',100);
ylim([0 5]);ylabel('Cell activity');
xlim([-7 18]);ylabel('Location');
  4 件のコメント
Bruno Luong
Bruno Luong 2019 年 8 月 15 日
編集済み: Bruno Luong 2019 年 8 月 15 日
IMO the "problem" is not due the fact that you impose the model to be continuous, in fact with free knot spline, when the 2 adjadcent knots get closer, and no point is falling stricly in the interval, it just makes a dummy transition, but the left and right model will fit the data independently as if you do not impose the continuity.
The reason that you don't like the fit is that the score (internally compute by SLM) is based on l^2 distance and as you data obviously do not follow a straight line you prefer the red line (but without telling us why), and the black line looks kind of not doing the job.
You should ask John if he can change SLM to other metric than 2-norm to fit your data (it is not clear to be what would be the score function to get the red line).
But again to my eye your data cannot be fitted by 2 lines.
Richard
Richard 2019 年 8 月 16 日
編集済み: Richard 2019 年 8 月 16 日
Hi Bruno,
"The reason that you don't like the fit is that ... you prefer the red line (but without telling us why), and the black line looks kind of not doing the job."
This is true. The data is actually cell activity across a section of brain tissue, and the dotted lines demarcate the known anatomical boundaries of brain regions (3 regions in total). We have an intuition that these regions are functionally distinct, and therefore we are looking to test whether the activity follows a linear gradient or a piecewise linear model. We do this by testing for interior knots at these boundaries, but we also allow the interior knots to fluctuate freely (as specified in SLM).
In this case, it appears to be a discontinuous linear model, and thus my question above. Do you have any additional insights or comments on this approach on fitting models?

サインインしてコメントする。

採用された回答

John D'Errico
John D'Errico 2019 年 8 月 15 日
SLM is not designed to solve the discontinuous linear case, nor would I ever decide to modify the code to allow that. (Sorry, but not gonna happen.) Nor, does it allow for some other noise model, though someone one time did modify the code to use a robust regression, but their modification was not sufficiently general for me to release. (In an older precursor code to SLM, I had allowed the user to indicate a 1-norm fit, but the option never seemed to be used, so I did not put that capability into SLM.)
If the break location is known, then the question is trivial. Just split the data between breaks. Fit a straight line within each break, ignoring any data that lies outside of the current interval. Polyfit will suffice for that.
If the break location is not known, then the problem is not simply amenable to use of an optimization tool to choose the break locations, because that problem will become a piecewise constant optimization. What do I mean by that? Here is your sample data:
x=[-4.5,-4.0,-3.5,-3.0,-2.5,-2.0,-1.5,-1.0,-0.5,0,0.5,1.0,1.5,2.0,2.5,3.0,3.5,4.0,4.5,5.0,5.5,7.5,8.0,8.5,9.0,9.5,16.0,16.5];
y=[0.66,0.65,1.39,1.27,1.05,0.90,0.48,1.11,1.42,1.69,1.95,2.00,2.27,3.02,2.37,1.10,0.97,0.95,1.06,1.10,0.74,0.68,0.64,0.53,0.71,1.45,0.77,0.81];
plot(x,y,'o')
grid on
If I look at your data,
untitled.jpg
the main break appears to lie somewhere between x=2.5 and x=3. There are no data points between those two points however. Suppose I chose a break at 2.6? 2.75? 2.9? If I did, see that the resulting fits would be IDENTICAL, since we just use only the data that lies blow or above that break to fit a line. The point is, when you do piecewise discontinuous fits, the objective to choose the breaks is logically the norm of the residuals (or some other simple measure of the residuals.) But the value of that objective does not change for any choice here of the break between x==2.5 and x==3. So the problem now becomes optimization of a piecewise constant function.
What can you do? The answer is simple enough, at least in 1-d. Just walk along the x axis. Put a break after the first two points, now fit the data as a pair of discontinuous line segments. Again, polyfit is sufficient, but you could do it easily enough using a simple matrix solve using one call to backslash. Then move the break to lie after the third data point, then after the 4th data point, etc. Choose the best fit as the one that has minimum overall norm of the residuals. You could even use robustfit here, if you prefer.
You ask if there is a simple way to compare the models? OF COURSE! Compute the residuals, then compute the norm of them. The one with the lowest norm is the "best" by some measure. Feel free to choose whatever norm you wish. A 1-norm will tend to decrease the impact of large residual points on the fit. (Sorry, but no, I won't write a complete code that compares all possible models, certainly not since anything I did write would immediately be subject to change, because your goals here are very loose and subjective.
I've give you a simple example though of how you might build a simple fit for a given break location.
x = x(:); % easier as column vectors
y = y(:);
xb1 = 2.75;
minx = min(x);
maxx = max(x);
edges = [minx,xb1,maxx];
bins = discretize(x,edges);
M = [x.*(bins == 1),bins == 1,x.*(bins == 2),bins == 2];
Now, we can choose a variety of tools for the fit here.
[rcoeffs,rstats] = robustfit(M,y,[],[],'off')
rcoeffs =
0.255468461914307
1.73482758648713
-0.0176408629185562
1.00411602902401
rstats =
struct with fields:
ols_s: 0.337737083788132
robust_s: 0.299596005270023
mad_s: 0.260947668710901
s: 0.314001985471304
resid: [28×1 double]
rstud: [28×1 double]
se: [4×1 double]
covb: [4×4 double]
coeffcorr: [4×4 double]
t: [4×1 double]
p: [4×1 double]
w: [28×1 double]
Qy: [4×1 double]
R: [4×4 double]
dfe: 24
h: [28×1 double]
Rtol: 1.97073061912377e-13
coeffs = M\y
coeffs =
0.276642857142857
1.75864285714286
-0.0156761776126963
1.00657352692892
In both cases, the coefficients are in the form [slope1,intercept1,slope2,intercept2], for the two lines estimated.
You can trivially plot the models and the data. And computing the norm of the residuals is as simple as:
preds = M*rcoeffs;
norm(preds - y)
ans =
1.6655356120512
Now you can simply compare the result of any fit to any other. If you want, vary the break point to lie between each pair of data points, as long as there are at least two data points in each interval.
plot(x,y,'o',x,M*rcoeffs,'r-',x,M*coeffs,'b-')
untitled.jpg
Not a lot of difference here between the OLS regression and the robustfit case. The connecting line across the break is not real, just an artifact of my being too lazy to do a better job in the plot.
  1 件のコメント
Richard
Richard 2019 年 8 月 16 日
編集済み: Richard 2019 年 8 月 16 日
Hi John, thank you very much for your detailed solution. I have managed to implement this, with a random walk down the x-axis.
I have three clarifications on model comparisons that I hope you can shed light on.
1) Is a 2-knot slm linear fit identical to the fitlm() OLS regression model? i.e.
slm = slmengine(x,y,'degree','linear','knots',2,...
'minval',0,'maxval',100,'plot','off');
mdl = fitlm(x,y);
2) If so, can the norm residuals of fitlm and slmstats be directly compared? I note that there is a scaling factor in slmstats.
function slmstats = modelstatistics(slmstats,y,coef,YScale,Mdes)
% generate model statistics, stuffing them into slmstats
% residuals, as yhat - y
resids = (Mdes*coef - y)./YScale;
3) Finally, is the norm residuals suitable for comparing models with different numbers of model coefficients, or is a penalty required for adding coefficients? I eventually need to compare linear piecewise models with 1-3 segments (red, black), versus the above discontinuous model (blue). I'm currently using the AIC formula below, which penalizes for the number of model coefficients by increasing AIC value. Across models, the model having the smallest AIC is better.
Example plot
AIC formula
Thanks!

サインインしてコメントする。

その他の回答 (2 件)

Bruno Luong
Bruno Luong 2019 年 8 月 16 日
編集済み: Bruno Luong 2019 年 8 月 16 日
"Do you have any additional insights or comments on this approach on fitting models?"
Well, I actually have problem to fit your data with just 2 lines. To me there are at least 4 regions with different linear models.
I can't suggest you an automatic way, but at least you can do semi-automatic by chosing the subintervals after running a free-knot splines.
I use here BSFK (very similar to SLM but I know it better)
This graph is obtained from this code
close all
x=[-4.5,-4.0,-3.5,-3.0,-2.5,-2.0,-1.5,-1.0,-0.5,0,0.5,1.0,1.5,2.0,2.5,3.0,3.5,4.0,4.5,5.0,5.5,7.5,8.0,8.5,9.0,9.5,16.0,16.5];
y=[0.66,0.65,1.39,1.27,1.05,0.90,0.48,1.11,1.42,1.69,1.95,2.00,2.27,3.02,2.37,1.10,0.97,0.95,1.06,1.10,0.74,0.68,0.64,0.53,0.71,1.45,0.77,0.81];
% https://fr.mathworks.com/matlabcentral/fileexchange/25872-free-knot-spline-approximation
pp = BSFK(x,y,4,20,[],struct('annimation',1,'sigma',0.001));
subintervals = pp.breaks([2 3;
5 6;
9 13;
13 15]); % selct manually the break points
subintervals = subintervals + [0.1 -0.1]; % reduce the intervals by some margin
hold on
for k=1:size(subintervals,1)
% Fit by a line for each interval
b = x >= subintervals(k,1) & x <= subintervals(k,2);
P = polyfit(x(b),y(b),1);
xlines = subintervals(k,:);
ylines = polyval(P,xlines);
plot(xlines,ylines,'k', 'Linewidth', 2);
end
  3 件のコメント
Bruno Luong
Bruno Luong 2019 年 8 月 16 日
編集済み: Bruno Luong 2019 年 8 月 16 日
1) How can I deduce the breakpoints from the plot above? I do see some trends from the red line plot, but I'm not sure which suggested breakpoints (in blue dotted lines) are to be selected and which to be ignored.
I use my eye as criteria. Sorry.
2) If my final discontinuous model is linear (4 black lines), is there any benefit to use a 3rd-order 20-knot spline to identify breakpoints (as you did), or should I stick with a lower order spline with less breakpoints?
The discontinuity break points are detected as those with large derivative, meaning the free knots are close to each other. As you data are clearly not a straight lines, I think it is better to use hight order to reduce unwanted artifact due to fitting non-linear data with linear one.
3) Furthermore, would it be right to specify knot positions only for a prior-driven model (tissue boundaries), and to use free knots for identifying the true best fit (ignoring known tissue boundaries)?
Yes, In my BSFK function you can impose fix knots and in the same time free knots in between.
Richard
Richard 2019 年 8 月 18 日
Thanks for your explanations and an alternative method of identifying breakpoints. I will likely use a combination of both answers.

サインインしてコメントする。


Mathieu NOE
Mathieu NOE 2021 年 2 月 1 日
hello
and fminsearch to create that code - seems to work finding the optimal values for 2 inner break points
% load x and y data
x=[-4.5,-4.0,-3.5,-3.0,-2.5,-2.0,-1.5,-1.0,-0.5,0,0.5,1.0,1.5,2.0,2.5,3.0,3.5,4.0,4.5,5.0,5.5,7.5,8.0,8.5,9.0,9.5,16.0,16.5];
y=[0.66,0.65,1.39,1.27,1.05,0.90,0.48,1.11,1.42,1.69,1.95,2.00,2.27,3.02,2.37,1.10,0.97,0.95,1.06,1.10,0.74,0.68,0.64,0.53,0.71,1.45,0.77,0.81];
xdata = x(:);
ydata = y(:);
% Init of fminsearch
a = (max(xdata)-min(xdata))/8;
b = a*2;
global xdata ydata
x0 = [a,b];
x = fminsearch(@obj_fun, x0);
a_sol = x(1);
b_sol = x(2);
XI = [min(xdata),a_sol,b_sol,max(xdata)]; % vector of 1-D look-up table "x" points
YI = lsq_lut_piecewise( xdata, ydata, XI ); % obtain vector of 1-D look-up table "y" points
% plot fit
plot(xdata,ydata,'.',XI,YI,'+-')
legend('experimental data (x,y(x))','LUT points (XI,YI)')
title('Piecewise 1-D look-up table least square estimation')
function err = obj_fun(x)
global xdata ydata
XI = [min(xdata),x(1),x(2),max(xdata)]; % vector of 1-D look-up table "x" points
YI = lsq_lut_piecewise( xdata, ydata, XI ); % obtain vector of 1-D look-up table "y" points
Yint = interp1(XI,YI,xdata);
err = sum((Yint-ydata).^2);
end

カテゴリ

Help Center および File ExchangeLinear and Nonlinear Regression についてさらに検索

製品


リリース

R2016b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by