# Fit a curve on scatter data (main behaviuor)

37 ビュー (過去 30 日間)
Poria Divshali 2020 年 7 月 9 日 11:45
コメント済み: Poria Divshali 2020 年 7 月 31 日 7:10
I need to fit a line to some data, plotted as weighted scatter diagram below. The larger point has bigger weight. I used 'fit' function and the results is shown by blue line in the figure. However, I want to find the Green line, where I beleve is the expected behaviour. The code and data is attached. I appreciate any idea that help me to reach to fit the green line.

#### 1 件のコメント

Poria Divshali 2020 年 7 月 10 日 6:20
Maybe one possible solution will be using kind of clustering algorithm. Visually, I can see two lines, with different slopes (one Green one and one with lower slope). However, I could not find a good clustering method to seprate these two group

### 回答 (2 件)

Image Analyst 2020 年 7 月 9 日 12:15
I'd first of all filter out known bad data, data that you know should not be included in the fit. I'd make a tentative fit for the green line, like going between the points (20,3000) and (60,11000). Then for any data where the y value is more than 4000 above the line or 4000 below the line, it's throw those out. Then I'd fit a line through what remains. Something like this untested code:
coefficients = polyfit([20, 60], [3000, 11000], 1);
keepers = true(1, length(x));
for k = 1 : length(x)
% Get the point on the green line for this particular x value.
yFit = polyval(coefficients, x(k));
% If it's farther away than 4000, mark it for deletion.
if abs(yFit - y(k)) > 4000
keepers(k) = false;
end
end
% Extract only the good data.
goodx = x(keepers);
goody = y(keepers)
% Now find a fit for only the good data.
goodCoefficients = polyfit(goodx, goody, 1);
yFit = polyval(goodCoefficients, x);
% Plot it.
plot(x, yFit, 'g-', 'LineWidth', 3);
It would be even better to use RANSAC. If you have the Computer Vision Toolbox, you can use fitPolynomialRANSAC(). This type of fit will ignore the other major clump of data between x=40 to 60 and y = 5000 to 6000 and give you a fit for just the larger, longer elliptical cluster that you want.

#### 1 件のコメント

Poria Divshali 2020 年 7 月 10 日 6:17
Thanks for your suggestion. I need to find a more automatic way to remove considered bad data, since the data will change in different run.
Thanks for the suggestion to use fitPolynomialRANSAC(). I will check it algorithm to find how can I select the appropriate distance for this problem. It may work well.

John D'Errico 2020 年 7 月 9 日 13:40

You "want" the green line. I want the Buffalo Bills to win the Super bowl. Probably not gonna happen in either case. ;-)
You want to solve a weighted linear least squares problem, but you don't really understand linear least squares. And that is the fundamental problem.
plot(WX,WY,'.')
opts = fitoptions( 'Method', 'LinearLeastSquares' );
opts.Weights = W;
f = fit(WX,WY,'poly1',opts);
hold on
plot(f) Linear least squares looks ONLY at errors in the y variable. Large positve or negative residuls, thus points that fall far above or below the line will have more importance in the fit. They will drag the curve around. In this case, it is a straight line.
Now, consider the green line that you "want" to see produce. Do you see a problem in this context? Down near x==0, ALL of the errors will be above the line, at lest if the green curve were the one we expected. Near x == 100, ALL of the errors will be BELOW the line.
In both cases, the line will be drawn into a position so the slope is reduced. While you see data that makes you WANT something, this is something the data tells me cannot happen. What you WANT is not relevant, because the tool you are using looks only at errors in the y variable. Sadly, I am pretty sure the tool cannot read your mind, nor does it really care about what you want to see happen. Computers do what they are programmed to do.
Worse, we have another problem that is just as serious. your two variables have wildly different variation.
stdx = std(WX)
stdx =
15.2824457612656
stdy = std(WY)
stdy =
1649.76961766046
We can also see that in the axis scaling.
We can see the vast difference by forcing the two axes to have the same scaling.
axis equal Can the problem be solved to do what you want to see? For that, you are probably thinking in terms of what is called the total least squares problem. And of course, you have weights, so that will force me to actually write code. And since the two variables have hugely different variations, we need to deal with that too. Bah, humbug. I hate writing code. It makes me think.
The trick for total least squares is to use the SVD to do the work, sometimes called orthogonal regression. Other people call this principal component regression. As you can see here I computed weighted means. Subtract them off the data variables, then I weighted that matrix using W, and rescaled the variables to have unit variances. SVD does the hard work though.
mux = sum(WX.*W)./sum(W);
muy = sum(WY.*W)./sum(W);
A = [(WX - mux).*W/stdx,(WY - muy).*W/stdy];
[U,S,V] = svd(A,0);
S
S =
240.408960729392 0
0 167.877291251367
>> V
V =
0.246669782511878 -0.969099591577431
0.969099591577431 0.246669782511878
We choose the singular vector with the SMALLER singular value. That the two singular values are so close in magnitude tells me this regression is poorly posed, in the sense that the data ended up as a vaguely circular point cloud. Seriously, any line is arguably almost as good as any other in this case.
The weighted orthogonal regression line becomes...
V(1,2)*(x - mux)/stdx + V(2,2)*(y - muy)/stdy = 0
I'll use MATLAB to display the equation in a standard form, mainly becaue I am feeling too lazy to do basic algebra. Just too hot out today.
syms x y
y = vpa(solve(V(1,2)*(x - mux)/stdx + V(2,2)*(y - muy)/stdy,y),5)
y =
424.11*x - 11482.0
Now replot things, and see what happens.
plot(f)
hold on
plot(WX,WY,'b.')
H = fplot(matlabFunction(y),[35,55]);
H.Color = 'g';
H.LineWidth = 3; So the total least squares regression, using weights and a rescaling of the variables to have unit variances works. Again, it was very close. I could almost have chosen the regression line orthogonal to the one we got. Your data is NOT well posed for regression, as an almost circular point cloud.
The Rolling Stones said it for me. "You can't always get what you want." Of course, you can freely decide the answer is exactly what you want to see. It is a scheme that works well for many politicians these days. :( But today, you got lucky. Que sera, sera... (I know somebody said that, but who? One "day" I'll remember.)

#### 9 件のコメント

Image Analyst 2020 年 7 月 10 日 13:12
Yes, but you still have not said what you're going to use it for since the predictive ability of that line is virtually worthless. Why do you think it will mean anything or be helpful in the slightest?
John D'Errico 2020 年 7 月 11 日 0:42
As image analyst points out, the resulting line is of moderately little value in terms of information content. When the point cloud is so roughly circular, the slope of the resulting line is virtually a random number.
Poria Divshali 2020 年 7 月 31 日 7:10
You helped me a lot, thanks. Maybe the slop look like a random number in the first look but, it is not. It is what I could expect from theory behind the data I expected. It is electricity market data, y is price and x the total production. I clustered data and find atlest two nice slope. still it is not very accurate but maybe good enough for rough estimation of system behaviour.
For better clustering, I may need to add some other feature, which make the work more complecated :)