# Fast method to find average pairwise distance of a very large matrix?

19 ビュー (過去 30 日間)
Anish Potnis 2020 年 1 月 10 日
コメント済み: Anish Potnis 2020 年 1 月 13 日
So I have a matrix that is 330,000 observations = rows x 160 variables = columns. I'd like to compute the average distance between each observation in my matrix, but pdist() fails here, because apparently it would take 405.6 GB to store the distance matrix...
Is there a reasonable vectorized (or fast) way to do something like this?
Any thoughts appreciated.
Thanks.

#### 0 件のコメント

サインイン to comment.

### 採用された回答

John D'Errico 2020 年 1 月 10 日
You cannot compute the 100 billion pairwise distances, because the matrix won't fit in memeory. (Ok, symmetry reduces it to be only 50 billion or so distances.)
But do you seriously, really, desperately need to compute ALL such pairwise distances? You want to know the mean distance. Do you desperately need to know the EXACT value of that mean distance?
The point is, just use a Monte Carlo sampling to get a very good estimate of the mean pairwise distance.
For example, I'll do this using a smaller matrix, just so we can get some result out in a reasonable time.
X = randn(10000,160);
mean(pdist(X))
ans =
17.868
Now, compute a simple random sample. Then compute just a very small subset of the total distances you would need for an exact solution.
mean(vecnorm(X - X(randperm(size(X,1)),:),2,2))
ans =
17.864
mean(vecnorm(X - X(randperm(size(X,1)),:),2,2))
ans =
17.868
mean(vecnorm(X - X(randperm(size(X,1)),:),2,2))
ans =
17.856
mean(vecnorm(X - X(randperm(size(X,1)),:),2,2))
ans =
17.872
mean(vecnorm(X - X(randperm(size(X,1)),:),2,2))
ans =
17.874
So 5 simple random permutations gave 5 different estimates. All were correct to within about 0.1% error. For a matrix with 330000 rows, the error would be even smaller. And if even that much error bothers you, then just take the average or median of several such samples.
I would point out that even for an array of size 330000x160, the computation I describe takes about a half second on my computer, almost as much time as it took to generate the array in the first place.
X = randn(330000,160);
tic,mean(vecnorm(X - X(randperm(size(X,1)),:),2,2));toc
Elapsed time is 0.585973 seconds.

#### 1 件のコメント

Anish Potnis 2020 年 1 月 13 日
I came to the same conclusion, thanks for your help!

サインイン to comment.

### その他の回答 (1 件)

Matt J 2020 年 1 月 10 日

I don't know about the mean distance, but the mean squared or root mean squared distance is pretty easy. Compare:
A=rand(330000/25,160); %The data
%% Without pdist2
tic;
D=mean(vecnorm(A,2,2).^2);
M=norm(mean(A,1))^2;
avgDist=2*(D-M);
toc %Elapsed time is 0.017454 seconds.
%% With pdist2
tic;
P=pdist2(A,A).^2;
avgPdist2=mean(P(:));
toc %Elapsed time is 1.716697 seconds.
Check:
>> avgDist,
avgDist =
26.6471
>> avgPdist2,
avgPdist2 =
26.6471

#### 0 件のコメント

サインイン to comment.

サインイン してこの質問に回答します。

R2018a