In your original code you didn't define n so I might have got the definition wrong here but since its just a scalar it won't affect time at all. It seems that using a loop is fater here but the amount of speed-up depends on the data-size quite a lot. Using your original numbersthe loop is slightly faster
ECDF = squeeze(sum(all(permute(bsxfun(@le, data, permute(u, [3,2,1])), [2,1,3])))) / n;
ECDF1 = squeeze(sum(all(permute(data <= permute(u, [3,2,1]), [2,1,3])))) / n;
validPoints = (data(:,1) <= u(i,1)) & (data(:,2) <= u(i,2));
count = sum(validPoints);
fprintf("Your original takes %f seconds\n",original)
Your original takes 0.028602 seconds
fprintf("Roberson's version takes %f seconds\n",Roberson);
Roberson's version takes 0.026836 seconds
fprintf("Loopy version takes %f seconds\n",loopy)
Loopy version takes 0.017854 seconds
fprintf("Loopy version is %f times faster than the original\n",original/loopy)
Loopy version is 1.601994 times faster than the original
fprintf("Do loopy version and original version give same result?\n")
Do loopy version and original version give same result?
Increasing the size of both n and m by 10x gives a much better speed-up for the loopy version
ECDF = squeeze(sum(all(permute(bsxfun(@le, data, permute(u, [3,2,1])), [2,1,3])))) / n;
ECDF1 = squeeze(sum(all(permute(data <= permute(u, [3,2,1]), [2,1,3])))) / n;
validPoints = (data(:,1) <= u(i,1)) & (data(:,2) <= u(i,2));
count = sum(validPoints);
fprintf("Your original takes %f seconds\n",original)
Your original takes 2.133713 seconds
fprintf("Roberson's version takes %f seconds\n",Roberson);
Roberson's version takes 2.179682 seconds
fprintf("Loopy version takes %f seconds\n",loopy)
Loopy version takes 0.494672 seconds
fprintf("Loopy version is %f times faster than the original\n",original/loopy)
Loopy version is 4.313389 times faster than the original
fprintf("Do loopy version and original version give same result?\n")
Do loopy version and original version give same result?
parfor can help for large enough m and n.
On my 4 year old 8 core machine I found that for large m and n I can get even more speed-up using parfor on the loop i=1:m and a threadpool. I.e. parpool('threads'). Using m=100000 and n=10000 as above, for example, I get the following times without parfor
Your original takes 1.213787 seconds
Roberson's version takes 1.332038 seconds
Loopy version takes 0.368781 seconds
Loopy version is 3.291346 times faster than the original
and with parfor:
Your original takes 1.184120 seconds
Roberson's version takes 1.402003 seconds
Loopy version takes 0.224152 seconds
Loopy version is 5.282667 times faster than the original
For smaller values of m and n, parfor may be slower than for and this will be machine dependent.
Hope this helps