When does distributing data improve computation time over overhead expense?
古いコメントを表示
I am trying to understand how to use distributed arrays or spmd code in order to optimize my computation. Specifically, I am trying to implement this in the context of an iterative solver. Essentially, within my algorithm code, I have just converted all of my input and starting vectors to distributed arrays. For example, the right hand side vector b and input matrix A become
A = distributed(A);
b = distributed(b);
The problem is that each iteration of the solver is now taking on the order of 100 times longer per iteration (and thus per solve) compared to when I did not distribute the data.
I am wondering if there is some threshhold at which the distributed memory wins out over not distributing, perhaps if the problem size gets large enough. My initial tests were on relatively small sparse matrices, around 3000x3000. But I have plenty more matrices of larger size to test on, up to around 10E6x10E6.
I also did some rudimentary tests just computing dot products of distributed random vectors. The time to compute a simple dot product where the vector was stored across distributed memory was consistently on the order of 100 times longer to compute compared to not distributing the vector.
rng('default');
v = rand(1,2000000); % Test vector across different sizes
tic
vv_true = v*v' % 1) compute dot product without distributing v
toc
v = distributed(v);
spmd
v % Verify the dot product distributed across workers
end
tic
vv_dist = v*v' % 2) compute dot product after distributing v
toc
This simple test is important to me because the iterative solver I am working on uses a substantially smaller number of dot products compared to traditional iterative methods like BiCGStab. In my tests, I really would like to show the speedup from the algorithm structure when the problem is done in the context of distributed memory, as I expect it would be in practice if the solver were used to solve really large data sets.
So, it is in one sense reassuring to see that computing dot products does take longer in a distributed memory context compared to a non-distributed (is there another term?) context. But, in another sense it is incredibly impractical to solve these intensive and time consuming problems if it is actually going to take 100 times longer. I simply don't even have the compute time on the cluster I work with to turn a 6-hour problem into a 600-hour one.
My question is three-fold:
- Am I using distributed memory correctly and in the most efficient way in the context of an iterative solver?
- At what point does the overhead cost of transferring the results between workers become smaller than the benefit of having multiple workers do the work?
- In the simple dot product code above, should it really take ~100 times longer to compute the dot product on a distributed data framework compared to a non-distributed one?
Thank you!
回答 (3 件)
Let’s break your three questions down clearly.
1) Yes, syntactically and conceptually, your usage is correct. However, correctness is not the same as efficiency. The issue is where the work is happening and what communication is required.
2) Distributed memory helps when compute cost >> communication cost. That is the key. For example:
a) Dot product: ca. 2 flops per element has very low compute cost, but global communication is very expensive relative to compute cost. Terrible for distributed.
b) Dense matrix-matrix multiply: High compute intensity. Excellent scaling
Rule of thumb: distributed MATLAB starts to make sense when the data does not fit in memory on one node OR per-worker workload is large enough (seconds, not milliseconds). At your test sizes:
- 3000×3000 sparse matrix is tiny in HPC terms
- Even 2e6 vector is still modest
Communication cost dominates, so distributed is slower.
3) Yes
A dot product is quite low computational burden as compared to memory fetch.
Unless the sparsity is so low as to make the memory requirements for the storage of the elements exceed physical memory, you'll invariably do better single-threaded.
The fastest times will be by using the GPU and gpuArray. If you can stand it, using single instead of double precision cuts down the memory by 2X and will likely also show a speed improvement thereby.
2 件のコメント
Benjamin
約2時間 前
dpb
19分 前
To win with distributed memory the Arithmetic Intensity has to be greater than the hardware limitation of compute throughput over memory bandwidth.
It's pretty simple to illustrate by example (borrowed from an NVIDIA source some time back)--
If you had a hypothetical device with 100 Gb/s of memory bandwidth, then you have at most 25 floats/s of memory bandwidth. If the device had 625 FLOP/s of peak single precision arithmetic throughput, then any code which achieved peak arithmetic and memory throughput would have to be performing 25 FLOP per memory transaction. If the code performs less than that, it will be memory bandwidth bound.
One caveat if went to GPU is whether the data already reside on the GPU itself or not -- the overhead to get it there and retrieve results has to be counted in the overall expense as well -- if the data were already there, then a ratio less than minimum might still outperform the base CPU overall because of the higher processing rate of the GPU plus data transfer is could be less than straight CPU compute without the offload memory.
You might want to see <Linear Systems Direct Methods Distributed> and the corollary <Linear Systems Iterative Methods Distributed>
Are your parallel workers all on a single computer, or are they distributed across a cloud? If all of this is on a single computer, it is unlikely that you will be able to use SPMD or other CPU-based parallelization functions to accelerate basic matrix operations. Remember, standard Matlab execution already uses internal parallelization routines to distribute matrix operations across local CPU cores. Typically, it does so in a much more optimized way than SPMD et al. can achieve. If you want to accelerate things at the level of individual matrix operations, you would probably want to use gpuArrays.
カテゴリ
ヘルプ センター および File Exchange で Parallel Computing Fundamentals についてさらに検索
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!