Matrix Multiplication on GPU quite slow?
24 ビュー (過去 30 日間)
古いコメントを表示
Hi, I just started out using GPU in Matlab and hoped for considerable performance gains in matrix multiplication. I did some performance test and read quite a bit on it in different spots. But my results from testing appear quite frustrating and I found no good explanations online for those mixed results.
First some hardware info: i5-4590 quadcore 3.30GHz, 64 bit(Win 7, Matlab 2016a); GeForce GT 640, 384 CUDA cores, ~1 GHz.
When running the tests, I got some gains when multiplying 2 1024x1024 matrices. But when looping on 200x200 or 500x500 matrices multiplication is down for GPU by about the difference in clock speed. While looping over some similiar matrix addition shows up as succesful as I hoped.
I also get different results for timing with tictoc or (gpu)timeit.
So here are my timing results, which mostly explain themselves. Attached there is also the MinExample producing this output.
-------------------------------------
Single Matrix Operation on 1024x1024
-------------------------------------
Standard CPU:
tictoc
Elapsed time is 0.030685 seconds.
timeit
Elapsed time is 0.035352 seconds
Lets check GPU:
tictoc
Elapsed time is 0.000323 seconds.
Elapsed time is 0.000173 seconds.
timeit
Elapsed time is 0.061935 seconds
Elapsed time is 0.061718 seconds
-------------------------------------
Now starting some loops:
-------------------------------------
-------------------------------------
Matrix Addition n=10000:
-------------------------------------
-------------------------------------
Matrix is 600x600
-------------------------------------
Standard CPU:
Elapsed time is 1.675066 seconds.
Lets check GPU:
Elapsed time is 0.123021 seconds.
-------------------------------------
Matrix is 1000x1000
-------------------------------------
Standard CPU:
Elapsed time is 20.782437 seconds.
Lets check GPU:
Elapsed time is 0.119888 seconds.
-------------------------------------
Matrix Multiplication n=1000:
-------------------------------------
-------------------------------------
Matrix is 200x200
-------------------------------------
Standard CPU:
Elapsed time is 0.190912 seconds.
Lets check GPU:
Elapsed time is 0.751289 seconds.
-------------------------------------
Matrix is 500x500
-------------------------------------
Standard CPU:
Elapsed time is 2.620033 seconds.
Lets check GPU:
Elapsed time is 7.402474 seconds.
I summarize here for better understanding. One time operations with CPU(1024x1024): around 0.031s While for GPU tictoc counts only like 0.0003s but timeit gets like 0.06s. First confusion here, does timing function matter so much? Does GPU really speedup?
Next doing 1k multiplications on 500x500 takes: CPU: 2.62s GPU: 7.40s Loosing around clock speed difference.
For the 100k addition of 1000x1000 GPU speeds up dramatically from 20.78s -> 0.12s
So is there a consistent way to speed up with GPU in matrix multiplications? Can exact implementation matter a lot? What slows down the multiplication loop?
Thanks in advance Best Sven
2 件のコメント
回答 (2 件)
Edric Ellis
2017 年 12 月 8 日
Note that your GPU (GT 640) is primarily a display card; high performance is usually achieved by dedicated "compute" cards such as the Tesla or Quadro family. Typically display cards have much worse performance in double precision compared to single precision.
Sven
2017 年 12 月 11 日
編集済み: Sven
2017 年 12 月 12 日
2 件のコメント
Edric Ellis
2018 年 1 月 2 日
According to the wikipedia description of NVIDIA Tesla cards, the K40 has the has a peak FLOP rate of ~1500 GFlops in double precision.
Therefore, the first graph you post shows the gpuArray performance approaching ~80% of peak performance. So, it's not clear why you consider this not to represent "serious parallelisation".
Also, please note that a CUDA core is really not comparable with an x86 CPU core - they are much less powerful, and they are not capable of fully independent operation.
According to wikipedia's list of other NVIDIA cards, the GT 640 has a peak processing power in double precision of ~30 GFlops. So, in your case, MATLAB is clearly taking full advantage of your GPU device.
Finally, it's worth bearing in mind that the CPU implementation of MTIMES is highly optimized and multi-threaded using the cores of your CPU.
参考
カテゴリ
Help Center および File Exchange で GPU Computing についてさらに検索
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!