Upgrading PC, tips to finding bottlenecks?

Question

0 投票

Hello,

We're currently running a simulation(simulating a grid of resistors), the main portion of it is solving a large (~1milx1mil or bigger) sparse SPD matrix, using a preconditioned conjugate gradient (PCG) method, many times (at each time step).

Is there a way to check whether memory bandwidth is an issue? We're currently looking to get an i7 6700K, but that would restrict us to dual channel RAM. I think this is probably fine, but it would be nice to confirm before we buy any hardware.

Also, Matlab's sparse library requires double precision. However, when looking at GPUs, it seems that they're optimized for single precision. When Matlab is doing GPU calculations, does it use psuedo double precision (which would incur a slowdown of ~1/32) ?

In addition, is there a way to check whether the PCG method would work in single precision? It would cause error, but we'd like to know if it's an acceptable amount or not. If it's too much error, we'd have to stick to CPU calculations, vs GPU. From what I've found online, most sparse libraries are done in double precision, and any error if you force single precision is at your own risk

0 件のコメント
-2 件の古いコメントを表示 -2 件の古いコメントを非表示

サインインしてコメントする。

サインインしてこの質問に回答する。

Follow Question

Answer 1

Walter Roberson 2016 年 11 月 28 日

1 投票

"When Matlab is doing GPU calculations, does it use psuedo double precision"

No. The Parallel Computing Toolbox requires GPUs with enough compute capability to handle double themselves. The precision it uses for a GPU operation is whatever precision is associated with the data to be processed. If your GPU only does double precision slowly then the result will be slow. MATLAB makes no attempt to emulate double precision with single precision.

"However, when looking at GPUs, it seems that they're optimized for single precision."

As you are looking at getting a new system with GPU, you should be considering getting a system with one of the new Pascal series of architectures, which offer much higher performance on double precision than previous architectures did. The new series is available to some supercomputer and academic sites in the USA, and is expected to be released to the public in January.

Caution: there are some operations that the Pascal architecture library does not handle properly yet due to bugs on the NVidia side. One of the major affected algorithms is Convoluntional Neural Networks. I have no information about PCG working on Pascal or not working.

Looking at https://www.mathworks.com/matlabcentral/answers/285134-preconditioning-algorithm-on-gpu-for-solution-of-sparse-matrices and following the links to http://docs.nvidia.com/cuda/cusparse/#cusparse-lt-t-gt-csrsmsolve I see that NVidia does implement cuSPARSE for single precision as well as double precision.

5 件のコメント
3 件の古いコメントを表示 3 件の古いコメントを非表示

Walter Roberson 2016 年 11 月 29 日

Working notes, specs put together from various sources

Titan X - Maxwell. 200 Gflop FP64. GM200 CPU
Titan Z - Kepler. 2700 Gflop FP64
Titan Black - Kepler. 1707 Gflop FP64
Titan X (Pascal) - Pascal. GP102 CPU. FP32 = 11 teraflop, FP64 = 1/32 FP32, so should be about 330 - 343 Gflop
Tesla P100 - Pascal. FP64 = 4.7 teraflop

http://www.pcworld.com/article/2896411/the-brutal-graphics-war-continues-as-nvidia-reveals-the-geforce-gtx-titan-x.html

http://www.bit-tech.net/hardware/graphics/2016/09/20/nvidia-titan-x-pascal-review/1

"While the Titan X [Pascal] has 40 percent more cores, 50 percent more memory and significantly more memory bandwidth than a GTX 1080, its clock speed is lower and it is more limited by power and thermals, all of which eats into its advantage. "

http://arstechnica.com/gadgets/2016/07/gtx-titan-x-pascal-specs-price-release-date/

http://images.nvidia.com/content/tesla/pdf/nvidia-tesla-p100-PCIe-datasheet.pdf

Walter Roberson 2016 年 11 月 29 日

So, uh, yes, Titan Z (Kepler) or Titan X Black (Kepler) would beat Titan X Pascal handily for FP64.

Looking at these specs, and at the prices, it almost looks to me as if most cost effective would be to go for putting in dual slower cards, like two of the older Titan X Maxwell GM200 cards. The disadvantage would be the need to partition the work between the two GPU. Hmmm, possibly your particular application is not suited for that. If you could distribute, then a tower of $200 graphics cards might be the most FP64 for the dollar. (Only one tower; after that you would need to go for MDCS licenses.)

サインインしてコメントする。

Upgrading PC, tips to finding bottlenecks?

0 件のコメント
-2 件の古いコメントを表示 -2 件の古いコメントを非表示

回答 (1 件)

5 件のコメント
3 件の古いコメントを表示 3 件の古いコメントを非表示

カテゴリ

タグ

Community Treasure Hunt

Upgrading PC, tips to finding bottlenecks?

0 件のコメント -2 件の古いコメントを表示 -2 件の古いコメントを非表示

回答 (1 件)

5 件のコメント 3 件の古いコメントを表示 3 件の古いコメントを非表示

カテゴリ

タグ

参考

Community Treasure Hunt

0 件のコメント
-2 件の古いコメントを表示 -2 件の古いコメントを非表示

5 件のコメント
3 件の古いコメントを表示 3 件の古いコメントを非表示