PTX kernel time to run

Question

Gaszton 2011 年 5 月 16 日

0
リンク

この質問への直接リンク

https://jp.mathworks.com/matlabcentral/answers/7511-ptx-kernel-time-to-run

Hello, i am using R2010b, CUDA toolkit 3.1 with a geforce gt425m. While is was optimalizing my cuda code i observed that calling the kernel with feval in matlab has a ~2ms constant time measured with

tic feval(k,...) toc

the kernel code:

    #define C_WIDTH 1024
    #define C_HEIGHT 768
    __global__ void timetest1(float* holo) {    
     int mindex=blockIdx.x*blockDim.x+threadIdx.x;
     int size=C_WIDTH*C_HEIGHT;
     if (mindex>=size) 
    return;
     holo[mindex]=mindex*mindex;
    }

Even if i take out the write to global memory //holo[mindex]=mindex*mindex; there is a ~2ms time

Does anybody know the origin of this lag? It would be great to somehow eliminate it.

Thanks,

Gaszton

PS: my matlab code for the kernel:

clear

import parallel.gpu.GPUArray

xsize=1024; ysize=768;

vectorsize=xsize*ysize; threadpblock=1024; k=parallel.gpu.CUDAKernel('TimeTest.ptx', 'TimeTest.cu'); k.ThreadBlockSize=[threadpblock,1,1]; k.GridSize=[ceil(vectorsize/threadpblock),1];

dholo=parallel.gpu.GPUArray.zeros(vectorsize,1,'single');

tic [dholo]=feval(k,dholo); time=toc;

['ms time= ' num2str(time*1000)]

clear

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

サインインしてコメントする。

サインインしてこの質問に回答する。

Answer 1

Edric Ellis 2011 年 5 月 16 日

0
リンク

この回答への直接リンク

https://jp.mathworks.com/matlabcentral/answers/7511-ptx-kernel-time-to-run#answer_10341

MATLAB Online で開く

Firstly, can I suggest that if possible you should upgrade to R2011a as we have made quite a few performance improvements in that release. Secondly, I think the main bottleneck in your code as written is that outside a function, an important optimisation called "in-place optimisation" cannot take place. If you place your code inside a function, then "dholo" will not be copied. For reference, I made a function like this:

function tmp
import parallel.gpu.GPUArray
xsize=1024; ysize=768;
vectorsize=xsize*ysize; 
threadpblock=512; % I have a C1060
k=parallel.gpu.CUDAKernel('TimeTest.ptx', 'TimeTest.cu'); 
k.ThreadBlockSize=[threadpblock,1,1]; 
k.GridSize=[ceil(vectorsize/threadpblock),1];
dholo=parallel.gpu.GPUArray.zeros(vectorsize,1,'single');
tic
for ii = 1:1000
    dholo=feval(k,dholo); 
end
time=toc;
disp(['ms time= ' num2str(time)])

And the overhead on my C1060 was down to 0.05 ms.

1 件のコメント
-1 件の古いコメントを表示-1 件の古いコメントを非表示

Gaszton 2011 年 5 月 16 日

Thank you for your help!

I am a PhD student in Hungary, Biological Research Centre

Hungarian Academy of Sciences,

we have a network licence (with limited number of instances of matlab to run parallel)

We used to buy a matlab update in every 1-2 year, but i dont really have an impact on that.

thank you again,

Gaszton

サインインしてコメントする。

PTX kernel time to run

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

採用された回答

1 件のコメント
-1 件の古いコメントを表示-1 件の古いコメントを非表示

その他の回答 (0 件)

参考

カテゴリ

タグ

製品

Community Treasure Hunt

PTX kernel time to run

0 件のコメント -2 件の古いコメントを表示-2 件の古いコメントを非表示

採用された回答

1 件のコメント -1 件の古いコメントを表示-1 件の古いコメントを非表示

その他の回答 (0 件)

参考

カテゴリ

タグ

製品

Community Treasure Hunt

0 件のコメント
-2 件の古いコメントを表示-2 件の古いコメントを非表示

1 件のコメント
-1 件の古いコメントを表示-1 件の古いコメントを非表示