How to efficiently generate a random integer within a range from an arbitrary probability distribution

I need to generate a random integer within a range from an arbitrary probability distribution, within a loop of 100000 iterations. My implementation works, but I am not sure it is mathematically clean, and it takes forever:
pdf = [ 0.9 0.3 0.003 0.1 0.07 0.0005 0.003 0.15 0.009 0.08 ]; % discrete prob distrib function
cdf = cumsum(pdf); % cumulative distribution function
cdf = cdf / max(cdf); cdf(1) = 0; % normalization
index = ceil(interp1(cdf, [1:numel(pdf)], rand(1)))
Notice that the pdf above is just an example: my actual case is a vector of about 500 numbers.
Here is a different solution, which seems mathematically cleaner, but does not work for my overall problem, and is just as slow as above:
pdf = [ 0.9 0.3 0.003 0.1 0.07 0.0005 0.003 0.15 0.009 0.08 ]; % discrete prob distrib function
cdf = cumsum(pdf); % cumulative distribution function
cdf = cdf - min(cdf); cdf = cdf / max(cdf); % normalization
index = round(interp1(cdf, [1:numel(pdf)], rand(1)))
Is there a more efficient/correct way to do this?

 採用された回答

the cyclist
the cyclist 2017 年 1 月 8 日
編集済み: the cyclist 2017 年 1 月 8 日
I think that
index = sum(rand()>cdf)+1;
will be much faster than using interp1 as you do, and will give the same result.

4 件のコメント

Also, I don't think your method of calculating the cdf is correct, because it will never select index=1, even though the pdf has support there.
To correct this, I think you want to use
cdf = [0 cdf];
in place of
cdf(1) = 0;
and modify my solution to just
index = sum(rand()>cdf);
Paolo Binetti
Paolo Binetti 2017 年 1 月 8 日
編集済み: Paolo Binetti 2017 年 1 月 8 日
Your solution seems to work perfectly for my problem. It is about 7 times faster, which is a major improvement.
I have tried "randsample" from the Octave statistics package (should be more or less equivalent to the Matlab function) with the following syntax:
index = randsample(numel(pdf),1,true,pdf);
but it is even slower than my initial solution. Is this normal?
And you are totally right about the wrong way I used to to start cdf from zero. I had in mind your way, but did not use it because I thought it would imply never selecting the end index ... but that's not a problem with your implementation, right?
I double-checked by generating 100,000 samples, and the results seem right.
Regarding speed ...
I get roughly equivalent results between my solution and randsample. Note that both of these solutions can be vectorized:
index = sum(rand(N,1)>cdf,2)';
and
index = randsample(1:numel(pdf),N,true,pdf);
These will both generate N = 100,000 samples in a few milliseconds.
I did not try to vectorize your solution, but as it currently stands, it took 5 seconds to generate that many.

サインインしてコメントする。

その他の回答 (1 件)

the cyclist
the cyclist 2017 年 1 月 8 日
Do you have the Statistics and Machine Learning Toolbox? If so, you can do this with the randsample command.

1 件のコメント

Thank you. No, I do not have this toolbox but I'll see if I can get it.

サインインしてコメントする。

カテゴリ

ヘルプ センター および File ExchangeRandom Number Generation についてさらに検索

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by