Does MATLAB have a Birthday Problem?

The Birthday Paradox or problem asks for the probability that in a room of n people, 2 or more have the same birthday (not date), assuming all years have N = 365 days. It is called a paradox because most people are surprised by the answer when there are (say) 30 people in the room.
Many applications require long sequences (or large vectors) of random numbers. In MATLAB these are supplied by the simple statement r = rand(n,1). This statement fills the vector r(n,1) with double precision random numbers uniformly distributed on (0,1).
Here is my question:
What is the probability that duplicates occur in r = rand(n,1), as n gets large?
I have an answer to this question but I would like to see what others come up with so that I can check the validity of my answer.

5 件のコメント

Walter Roberson
Walter Roberson 2011 年 2 月 5 日
You will have to specify exactly which random number generator is being used.
Jiro Doke
Jiro Doke 2011 年 2 月 5 日
This is more of a mathematics question, rather than a MATLAB question. You should extend the question, state what you have discovered (in regards to "rand"), and ask specific MATLAB programming questions to help you get the final answer.
Oleg Komarov
Oleg Komarov 2011 年 2 月 5 日
I vote this question for best Title :D
Malcolm Lidierth
Malcolm Lidierth 2011 年 2 月 8 日
In the UK, when the BBC first moved its football highlights show ('Match of the Day') from its traditional Saturday night slot, the birth rate tripled 9 months later. The birthday paradox, makes a false assumption: birthdays are not random - they vary seasonally and are influenced by all sorts of things.
Derek O'Connor
Derek O'Connor 2011 年 2 月 9 日
Malcolm,
Thank you for sharing that with us. However, it seems to have escaped your notice that we are not talking about actual births or birthdays here.
Derek

サインインしてコメントする。

 採用された回答

Derek O'Connor
Derek O'Connor 2011 年 2 月 5 日

1 投票

Reply from Derek O'Connor:
I must apologize for the omission of the RNG used. It was in my original draft but somehow got left out. I'm using the default Matlab rand(n,1) Here is what the docs (R2008a) say:
"Use the Mersenne Twister algorithm by Nishimura and Matsumoto (the default in MATLAB® Versions 7.4 and later). This method generates double-precision values in the closed interval [2^(-53), 1-2^(-53)], with a period of (2^19937-1)/2."
My answer is here, which is based on the RNG information above: http://www.scribd.com/doc/48243415/The-Birthday-Paradox-and-Random-Number-Generation
A summary of my analysis is that the probability of duplicates rises significantly after n = 10^8 and has converged to 1 at n = 10^9.
Here are the results of testing, with 5 runs for each value of n.
n = 1*10^8, Prob = 0.426..., Duplicates = (1, 0, 1, 0, 1). Avg = 0.6
n = 5*10^8, Prob = 0.999..., Duplicates = (18, 18, 18, 16, 20). Avg = 16.0
n = 1*10^9, Prob = 1.000..., Duplicates = (66, 64,43, 48, 62). Avg = 56.6
I am not an expert on the mathematics or implementation of RNGs and tend to accept what the docs tell me until I'm told otherwise.
This is not "just a mathematics" question. I say this at the end of my notes above which starts with the real question behind my original question:
What are the implications of getting duplicates in long vectors? This question cannot be answered in general because the answer depends on how these random vectors are used. If they are being used to test a new sorting algorithm, for example, then it is difficult to see how a few duplicates would matter.
I would not be so sanguine about the use of such vectors in medical research. A lot of DNA research seems to involve large amounts of computation which may use large vectors.\footnote{I don't really know what is done, so this is purely speculative.} This is particularly troubling, given the appalling record that bio-medical researchers have in misusing and abusing probability, statistics, and software.
I would certainly be uneasy if I found out that the Boeing or Airbus I was flying in used long random vectors as part of their avionics software. Hence my often-repeated question to students:
Would you fly in a plane whose software was written by you? "
Regards,
Derek O'Connor.
------------------------
I'm replying to comments here because I can't use that tiny comment window.
|Dear Bruno, Walter, et al.
I realize that I have not been clear in my answers to your questions and that my statement that rand picks "integer multiples of 2^(-53)" is just plain wrong. Before I get into the floating point aspects of the problem (ugh!), let me state clearly what probability model the analysis is based on:
I assumed that r = rand(n,1) picks r1, r2,...,rn from a bag of N = 2^(53)-1 'objects', WITH replacement, independently, and with equal probability Pr(ri) = 1/N. In other words its the standard "random sample of size n from a population of size N, with replacement" model. The standard textbook analysis and results follow, giving 1-P(N,n) = Pr(1 or more duplicates).
I think these are reasonable assumptions to make about any good random number generator, in any programming language. Hence the analysis and probability calculations are not specific to Matlab, and carry over to any programming language. Of course, for each situation we need to know N, but that is the only parameter needed.
But what about the 'objects' in the bag? In this case they are double precision floating point numbers. But who cares? The Matlab program that counts the number of duplicates merely tests for the equality of two objects. These objects could be integers, singles, doubles, or even matrices.
But you sorted these objects before you tested for equality, and this implies a '<' on the set of objects. Yes, the '<' was used to reduce the search for duplicates to O(nlogn). If '<' does not apply to the bag of objects then we must use an O(n^2) search algorithm that uses '=' only.
And now to the floating point aspects of the problem, where I'll meet my Waterloo, more than likely.
An IEEE 64-bit double precision floating point number is represented as
fl(x) = +/- s x 2^e where s (significand) is a 52-bit integer and e (exponent) is an 11-bit integer and the sign is 1 bit. The precision is p = 53 bits, and machine epsilon emach = 2^(1-p) = 2^(-52) is the distance between fl(1)=1 and the next higher fpn. = 1+emach = 1+2^(-52). A floating point number is normalized if the first bit of s is 1. In base 2 this can be a hidden or implied bit which is not actually stored.
Distribution Properties:
  1. There are 2^(p-1) = 2^(52) normalized dpfpns in [1, 1-2^(-52)]
  2. These dpfpns are uniformly distributed with spacing emach = 2^(-52).
  3. These dpfpns have the form x = 1 + k*emach, k = 0,1,2,...,2^(52)-1.
These numbers have the following bit patterns, with fixed exponent 2^0:
k = 0 1.000...000 x 2^0
k = 1 1.000...001 x 2^0
...
k = 2^(52)-3 1.111...101 x 2^0
k = 2^(52)-2 1.111...110 x 2^0
k = 2^(52)-1 1.111...111 x 2^0
If we wanted uniform numbers in [1, 1-2^(-52)] we could randomly pick an integer k between 0 and 2^(52)-1 and return r = 1+k*emach. Alternatively, we may view a random number as one of the 2^(52) possible bit patterns (000...000) ... (111...111).
Matlab's rand however, picks dpfpns from [2^(-53), 1-2^(-53)]. If we assume that all significand bit patterns are generated then using 2 and 3 above we get
x = 2^(-53) + k*2^(-53)*emach = 2^(-53) + k*2^(-105), k = 0,1, 2^(52)-1.
But this does not give the correct result. To get numbers in the range [2^(-53), 1-2^(-53)] we need to change the exponent. As Walter pointed out 53 different exponents are used. And as Bruno pointed out this means non-uniformity.
Right now I'm stuck. I'm missing something obvious but I can't see what it is. Floating point Aghhhh!
Does this invalidate the Birthday problem analysis, calculations and Matlab test results? No, because I'm not assuming anything about the values of the random numbers generated. I test only for equality. However, I'm not sure: should N = 2^(52) or 2^(53)?
Regards,
Derek.
|

10 件のコメント

Walter Roberson
Walter Roberson 2011 年 2 月 5 日
10^8 agrees with Mark's calculation.
What are the implications of getting duplicates in long vectors? The implications of *not* getting duplicates are far worse. You cannot have a true uniform random distribution if you are drawing without replacement. Linear Congruential generators that cannot repeat numbers within the period can be shown to be biased in surprising ways, leading to periodic tiling of some N dimensional space; for the standard LCG that was in use for years, N was 2 (that is, if you generated pairs of points, you could see that they fell into distinct lines in a square rather than being evenly distributed.) MT is robust that way up to something like 993 dimensions.
I can't speak about other programs, but the major one I work on does have a statistical bias in one particular seldom-used algorithm, with the bias occurring only if two *adjacent* random numbers are exact duplicates. The particular code segment is only invoked to break a tie arrived at through other computations; it should in theory continuing generating pairs until they do not duplicate each other. On the other hand, at some point I put a debugging message in there and over a fair number of runs, never once did the tie-breaking code get invoked at all. I probably should have just fixed the code anyhow, but the probability was so low it didn't seem worth it.
Walter Roberson
Walter Roberson 2011 年 2 月 6 日
Derek, why would you be uneasy if the airplane used long random vectors in their software? Turbulent fluid flow cannot be solved deterministically, but it can be modeled statistically. You advance the dynamic air-flow model a step, you get feedback from the sensors, you advance the model another step. A minor random bias would self-correct due to the sensor feedback. Using PRNGs that _cannot_ duplicate values within the period is, however, not a minor random bias!
Bruno Luong
Bruno Luong 2011 年 2 月 6 日
Derek, I read quickly your note report. There is one thing that bugs me a little bit. The spacing of floating points in (0,1) is not equal due to mantissa coding. The problem birthday as original stated does not have this complication. Does the formula is straight forward applicable?
Walter Roberson
Walter Roberson 2011 年 2 月 6 日
Bruno, Derek's formula is applicable if and only if the random numbers that can be generated are the integer multiples of 2^(-53). At the moment, we don't know that.
Bruno Luong
Bruno Luong 2011 年 2 月 6 日
It is reasonably to assume rand() returns integer multiple of eps(1/2), i.e., 2^-53, for example if the rand just truncate the integer output of MT to 53 bits. In that case Derek's conclusion must be correct.
But again, this is reverse engineering because it is not based on a documented feature of RAND. If a code has requirement/precondition on this aspect then they have to use another library to make sure it meets whatever the code requirement.
Though Derek's question is a valid one and merit to be read not only by Matlab users but also by TMW people.
Bruno
Derek O'Connor
Derek O'Connor 2011 年 2 月 6 日
Dear Walter,
Thanks for reassuring me about flying. Although anytime I asked my friends at Boeing, "Would you fly...etc.", they got a bit shifty.
Bruno and Walter:
It is very misleading for books and software manuals to say that their (dp floating point) random number generator gives numbers uniformly from (0,1). As you point out Bruno, floating point numbers are, by design, not uniformly distributed. This is why we need the eps(x) function. However, for a fixed exponent e the mantissas or significands (IEEE-speak) or fractional parts are uniformly distributed. This is what the standard and good books
say. A dp significand can be one of the 53-bit strings (00...00) up to (11...11).
I assumed in my analysis that rand/Mersenne Twister picks uniformly from this set of bit strings any string except (00...00), or in Walter's terms, integer multiples of 2^(-53). I'm not sure what else I could assume.
Walter, why do you say that at the moment we don't know if rand picks integer multiples of 2^(-53)? Am I missing something here? Do you think or know that there is something wrong with rand/mt?
Matlab got it wrong before so I wouldn't be surprised if it happened again.
My analysis depends on the stream of random numbers {u1, u2, ..., ui, ...} being iid (independent, identically (uniformly)
distributed. If this assumption does not hold then the analysis does not hold. Again, I'm not sure what other assumption I could have made.
Nonetheless, the analysis does carry over to any programming language that has a good random number generator. The only thing
that changes for different precisions is the population size, N = 2^(53) - 1.
Derek.
Bruno Luong
Bruno Luong 2011 年 2 月 6 日
Derek, I tend not to agree. The underlying assumption you made is - as Walter pointed out - rand() return integer-multiple of 2^(-53), since you seems to compute the upper number of values rand() can generate is 2^53-1, which is not true *if* Matlab RAND can return other value such as 1.5*2^-53.
If RAND() could fill the gaps between two successive multiple-integers of 2^(-53) then the probability of repeat occurrences would be smaller then what you have estimated. It's not clear to me how to relate the birthday problem with floating repeating, since the probability of all floating occurrences is non-uniform but decrease almost linearly with the value.
But in any case it will be no greater than the birthday probability of repeating elements for 64-bit (2^64) numbers
Mark Shore
Mark Shore 2011 年 2 月 6 日
Rearranging a formula in the Wikipedia reference noted by Walter gives, for n draws from a possible 2^53 distinct values, the probability p of at least one duplicated value of
p = 1 - exp(-n^2/(2*2^53)).
This is similar to your formula 1.4, if rearranged.
Derek O'Connor
Derek O'Connor 2011 年 2 月 6 日
Mark,
Yes. It is a further simplification (n-1) ~ n for large n.
Remember, all I have done is to apply the standard Birthday Problem analysis to Matlab's rand(n,1), assuming N = 2^(53)-1.
Derek O'Connor
Walter Roberson
Walter Roberson 2011 年 2 月 6 日
Derek:
>> num2hex(2^-53)
ans =
3ca0000000000000
>> num2hex(1-2^(-53))
ans =
3fefffffffffffff
That is 53 different exponents that are used over the range of random numbers you have documented.
The only integral multiple of 2^(-53) that has an exponent of 3ca is 2^(-53) itself, with an all-0 mantissa. 2*2^(-53) has an exponent of 3cb and an all-0 mantissa. 3*2^(-53) has an exponent of 3cb and a mantissa whose first bit is set. 4*2^(-53) has an exponent of 3cc and a mantissa of 0.
If we follow this further than it becomes clear from the pigeon-hole principle that not all of the values starting with 3cf can be represented.
Example:
>> num2hex(hex2num('3fe0ffffffffffff') / 2^(-53))
ans =
4330ffffffffffff
It follows that 3fe0ffffffffffff is not an exact integral multiple of 2^(-53) and thus will not be randomly generated. (it is the representable number just below 0.53125 exactly)
Any argument made on the basis of equi-probable bit patterns after a constant exponent are suspect. One _can_, however, make arguments about taking integer multiples of 2^(-53) and then normalizing the numbers.
IEEE 754 does not actually store 53 bits of mantissa: it uses 52 bits of mantissa plus the "hidden bit" implied by the exponent.

サインインしてコメントする。

その他の回答 (4 件)

Richard Willey
Richard Willey 2011 年 2 月 7 日

1 投票

Hi Derek
Your question about the birthday problem is best viewed as a special case of a more general topic: How should measure the statistical properties of a random number generator?
Rather than using this formulation of the birthday problem to reinvent the wheel, I recommend looking at established literature on this topic. A lot of very qualified experts have spent a lot of time and effort developing criteria to measure the quality of random number generators. The “Diehard” test suite is one such test (Diehard uses a reformulated version of the birthday paradox as one part of the test suite). Alternatively, there is another test suite that uses a combination of
  1. A monobit test (testing the distribution of ones and zeros in a sequence)
  2. A “Poker” test (a modified chi-square test)
  3. A Longruns test (checking whether there are runs of length 34 or greater within a 20,000 bit sequence)
  4. An autocorrelation test
Most of MATLAB’s code can be inspected by users. If you have questions about the specific implementation of a given algorithm – for example, what version of Twister does MathWorks use – you typically have the option to inspect the function and see how the code was written.
Regards,
Richard

4 件のコメント

Mark Shore
Mark Shore 2011 年 2 月 7 日
The test suite information is useful, but regardless of the number of "very qualified experts", bugs or flawed algorithms do manage to find their way into code, and the rand function in MATLAB is not clear text code that can be inspected by users.
Walter Roberson
Walter Roberson 2011 年 2 月 7 日
The @Randstream.list documentation does indicate which MT algorithm is used and gives a link to a site that has implementation.
However, the description of the code indicate only 32 bit precision for the (0,1) algorithm, and we have demonstrated at least 53 bit precision, which is documented as occurring for [0,1) . Possibly Mathworks wrapped the 53 bit version with a rejection of an exact 0 result; such an implementation is quite plausible, but we don't *know* since the documentation that is available to us does not specify.
The release notes introducing RandStream might have documented the precision and range of MT, but I am unable to dig the information out of the current documentation. I think dropping that table of producible values was a mis-service to the users.
Derek O'Connor
Derek O'Connor 2011 年 2 月 7 日
Dear Richard,
You have completely misconstrued my question, which is a simple one:
What is the probability that rand(n,1) has 1 or more duplicates as n gets large.
This is a general question that applies to any programming language that has the equivalent of rand(n,1).
I must state most emphatically that I did not set out to test the statistical properties of your random number generator. Indeed, I assume in the analysis that rand is perfect, in the sense that it returns each of the 53-bit strings with equal probability. That is, it randomly samples, with replacement, from a bag of N = 2^(53)-1
objects.
Perhaps the use of the word "Problem" in the title is the problem. The word is used in the sense "Homework Problem". Perhaps you
inferred, mistakenly, that I thought Matlab's rand had a problem. My assumption that rand(n,1) is perfect refutes that inference. In fact I state in the Conclusion to my note
<http://www.scribd.com/doc/48243415/The-Birthday-Paradox-and-Random-Number-Generation>
"This is not a deficiency in Matlab's implementation of the Mersenne Twister random number generator, but the inevitable
result of using any 53-bit generator to generate long vectors (and has nothing to do with the period)."
I have noticed that you are the second MathWorks-er that has misconstrued my question. Jiro Doke, in missing the point of my question, and dismissing it as "mathematical", seemed to think that
I was complaining about Matlab's rand:
"This is more of a mathematics question, rather than a MATLAB question. You should extend the question, state what you have
discovered (in regards to "rand"), and ask specific MATLAB programming questions to help you get the final answer."
Regards,
Derek O'Connor
Richard Willey
Richard Willey 2011 年 2 月 9 日
Hi Derek
Sorry that I misconstrued your original question. Here’s one last quick comment:
Random number generators are designed to sample from specific distributions.
A random number generator that generates normally distributed random numbers is not identical to a random number generator that generates normally distributed random numbers (but excludes duplicate values)
Suppose you were using an RNG to generate random integers from a Poisson distribution with lambda = 10. Would you be happy if the RNG was “gimmicked” to exclude duplicate values? I’d argue that this would be a very flawed RNG because the samples being generated aren’t actually coming from a Poisson distribution.
Let’s move back to the example using “rand”. We’re dramatically changing the granularity of the problem, but the basic principle remains the same…
At the end of the day, I don’t think it’s reasonable evaluate an RNG based on a design criteria which is (arguably) contrary to the way in one would expect an RNG to behave (in this case, deliberately excluding duplicate values)
If you have an application that needs RNGs that are generated from “some distribution that looks very much like a normal distribution but is guaranteed never to produce a duplicate value” then you should probably find an RNG designed to do precisely that.
From my own perspective, I’d worry about implementations that were this persnicetky…
regards,
Richard

サインインしてコメントする。

Walter Roberson
Walter Roberson 2011 年 2 月 5 日

0 投票

Upper bound: 1/2+(1/2)*sqrt(1+8*S*ln(2)) where S is the number of distinct random numbers that can be generated.
Unfortunately Mathworks does not appear to document which Twister version they are using. Traditionally their random number generators were (0,1); mt19937ar over (0,1) is 32 bits precision, but over [0,1) is 53 bits precision.
If we speculate that Mathworks used the (0,1) version with 32 bits precision, then according to Wikipedia the 50% probability is at 7.7 × 10^4 numbers.

1 件のコメント

Peter Perkins
Peter Perkins 2011 年 2 月 8 日
Walter, the User Guide includes this: "mt19937ar: The Mersenne Twister, as described in [9], with Mersenne prime 2^19937-1. This is the generator documented at http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/emt.html. It has a period of 2^19937-1. Each U(0,1) value is created using two 32-bit integers from the generator; the possible values are all multiples of 2^-53 strictly within the interval (0,1)."

サインインしてコメントする。

Mark Shore
Mark Shore 2011 年 2 月 5 日

0 投票

Using the formula Walter gives leads to the following numbers for a 50% probability:
32-bit 7.716e4
53-bit 1.112e8
57-bit 4.470e8
64-bit 5.056e9
80-bit 1.294e12
A brute-force check of 1e10 random numbers
for n=1:1000 mindiff(n) = min(diff(sort(rand(1e7,1))))/eps(1); end
found 11 identical values, or a ~50% probability for 4.44e8 numbers, suggesting that the equivalent of a 57-bit precision RNG is in use.

14 件のコメント

Jan
Jan 2011 年 2 月 5 日
Isn't this a strange result, if we can store 53 bits for the mantissa in DOUBLEs only?
Mark Shore
Mark Shore 2011 年 2 月 5 日
Agreed, but I didn't try to work out either the details of the distribution calculation or look at the Mersenne twister RNG algorithm to see if it uses some form of extended precision calculations.
Jan
Jan 2011 年 2 月 5 日
Even if MT use extended precision, the results are stored in DOUBLEs and all bits beyond the 53rd one would be cut away. Anyway, diff(sort(rand)) finds a difference of 1/2^53 => RAND replies 53 bit, as expected.
Walter Roberson
Walter Roberson 2011 年 2 月 5 日
I recall that sometime in the fall, I posted that MT generates only multiples of 2^32 (or perhaps it was 2^53), but that the response was that every floating point number in (0,1) could be generated (with appropriately weighted probability.) The numbers below 1/2 gain additional bits from the exponent, so in theory about 62 bits worth should be available (1 sign bit for the number as a whole, and roughly half the 11 bits worth of exponent represent negative exponents, each of which represents a generatible number less than 1.
Each individual uniform random number over (0,1) is not identical probability, as it is necessary to generate [1/2,1) with the same total probability that one generates (0,1/2) even though there are many more representable numbers in (0,1/2). This does, of course, contradict the assumption of the birthday problem that each of the elements is equal probability.
Walter Roberson
Walter Roberson 2011 年 2 月 5 日
Jan, you can't be sure -- you just might not have happened to have generated a value smaller than 1/2^53.
Mark Shore
Mark Shore 2011 年 2 月 5 日
I should also point out a possible flaw in my assumption that 1000 trials of 10^7 numbers is the same as a single trial using 10^10 numbers - but the difference should be small. I'll run a few trials with different parameters; my spare computer never complains about working overtime on Saturday...
Jan
Jan 2011 年 2 月 5 日
@Walter: I was too sloppy. I meant: diff(sort(rand)) finds a difference of 1/2^53 => RAND replies _at least_ 53 bit precision, assuming that the values are equally distributed.
Walter Roberson
Walter Roberson 2011 年 2 月 6 日
I am having difficulty locating the discussion of whether MT generated multiples of 2^(-53) or generated every possible number in (0,1) .
Bruno Luong
Bruno Luong 2011 年 2 月 6 日
According to my test, I have not found RAND generate anything but multiple integer of 2^(-53). So it seems Walter is right.
Mark Shore
Mark Shore 2011 年 2 月 6 日
Some more thought, and test results. First, Jan is correct and whatever internal precision the MT algorithm may use, it will still be converted to MATLAB doubles with a granularity of 2^-53 and 2^53 -1 distinct values over (0,1).
Second, 1000 trials of 1e7 values is NOT the equivalent to 1 trial of 1e10, as Derek's figures (and more work) suggests. 27000 runs of 1e7 gave a probability p of a pair of duplicate values of 0.006 +/- 0.002.
Rearranging a formula in the Wikipedia reference noted by Walter gives, for n draws from a possible 2^53 distinct values, the probability p of at least one duplicated value of
p = 1 - exp(-n^2/(2*2^53))
For n = 1e7, this returns a p of 0.0055, consistent with my test value.
Jan
Jan 2011 年 2 月 6 日
@Bruno: Which theory of Walter do you mean? As far as I understand we all agree that RAND replies more than 32 bits, but at least 53 bits (that's all I wanted to say). And if the numbers should be equally distributed in (0,1), it cannot not be more than 53 bits. Correct until here? Using the additional bits for numbers < 0.5 would be very unkind for a RNG and MEAN(RAND(1e8,1)) would reveal this fast.
Do we agree, that Mark's 57bit theory is not correct?
@Walter: Derek asks for repeated values. I do not see the connection to "every possible number in (0,1)".
Mark Shore
Mark Shore 2011 年 2 月 6 日
@ Jan, to avoid any misunderstanding, my 57-bit hypothesis was based on faulty analysis and is incorrect.
Bruno Luong
Bruno Luong 2011 年 2 月 6 日
@Jan: Walter is right by stating the fact that Matlab rand() command returns only integer-multiple of 2^(-53).
Walter Roberson
Walter Roberson 2011 年 2 月 6 日
Jan, the testing I did was unable to find a non-zero difference smaller than 2^(-53) as well. But it could just be quite rare. I don't think I have time to generate 2^53-ish values to check!

サインインしてコメントする。

Walter Roberson
Walter Roberson 2011 年 2 月 8 日

0 投票

The Matlab content of this question is:
  • How many distinct random numbers can be generated in Matlab. (We still don't know for sure, but 2^53-1 appears likely.)
  • Are the numbers generated with equal probability. (The algorithm would have to be examined in detail to be sure. If there is a bias, it is minute.)
  • Is the period of the generator greater than the number of different possible values, or is it equal to the number of possible values?
Anything beyond that is a straight-forward application of the Birthday Paradox and is thus a mathematical question rather than a Matlab question. This forum exists to serve Matlab questions, not to serve mathematical questions and especially not to discuss the risks of fly-by-wire or to discuss deliberate fraud by Matlab users.
One only has to look at the period of MT and compare that to the largest number directly representable in Matlab in order to see that rand() will eventually generate a duplicate. Software that expects otherwise is broken.
It would be fair game to ask Mathworks to create a random number generator with no repetitions for those people who need one; it would even be fair game to ask Mathworks to rename rand() to make it clear that duplicates were possible. Those would be Matlab issues. Considering the amount of legacy code that uses rand(), I think Mathworks would want some strong evidence that "rand with replacement" was widely enough misunderstood to be worth renaming; I personally don't think speculation about possible trouble in genome software would qualify.

10 件のコメント

Bruno Luong
Bruno Luong 2011 年 2 月 8 日
I don't agree with Walter. Asking question with relation to Birthday Paradox make it becomes clear why some properties of RAND() are important to be known. It *is* a Matlab related question. The boundary of side topics discussion are possibly up to individual user to judge. Mileages will vary. With all due respect I don't see the reason why we have to stick with your point of view of what has to be discussed here.
The whole notion of "rand with replacement" does not make much sense to me. "RANDI" is probably a better candidate, but that's covered by many existing FEX tools.
The second question is not well formulated. Richard has explained why.
The third question seems to be out of the blue. Period has no relationship with the set cardinal. It can be found on MT papers.
Only the first question make some sense to me.
Peter Perkins
Peter Perkins 2011 年 2 月 8 日
Since R2008b, MATLAB (at start-up) has used the "standard" mt19937ar Mersenne Twister. Mark and Derek are right that this code is not visible to the user in the way that much of MATLAB's code is, because for speed reasons it has to be implemented in compiled C code. The idea behind expecting you to use that cryptic "mt19937ar" in your code when creating a random stream was to hopefully help make clear _exactly_ what algorithm is used, and the documentation does point to the original code authored by Nishimura and Matsumoto (similarly for "mrg32k3a", either can be googled). And Bruno is right, the doc does not specifically say how we get from 32bit integers (which is what the MT "natively" creates) to d.p. floating point. If you look at N&M's original code, you'll find that they have a piece of utility code that takes two 32bit integers and shifts and divides to map into (0,1), and that's what MATLAB does (I guess that needs to be added to the doc as well, it's hard to know when to stop). So, mt19937ar in MATLAB produces integer multiples of 2^-53 strictly between 0 and 1, bu since the period is so long you are guaranteed to repeat individual values long before you wrap around.
The other generators are similar in that respect, except for "swb2712". It actually produces all possible floating point values within (0,1), and works "natively" in d.p. Cleve Moler's book Numerical Computing with MATLAB went into great detail on this algorithm when it was the default, you can find it on line.
Bruno Luong
Bruno Luong 2011 年 2 月 8 日
Thanks Peter for the info. Now it's clear.
Derek O'Connor
Derek O'Connor 2011 年 2 月 8 日
Peter,
I presume this is what you are referring to:
/* generates a random number on [0,1) with 53-bit resolution*/
double genrand_res53(void)
{
unsigned long a=genrand_int32()>>5, b=genrand_int32()>>6;
return(a*67108864.0+b)*(1.0/9007199254740992.0);
}
/* These real versions are due to Isaku Wada, 2002/01/09 added */
Derek
Walter Roberson
Walter Roberson 2011 年 2 月 8 日
Can't be as simple as that, Derek, that's for [0,1) whereas Mathwork's version is (0,1) .
Derek O'Connor
Derek O'Connor 2011 年 2 月 9 日
Walter, I'm not sure what you mean.
Anyway, I've downloaded and mexed Peter's 'twister' which is a mex wrapper around the original C++ code for the Mersenne Twister.
http://www.mathworks.com/matlabcentral/fileexchange/6614
The comments in Peter's code say:
This method generates double-precision values in the closed interval [0,1-2^(-53)], with a period of (2^19937-1).
So, instead of speculating about how rand works, you can look at the actual code, which is headed:
"A C-program for MT19937, with initialization improved 2002/1/26.
Coded by Takuji Nishimura and Makoto Matsumoto"
As you will see, the double precision rn is generated from two 32-bit random integers.
My mexed version runs about 10% slower than rand, and gives the same test results that I got using rand.
Following up on Peter's comment: "Each U(0,1) value is created using two 32-bit integers from the generator; the possible values are all multiples of 2^-53 strictly within the interval (0,1).", this implies that all the random numbers generated are given by
u = k*2^(-53), k = 1,2,...,2^(53)-1
and we get any u by generating a random k in [1,2^(53)-1] and returning k*2^(-53). Is this the random number 'model' we should use when discussing the output of rand or MT19937 (k=0,1...)?
Also following another suggestion of Peter, Chapter 9 of Cleve Moler's NCM is worth a re-read, although he says very little about
Twister.
Derek
Walter Roberson
Walter Roberson 2011 年 2 月 9 日
If "a" and "b" both come out as 0, then Isaku's code would return 0.0, which is not a permitted output from Matlab's rand. Notice the reference to the "closed interval" and the lower endpoint is 0. But http://www.mathworks.com/help/techdoc/ref/rand.html is clear that the output is the open interval -- i.e., excluding 0 from the possibilities.
Bruno Luong
Bruno Luong 2011 年 2 月 9 日
Personally I'll believe on the doc and also what Peter wrote earlier:
[ So, mt19937ar in MATLAB produces integer multiples of 2^-53 strictly between 0 and 1,]
There is no indication of the twister on FEX submission is the exact code of Matlab RAND().
Peter Perkins
Peter Perkins 2011 年 2 月 10 日
While the code on the FEX is indeed a wrapper around N&M's original mt19937ar code, the wrapper differs slightly from the implementation that the RAND function has used for the MT since that generator was officially supported circa R2007b. You have noticed two things: the range, and the speed. There is also the treatment of a zero seed. I can't recall if there are other differences, but in any case, RAND is RAND, and the FEX submission is the FEX submission.
The doc for RAND and my earlier comments are correct, and in particular RAND (internally) rejects an exact (double precision) zero in the very infrequent event that it happens to generate one.
Hoping this clears it all up. I almost hesitate to even mention it, but as described in the RandStream doc, you can also set the FullPrecision property of an mt19937ar stream to false, and get values that are multiples of 2^-32.
Bruno Luong
Bruno Luong 2011 年 2 月 10 日
Thanks again Peter. That settles the question.

サインインしてコメントする。

カテゴリ

ヘルプ センター および File ExchangeBirthdays についてさらに検索

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by