Birthday problem
From Wikipedia, the free encyclopedia
Categories: Probability theory paradoxes | Probability theory | Probability and statistics | Birthdays
|
In probability theory, the birthday problem, or birthday paradox,[1] pertains to the probability that in a set of randomly chosen people some pair of them will have the same birthday. In a group of 23 (or more) randomly chosen people, there is more than 50% probability that some pair of them will have the same birthday. For 57 or more people, the probability is more than 99%, tending toward 100% as the pool of people grows.[2] The mathematics behind this problem leads to a well-known cryptographic attack called the birthday attack.
Understanding the problemThe birthday problem asks whether any of the 23 people have a matching birthday with any of the others — not one in particular. (See "Same birthday as you" below for an analysis of this much less surprising alternative problem.) In a list of 23 persons, if you compare the birthday of the first person on the list to the others, you have 22 chances of success, but if you compare each to the others, you have 253 chances. This is because in a group of 23 people there are 23*22/2=253 pairs, which is more than half of the number of days in the year. So the chance that one of these pairs has a matching birthday is not small. For this reason, along with many others we have the Birthday Paradox. It is easier to figure the probability that the birthdays will be different, such as: with one person they have 365 opportunities to have a different birthday. The second person only has 364 possibilities to have a different birthday than the first person. The third person has 363 days, and so on. Calculating the probabilityTo compute the approximate probability that in a room of n people, at least two have the same birthday, we disregard variations in the distribution, such as leap years, twins, seasonal or weekday variations, and assume that the 365 possible birthdays are equally likely. Real-life birthday distributions are not uniform since not all dates are equally likely.[3] It is easier to first calculate the probability p(n) that all n birthdays are different. If n > 365, by the pigeonhole principle this probability is 0. On the other hand, if n ≤ 365, it is given by
The event of at least two of the n persons having the same birthday is complementary to all n birthdays being different. Therefore, its probability p(n) is
* excluding leap years. ApproximationsThe Taylor series expansion of the exponential function
provides a first-order approximation for Failed to parse (Missing texvc executable; please see math/README to configure.): e^x
A simple exponentiationVery basically, the probability of any two people not having the same birthday is 364/365. In a room of people of size N, there are C(N, 2) pairs of people, i.e. C(N, 2) events. We can approximate the probability of no two people sharing the same birthday by assuming that these events are independent and hence by multiplying their probability together. In short we multiply 364/365 by itself C(N, 2) times, which gives us
Poisson approximationUsing the Poisson approximation for the binomial,
Approximation of number of peopleWe can also approximate this using the following formula for the number of people necessary to have at least a 50% chance of matching:
This is a result of the good approximation that an event with 1 in k probability will have a 50% chance of occurring at least once if it is repeated k ln 2 times. An upper bound and a different perspectiveThe argument below is adapted from an argument of Paul Halmos.[4] As stated above, the probability that no two birthdays coincide is
Replacing 1 − k/365, as above, with e−k/365, and using the inequality 1 − x < e−x, we have
This derivation only shows that at most 23 people are needed to ensure a birthday match with even chance; it leaves open the possibility that, say, n = 22 could also work. GeneralizationsCast as a collision problemThe birthday problem can be generalised as follows: given n random integers drawn from a discrete uniform distribution with range [1,d], what is the probability p(n;d) that at least two numbers are the same? The generic results can be derived using the same arguments given above.
The theory behind the birthday problem was used by Zoe Schnabel[5] under the name of capture-recapture statistics to estimate the size of fish population in lakes. Generalization to multiple typesThe basic problem considers all trials to be of one "type". Michael Wendl[6] generalized the birthday problem to consider an arbitrary number of types. In the simplest extension there are just two types, say m "men" and n "women", and the problem becomes characterizing the probability of a shared birthday between at least one man and one woman. (Shared birthdays between, say two women do not count.) The probability of no (i.e. zero) shared birthdays here is
where we set Failed to parse (Missing texvc executable; please see math/README to configure.): d = 365 and where Failed to parse (Missing texvc executable; please see math/README to configure.): S_2 are Stirling numbers of the second kind. Consequently, the desired probability is Failed to parse (Missing texvc executable; please see math/README to configure.): 1 - p_0 . This variation of the birthday problem is interesting because there is not a unique solution for the total number of people Failed to parse (Missing texvc executable; please see math/README to configure.): m + n . For example, the usual 0.5 probability value is realized for both a 32-member group of 16 men and 16 women and a 49-member group of 43 women and 6 men. Other birthday problemsReverse problemFor a fixed probability p:
An approximation to this can be derived by inverting the 'coarser' approximation above:
Sample calculations
Note: some values falling outside the bounds have been coloured to show that the approximation is not always exact. Same birthday as youImage:Birthday paradox.svg
Comparing p(n) = probability of a birthday match with q(n) = probability of matching your birthday
Note that in the birthday problem, neither of the two people is chosen in advance. By way of contrast, the probability q(n) that someone in a room of n other people has the same birthday as a particular person (for example, you), is given by
Near matchesAnother generalization is to ask how many people are needed in order to have a better than 50% chance that two people have a birthday within one day of each other, or within two, three, etc., days of each other. This is a more difficult problem and requires use of the inclusion-exclusion principle. The number of people required so that the probability that some pair will have a birthday separated by fewer than Failed to parse (Missing texvc executable; please see math/README to configure.): k days will be higher than 50% is:
Thus in a group of just seven random people, it is more likely than not that two of them will have a birthday within a week of each other.[7] Collision countingThe probability that the kth integer randomly chosen from [1, d] will repeat at least one previous choice equals Failed to parse (Missing texvc executable; please see math/README to configure.): q(k-1;d) above. The expected total number of times a selection will repeat a previous selection as n such integers are chosen equals
Average number of peopleIn an alternative formulation of the birthday problem, one asks the average number of people required to find a pair with the same birthday. The problem is relevant to several hashing algorithms analyzed by Donald Knuth in his monumental book The Art of Computer Programming. It may be shown [8], [9] that if one samples uniformly, with replacement, from a population of size Failed to parse (Missing texvc executable; please see math/README to configure.): M , the number of trials required for the first repeated sampling of some individual has expectation value Failed to parse (Missing texvc executable; please see math/README to configure.): \overline{n}=1+Q(M) , where Failed to parse (Missing texvc executable; please see math/README to configure.): Q(M)=\sum_{k=1}^{M} \frac{M!}{(M-k)! M^k} . The function Failed to parse (Missing texvc executable; please see math/README to configure.): Q(M)= 1 + \frac{M-1}{M} + \frac{(M-1)(M-2)}{M^2} + \cdots + \frac{(M-1)(M-2) \cdots 1}{M^{M-1}}
Failed to parse (Missing texvc executable; please see math/README to configure.): Q(M)\sim\sqrt{\frac{\pi M}{2}}-\frac{1}{3}+\frac{1}{12}\sqrt{\frac{\pi}{2n}}-\frac{4}{135n}+\cdots . With Failed to parse (Missing texvc executable; please see math/README to configure.): M=365
days in a year, the average number of people required to find a pair with the same birthday is Failed to parse (Missing texvc executable; please see math/README to configure.): \overline{n}=1+Q(M)\approx24.61658
, slightly more than the number required for a 50% chance. In the best case, two people will suffice; at worst, the maximum possible number of Failed to parse (Missing texvc executable; please see math/README to configure.): M+1=366 people is needed; but on average, only 25 people are required. Partition problemA related problem is the partition problem, a variant of the knapsack problem from computer science. Some weights are put on a balance; each weight is an integer number of grams randomly chosen between one gram and one million grams (one metric ton). The question is whether you can transfer the weights between the left and right arms to balance the scale. If there are only two or three weights, the answer is very clearly no. If there are very many weights, the answer is clearly yes. The question is, how many are just sufficient? Some people's intuition is that the answer is above 100,000. Most people's intuition is that it is in the thousands or tens of thousands, while others feel it should at least be in the hundreds. The correct answer is approximately 23. The reason is that the correct comparison is to the number of partitions of the weights into left and right. There are 2N−1 different partitions for N weights, and the left sum minus the right sum can be thought of as a new random quantity for each partition. The distribution of the sum of weights is approximately Gaussian, with a peak at 1,000,000 N and width Failed to parse (Missing texvc executable; please see math/README to configure.): \scriptstyle 1,000,000\sqrt{N} , so that when 2N−1 is approximately equal to Failed to parse (Missing texvc executable; please see math/README to configure.): \scriptstyle 1,000,000\sqrt{N} the transition occurs. 223−1 is about 4 million, while the width of the distribution is only 5 million[10]. Notes
References
External links
|


