Abraham De Moivre, His Famous Theorem, and the Birth of the Normal Curve | by Sachin Date | Feb, 2024


It would be several years into his new life in England that a middle-aged De Moivre would take a real abiding interest in Jacob Bernoulli’s work on the Law of Large Numbers. To see what his interest led to, let’s visit Bernoulli’s theorem and the thought experiment that lead Bernoulli to its discovery.

In Ars Conjectandi, Bernoulli had imagined a large urn containing r black tickets and s white tickets. Both r and s are unknown to you and so is the true fraction p = r/(r+s) of black tickets in the urn. Now suppose you draw n tickets from the urn randomly with replacement and your random sample contains X_bar_n black tickets. Here, X_bar_n is the sum of n i.i.d. random variables. Thus, X_bar_n/n is the ratio of black tickets that you observe. In essence, X_bar_n/n is your estimate of the true value of p.

The number of black tickets X_bar_n found in a random sample of black and white tickets has the familiar binomial distribution. That is:

X_bar_n ~ Binomial(n, p)

Where n is the sample size, and p=r/(r+s) is the true probability of a single ticket being a black ticket. Of course, p is unknown to you since in Bernoulli’s experiment, the number of black tickets (r) and white tickets (s) are unknown to you.

Since X_bar_n is binomially distributed, its expected value E(X_bar_n) = np and its Var(X_bar_n) = np(1 — p). Again, since p is unknown, both the mean and variance of X_bar_n are also unknown.

Also unknown to you is the absolute difference between your estimate of p and the true value of p. This estimate is the error |X_bar_n/n — p|.

Bernoulli’s great discovery was to show that as the sample size n becomes very large, the odds of the error |X_bar_n/n — p| being smaller than any arbitrarily small positive number ϵ of your choosing become incredibly large. As an equation, his discovery can be stated as follows:

Bernoulli’s theorem
Bernoulli’s theorem (Image by Author)

The above equation is the Weak Law of Large Numbers. In the above equation:

P(|X_bar_n/n — p| <= ϵ) is the probability of the estimation error being at most ϵ.
P(|X_bar_n/n — p| > ϵ) is the probability of the estimation error being greater than ϵ.
‘c’ is a very large positive number.

The WLLN can be stated in three other forms highlighted in the blue boxes below. These alternate forms result from doing some simple algebraic gymnastics as follows:

Alternate forms for Bernoulli’s theorem
Alternate forms of the Weak Law of Large Numbers (Image by Author)

Now notice the probability in the third blue colored box:
P(μ — δ ≤ X_bar_n ≤ μ + δ) = (1 — α)

Or plugging back μ =np:
P(np — δ ≤ X_bar_n ≤ np + δ) = (1 — α)

Since X_bar_n ~ Binomial(n,p), it is straightforward to express this probability as a difference of two binomial probabilities as follows:

P(np-δ ≤ X_bar_n ≤ np+δ) where X_bar_n ~ Binomial(n,p)
P(np-δ ≤ X_bar_n ≤ np+δ) where X_bar_n ~ Binomial(n,p) (Image by Author)

But it is at this point that things stop being straightforward. For large n, the factorials inside the two summations become enormous and near about impossible to calculate. Imagine having to calculate 20!, leave alone 100! or 1000!. What is needed is a good approximation technique for factorial(n). In Ars Conjectandi, Jacob Bernoulli made a few weak attempts at approximating these probabilities, but the quality of his approximations left a lot to be desired.

Abraham De Moivre’s big idea

In the early 1700s, when De Moivre first began looking at Bernoulli’s work, he immediately sensed the need for a fast, high quality approximation technique for the factorial terms in the two summations. Without an approximation technique, Bernoulli’s great accomplishment was like a big, beautiful kite without a string. A law of great beauty but of little practical use.

De Moivre recast the problem as an approximation for the sum of successive terms in the expansion of (a + b) raised to the nth power. This expansion, known as the binomial formula, goes as follows:

The formula for (a+b) raised to the nth power
The formula for (a+b) raised to the nth power (Image by Author)

De Moivre’s reasons for recasting the probabilities in the WLLN in terms of the binomial formula were arrestingly simple. It was known that if the sample sum X_bar_n has a binomial distribution, the probability of X_bar_n being less than or equal to some value n can be expressed as a sum of (n+1) probabilities as follows:

The formula for P(X_bar_n ≤ n)
The formula for P(X_bar_n ≤ n) (Image by Author)

If you compare the coefficients of the terms on the R.H.S. of the above equation with the coefficients in the terms in the expansion of (a+b) raised to n, you’ll find them to be remarkably similar. And so De Moivre theorized, if you find a way to appropriate the factorial terms in the R.H.S. of (a+b) raised to n, you have paved the way for approximating P(X_bar_n ≤ n), and thus also the probability lying at the heart of the Weak Law of Large Numbers, namely:

P(np — δ ≤ X_bar_n ≤ np + δ) = (1 — α)

For over 10 years, De Moivre toiled on the approximation problem creating increasingly accurate approximations of the factorial terms. By 1733, he had largely concluded his work when he published what came to be called De Moivre’s theorem (or less accurately, the De Moivre-Laplace theorem).

At this point, I could just state De Moivre’s theorem but that will spoil half the fun. Instead, let’s follow along De Moivre’s train of thought. We’ll work through the calculations leading up to the formulation of his great theorem.

Our requirement is for a fast, high accuracy approximation technique for the probability that lies at the heart of Bernoulli’s theorem, namely:

P(|X_bar_n/n — p| ≤ ϵ)

Or equivalently its transformed version:
P(np — δ ≤ X_bar_n ≤ np + δ)

Or in the most general form, the following probability:
P(x_1 ≤ X ≤ x_2)

In this final form, we have assumed that X is a discrete random variable that has a binomial distribution. Specifically, X ~ Binomial(n,p).

The probability P(x_1 ≤ X ≤ x_2) can be expressed as follows:

Formula for probability P(x_1 ≤ X ≤ x_2)
Formula for probability P(x_1 ≤ X ≤ x_2) (Image by Author)

Now let p, q be two real numbers such that:
0 ≤ p ≤ 1, and 0 ≤ q ≤ 1, and q = (1 — p).

Since X ~ Binomial(n,p), E(X) = μ = np, and Var(X) = σ² = npq.

Let’s create a new random variable Z as follows:

A few variable definitions (Image by Author)

Z is clearly the standardized version of X. Specifically, Z is a standard normal random variable. Thus,

If X ~ Binomial(n,p), then Z ~ N(0, 1)

Keep this in mind for we’ll visit this fact in just a minute.

With the above framework in place, De Moivre showed that for very large values of n, the probability:

P(x1 ≤ X ≤ x2)

can be approximated by evaluating the following specific type of integral:

P(x1 <= X <= x2) asymptotically converges to the area under the curve exp(-z²/2) from z1 to z2.
P(x1 <= X <= x2) asymptotically converges to the area under the curve exp(-z²/2) from z1 to z2. (Image by Author)

The ≃ sign means the L.H.S. asymptotically equals the R.H.S. In other words, as the sample size grows to ∞, L.H.S. = R.H.S.

Did you notice something familiar about the integral on R.H.S? It’s the formula for the area under a standard normal variable’s probability density curve from z_1 to z_2.

Area under the PDF of N(0,1) from z1=-1 to z2=+1
Area under the PDF of N(0,1) from z_1=-1 to z_2=+1 (Image by Author)

And the formula inside the integral is the Probability Density Function of the standard normal random Z:

PDF of the standard normal random variable Z
PDF of the standard normal random variable Z (Image by Author)

Let’s split apart the integral on the R.H.S. as a difference of two integrals as follows:

P(z1 ≤ Z ≤ z2) = P(Z ≤ z2) — P(Z ≤ z1)
P(z1 ≤ Z ≤ z2) = P(Z ≤ z2) — P(Z ≤ z1) (Image by Author)

The two new integrals on the R.H.S. are the respectively the cumulative densities P(Z ≤ z_2), and P(Z ≤ z_1).

The Cumulative Density Function P(Z ≤ z) of a standard normal random variable is represented using the standard notation:

𝛟(z)

Therefore, the integral on the L.H.S. of the above equation is equal to:

𝛟(z_2) — 𝛟(z_1).

Bringing it all together, we can see that the probability:

P(x1 ≤ X ≤ x2)

asymptotically converges to 𝛟(z_2) — 𝛟(z_1):

Image by Author

Now recall how we defined Z as a standardized X :

The standardized X
The standardized X (Image by Author)

And thus we also have the following:

(Image by Author)

While formulating his theorem, De Moivre defined the limits x_1 and x_2 as follows:

(Image by Author)

Substituting these values of x_1 and x_2 in the previous set of equations, we get:

(Image by Author)

And therefore, De Moivre showed that for very large n:

De Moivre’s Theorem (Image by Author)

Remember, what De Moivre really wanted was to approximate the probability on the L.H.S. of Bernoulli’s theorem:

Bernoulli’s Theorem
Bernoulli’s Theorem (Image by Author)

Which he succeeded in doing by making the following simple substitutions:

Image by Author

Which produces the following asymptotic equality:

De Moivre’s approximation for Bernoulli’s theorem
De Moivre’s approximation for Bernoulli’s theorem (Image by Author)

In a single elegant stroke, De Moivre showed how to approximate the probability in Bernoulli’s theorem for large sample sizes. And large sample sizes is what Bernoulli’s theorem is all about. There is however some subtext to De Moivre’s achievement. The integral on the R.H.S. does not have a closed form and De Moivre approximated it using an infinite series.

An illustration of De Moivre’s Theorem

Suppose there are exactly three times as many black tickets as white tickets in the urn. So the true fraction of black tickets, p, is 3/4. Suppose also that you draw a random sample with replacement of 1000 tickets. Since p=0.75, the expected value of black tickets is np = 750. Suppose the number of black tickets you observe in the sample is 789. What is the probability of drawing such a random sample?

Let’s set out the facts:

(Image by Author)

We wish to find out:

P(750 — 39 ≤ X_bar_n <= 750 + 39)

We’ll use De Moivre’s Theorem to find this probability. As we know, the theorem can be stated as follows:

De Moivre’s approximation for Bernoulli’s theorem
De Moivre’s approximation for Bernoulli’s theorem (Image by Author)

We know that n=1000, p=0.75, X_bar_n=789, and δ=39. We can find k as follows:

(Image by Author)

Plugging in all the values:

Application of De Moivre’s theorem (Image by Author)

In approximately 99.56% of random samples of size 1000 tickets each, the number of black tickets will lie between 711 and 789.



Source link

Be the first to comment

Leave a Reply

Your email address will not be published.


*