Café Math : Monte Carlo Integration

# 1. Foreword

Hi today, I want to talk about the basic aspects of Monte Carlo integration. It is a stochastic algorithm to numerically compute an integral, $$\int_I f(x) \, dx$$ It has as host of good properties. First it is surprisingly simple, second, the exact same method can be applied whatever the function $f$ is, in whatever dimension. It doesn't get more complicated as $f$ gets complicated or as the dimensionality of the problem changes. Its main disadvantage over other methods however is that it is spectacularly slow.

# 2. Principle of the Method

The principle of the method, is that the integral to compute is related to the expectation of a random variable. $$\lambda \, \mathrm{E}(f(X)) = \int_I f(x) \, dx$$ where $\lambda$ is the length of the domain of integration $I$ and $X$ is a random variable uniformly distributed in $I$. By estimating the mean of the random variable $Y = \lambda f(X)$ one is actually estimating the integral of $f(x)$ on $I$. By that, the problem of numerical integration reduces to that of estimating the mean of a random variable. This is the Monte Carlo integration principle.

The algorithm is quite simple. First draw a bunch of independent random values, uniformly distributed over $I$. $$X_1, \dots , X_N \sim \mathcal{U}(I)$$ Then compute the value, $$\hat{\mu} = \frac{\lambda}{N} \sum_{k = 1}^{N} f(X_k)$$ for $k = 1, \dots, N$, which is the estimator to the expectation of the random variable $Y = \lambda f(X)$. The random variable $\hat{\mu}$ is just the empirical mean of the sample $Y_1, \dots, Y_N$, it's not the expectation of $Y$ itself.

# 3. Small Digression on the Hypothesis

What are the characteristics of that estimator ? Well it is the sum of the $N$ independent identically distributed random variables $Y_1, \dots, Y_N$. One immediately think of the central limit theorem, the most important theorem of statistics, but for it to apply, its hypothesis are to be fulfilled. What are they ? Well there are four of them:

1. The random variables $Y_k$ must be independent.
2. They must be identically distributed.
3. They must have a mean $\mu$.
4. They must have a variance $\sigma^2$.
In almost any textbook, the two last hypothesis are overlooked... Don't every random variables have a mean and a variance ? Well, no, not even close... Take the Cauchy distribution for example, its density is, $$\frac{dx}{\pi(1+x^2)}$$ Both of the following integrals fail to converge, \begin{align*} \mu_1' &= \frac{1}{\pi} \int_{-\infty}^{+\infty} \frac{x}{1+x^2}\,dx \\ \mu_2' &= \frac{1}{\pi} \int_{-\infty}^{+\infty} \frac{x^2}{1+x^2}\,dx \end{align*} So no random variable having a Cauchy distribution has either a mean or a variance.

Ok let's check the hypothesis of the central limit theorem,

1. The $X_k$ are independent so are the $Y_k = \lambda f(X_k)$.
2. The $X_k$ are identically distributed, so are the $Y_k$.
3. If the function $f$ is integrable $L^1(I)$ then in particular the integral, $$\mu_1' = \frac{1}{\lambda}\int_I f(x) \, dx$$ is convergent.
4. If the function $f$ is $L^2(I)$ then in particular the integral, $$\mu_2' = \frac{1}{\lambda^2}\int_I (f(x))^2 \, dx$$ is convergent.

Therefore, as far as $f \in L^1(I) \cap L^2(I)$ we can safely apply the central limit theorem. For the record, the conclusion of the theorem says that, the random variable, $$\hat{\mu} = \frac{\lambda}{N} \sum_{k = 1}^{N} f(X_k)$$ tends in law to a normal distribution with mean $\mu$ and variance $\sigma^2 / N$, where \begin{align*} \mu &= \mathrm{E}(Y) = \lambda \, \mathrm{E}(f(X)) \\ \sigma^2 &= \mathrm{V}(Y) = \lambda^2\, \mathrm{V}(f(X)) \end{align*} are the mean and variance of $Y = \lambda f(X)$.

# 4. Confidence Interval

Now we suppose that the function to integrate $f$ is both $L^1(I)$ and $L^2(I)$. From that, we know for sure that for large $N$ the quantity, $$\hat{\mu} = \frac{\lambda}{N} \sum_{k = 1}^{N} f(X_k)$$ is fairly close to be normally distributed. For large $N$ a good confidence interval for the quantity, $$\mu = \mathrm{E}(Y) = \int_I f(x)\,dx$$ is therefore given by, $$P\left[\,|\,\mu - \hat{\mu}\,| \le \frac{s}{\sqrt{N}} \left|\,z_{\alpha/2}\,\right|\,\right] \ge 1 - \alpha$$ where, \begin{align*} s^2 &= \frac{1}{N-1} \sum_{k=1}^N (Y_k - \hat{\mu})^2 \\ &= \frac{1}{N-1} \left(\sum_{k=1}^N Y_k^2 - N\hat{\mu} \right) \end{align*} and $z_{\alpha/2}$ is the standard $\alpha/2$ normal quantile. In other words, there is a probability of $1-\alpha$ that the unknown quantity $\mu$ is contained in the confidence interval, $$\mu \in \left[\,\hat{\mu} - \frac{s}{\sqrt{N}} \left|\,z_{\alpha/2}\,\right|,\, \hat{\mu} + \frac{s}{\sqrt{N}} \left|\,z_{\alpha/2}\,\right|\,\right]$$

Note to the purists. There are two approximations made here. The first is that the random variable $\hat{\mu}$ is not perfectly normally distributed, and the second is that even though it were, the distribution of the empiric mean of a $N$ samples of a normally distributed random variable follow a Student-$t$ distribution with $N-1$ degrees of freedom. For large $N$ however, the this distribution is fairly close to the normal distribution.

# 5. Rate of Convergence Estimation

Ok, now we've got the method. How many terms are needed to compute an integral using this method. Say we want to evaluate the integral, $$\int_0^1 x^2 \, dx = \frac{1}{3}$$ with an error less than $\varepsilon = 10^{-3}$ with probability $p = 0.99$. As $\alpha = 1 - p$ we have $\alpha/2 = 0.005$. Looking in a table of the normal quantiles we get, $$\left|\,z_{\alpha/2}\,\right| = 2.58...$$ The inequality for the confidence interval, $$\frac{s}{\sqrt{N}} \left|\,z_{\alpha/2}\,\right| \le \varepsilon$$ becomes, $$N \ge \frac{s^2\, z_{\alpha/2}^2}{\varepsilon^2}$$ Mmmh bad news... The small quantity $\varepsilon$ appears squared at the denominator, the large values $s$ and $\left|\,z_{\alpha/2}\,\right|$ appear squared at the numerator. It couldn't be worse, really...

Using the computation, \begin{align*} s^2 &\simeq E(f(X)^2) - E(f(X))^2 \\ &= \int_0^1 x^4 \, dx - \left(\int_0^1 x^2\, dx \right)^2 = \frac{4}{45} = 0.08888... \end{align*} we finally get, $$N \ge 5.9168... \times 10^5$$ And indeed, running the Java program provided at the top of the page we get the following executions,

epsilon = 0.001
N = 590082
The result is: 0.333035766138238 +/- 1.0E-3
N = 592527
The result is: 0.333510296419398 +/- 1.0E-3
N = 591419
The result is: 0.333070996685725 +/- 1.0E-3
N = 590761
The result is: 0.332972761984602 +/- 1.0E-3

Cyrus  posted 2012-04-10 11:31:39

Excellent work! Perhaps providing an example or two might make it more educational for students. For example,

Question. Let $U_1, U_2, U_3, ..., U_n$ be independent and identically distributed random variables where $U_i$ is uniformly distributed on the open interval $0 \le x \le 1$. Suppose that $f(x)$ is an integrable function over the closed interval $0 \le x \le 1$ and $f(x) \ge 0$ on $0 \le x \le 1$. Show that, $$\lim_{n \rightarrow\infty} \frac{1}{n} \sum_{i = 1}^{n} f(U_i) = \int_0^1 f(x)\, dx$$ in probability.

Answer. Let $X_i = f(U_i)$. The density function of a uniform distribution over the interval $0 \le x \le 1$ is equal to $1$ on the interval $0 \le x \le 1$. Thus $$\mathrm{E}\,|\,X_i\,| = \mathrm{E}\,|\,f(U_i)\,| = \int_0^1|\,f(x)\,|\, dx = \int_0^1 f(x) \, dx < \infty.$$ And $$\mathrm{E}\,X_i = \mathrm{E}\,f(U_i) = \int_0^1 f(x) \, dx$$ where the $X_i$'s are independent and identically distributed. By applying the law of large numbers, we can conclude that $$\lim_{n \rightarrow\infty} \frac{1}{n} \sum_{i = 1}^{n} f(U_i) = \lim_{n \rightarrow\infty} \frac{1}{n} \sum_{i = 1}^{n} X_i = \int_0^1 f(x)\, dx$$ in probability.

Samuel Vidal  posted 2012-04-11 23:08:28

Hi Cyrus. Thank you very much for you friendly comment. The presentation you suggest has the charm and elegance of simplicity. It would be perfect as an introductory example at the beginning of this article.

If I chose to use a slightly more general setting it's because the main point of the discussion was to provide the reader with a fully exploitable version of the math, ready to be coded into a computer program. This is for example why I chose to talk about the confidence interval as it provides a stoping criterion for the algorithm.

Anonymous  posted 2012-05-19 12:28:03

By Jove, this is a very beautiful and elegant example.

kangu  posted 2012-06-11 09:47:55

Thanks It was helpful! how abut some sampling algorithms ? sobol for example :) Thanks again et bonne continuation !

Samuel VIDAL  posted 2012-06-28 21:03:03

Dear Anonymous, you are right Cyrus is such a good teacher ;-).

Dear Kangu, Thank you for your friendly comment. Happy to hear you find this post useful. You are right, I was planing to continue and talk about quasi-Monte-Carlo methods, but lately, I felt in love with a subject I don't know much about : Analytic Number Theory ;-) And I pretty much busy trying to catch up.

Richard Randa  posted 2013-03-16 14:21:07

Would you please explain stratified sampling in relation to MonteCarlo

Samuel VIDAL  posted 2013-03-22 01:48:03

Ok nice question. I didn't know about the subject and wikipedia only gives the general idea. So I tryed to figure out myself by an example. Here is what I get.

Ok suppose you want to run a survey to know what proportion of the population watches football. You know beforhand that this proportion is very different between men and women.

Say $p_1 = 0.80$ of men and $p_2 = 0.50$ of women like football. Say the proportion of man is $p_m = 0.45$ and woman is $p_f = 0.55$.

If you sample uniformly, each time you sample the probability of finding someone who watches football is $p = p_1 p_m + p_2 p_f$. Of course you don't know this number. This is the quantity you want to estimate with precision.

So if you sample say $N = 1000$ persons, the distribution you get is a binomial distribution of parameter $p$. Say $K$ is the number of persons that answer Yes to your question. The quantity $K$ is a random variable with binomial distribution. For $0 \le k \le N$, $$\mathrm{P}(K = k) = \binom{N}{k} p^k(1-p)^{N-k}.$$ The variance of this variable is, $$\mathrm{Var}(K) = N p (1-p)$$ Your estimator for $p$ will be $\hat{p} = K/N$. This estimator is unbiased because its expectancy is, $$\mathrm{E}(\hat{p}) = \mathrm{E}(K)/N = p$$ And the quality of this estimator is given by its variance, $$\mathrm{Var}(\hat{p}) = \mathrm{Var}(K)/N^2 = \frac{1}{N} p (1-p)$$

Ok in our case $p = 0.80 \cdot 0.45 + 0.50 \cdot 0.55 = 0.635$ so we get, $$\mathrm{Var}(K/N) = \frac{1}{N} 0.635 (1 - 0.635) = 0.231775 \frac{1}{N}.$$

If we restrict our survey to men, the variance we get is, $$\frac{1}{N} p_1 (1-p_1) = 0.16 \frac{1}{N}$$ For women, it is, $$\frac{1}{N} p_2 (1-p_2) = 0.25 \frac{1}{N}$$

The fact of the variance of men being lower to women reflects the fact that there is a better concensus among men on watching football.

All is well this this the naive Monte-Carlo version of the survey. But can we use the fact that the proportions are really different between men and women to improve the quality of our estimator ?

The idea is as follows, sample $n_1$ mens and $n_2$ women with $n_1 + n_2 = N$, instead of relying on chance only. And determines the best values for those two numbers so as to minimise the variance of our estimator.

Say now, among the $n_1$ men and $n_2$ women, $k_1$ men and $k_2$ women watch football. Those are again random variables with binomial distributions.

The estimator you chose is now, $$\hat{p'} = \frac{1}{n_1} k_1 p_m + \frac{1}{n_2} k_2 p_f.$$ This estimator is unbiased because, $$\mathrm{E}(\hat{p'}) = p_1 p_m + p_2 p_f = p.$$ Now take a look at its variance, $$\mathrm{Var}(\hat{p'}) = \frac{p_m^2}{n_1^2} \mathrm{Var}(k_1) + \frac{p_f^2}{n_2^2} \mathrm{Var}(k_2) = \frac{p_m^2}{n_1} p_1 (1-p_1) + \frac{p_f^2}{n_2} p_2 (1-p_2)$$ If you want to optimise this quantity remember $n_2 = N - n_1$ and differentiate by $n_1$. $$\frac{\partial}{\partial n_1} ( \frac{p_m^2}{n_1} p_1 (1-p_1) + \frac{p_f^2}{N-n_1} p_1 (1-p_2) ) = -\frac{p_m^2 p_1 (1-p_1)}{n_1^2}+\frac{p_f^2 p_2 (1-p_2)}{(N-n_1)^2} = 0$$

With the values of our problem this becomes, $$-\frac{0.0324}{n_1^2}+\frac{0.075625}{(N-n_1)^2} = 0$$ The left hand side is null when its numerator is null. This gives a second degree equation on $n_1$. I get aproximately $n_1 = 0.396 N$ as the positive solution to this equation.

Substituting back this value in the formula for the variance of our estimator I get, $$\mathrm{Var}(\hat{p'}) = 0.207 \frac{1}{N}.$$ Which is better than $$\mathrm{Var}(\hat{p}) = 0.231775 \frac{1}{N}.$$

Hope this is useful. In any case thank you for your question, I've learned something.

Comment:

Name:

Email:

Website: