In machine learning, we often choose the model $\delta$ which minimizes the expected risk. The risk is the expected loss of the model over the data:

However, we usually don’t know the true data-generating distribution $p_{*}$. So we must use a Monte Carlo to approximate this expectation (i.e. integral).

First, we approximate the distribution of $L(y, \delta \bf(x))$ with the empirical distribution $\{L(y_i, \delta \bf(x_i) \}_{i=1}^N$. We simply draw samples, then compute the arithmetic mean of the function applied to the samples:

This is standard empricical risk minimization (ERM), but I want to point out that the emprical risk $R_{emp}$ is a Monte Carlo approximation of an integral.

Recall that in Monte Carlo approximations, we approximate the mean (or other integral) of some statistic using finite samples:

Monte Carlo approximation has the advantage over numerical integration (which is based on evaluating the function at a fixed grid of points) that the function is only evaluated in places where there is non-negligible probability (Murphy p.53). This explains why Monte Carlo is used to approximate integrals and not other numerical integration methods.

## References

• Machine Learning from a Probabilistic Perspective (Murphy), p. 204-205
• Natural Language Understanding with Distibuted Representations (Cho). p. 8-9