When Math Fails: Intractable Posteriors and MCMC

In the coin toss example, the math was elegant. We could easily calculate the exact integral because we used a simple Beta prior and a simple Binomial likelihood.

However, in real-world machine learning models, $z$ isn't a single coin-toss probability. A model might have hundreds, thousands, or even millions of parameters (like the parameters in neural networks). Furthermore, we rarely, i.e., never, get to use perfectly matching "conjugate" distributions.

In these complex, high-dimensional scenarios, the integral $\int p(y'|z) p(z|y) dz$ becomes mathematically intractable. We hit a computational brick wall where calculating the exact analytical expected value is impossible.

This is where Markov Chain Monte Carlo (MCMC) comes in. When the posterior distribution is too complex to integrate directly, MCMC algorithms allow us to draw representative random samples directly from that complex posterior space.

Averaging thousands of MCMC samples, when with enough samples, serves as a highly accurate approximation of the intractable integral, allowing us to retain the full power of Bayesian uncertainty even when the exact math is impossible to solve.

Alert to Help Understanding

A critical trap when studying Bayesian machine learning is assuming there is only one intractable integral to worry about, when in reality, there are two distinct mathematical roadblocks.

  1. The first is finding the posterior distribution $p(\theta|x, y)$; calculating the exact denominator (the marginal likelihood) requires integrating over the entire, often high-dimensional, parameter space, which is analytically impossible for complex models.
  2. The second, frequently overlooked hurdle is the posterior predictive distribution $p(y'|x', x, y)$. Even if you somehow acquired the perfect closed-form posterior, making a prediction for a new data point requires a second integration of that posterior against the likelihood, which once again results in an unsolvable equation.

Fortunately, modern probabilistic machine learning provides a "one stone, two birds" solution to this problem: sampling-based approximation.

By shifting our approach from pure calculus to discrete sampling, we neatly sidestep both computational bottlenecks at once.