In the coin toss example, the math was elegant. We could easily calculate the exact integral because we used a simple Beta prior and a simple Binomial likelihood.
However, in real-world machine learning models, $z$ isn't a single coin-toss probability. A model might have hundreds, thousands, or even millions of parameters (like the parameters in neural networks). Furthermore, we rarely, i.e., never, get to use perfectly matching "conjugate" distributions.
In these complex, high-dimensional scenarios, the integral $\int p(y'|z) p(z|y) dz$ becomes mathematically intractable. We hit a computational brick wall where calculating the exact analytical expected value is impossible.
This is where Markov Chain Monte Carlo (MCMC) comes in. When the posterior distribution is too complex to integrate directly, MCMC algorithms allow us to draw representative random samples directly from that complex posterior space.
Averaging thousands of MCMC samples, when with enough samples, serves as a highly accurate approximation of the intractable integral, allowing us to retain the full power of Bayesian uncertainty even when the exact math is impossible to solve.
A critical trap when studying Bayesian machine learning is assuming there is only one intractable integral to worry about, when in reality, there are two distinct mathematical roadblocks.
Fortunately, modern probabilistic machine learning provides a "one stone, two birds" solution to this problem: sampling-based approximation.
By shifting our approach from pure calculus to discrete sampling, we neatly sidestep both computational bottlenecks at once.