Optimization plays a vital role in not only in machine learning but also in everyday life, operating behind the scenes in numerous familiar scenarios. It determines the quickest routes in navigation apps like Google Maps, compares prices in online shopping, and improves energy efficiency through smart thermostats. Airlines optimize flight and crew schedules, while fitness trackers customize workout and diet plans. Streaming services like Netflix adjust video quality to internet speed, social media platforms curate personalized feeds, and manufacturers align production and inventory with demand.
Operations Research: Imagine a bakery that produces two types of goods: artisan bread (variable $x_1$ ) and gourmet pastries (variable $x_2$). Each product line uses limited resources—flour, oven capacity, and labor hours—so the bakery must optimize production to maximize profit without exceeding these constraints. Suppose each loaf of bread yields €5 in profit and each pastry yields €4. The bakery has up to 20 kilograms of flour and 16 labor hours available daily. A simple model might look like:
$$ \begin{align*} \argmax_{x_1, x_2} \quad &5x_1 + 4x_2 &(\text{total profit})\\ \text{s.t.} \quad & 1x_1 + 0.5x_2 \leq 20 &(\text{flour constraint})\\ & 0.25x_1 + 0.5x_2 \leq 16 & (\text{labor constraint})\\ & x_1, x_2 \geq 0 \end{align*} $$
This is a linear programming problem. Its objective function $5x_1 + 4x_2$ quantifies the total profit from producing bread and pastries. The constraints, represented by the inequalities, ensure production stays within the bakery's resource limits. Specifically, $1x_1 + 0.5x_2 \leq 20$ stipulates that the combined flour usage for both products does not exceed 20 kilograms. Meanwhile, $0.25x_1 + 0.5x_2 \leq 16$ mandates that the total labor hours do not surpass 16 hours, assuming each of two workers is available for 8 hours.
Inverse Problem (Image Classification): Optimization plays a pivotal role in solving inverse problems, where inputs must be inferred from known outputs given a well-defined forward function. This approach is essential in various domains, including engineering, physics, and economics, where directly measuring inputs can be impractical. In deep learning, a classic example is reversing the classification process to discover which input maximizes the model’s confidence for a specific class. Formally, one might write:
$$ \argmax_{x_{\text{cat}}} \mathcal{L}(f_{\text{classify}}(\hat{\mathbf{x}}_{\text{cat}}), \text{cat}) $$
Here, $f_{\text{classify}}(\cdot)$ is a trained classifier (with parameters fixed), and $\mathcal{L}(f_{\text{classify}}(\hat{\mathbf{x}}{\text{cat}}), \text{cat})$ is the confidence score for the "cat" class, $\hat{\mathbf{x}}{\text{cat}}$ is the generated image. By optimizing this objective, the resulting image $\hat{\mathbf{x}}_{\text{cat}}$ highlights the features that the model associates most strongly with the class cat, such as specific textures, shapes, or colors. I.e., this is to ask the model to draw a cat according to its understanding.
A well-trained model aligned with human understanding should have a visually coherent representation of a cat. Analyzing these generated images provides valuable insights into the model’s decision-making process, revealing whether it relies on meaningful attributes or potentially spurious correlations. This method not only aids in understanding and validating the model’s behavior but also helps in identifying and mitigating biases, thereby enhancing the overall transparency and reliability of the machine learning system.
Parametric Estimation: In statistics, machine learning, and system identification, parametric estimation involves identifying the unknown parameters within a model. This approach is applied to inverse problems like image restoration by modeling the restored image $x_{\text{restored}}$ as a function of the degraded image $x_{\text{low}}$ and a set of parameters $\theta$, expressed as $x_{\text{restored}} = f_{\text{gen}}(\theta, x_{\text{low}})$. Here $f_{\text{gen}}$ represents a generator function, such as a neural network or diffusion-based model, that generates the restored image from the degraded input. The objective is to determine the optimal parameters $\theta$ that best facilitate this restoration process.
$$ \argmax_{\theta}\quad \mathcal{L}\text{MSE}(\overbrace{f{\text{gen}}(\theta, x_{\text{low}}}^{x_{\text{restored}}}), x_{\text{high}}) $$
This formulation aims to find the parameter set $\theta$ that minimizes the mean squared error between the high quality image and the generated image.
Further: In broader machine learning applications, optimization is widely used across various fields to fine-tune parameters, determine inputs, and set goals. Models such as Reinforcement Learning (RL), Diffusion Models, and Generative Adversarial Networks (GANs) are employed to minimize errors and optimize performance toward ideal outcomes. However, the optimization algorithms used differ across these areas. For example, RL typically employs approximate dynamic programming, Diffusion Models utilize the inverse of stochastic differential equations, and GANs are based on game theory principles.
In some fortunate scenarios, particularly when the model is purely linear, the optimization problem can be efficiently solved in a "one-step" process. This means that the solution to find the best fit or optimize the parameters can be directly computed through straightforward mathematical techniques without iterative procedures. This direct approach is known as close-form, not only simpler but usually faster.
Here, we'll focus on simple univariate linear regression, which uses a straight line to model relationships between variables.
Objective: Adjust $w$ and $b$ to find the best straight line that fits the observed x-y data points $(1, 1)$, $(2, 3)$, and $(3, 3)$, using the Mean Squared Error (MSE) to measure the difference between the predicted outputs $wx+b$ and the observed outputs $y$.
Steps to Solve Using Mean Squared Error (MSE):
Define the Error Function: The error measure we will use is the Mean Squared Error (MSE), which calculates the average of the squared differences between the actual data values and the values predicted by our model. The MSE formula is:
$$ \mathcal{L}{\text{MSE}} = \frac{1}{N} \sum{i=1}^N (y^{(i)} - (wx^{(i)} + b))^2 $$
Here, $y^{(i)}$ are the actual data values, and $wx^{(i)} + b$ are the values predicted by the line for each input $x^{(i)}$.
Calculate the Partial Derivatives: To minimize the MSE, we need to adjust $w$ and $b$ in such a way that this function reaches its minimum. To keep it simple, you just compute the derivative for each sample and then add them together.
The derivative of the MSE with respect to $w$ and $b$ are:
$$ \frac{\mathcal{L}{\text{MSE}}}{\partial w} = \frac{-2}{N} \sum{i=1}^N x^{(i)}(y^{(i)} - (wx^{(i)} + b))\\\frac{\mathcal{L}{\text{MSE}}}{\partial b} = \frac{-2}{N} \sum{i=1}^N (y^{(i)} - (wx^{(i)} + b)) $$
Let's compute these for our data points $(1,1)$, $(2,3)$, and $(3,3)$. Let's first set them up algebraically to solve the equations.