The essence of building up a neural network lies in its training process, which involves adjusting model parameters according to a training dataset to ensure the output meets certain standards.
- In supervised learning, the goal is to reduce the disparity between the model's outputs and the given labels.
- In unsupervised learning, the focus is on making outputs adhere to a specific probability distribution (as in clustering) or reducing information loss relative to the inputs (as in dimensionality reduction).
3 Pillars
The primary aim of training is to discover the optimal model $f$ so that the prediction $\hat{y}$ is as desired.
This explanation will focus on the training aspect of supervised learning. The three core elements in machine learning are:
- Machine Learning Models: The architecture of machine learning models, denoted by $f$ and characterized by parameters $\theta$, varies based on the application. These models take input data $x$ and produce predictions $\hat{y}$.
- Loss Functions: The loss function $L$ computes the loss value $J$, which quantifies the difference between the predicted output $\hat{y}$ and the true label $y$. A lower value of $J$ indicates higher accuracy for the model $f$.
- Optimization Algorithm: This algorithm, symbolized by $\nabla$, adjusts the parameters of the model to minimize the loss value $J$, thereby enhancing the performance of the model.
Training Process in Machine Learning
The training process in machine learning is a systematic approach to refining the model $f$ so that its predictions $\hat{y}$ closely match the actual labels $y$. This process can be broken down into several key steps:
- Initialize Parameters: Start with initial guesses for the parameters of the model $f$. These parameters might be set randomly or according to a specific initialization rule.
- Forward Propagation: For each training data point, input $x$, compute the predicted output $\hat{y} = f(x)$. This step involves passing the data forward through the model (e.g., through the layers of a neural network).
- Calculate Loss: Compute the loss $J = L(\hat{y}, y)$, which measures the discrepancy between the predicted output $\hat{y}$ and the actual label $y$. The loss function $L$ quantifies how well the model is performing; the lower the loss, the better the model's predictions.
- Backward Propagation: Calculate the gradient of the loss function with respect to each parameter of the model. This involves applying the chain rule to find $\nabla_{\theta} J$, where $\theta$ represents the parameters of $f$.
- Update Parameters: Adjust the parameters $\theta$ of the model using the gradients computed in the previous step. This is done using an optimization algorithm, such as gradient descent, where $\theta = \theta - \eta \nabla_{\theta} J$. Here, $\eta$ is the learning rate, a small positive scalar determining the step size.
- Iterate: Repeat steps 2 through 5 for a set number of iterations or until the change in loss is below a predetermined threshold, or the number of iterations is reached. Each complete iteration through all training data is called an epoch.
Overall, the model $f$ is trained to minimize the loss function $J$, leading to improved accuracy in predicting the output $\hat{y}$ for given inputs $x$.
Note: During training, the model structure, hyperparameters, loss function, and optimization algorithm are typically set beforehand and remain constant throughout the training phase.