The purpose of a machine learning model is to deliver practical and effective solutions. This implies that after training with data, the model must fulfill the client's needs. This stage is commonly known as deployment (evaluating runtime performance).
Note: At this stage, since the solution is already operational, formal mathematical metrics may not be available, but we typically receive feedback from clients.
We know that training aims to reach minimal loss by adjusting model parameters, but training alone doesn't guarantee the most useful solutions. This is because a model has many configurations, commonly known as hyperparameters, that also need to be set, but this cannot be done during training.
Hyperparameters: These are the parameters whose values are set before the learning process begins and influence the model's training and performance. For instance, hyperparameters include the number of layers, the number of hidden units per layer, and the learning rate. The reason hyperparameters cannot be adjusted during the training is that hyperparameters are not differentiable. In the context of algorithms, it is like a LeetCode problem that can only be solved using brute force, and you cannot use greedy or dynamic programming to solve it.
These hyperparameters affect not only the loss but also the model's performance on tasks (which is related to, but not the same as, loss). The configuration of these hyperparameters is enumerated and evaluated during the validation stage.
Testing is another stage involves assessing model performance with fixed test data before deployment, typically without iteration, evaluating both model loss and task performance.
The training, validation, and testing stages correspond to the use of the training dataset, validation dataset, and testing dataset, respectively.
In practice, there are typically two configurations for organizing these datasets:
The validation stage is flexible and aims to identify the best hyperparameters (including but not limited to model structures, loss functions, and optimization algorithms). It acts as a super-loop over the training loop, with each validation iteration corresponding to a full training cycle. Usually, the goal is to improve the task-specific metric, denoted as M below, by optimizing these components.
The need for validation arises from the limitations of loss functions and the risk of overfitting, which can stem from the design of these functions. Loss functions are usually simple and differentiable to facilitate smooth optimization but may not accurately capture the complexities of the target task, potentially reducing model performance. Moreover, good training performance does not guarantee effectiveness on unseen data, underscoring the importance of validation.
Example: In image restoration tasks, the training often uses mean square error as the pixel-level loss function. Validation then assesses the model with perceptual metrics like PSNR, SSIM, and MOVIE. These metrics, which are too complex for training, evaluate the perceptual quality of the models for a specific configuration of model structure, hyperparameters, and optimization algorithms.
You might wonder if validation, by brute-forcing hyperparameter changes, is itself a form of optimization. This can indeed lead to overfitting and bias in the validation dataset. While we can't completely eliminate this bias, we can check for its presence to some extent. Therefore, during the testing phase, we use entirely new data (called the testing dataset) that the model has never encountered to ensure the stability of its performance.
During testing, both loss functions and metrics can be used to evaluate model performance. This example uses a metric function.
Note: During the testing stage, we do not alter any parameters or configurations. Therefore, testing typically focuses solely on assessing model performance without further optimization.