Validation and Testing and Deployment

Although, training minimizes loss by adjusting model parameters $\theta$, training doesn't ensure the best solutions because hyperparameters, which significantly influence the model, must be set separately and cannot be adjusted during training.

Hyperparameters: such as the number of layers, the number of hidden units per layer, the learning rate, and the number of training iterations, differ fundamentally from trainable parameters in that they are not subject to optimization during training, as they are not differentiable. Instead, hyperparameter tuning must be performed through external optimization techniques, such as grid search or Bayesian optimization, to identify configurations that enhance model performance.

Note: Hyperparameters influence not only the loss but also overall task-specific performance (usually known as metrics).

Validation

Validation takes place during training, typically after each epoch or a few iterations, to evaluate model performance and inform hyperparameter tuning. Hyperparameter optimization, which involves techniques like enumeration, operates as a super-loop over the training process, with each iteration representing a complete training cycle.

The goal is to optimize task-specific metrics by selecting the best model structures, loss functions, and optimization algorithms.

Untitled

Validation is essential due to (1) the limitations of loss functions and (2) the risk of overfitting.

Loss functions, designed to be simple and differentiable for optimization, often fail to capture the complexities of the target task, potentially hindering model performance. Moreover, strong training results do not ensure effectiveness on unseen data, highlighting the critical role of validation. Therefore, during the validation stage, we prioritize task-specific metrics when available, while also monitoring the relationship between training loss and validation loss to detect overfitting.

Example: In image restoration, training uses mean square error as the pixel-level loss, while validation employs perceptual metrics like PSNR and SSIM to assess model quality based on structure, hyperparameters, and optimization algorithms.
Overfitting occurs when a model learns patterns that are specific to the training data but fail to generalize to new, unseen data. This typically manifests as a significant gap between training and validation performance, where the training loss continues to decrease while validation loss stabilizes or increases. To mitigate overfitting, techniques such as early stopping (via validation), dropout, regularization, and data augmentation are commonly employed.

Example: In face detection tasks, a transformer model initially shows decreasing training and validation losses, indicating effective learning. However, when training loss keeps dropping while validation loss rises, overfitting occurs. Early stopping is used to monitor validation loss and halt training when no improvement is seen, preserving the model before overfitting.

Testing

Testing-stage evaluation involves assessing model performance on fixed test data before deployment, typically without iteration, to evaluate both model loss and task performance. To avoid overfitting and bias from hyperparameter tuning during validation, the testing phase uses entirely new data (the testing dataset) that the model has never encountered, ensuring stable performance.

embed (37).svg

During testing, metrics or loss functions evaluate model performance, with this example using a metric function.

Note: Testing does not modify parameters or configurations and focuses solely on assessing performance without further optimization.

Deployment or Runtime

Deployment, or the runtime phase, is where a machine learning model is applied to real-world tasks to deliver practical solutions. Unlike the controlled environments of training and testing, deployment involves dynamic and often unpredictable conditions. Tensorflow Serving, & TorchServe.

Deploying deep learning models comes with unique challenges:

Real-World Data Variability: Models face data distributions that may differ significantly from training and testing datasets, leading to potential performance degradation. Concept (Distribution) Drift.
Resource Constraints: Efficient execution within limits of memory, processing power, and latency is critical, especially in edge or mobile deployments. TinyML, and ONNX.