In the previous courses, we systematically studied the basic principles and applications of loss functions in regression and classification problems. In the context of supervised learning, these loss functions play a crucial role, where each input sample is equipped with a clear label. However, in real applications, we often encounter situations where label information is incomplete (i.e., weak labels), meaning the association between input samples and their corresponding outputs might not be explicit. For example, we might acquire a dataset containing a large number of male voice clips as input, while the output would be female voice clips. Such datasets often originate from the internet, where a direct and one-to-one correspondence between input and output is lacking.

To address these issues, different technical methods are required. A common approach is to leverage existing speech recognition technology to extract phrases and phonemes from the speech and attempt partial alignment. While this method is feasible to an extent, the limitations of sentence matching may result in a relatively small aligned dataset. Moreover, due to variations in pronunciation across individuals in terms of time and characteristics, the correspondence at the phoneme level is often unstable, which could lead to clarity issues in the sound of the model trained eventually.

In contrast to the traditional per-sample error computation method, we prefer to adopt a broader approach to assess the success level of voice transformation from male to female and provide feedback to the neural network based on this assessment. This approach is known as the distribution comparison method. Suppose our input data is $\mathbf{x}$, and the output data is $\mathbf{y}$. We aim to construct a function $f_\theta(\mathbf{x})$ that can directly measure the difference between its output distribution and the distribution of $\mathbf{y}$, and use this difference to update the neural network's parameters $\theta$.

Maximum Mean Discrepancy

In this field, the Maximum Mean Discrepancy (MMD) algorithm is an important tool. MMD is an effective method for comparing probability distributions, particularly suitable for the comparison of non-standard complex distributions. It measures the difference between two distributions by comparing their means in a Reproducing Kernel Hilbert Space (RKHS). Specifically, the working principle of the MMD algorithm can be summarized in the following steps:

First, select an appropriate kernel function to map the data points of the two distributions A and B to a high-dimensional feature space.
Then, compute the mean of the mapped data points for each distribution in that high-dimensional space.
Finally, MMD uses the difference or distance between these two means as a measure of the dissimilarity between the two distributions. The smaller the MMD value, the more similar the two distributions are; conversely, the larger the MMD value, the more distinct their differences.

Example: Take two data distributions A and B as an example, where A is a unimodal distribution, and B is a multimodal distribution. The histograms of these two distributions can be depicted as shown below. In this case, we can use a neural network $f_\theta$ to process distribution A so that $f_\theta(A)$ approaches distribution B. MMD is used to measure the difference between $f_\theta(A)$ and distribution B. If $f_\theta(A)$ differs significantly from the B distribution, the loss generated by MMD will be fed back to the parameters $\theta$, thereby adjusting $f_\theta(A)$ to better fit the target distribution.

Untitled

In the preliminary experiments with this dataset, the results indicate that through the feedback of the MMD loss, the neural network gradually attempts to fit the multimodal distribution. Although the final results are not entirely consistent, it is clearly observable that the pattern of the three peaks has become more pronounced.

Untitled

Analysis of the Reasons for Insufficient MMD Fitting

When applying the MMD loss, we observed that the algorithm fails to perfectly fit the target trimodal distribution in some cases. MMD mainly focuses on the difference between the means of two distributions in the Reproducing Kernel Hilbert Space (RKHS), which might correspond to the locations of the three main peaks. However, closeness of means in the RKHS does not necessarily imply similarity in the distribution shapes (e.g., the shape of the peaks) in the original space. For example, even if the means are close in the high-dimensional feature space, one distribution might be more concentrated (green) and the other more dispersed (yellow) in the original space. Hence, while MMD can reduce the discrepancy between two distributions to a certain extent, it might not fully reproduce all characteristics of the original distribution.

Therefore, when dealing with more complex structures or distributions with richer features, we might need to explore more advanced or suitable methods to complement or replace MMD in order to more accurately capture and reflect the subtle differences between distributions.

One important method in this context is adversarial learning, where a neural network known as a discriminator is used to determine whether the mapped data is similar to the desired distribution.