Case Study: Solving Soft WCSS Loss with Gradient Descent

In cluster analysis, especially in the improvement of the K-means algorithm, the concept of soft classification WCSS (Within-Cluster Sum of Squares) loss has been proposed. Different from the hard classification in traditional K-means algorithms (where each point strictly belongs to one cluster), soft classification allows data points to be assigned to multiple clusters in some way, with a common method being distance-based weighted allocation.

Definition of soft classification WCSS loss: Under soft classification, WCSS loss takes into account the distance of each data point to all cluster centroids. The contribution of data points to different clusters is calculated based on the weighted distance to each cluster centroid. The loss function can be expressed as:

$$ L_{\text{WCSS}}(\mu_1, \cdots, \mu_K) = \sum_{k=1}^{K} \sum_{x^{(i)}} w_{k}(x^{(i)}) \| x^{(i)} - \mu_k \|^2 $$

Here, $w_{k}(x^{(i)})$ is the weight of data point $x$ for cluster $k$, which can typically be set as the reciprocal of the normalized distance,

$$ w_{k}(x^{(i)}) = \frac{e^{-\frac{\| x^{(i)} - \mu_k \|^2}{2\sigma^2}}}{\sum_{j=1}^{k} e^{-\frac{\| x^{(i)} - \mu_j \|^2}{2\sigma^2}}} $$

Here, $\mu_k$ is the centroid of cluster $k$, and $\sigma$ is the standard deviation of the Gaussian function (assumed to be constant), which determines the rate at which the weight changes with distance. Thus, clusters closer to the data point have a greater influence, while those farther away have less.

Gradient descent solution: When optimizing soft classification WCSS loss using gradient descent, we need to focus on the gradient of the loss function relative to each cluster centroid $\mu_i$, and update the position of the centroid accordingly. Due to the introduction of Gaussian function-based weights, the gradient calculation becomes relatively complex, and the gradient contribution of each data point to each cluster centroid needs to be normalized.

Specifically, we first define the loss function $L_{\text{WCSS}}(\mu_1, \mu_2, \mu_3)$, which includes calculating the Euclidean distance of each data point to each centroid, and converting these distances into Gaussian function-based weights. These weights change dynamically according to the distance of each data point to the centroids, with clusters closer to the data point having a greater influence, and those farther away less so. Then, we calculate the gradient of the loss function for each cluster centroid. In this process, the gradient reflects how adjusting the position of the centroid affects the reduction of the overall WCSS loss.

In each iteration, we update the position of the cluster centroid $\mu_k$ based on the calculated gradient $\partial{L}/\partial{\mu_k}$. By gradually adjusting the position of each centroid, the overall WCSS loss of the dataset $\min_\mu{L_{\text{WCSS}}(\mu)}$ can be effectively reduced.

By directly optimizing the loss $L_{\text{WCSS}}$ through gradient descent, we can obtain the following results, where we see the gradient descent moves the centroids towards each cluster, achieving clustering of the three clusters.

New Project.png

In cluster analysis, although the optimization method of soft classification WCSS loss based on Gaussian weights provides us with a new perspective, its main purpose is not to replace traditional clustering algorithms, such as K-means. In fact, through multiple tests, it has been found that the K-means algorithm often outperforms the soft classification WCSS loss optimization method based on Gaussian weights in terms of performance and is more economical in computational cost.

The significance of this method lies in demonstrating the possibilities and diversity of neural network optimization, not limited to specific application scenarios. The main challenges of neural network optimization include two aspects: first, constructing an end-to-end, optimizable loss function to ensure effective parameter adjustment throughout the optimization process to reduce loss; second, creating an effective nonlinear mapping process so that the model can capture and learn complex patterns and relationships in the data.

Through the example of Gaussian weight soft classification WCSS loss, we can see that even in problems where traditional algorithms have already achieved good results, by introducing new optimization methods and loss functions, we can still expand our understanding of the problem and explore more complex and efficient solutions. This spirit of exploration and methodology is crucial in the research of neural networks and machine learning, driving continuous progress and innovation in algorithms and models.