Exploration of linkage functions
In this task, you will load the data, compute the distances, run hierarchical clustering and observe the resulting clusters in the scatterplot. (Hint: these are four widgets. Put them in a chain, one after another.)
In Distances, use Euclidean distance, with enabled normalization.
Besides, connect another scatter plot widget directly to the file in order to observe data before it's clustered.
Clust 1
- Load "clust1.tab". Observe it in the scatter plot. How many groups of points do you see?
- Now think about the linkage functions: single, complete, average and ward. Only one of them should be able to correctly split the data into these to clusters. Which?
- Open Hierarchical clustering and Scatter plot, preferably so that you see both at once. Zoom the hierarchical clustering all the way out. Set the color of Scatter plot to Cluster.
- Try different linkage functions. In the dendrogram of the clustering, set a threshold that creates two clusters. Which linkage function gives you the expected clusters? Did you guess correctly? (Hint: instead of manually setting the threshold, you can choose "Top N:" and set the number of clusters to 2.)
- For linkage functions that don't work correctly - perhaps you can split into 3, 4 or even 5 clusters, and they would represent a parts of the two actual clusters. Does any of them create such clusters, that none of them would contain points from both curves?
Clust 2
- Load "clust2.tab". Observe it in the scatter plot. The number of clusters is obvious.
- There is one linkage function that won't find the correct clusters here. Can you guess which one is it?
- Go on and try - split into two clusters.
- For the linkage that failed to identify two clusters, try with three clusters. It works. Why?
Clust 3
- Load "clust3.tab". Observe it in the scatter plot. Why is this data more difficult than clust2?
- Set the number of clusters to 2. Why does single linkage fail so amazingly?
- Which linkage actually finds the correct two clusters?
On this data, one could assume than single and complete linkage will fail, but the behaviour of the others is more difficult to predict.
Clust 4
- Load "clust4.tab". Observe it in the scatter plot. How many clusters do you see?
- The answer to the above could be three. Which linkage actually finds these three clusters?
As for clust3.tab, this behaviour is not necessarily obvious, but can be somewhat expected. This linkage tends to find dense clusters. Also, it's behaviour is similar to k-means (they optimize the same criterion, squared distances from centroids), but Ward is more likely to correctly find clusters of different sizes.
If you have time...
Observe the performance of k-Means on these data sets, both in terms of the suggested number of clusters as well as with respect to how it grouped the points.