Clustering versus Classification
Load the Heart Disease data and run k-Means clustering. Let the method decide the number of clusters between 2 and 8. What is the most appropriate number of clusters, according to the method?
The k-Means outputs data that contains an additional column named Cluster. Each data instance is thus marked with a cluster to which it belongs. We will now investigate whether the found clusters correspond to the two classes. We shall check this in two widgets:
- Sieve diagram, and
- Distributions widget.
Try both, set them up as appropriate and interpret what you see.
You can do something similar by seeing whether it is possible to make good predictions based only on the cluster. Connect k-Means to Select Columns. Set diameter narrowing as target and Cluster as variable. Move all other variables (you can select all at once!) to ignore. Then test the performance of naive Bayesian classifier on such data.
Note that this result could in principle be read from the Distributions widget.