Hierarchical clustering is a form of unsupervised learning. What this means is that the data points lack any form of label and the purpose of the analysis is to generate labels for our data points. IN other words, we have no Y values in our data.
Hierarchical clustering is an agglomerative technique. This means that each data point starts as their own individual clusters and are merged over iterations. This is great for small datasets but is difficult to scale. In addition, you need to set the linkage which is used to place observations in different clusters. There are several choices (ward, complete, single, etc.) and the best choice depends on context.
In this post, we will make a hierarchical clustering analysis of the “MedExp” data from the “Ecdat” package. We are trying to identify distinct subgroups in the sample. The actual hierarchical cluster creates what is a called a dendrogram. Below is some initial code.
library(cluster);library(compareGroups);library(NbClust);library(HDclassif);library(sparcl);library(Ecdat)
data("MedExp")
str(MedExp)
## 'data.frame': 5574 obs. of 15 variables:
## $ med : num 62.1 0 27.8 290.6 0 ...
## $ lc : num 0 0 0 0 0 0 0 0 0 0 ...
## $ idp : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 1 1 ...
## $ lpi : num 6.91 6.91 6.91 6.91 6.11 ...
## $ fmde : num 0 0 0 0 0 0 0 0 0 0 ...
## $ physlim : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 1 1 1 1 ...
## $ ndisease: num 13.7 13.7 13.7 13.7 13.7 ...
## $ health : Factor w/ 4 levels "excellent","good",..: 2 1 1 2 2 2 2 1 2 2 ...
## $ linc : num 9.53 9.53 9.53 9.53 8.54 ...
## $ lfam : num 1.39 1.39 1.39 1.39 1.1 ...
## $ educdec : num 12 12 12 12 12 12 12 12 9 9 ...
## $ age : num 43.9 17.6 15.5 44.1 14.5 ...
## $ sex : Factor w/ 2 levels "male","female": 1 1 2 2 2 2 2 1 2 2 ...
## $ child : Factor w/ 2 levels "no","yes": 1 2 2 1 2 2 1 1 2 1 ...
## $ black : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 2 2 2 ...
Currently, for the purposes of this post. The dataset is too big. IF we try to do the analysis with over 5500 observations it will take a long time. Therefore, we will only use the first 1000 observations. In addition, We need to remove factor variables as hierarchical clustering cannot analyze factor variables. Below is the code.
MedExp_small<-MedExp[1:1000,]
MedExp_small$sex<-NULL
MedExp_small$idp<-NULL
MedExp_small$child<-NULL
MedExp_small$black<-NULL
MedExp_small$physlim<-NULL
MedExp_small$health<-NULL
We now need to scale are data. This is important because different scales will cause different variables to have more or less influence on the results. Below is the code
MedExp_small_df<-as.data.frame(scale(MedExp_small))
We now need to determine how many clusters to create. There is no rule on this but we can use statistical analysis to help us. The “NbClust” package will conduct several different analysis to provide a suggested number of clusters to create. You have to set the distance, min/max number of clusters, the method, and the index. The graphs can be understood by looking for the bend or elbow in them. At this point is the best number of clusters.
numComplete<-NbClust(MedExp_small_df,distance = 'euclidean',min.nc = 2,max.nc = 8,method = 'ward.D2',index = c('all'))
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 7 proposed 2 as the best number of clusters
## * 9 proposed 3 as the best number of clusters
## * 6 proposed 6 as the best number of clusters
## * 1 proposed 8 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 3
##
##
## *******************************************************************
numComplete$Best.nc
## KL CH Hartigan CCC Scott Marriot
## Number_clusters 2.0000 2.0000 6.0000 8.0000 3.000 3.000000e+00
## Value_Index 2.9814 292.0974 56.9262 28.4817 1800.873 4.127267e+24
## TrCovW TraceW Friedman Rubin Cindex DB
## Number_clusters 6.0 6.0000 3.0000 6.0000 2.000 3.0000
## Value_Index 166569.3 265.6967 5.3929 -0.0913 0.112 1.0987
## Silhouette Duda PseudoT2 Beale Ratkowsky Ball
## Number_clusters 2.0000 2.0000 2.0000 2.0000 6.0000 3.000
## Value_Index 0.2809 0.9567 16.1209 0.2712 0.2707 1435.833
## PtBiserial Frey McClain Dunn Hubert SDindex Dindex
## Number_clusters 6.0000 1 3.000 3.0000 0 3.0000 0
## Value_Index 0.4102 NA 0.622 0.1779 0 1.9507 0
## SDbw
## Number_clusters 3.0000
## Value_Index 0.5195
Simple majority indicates that three clusters is most appropriate. However, four clusters are probably just as good. Every time you do the analysis you will get slightly different results unless you set the seed.
To make our actual clusters we need to calculate the distances between clusters using the “dist” function while also specifying the way to calculate it. We will calculate distance using the “Euclidean” method. Then we will take the distance’s information and make the actual clustering using the ‘hclust’ function. Below is the code.
distance<-dist(MedExp_small_df,method = 'euclidean')
hiclust<-hclust(distance,method = 'ward.D2')
We can now plot the results. We will plot “hiclust” and set hang to -1 so this will place the observations at the bottom of the plot. Next, we use the “cutree” function to identify 4 clusters and store this in the “comp” variable. Lastly, we use the “ColorDendrogram” function to highlight are actual clusters.
plot(hiclust,hang=-1, labels=F)
comp<-cutree(hiclust,4) ColorDendrogram(hiclust,y=comp,branchlength = 100)
We can also create some descriptive stats such as the number of observations per cluster.
table(comp)
## comp
## 1 2 3 4
## 439 203 357 1
We can also make a table that looks at the descriptive stats by cluster by using the “aggregate” function.
aggregate(MedExp_small_df,list(comp),mean)
## Group.1 med lc lpi fmde ndisease
## 1 1 0.01355537 -0.7644175 0.2721403 -0.7498859 0.048977122
## 2 2 -0.06470294 -0.5358340 -1.7100649 -0.6703288 -0.105004408
## 3 3 -0.06018129 1.2405612 0.6362697 1.3001820 -0.002099968
## 4 4 28.66860936 1.4732183 0.5252898 1.1117244 0.564626907
## linc lfam educdec age
## 1 0.12531718 -0.08861109 0.1149516 0.12754008
## 2 -0.44435225 0.22404456 -0.3767211 -0.22681535
## 3 0.09804031 -0.01182114 0.0700381 -0.02765987
## 4 0.18887531 -2.36063161 1.0070155 -0.07200553
Cluster 1 is the most educated (‘educdec’). Cluster 2 stands out as having higher medical cost (‘med’), chronic disease (‘ndisease’) and age. Cluster 3 had the lowest annual incentive payment (‘lpi’). Cluster 4 had the highest coinsurance rate (‘lc’). You can make boxplots of each of the stats above. Below is just an example of age by cluster.
MedExp_small_df$cluster<-comp
boxplot(age~cluster,MedExp_small_df)
Conclusion
Hierarchical clustering is one way in which to provide labels for data that does not have labels. The main challenge is determining how many clusters to create. However, this can be dealt with through using recommendations that come from various functions in R.