Hierarchical Clustering in R

Hierarchical clustering is a form of unsupervised learning. What this means is that the data points lack any form of label and the purpose of the analysis is to generate labels for our data points. IN other words, we have no Y values in our data.

Hierarchical clustering is an agglomerative technique. This means that each data point starts as their own individual clusters and are merged over iterations. This is great for small datasets but is difficult to scale. In addition, you need to set the linkage which is used to place observations in different clusters. There are several choices (ward, complete, single, etc.) and the best choice depends on context.

In this post, we will make a hierarchical clustering analysis of the “MedExp” data from the “Ecdat” package. We are trying to identify distinct subgroups in the sample. The actual hierarchical cluster creates what is a called a dendrogram. Below is some initial code.

library(cluster);library(compareGroups);library(NbClust);library(HDclassif);library(sparcl);library(Ecdat)
data("MedExp")
str(MedExp)
## 'data.frame':    5574 obs. of  15 variables:
##  $ med     : num  62.1 0 27.8 290.6 0 ...
##  $ lc      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ idp     : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 1 1 ...
##  $ lpi     : num  6.91 6.91 6.91 6.91 6.11 ...
##  $ fmde    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ physlim : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 1 1 1 1 ...
##  $ ndisease: num  13.7 13.7 13.7 13.7 13.7 ...
##  $ health  : Factor w/ 4 levels "excellent","good",..: 2 1 1 2 2 2 2 1 2 2 ...
##  $ linc    : num  9.53 9.53 9.53 9.53 8.54 ...
##  $ lfam    : num  1.39 1.39 1.39 1.39 1.1 ...
##  $ educdec : num  12 12 12 12 12 12 12 12 9 9 ...
##  $ age     : num  43.9 17.6 15.5 44.1 14.5 ...
##  $ sex     : Factor w/ 2 levels "male","female": 1 1 2 2 2 2 2 1 2 2 ...
##  $ child   : Factor w/ 2 levels "no","yes": 1 2 2 1 2 2 1 1 2 1 ...
##  $ black   : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 2 2 2 ...

Currently, for the purposes of this post. The dataset is too big. IF we try to do the analysis with over 5500 observations it will take a long time. Therefore, we will only use the first 1000 observations. In addition, We need to remove factor variables as hierarchical clustering cannot analyze factor variables. Below is the code.

MedExp_small<-MedExp[1:1000,]
MedExp_small$sex<-NULL
MedExp_small$idp<-NULL
MedExp_small$child<-NULL
MedExp_small$black<-NULL
MedExp_small$physlim<-NULL
MedExp_small$health<-NULL

We now need to scale are data. This is important because different scales will cause different variables to have more or less influence on the results. Below is the code

MedExp_small_df<-as.data.frame(scale(MedExp_small))

We now need to determine how many clusters to create. There is no rule on this but we can use statistical analysis to help us. The “NbClust” package will conduct several different analysis to provide a suggested number of clusters to create. You have to set the distance, min/max number of clusters, the method, and the index. The graphs can be understood by looking for the bend or elbow in them. At this point is the best number of clusters.

numComplete<-NbClust(MedExp_small_df,distance = 'euclidean',min.nc = 2,max.nc = 8,method = 'ward.D2',index = c('all'))

1.png

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

1.png

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 7 proposed 2 as the best number of clusters 
## * 9 proposed 3 as the best number of clusters 
## * 6 proposed 6 as the best number of clusters 
## * 1 proposed 8 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************
numComplete$Best.nc
##                     KL       CH Hartigan     CCC    Scott      Marriot
## Number_clusters 2.0000   2.0000   6.0000  8.0000    3.000 3.000000e+00
## Value_Index     2.9814 292.0974  56.9262 28.4817 1800.873 4.127267e+24
##                   TrCovW   TraceW Friedman   Rubin Cindex     DB
## Number_clusters      6.0   6.0000   3.0000  6.0000  2.000 3.0000
## Value_Index     166569.3 265.6967   5.3929 -0.0913  0.112 1.0987
##                 Silhouette   Duda PseudoT2  Beale Ratkowsky     Ball
## Number_clusters     2.0000 2.0000   2.0000 2.0000    6.0000    3.000
## Value_Index         0.2809 0.9567  16.1209 0.2712    0.2707 1435.833
##                 PtBiserial Frey McClain   Dunn Hubert SDindex Dindex
## Number_clusters     6.0000    1   3.000 3.0000      0  3.0000      0
## Value_Index         0.4102   NA   0.622 0.1779      0  1.9507      0
##                   SDbw
## Number_clusters 3.0000
## Value_Index     0.5195

Simple majority indicates that three clusters is most appropriate. However, four clusters are probably just as good. Every time you do the analysis you will get slightly different results unless you set the seed.

To make our actual clusters we need to calculate the distances between clusters using the “dist” function while also specifying the way to calculate it. We will calculate distance using the “Euclidean” method. Then we will take the distance’s information and make the actual clustering using the ‘hclust’ function. Below is the code.

distance<-dist(MedExp_small_df,method = 'euclidean')
hiclust<-hclust(distance,method = 'ward.D2')

We can now plot the results. We will plot “hiclust” and set hang to -1 so this will place the observations at the bottom of the plot. Next, we use the “cutree” function to identify 4 clusters and store this in the “comp” variable. Lastly, we use the “ColorDendrogram” function to highlight are actual clusters.

plot(hiclust,hang=-1, labels=F)
comp<-cutree(hiclust,4) ColorDendrogram(hiclust,y=comp,branchlength = 100)

1.jpeg

We can also create some descriptive stats such as the number of observations per cluster.

table(comp)
## comp
##   1   2   3   4 
## 439 203 357   1

We can also make a table that looks at the descriptive stats by cluster by using the “aggregate” function.

aggregate(MedExp_small_df,list(comp),mean)
##   Group.1         med         lc        lpi       fmde     ndisease
## 1       1  0.01355537 -0.7644175  0.2721403 -0.7498859  0.048977122
## 2       2 -0.06470294 -0.5358340 -1.7100649 -0.6703288 -0.105004408
## 3       3 -0.06018129  1.2405612  0.6362697  1.3001820 -0.002099968
## 4       4 28.66860936  1.4732183  0.5252898  1.1117244  0.564626907
##          linc        lfam    educdec         age
## 1  0.12531718 -0.08861109  0.1149516  0.12754008
## 2 -0.44435225  0.22404456 -0.3767211 -0.22681535
## 3  0.09804031 -0.01182114  0.0700381 -0.02765987
## 4  0.18887531 -2.36063161  1.0070155 -0.07200553

Cluster 1 is the most educated (‘educdec’). Cluster 2 stands out as having higher medical cost (‘med’), chronic disease (‘ndisease’) and age. Cluster 3 had the lowest annual incentive payment (‘lpi’). Cluster 4 had the highest coinsurance rate (‘lc’). You can make boxplots of each of the stats above. Below is just an example of age by cluster.

MedExp_small_df$cluster<-comp
boxplot(age~cluster,MedExp_small_df)

1.jpeg

Conclusion

Hierarchical clustering is one way in which to provide labels for data that does not have labels. The main challenge is determining how many clusters to create. However, this can be dealt with through using recommendations that come from various functions in R.

Leave a Reply