One of the major problems with hierarchical and k-means clustering is that they cannot handle nominal data. The reality is that most data is mixed or a combination of both interval/ratio data and nominal/ordinal data.
One of many ways to deal with this problem is by using the Gower coefficient. This coefficient compares the pairwise cases in the data set and calculates a dissimilarity between. By dissimilar we mean the weighted mean of the variables in that row.
Once the dissimilarity calculations are completed using the gower coefficient (there are naturally other choices), you can then use regular kmeans clustering (there are also other choices) to find the traits of the various clusters. In this post, we will use the “MedExp” dataset from the “Ecdat” package. Our goal will be to cluster the mixed data into four clusters. Below is some initial code.
library(cluster);library(Ecdat);library(compareGroups)
data("MedExp")
str(MedExp)
## 'data.frame': 5574 obs. of 15 variables:
## $ med : num 62.1 0 27.8 290.6 0 ...
## $ lc : num 0 0 0 0 0 0 0 0 0 0 ...
## $ idp : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 1 1 ...
## $ lpi : num 6.91 6.91 6.91 6.91 6.11 ...
## $ fmde : num 0 0 0 0 0 0 0 0 0 0 ...
## $ physlim : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 1 1 1 1 ...
## $ ndisease: num 13.7 13.7 13.7 13.7 13.7 ...
## $ health : Factor w/ 4 levels "excellent","good",..: 2 1 1 2 2 2 2 1 2 2 ...
## $ linc : num 9.53 9.53 9.53 9.53 8.54 ...
## $ lfam : num 1.39 1.39 1.39 1.39 1.1 ...
## $ educdec : num 12 12 12 12 12 12 12 12 9 9 ...
## $ age : num 43.9 17.6 15.5 44.1 14.5 ...
## $ sex : Factor w/ 2 levels "male","female": 1 1 2 2 2 2 2 1 2 2 ...
## $ child : Factor w/ 2 levels "no","yes": 1 2 2 1 2 2 1 1 2 1 ...
## $ black : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 2 2 2 ...
You can clearly see that our data is mixed with both numerical and factor variables. Therefore, the first thing we must do is calculate the gower coefficient for the dataset. This is done with the “daisy” function from the “cluster” package.
disMat<-daisy(MedExp,metric = "gower")
Now we can use the “kmeans” to make are clusters. This is possible because all the factor variables have been converted to a numerical value. We will set the number of clusters to 4. Below is the code.
set.seed(123)
mixedClusters<-kmeans(disMat, centers=4)
We can now look at a table of the clusters
table(mixedClusters$cluster)
##
## 1 2 3 4
## 1960 1342 1356 916
The groups seem reasonably balanced. We now need to add the results of the kmeans to the original dataset. Below is the code
MedExp$cluster<-mixedClusters$cluster
We now can built a descriptive table that will give us the proportions of each variable in each cluster. To do this we need to use the “compareGroups” function. We will then take the output of the “compareGroups” function and use it in the “createTable” function to get are actual descriptive stats.
group<-compareGroups(cluster~.,data=MedExp)
clustab<-createTable(group)
clustab
##
## --------Summary descriptives table by 'cluster'---------
##
## __________________________________________________________________________
## 1 2 3 4 p.overall
## N=1960 N=1342 N=1356 N=916
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
## med 211 (1119) 68.2 (333) 269 (820) 83.8 (210) <0.001
## lc 4.07 (0.60) 4.05 (0.60) 0.04 (0.39) 0.03 (0.34) 0.000
## idp: <0.001
## no 1289 (65.8%) 922 (68.7%) 1123 (82.8%) 781 (85.3%)
## yes 671 (34.2%) 420 (31.3%) 233 (17.2%) 135 (14.7%)
## lpi 5.72 (1.94) 5.90 (1.73) 3.27 (2.91) 3.05 (2.96) <0.001
## fmde 6.82 (0.99) 6.93 (0.90) 0.00 (0.12) 0.00 (0.00) 0.000
## physlim: <0.001
## no 1609 (82.1%) 1163 (86.7%) 1096 (80.8%) 789 (86.1%)
## yes 351 (17.9%) 179 (13.3%) 260 (19.2%) 127 (13.9%)
## ndisease 11.5 (8.26) 10.2 (2.97) 12.2 (8.50) 10.6 (3.35) <0.001
## health: <0.001
## excellent 910 (46.4%) 880 (65.6%) 615 (45.4%) 612 (66.8%)
## good 828 (42.2%) 382 (28.5%) 563 (41.5%) 261 (28.5%)
## fair 183 (9.34%) 74 (5.51%) 137 (10.1%) 42 (4.59%)
## poor 39 (1.99%) 6 (0.45%) 41 (3.02%) 1 (0.11%)
## linc 8.68 (1.22) 8.61 (1.37) 8.75 (1.17) 8.78 (1.06) 0.005
## lfam 1.05 (0.57) 1.49 (0.34) 1.08 (0.58) 1.52 (0.35) <0.001
## educdec 12.1 (2.87) 11.8 (2.58) 12.0 (3.08) 11.8 (2.73) 0.005
## age 36.5 (12.0) 9.26 (5.01) 37.0 (12.5) 9.29 (5.11) 0.000
## sex: <0.001
## male 893 (45.6%) 686 (51.1%) 623 (45.9%) 482 (52.6%)
## female 1067 (54.4%) 656 (48.9%) 733 (54.1%) 434 (47.4%)
## child: 0.000
## no 1960 (100%) 0 (0.00%) 1356 (100%) 0 (0.00%)
## yes 0 (0.00%) 1342 (100%) 0 (0.00%) 916 (100%)
## black: <0.001
## yes 1623 (82.8%) 986 (73.5%) 1148 (84.7%) 730 (79.7%)
## no 337 (17.2%) 356 (26.5%) 208 (15.3%) 186 (20.3%)
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
The table speaks for itself. Results that utilize factor variables have proportions to them. For example, in cluster 1, 1289 people or 65.8% responded “no” that the have an individual deductible plan (idp). Numerical variables have the mean with the standard deviation in parentheses. For example, in cluster 1 the average family size was 1 with a standard deviation of 1.05 (lfam).
Conclusion
Mixed data can be partition into clusters with the help of the gower or another coefficient. In addition, kmeans is not the only way to cluster the data. There are other choices such as the partitioning around medoids. The example provided here simply serves as a basic introduction to this.