Clustering Mixed Data in R

Advertisements

One of the major problems with hierarchical and k-means clustering is that they cannot handle nominal data. The reality is that most data is mixed or a combination of both interval/ratio data and nominal/ordinal data.

One of many ways to deal with this problem is by using the Gower coefficient. This coefficient compares the pairwise cases in the data set and calculates a dissimilarity between. By dissimilar we mean the weighted mean of the variables in that row.

Once the dissimilarity calculations are completed using the gower coefficient (there are naturally other choices), you can then use regular kmeans clustering (there are also other choices) to find the traits of the various clusters. In this post, we will use the “MedExp” dataset from the “Ecdat” package. Our goal will be to cluster the mixed data into four clusters. Below is some initial code.

library(cluster);library(Ecdat);library(compareGroups)
data("MedExp")
str(MedExp)
## 'data.frame':    5574 obs. of  15 variables:
##  $ med     : num  62.1 0 27.8 290.6 0 ...
##  $ lc      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ idp     : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 1 1 ...
##  $ lpi     : num  6.91 6.91 6.91 6.91 6.11 ...
##  $ fmde    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ physlim : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 1 1 1 1 ...
##  $ ndisease: num  13.7 13.7 13.7 13.7 13.7 ...
##  $ health  : Factor w/ 4 levels "excellent","good",..: 2 1 1 2 2 2 2 1 2 2 ...
##  $ linc    : num  9.53 9.53 9.53 9.53 8.54 ...
##  $ lfam    : num  1.39 1.39 1.39 1.39 1.1 ...
##  $ educdec : num  12 12 12 12 12 12 12 12 9 9 ...
##  $ age     : num  43.9 17.6 15.5 44.1 14.5 ...
##  $ sex     : Factor w/ 2 levels "male","female": 1 1 2 2 2 2 2 1 2 2 ...
##  $ child   : Factor w/ 2 levels "no","yes": 1 2 2 1 2 2 1 1 2 1 ...
##  $ black   : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 2 2 2 ...

You can clearly see that our data is mixed with both numerical and factor variables. Therefore, the first thing we must do is calculate the gower coefficient for the dataset. This is done with the “daisy” function from the “cluster” package.

disMat<-daisy(MedExp,metric = "gower")

Now we can use the “kmeans” to make are clusters. This is possible because all the factor variables have been converted to a numerical value. We will set the number of clusters to 4. Below is the code.

set.seed(123)
mixedClusters<-kmeans(disMat, centers=4)

We can now look at a table of the clusters

table(mixedClusters$cluster)
## 
##    1    2    3    4 
## 1960 1342 1356  916

The groups seem reasonably balanced. We now need to add the results of the kmeans to the original dataset. Below is the code

MedExp$cluster<-mixedClusters$cluster

We now can built a descriptive table that will give us the proportions of each variable in each cluster. To do this we need to use the “compareGroups” function. We will then take the output of the “compareGroups” function and use it in the “createTable” function to get are actual descriptive stats.

group<-compareGroups(cluster~.,data=MedExp)
clustab<-createTable(group)
clustab
## 
## --------Summary descriptives table by 'cluster'---------
## 
## __________________________________________________________________________ 
##                    1            2            3            4      p.overall 
##                  N=1960       N=1342       N=1356       N=916              
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ 
## med            211 (1119)   68.2 (333)   269 (820)   83.8 (210)   <0.001   
## lc            4.07 (0.60)  4.05 (0.60)  0.04 (0.39)  0.03 (0.34)   0.000   
## idp:                                                              <0.001   
##     no        1289 (65.8%) 922 (68.7%)  1123 (82.8%) 781 (85.3%)           
##     yes       671 (34.2%)  420 (31.3%)  233 (17.2%)  135 (14.7%)           
## lpi           5.72 (1.94)  5.90 (1.73)  3.27 (2.91)  3.05 (2.96)  <0.001   
## fmde          6.82 (0.99)  6.93 (0.90)  0.00 (0.12)  0.00 (0.00)   0.000   
## physlim:                                                          <0.001   
##     no        1609 (82.1%) 1163 (86.7%) 1096 (80.8%) 789 (86.1%)           
##     yes       351 (17.9%)  179 (13.3%)  260 (19.2%)  127 (13.9%)           
## ndisease      11.5 (8.26)  10.2 (2.97)  12.2 (8.50)  10.6 (3.35)  <0.001   
## health:                                                           <0.001   
##     excellent 910 (46.4%)  880 (65.6%)  615 (45.4%)  612 (66.8%)           
##     good      828 (42.2%)  382 (28.5%)  563 (41.5%)  261 (28.5%)           
##     fair      183 (9.34%)   74 (5.51%)  137 (10.1%)  42 (4.59%)            
##     poor       39 (1.99%)   6 (0.45%)    41 (3.02%)   1 (0.11%)            
## linc          8.68 (1.22)  8.61 (1.37)  8.75 (1.17)  8.78 (1.06)   0.005   
## lfam          1.05 (0.57)  1.49 (0.34)  1.08 (0.58)  1.52 (0.35)  <0.001   
## educdec       12.1 (2.87)  11.8 (2.58)  12.0 (3.08)  11.8 (2.73)   0.005   
## age           36.5 (12.0)  9.26 (5.01)  37.0 (12.5)  9.29 (5.11)   0.000   
## sex:                                                              <0.001   
##     male      893 (45.6%)  686 (51.1%)  623 (45.9%)  482 (52.6%)           
##     female    1067 (54.4%) 656 (48.9%)  733 (54.1%)  434 (47.4%)           
## child:                                                             0.000   
##     no        1960 (100%)   0 (0.00%)   1356 (100%)   0 (0.00%)            
##     yes        0 (0.00%)   1342 (100%)   0 (0.00%)   916 (100%)            
## black:                                                            <0.001   
##     yes       1623 (82.8%) 986 (73.5%)  1148 (84.7%) 730 (79.7%)           
##     no        337 (17.2%)  356 (26.5%)  208 (15.3%)  186 (20.3%)           
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯

The table speaks for itself. Results that utilize factor variables have proportions to them. For example, in cluster 1, 1289 people or 65.8% responded “no” that the have an individual deductible plan (idp). Numerical variables have the mean with the standard deviation in parentheses. For example, in cluster 1 the average family size was 1 with a standard deviation of 1.05 (lfam).

Conclusion

Mixed data can be partition into clusters with the help of the gower or another coefficient. In addition, kmeans is not the only way to cluster the data. There are other choices such as the partitioning around medoids. The example provided here simply serves as a basic introduction to this.