In this post, we will look at how to visualize multivariate clustered data. We will use the “Hitters” dataset from the “ISLR” package. We will use the features of the various baseball players as the dimensions for the clustering. Below is the initial code
library(ISLR);library(cluster)
data("Hitters")
str(Hitters)
## 'data.frame': 322 obs. of 20 variables:
## $ AtBat : int 293 315 479 496 321 594 185 298 323 401 ...
## $ Hits : int 66 81 130 141 87 169 37 73 81 92 ...
## $ HmRun : int 1 7 18 20 10 4 1 0 6 17 ...
## $ Runs : int 30 24 66 65 39 74 23 24 26 49 ...
## $ RBI : int 29 38 72 78 42 51 8 24 32 66 ...
## $ Walks : int 14 39 76 37 30 35 21 7 8 65 ...
## $ Years : int 1 14 3 11 2 11 2 3 2 13 ...
## $ CAtBat : int 293 3449 1624 5628 396 4408 214 509 341 5206 ...
## $ CHits : int 66 835 457 1575 101 1133 42 108 86 1332 ...
## $ CHmRun : int 1 69 63 225 12 19 1 0 6 253 ...
## $ CRuns : int 30 321 224 828 48 501 30 41 32 784 ...
## $ CRBI : int 29 414 266 838 46 336 9 37 34 890 ...
## $ CWalks : int 14 375 263 354 33 194 24 12 8 866 ...
## $ League : Factor w/ 2 levels "A","N": 1 2 1 2 2 1 2 1 2 1 ...
## $ Division : Factor w/ 2 levels "E","W": 1 2 2 1 1 2 1 2 2 1 ...
## $ PutOuts : int 446 632 880 200 805 282 76 121 143 0 ...
## $ Assists : int 33 43 82 11 40 421 127 283 290 0 ...
## $ Errors : int 20 10 14 3 4 25 7 9 19 0 ...
## $ Salary : num NA 475 480 500 91.5 750 70 100 75 1100 ...
## $ NewLeague: Factor w/ 2 levels "A","N": 1 2 1 2 2 1 1 1 2 1 ...
Data Preparation
We need to remove all of the factor variables as the kmeans algorithm cannot support factor variables. In addition, we need to remove the “Salary” variable because it is missing data. Lastly, we need to scale the data because the scaling affects the results of the clustering. The code for all of this is below.
hittersScaled<-scale(Hitters[,c(-14,-15,-19,-20)])
Data Analysis
We will set the k for the kmeans to 3. This can be set to any number and it often requires domain knowledge to determine what is most appropriate. Below is the code
kHitters<-kmeans(hittersScaled,3)
We now look at some descriptive stats. First, we will see how many examples are in each cluster.
table(kHitters$cluster)
##
## 1 2 3
## 116 144 62
The groups are mostly balanced. Next, we will look at the mean of each feature by cluster. This will be done with the “aggregate” function. We will use the original data and make a list by the three clusters.
round(aggregate(Hitters[,c(-14,-15,-19,-20)],FUN=mean,by=list(kHitters$cluster)),1)
## Group.1 AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun
## 1 1 522.4 143.4 15.1 73.8 66.0 51.7 5.7 2179.1 597.2 51.3
## 2 2 256.6 64.5 5.5 30.9 28.6 24.3 5.6 1377.1 355.6 24.7
## 3 3 404.9 106.7 14.8 54.6 59.4 48.1 15.1 6480.7 1783.4 207.5
## CRuns CRBI CWalks PutOuts Assists Errors
## 1 299.2 256.1 199.7 380.2 181.8 11.7
## 2 170.1 143.6 122.2 209.0 62.4 5.8
## 3 908.5 901.8 694.0 303.7 70.3 6.4
Now we can see some difference. It seems group 3 are young (5.6 years of experience) starters based on the number of at-bats they get. Group 1 is young players who may not get to start due to the lower at-bats the receive. Group 2 is old (15.1 years) players who receive significant playing time and have but together impressive career statistics.
Now we will create our visual of the three clusters. For this, we use the “clusplot” function from the “cluster” package.
clusplot(hittersScaled,kHitters$cluster,color = T,shade = T,labels = 4)
In general, there is little overlap between the clusters. The overlap between groups 1 and 3 may be due to how they both have a similar amount of experience.
Conclusion
Visualizing the clusters can help with developing insights into the groups found during the analysis. This post provided one example of this.
One of the main prerequisites is to check whether the variables used for cluster analysis are correlated or not. If they are it is necessary to make a factor analysis in before and use factor scores of the the dimensiions detected