Kmeans Analysis in R

In this post, we will conduct a kmeans analysis on some data on student alcohol consumption. The data is available at the UC Irvine Machine Learning Repository and is available at https://archive.ics.uci.edu/ml/datasets/STUDENT+ALCOHOL+CONSUMPTION

We want to see what segments are within the sample of students who participated in this study on several factors in addition to alcohol consumption. Understanding the characteristics that groups of students have in common could be useful in reaching out to them for various purposes.

We will begin by loading the “stats” package for the kmeans analysis. Then we will combine the data at it is in two different files and we will explore the data using tables and histograms. I will not print the results of the exploration of the data here as there are too many variables. Lastly, we need to set the seed in order to get the same results each time. The code is still found below.

library(stats)
student.mat <- read.csv("~/Documents/R working directory/student-mat.csv", sep=";")
student.por <- read.csv("~/Documents/R working directory/student-por.csv", sep=";")
student_alcohol <- rbind(student.mat, student.por)
set.seed(123)
options(digits = 2)
  • str(student_alcohol)
  • hist(student_alcoholage)
  • table(studentalcoholage)
  • table(studentalcoholaddress)
  • table(student_alcoholfamsize)
  • table(studentalcoholfamsize)
  • table(studentalcoholPstatus)
  • hist(student_alcoholMedu)
  • hist(studentalcoholMedu)
  • hist(studentalcoholFedu)
  • hist(student_alcoholtraveltime)
  • hist(studentalcoholtraveltime)
  • hist(studentalcoholstudytime)
  • hist(student_alcoholfailures)
  • table(studentalcoholfailures)
  • table(studentalcoholschoolsup)
  • table(student_alcoholfamsup)
  • table(studentalcoholfamsup)
  • table(studentalcoholpaid)
  • table(student_alcoholactivities)
  • table(studentalcoholactivities)
  • table(studentalcoholnursery)
  • table(student_alcoholhigher)
  • table(studentalcoholhigher)
  • table(studentalcoholinternet)
  • hist(student_alcoholfamrel)
  • hist(studentalcoholfamrel)
  • hist(studentalcoholfreetime)
  • hist(student_alcoholgoout)
  • hist(studentalcoholgoout)
  • hist(studentalcoholDalc)
  • hist(student_alcoholWalc)
  • hist(studentalcoholWalc)
  • hist(studentalcoholhealth)
  • hist(student_alcohol$absences)

The details about the variables can be found at the website link in the first paragraph of this post. The study look at students alcohol use and other factors related to school and family life.

Before we do the actual kmeans clustering we need to normalize the variables. This is because are variables are measured using different scales. Some are Likert with 5 steps, while others are numeric going from 0 to over 300. The different ranges have an influence on the results. To deal with this problem we will use the “scale” for the variables that will be included in the analysis. Below is the code

student_alcohol_clean<-as.data.frame(student_alcohol)
student_alcohol_clean$age<-scale(student_alcohol$age)
student_alcohol_clean$address<-scale(as.numeric(student_alcohol$address))
student_alcohol_clean$famsize<-scale(as.numeric(student_alcohol$famsize))
student_alcohol_clean$Pstatus<-scale(as.numeric(student_alcohol$Pstatus))
student_alcohol_clean$Medu<-scale(student_alcohol$Medu)
student_alcohol_clean$Fedu<-scale(student_alcohol$Fedu)
student_alcohol_clean$traveltime<-scale(student_alcohol$traveltime)
student_alcohol_clean$studytime<-scale(student_alcohol$studytime)
student_alcohol_clean$failures<-scale(student_alcohol$failures)
student_alcohol_clean$schoolsup<-scale(as.numeric(student_alcohol$schoolsup))
student_alcohol_clean$famsup<-scale(as.numeric(student_alcohol$famsup))
student_alcohol_clean$paid<-scale(as.numeric(student_alcohol$paid))
student_alcohol_clean$activities<-scale(as.numeric(student_alcohol$activities))
student_alcohol_clean$internet<-scale(as.numeric(student_alcohol$internet))
student_alcohol_clean$famrel<-scale(student_alcohol$famrel)
student_alcohol_clean$freetime<-scale(student_alcohol$freetime)
student_alcohol_clean$goout<-scale(student_alcohol$goout)
student_alcohol_clean$Dalc<-scale(student_alcohol$Dalc)
student_alcohol_clean$Walc<-scale(student_alcohol$Walc)
student_alcohol_clean$health<-scale(student_alcohol$health)
student_alcohol_clean$absences<-scale(student_alcohol$absences)
student_alcohol_clean$G1<-scale(student_alcohol$G1)
student_alcohol_clean$G2<-scale(student_alcohol$G2)
student_alcohol_clean$G3<-scale(student_alcohol$G3)

We also need to create a matrix in order to deal with the factor variables. All factor variables need to be converted so that they have dummy variables for the analysis. To do this we use the “matrix” function as shown in the code below.

student_alcohol_clean_matrix<-(model.matrix(~.+0, data=student_alcohol_clean))

We are now ready to conduct our kmeans cluster analysis using the “kmeans” function. We have to determine how many clusters to develop before the analysis. There are statistical ways to do this but another method is domain knowledge. Since we are dealing with teenagers, it is probably that there will be about four distinct groups because of how high school is structured. Therefore, we will use four segments for our analysis. Our code is below.

alcohol_cluster<-kmeans(student_alcohol_clean_matrix, 4)

To view the results we need to view two variables in are “alcohol_cluster” list. The “size” variable will tell us how many people are in each cluster and the “centers” variable describes a clusters characteristics on a particular variable. Below is the code

alcohol_cluster$size # size of clusters
## [1] 191 381 325 147
alcohol_cluster$centers #center of clusters
##   schoolGP schoolMS sexM   age address famsize Pstatus  Medu  Fedu
## 1     0.74     0.26 0.70  0.06  0.0017   0.265  -0.030  0.25  0.29
## 2     0.69     0.31 0.27 -0.15 -0.1059  -0.056  -0.031 -0.53 -0.45
## 3     0.88     0.12 0.45 -0.13  0.3363   0.005   0.016  0.73  0.58
## 4     0.56     0.44 0.48  0.59 -0.4712  -0.210   0.086 -0.55 -0.51
##   Mjobhealth Mjobother Mjobservices Mjobteacher Fjobhealth Fjobother
## 1      0.079      0.30         0.27       0.199     0.0471      0.54
## 2      0.031      0.50         0.17       0.024     0.0210      0.62
## 3      0.154      0.26         0.28       0.237     0.0708      0.48
## 4      0.034      0.46         0.22       0.041     0.0068      0.60
##   Fjobservices Fjobteacher reasonhome reasonother reasonreputation
## 1         0.33       0.042       0.27       0.120             0.20
## 2         0.25       0.031       0.27       0.113             0.20
## 3         0.28       0.123       0.24       0.077             0.34
## 4         0.29       0.034       0.17       0.116             0.15
##   guardianmother guardianother traveltime studytime failures schoolsup
## 1           0.69         0.079       0.17     -0.32    -0.12    -0.079
## 2           0.70         0.052       0.10      0.10    -0.26     0.269
## 3           0.73         0.040      -0.37      0.29    -0.35    -0.213
## 4           0.65         0.170       0.33     -0.49     1.60    -0.123
##   famsup   paid activities nurseryyes higheryes internet romanticyes
## 1 -0.033  0.253     0.1319       0.79      0.92    0.228        0.34
## 2 -0.095 -0.098    -0.2587       0.76      0.93   -0.387        0.35
## 3  0.156  0.079     0.2237       0.88      1.00    0.360        0.31
## 4 -0.057 -0.250     0.0047       0.73      0.70   -0.091        0.49
##   famrel freetime goout  Dalc  Walc health absences    G1    G2     G3
## 1 -0.184     0.43  0.76  1.34  1.29  0.273    0.429 -0.23 -0.17 -0.129
## 2 -0.038    -0.31 -0.35 -0.37 -0.40 -0.123   -0.042 -0.17 -0.12 -0.053
## 3  0.178    -0.01 -0.14 -0.41 -0.35 -0.055   -0.184  0.90  0.87  0.825
## 4 -0.055     0.25  0.22  0.11  0.14  0.087   -0.043 -1.24 -1.40 -1.518

The size of the each cluster is about the same which indicates reasonable segmentation of the sample. The output for “centers” tells us how much above or below the mean a particular cluster is. For example, for the variable “age” we see the following

age 
1    0.06
2   -0.15   
3   -0.13 
4    0.59

What this means is that people in cluster one have an average age 0.06 standard deviations above the mean, cluster two is -0.14 standard deviations below the mean etc. To give our clusters meaning we have to look at the variables and see which one the clusters are extremely above or below the mean. Below is my interpretation of the clusters. The words in parenthesis is the variable from which I made my conclusion

Cluster 1 doesn’t study much (studytime), lives in the biggest families (famsize), requires litle school support (schoolsup), has a lot of free time (freetime), and consumes the most alcohol (Dalc, Walc), lives in an urban area (address), loves to go out the most (goout), and has the most absences. This is the underachieving party alcoholics of the sample

Cluster 2 have parents that are much less educated (Medu, Fedu), requires the most school support (schoolsup), while receiving the less family support (famsup), have the least involvement in extra-curricular activities (activities), has the least internet access at home (internet), socialize the least (goout), and lowest alcohol consumption. (Dalc, Walc) This cluster is the unsupported non-alcoholic loners of the sample

Cluster 3 has the most educated parents (Medu, Fedu), live in an urban setting (address) choose their school based on reputation (reasonreputation), have the lowest travel time to school (traveltime), study the most (studytime), rarely fail a course (failures), have the lowest support from the school while having the highest family support and family relationship (schoolsup, famsup, famrel), most involve in extra-curricular activities (activities), best internet acess at home (internet), least amount of free time (freetime) low alcohol consumption (Dalc, Walc). This cluster represents the affluent non-alcoholic high achievers.

CLuster 4 is the oldest (age), live in rural setting (address), has the smallest families (famsize), the least educated parents (Medu, Fedu), spends the most time traveling to school (traveltime), doesnt study much (studytime), has the highest failure rate (failures), never pays for extra classes (paid), most likely to be in a relationship (romanticyes), consumes alcohol moderately (Dalc, Walc), does poorly in school (G3). These students are the students in the greatest academic danger.

To get better insights, we can add the cluster results to our original dataset that was not normalize we can then identify what cluster each student belongs to ad calculate unstandardized means if we wanted.

student_alcohol$cluster<-alcohol_cluster$cluster # add clusters back to original data normalize does not mean much
View(student_alcohol)
aggregate(data=student_alcohol, G3~cluster, mean)
##   cluster   G3
## 1       1 10.8
## 2       2 11.1
## 3       3 14.5
## 4       4  5.5

The “aggregate function” tells us the average for each cluster on this question. We could do this for all of our variables to learn more about our clusters.

Kmeans provides a researcher with an understanding of the homogeneous characteristics of individuals within a sample. This information can be used to develop intervention plans in education or marketing plans in business. As such, kmeans is another powerful tool in machine learning

1 thought on “Kmeans Analysis in R

  1. Pingback: Kmeans Analysis in R | Education and Research |...

Leave a Reply