Deep learning is a complex machine learning concept in which new features are created new features from the variables that were inputted. These new features are used for classifying labeled data. This all done mostly with artificial neural networks that are multiple layers deep and can involve regularization.
If understanding is not important but you are in search of the most accurate classification possible deep learning is a useful tool. It is nearly impossible to explain to the typical stakeholder and is best for just getting the job done.
One of the most accessible packages for using deep learning is the “h2o” package.This package allows you to access the H2O website which will analyze your data and send it back to you. This allows a researcher to do analytics on a much larger scale than their own computer can handle. In this post, we will use deep learning to predict the gender of the head of household in the “VietnamH” dataset from the “Ecdat” package. Below is some initial code.
Data Preparation
library(h2o);library(Ecdat);library(corrplot)
data("VietNamH") str(VietNamH)
## 'data.frame': 5999 obs. of 11 variables:
## $ sex : Factor w/ 2 levels "male","female": 2 2 1 2 2 2 2 1 1 1 ...
## $ age : int 68 57 42 72 73 66 73 46 50 45 ...
## $ educyr : num 4 8 14 9 1 13 2 9 12 12 ...
## $ farm : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 2 2 2 ...
## $ urban : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ hhsize : int 6 6 6 6 8 7 9 4 5 4 ...
## $ lntotal : num 10.1 10.3 10.9 10.3 10.5 ...
## $ lnmed : num 11.23 8.51 8.71 9.29 7.56 ...
## $ lnrlfood: num 8.64 9.35 10.23 9.26 9.59 ...
## $ lnexp12m: num 11.23 8.51 8.71 9.29 7.56 ...
## $ commune : Factor w/ 194 levels "1","10","100",..: 1 1 1 1 1 1 1 1 1 1 ...
corrplot(cor(na.omit(VietNamH[,c(-1,-4,-5,-11)])),method = 'number')
We need to remove the “commune” variable “lnexp12m” and the “lntotal” variable. The “commune” variable should be removed because it doesn’t provide much information. The “lntotal” variable should be removed because it is the total expenditures that the family spends. This is represented by other variables such as food “lnrlfood” which “lntotal” highly correlates with. the “lnexp12m” should be removed because it has a perfect correlation with “lnmed”. Below is the code
VietNamH$commune<-NULL
VietNamH$lnexp12m<-NULL
VietNamH$lntotal<-NULL
Save as CSV file
We now need to save our modified dataset as a csv file that we can send to h2o. The code is as follows.
write.csv(VietNamH, file="viet.csv",row.names = F)
Connect to H2O
Now we can connect to H2o and start what is called an instance.
localH2O<-h2o.init()
## Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 50 minutes 18 seconds
## H2O cluster version: 3.10.4.6
## H2O cluster version age: 27 days
## H2O cluster name: H2O_started_from_R_darrin_hsl318
## H2O cluster total nodes: 1
## H2O cluster total memory: 3.44 GB
## H2O cluster total cores: 4
## H2O cluster allowed cores: 2
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## R Version: R version 3.4.0 (2017-04-21)
The output indicates that we are connected. The next step is where it really gets complicated. We need to upload our data to h2o as an h2o dataframe, which is different from a regular data frame. We also need to indicate the location of the csv file on your computer that needs to be converted. All of this is done in the code below.
viet.hex<-h2o.uploadFile(path="/home/darrin/Documents/R working directory/blog/blog/viet.csv",destination_frame = "viet.hex")
In the code above we create an object called “viet.hex”. This object uses the “h2o.uploadFile” function to send our csv to h2o. We can check if everything worked by using the “class” function and the “str” function on “viet.hex”.
class(viet.hex)
## [1] "H2OFrame"
str(viet.hex)
## Class 'H2OFrame'
## - attr(*, "op")= chr "Parse"
## - attr(*, "id")= chr "viet.hex"
## - attr(*, "eval")= logi FALSE
## - attr(*, "nrow")= int 5999
## - attr(*, "ncol")= int 8
## - attr(*, "types")=List of 8
## ..$ : chr "enum"
## ..$ : chr "int"
## ..$ : chr "real"
## ..$ : chr "enum"
## ..$ : chr "enum"
## ..$ : chr "int"
## ..$ : chr "real"
## ..$ : chr "real"
## - attr(*, "data")='data.frame': 10 obs. of 8 variables:
## ..$ sex : Factor w/ 2 levels "female","male": 1 1 2 1 1 1 1 2 2 2
## ..$ age : num 68 57 42 72 73 66 73 46 50 45
## ..$ educyr : num 4 8 14 9 1 13 2 9 12 12
## ..$ farm : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1
## ..$ urban : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2
## ..$ hhsize : num 6 6 6 6 8 7 9 4 5 4
## ..$ lnmed : num 11.23 8.51 8.71 9.29 7.56 ...
## ..$ lnrlfood: num 8.64 9.35 10.23 9.26 9.59 ...
The “summary” function also provides insight into the data.
summary(viet.hex)
## sex age educyr farm urban
## male :4375 Min. :16.00 Min. : 0.000 yes:3438 no :4269
## female:1624 1st Qu.:37.00 1st Qu.: 3.982 no :2561 yes:1730
## Median :46.00 Median : 6.996
## Mean :48.01 Mean : 7.094
## 3rd Qu.:58.00 3rd Qu.: 9.988
## Max. :95.00 Max. :22.000
## hhsize lnmed lnrlfood
## Min. : 1.000 Min. : 0.000 Min. : 6.356
## 1st Qu.: 4.000 1st Qu.: 4.166 1st Qu.: 8.372
## Median : 5.000 Median : 5.959 Median : 8.689
## Mean : 4.752 Mean : 5.266 Mean : 8.680
## 3rd Qu.: 6.000 3rd Qu.: 7.171 3rd Qu.: 9.001
## Max. :19.000 Max. :12.363 Max. :11.384
Create Training and Testing Sets
We now need to create our train and test sets. We need to use slightly different syntax to do this with h2o. The code below is how it is done to create a 70/30 split in the data.
rand<-h2o.runif(viet.hex,seed = 123)
train<-viet.hex[rand<=.7,]
train<-h2o.assign(train, key = "train")
test<-viet.hex[rand>.7,]
test<-h2o.assign(test, key = "test")
Here is what we did
- We created an object called “rand” that created random numbers for or “viet.hex” dataset.
- All values less than .7 were assigned to the “train” variable
- The train variable was given the key name “train” in order to use it in the h2o framework
- All values greater than .7 were assigned to test and test was given a key name
You can check the proportions of the train and test sets using the “h2o.table” function.
h2o.table(train$sex)
## sex Count
## 1 female 1146
## 2 male 3058
##
## [2 rows x 2 columns]
h2o.table(test$sex)
## sex Count
## 1 female 478
## 2 male 1317
##
## [2 rows x 2 columns]
Model Development
We can now create our model.
vietdlmodel<-h2o.deeplearning(x=2:8,y=1,training_frame = train,validation_frame = test,seed=123,variable_importances = T)
Here is what the code above means.
- We created an object called “vietdlmodel”
- We used the “h2o.deeplearning” function.
- x = 2:8 is all the independent variables in the dataframe and y=1 is the first variable “sex”
- We set the training and validation frame to “train” and “test” and set the seed.
- Finally, we indicated that we want to know the variable importance.
We can check the performance of the model with the code below.
vietdlmodel
## Model Details:
## training
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## female male Error Rate
## female 435 711 0.620419 =711/1146
## male 162 2896 0.052976 =162/3058
## Totals 597 3607 0.207659 =873/4204
## testing
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## female male Error Rate
## female 151 327 0.684100 =327/478
## male 60 1257 0.045558 =60/1317
## Totals 211 1584 0.215599 =387/1795
There is a lot of output here. For simplicity, we will focus on the confusion matrices for the training and testing sets.The error rate for the training set is 19.8% and for the testing set, it is 21.2%. Below we can see which variable were most useful
vietdlmodel@model$variable_importances
## Variable Importances:
## variable relative_importance scaled_importance percentage
## 1 urban.no 1.000000 1.000000 0.189129
## 2 urban.yes 0.875128 0.875128 0.165512
## 3 farm.no 0.807208 0.807208 0.152666
## 4 farm.yes 0.719517 0.719517 0.136081
## 5 age 0.451581 0.451581 0.085407
## 6 hhsize 0.410472 0.410472 0.077632
## 7 lnrlfood 0.386189 0.386189 0.073039
## 8 educyr 0.380398 0.380398 0.071944
## 9 lnmed 0.256911 0.256911 0.048589
## 10 farm.missing(NA) 0.000000 0.000000 0.000000
## 11 urban.missing(NA) 0.000000 0.000000 0.000000
The numbers speak for themselves. “Urban” and “farm” are both the most important variables for predicting sex. Below is the code for obtaining the predicted results and placing them into a dataframe. This is useful if you need to send in final results to a data science competition such as those found at kaggle.
vietdlPredict<-h2o.predict(vietdlmodel,newdata = test)
vietdlPredict
## predict female male
## 1 male 0.06045560 0.9395444
## 2 male 0.10957121 0.8904288
## 3 male 0.27459108 0.7254089
## 4 male 0.14721353 0.8527865
## 5 male 0.05493486 0.9450651
## 6 male 0.10598351 0.8940165
##
## [1795 rows x 3 columns]
vietdlPred<-as.data.frame(vietdlPredict)
head(vietdlPred)
## predict female male
## 1 male 0.06045560 0.9395444
## 2 male 0.10957121 0.8904288
## 3 male 0.27459108 0.7254089
## 4 male 0.14721353 0.8527865
## 5 male 0.05493486 0.9450651
## 6 male 0.10598351 0.8940165
Conclusion
This was a complicated experience. However, we learned how to upload and download results from h2.