Using H2o Deep Learning in R

Deep learning is a complex machine learning concept in which new features are created new features from the variables that were inputted. These new features are used for classifying labeled data. This all done mostly with artificial neural networks that are multiple layers deep and can involve regularization.

If understanding is not important but you are in search of the most accurate classification possible deep learning is a useful tool. It is nearly impossible to explain to the typical stakeholder and is best for just getting the job done.

One of the most accessible packages for using deep learning is the “h2o” package.This package allows you to access the H2O website which will analyze your data and send it back to you. This allows a researcher to do analytics on a much larger scale than their own computer can handle. In this post, we will use deep learning to predict the gender of the head of household in the “VietnamH” dataset from the “Ecdat” package. Below is some initial code.

Data Preparation

library(h2o);library(Ecdat);library(corrplot)
data("VietNamH")
str(VietNamH)
## 'data.frame':    5999 obs. of  11 variables:
##  $ sex     : Factor w/ 2 levels "male","female": 2 2 1 2 2 2 2 1 1 1 ...
##  $ age     : int  68 57 42 72 73 66 73 46 50 45 ...
##  $ educyr  : num  4 8 14 9 1 13 2 9 12 12 ...
##  $ farm    : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 2 2 2 ...
##  $ urban   : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hhsize  : int  6 6 6 6 8 7 9 4 5 4 ...
##  $ lntotal : num  10.1 10.3 10.9 10.3 10.5 ...
##  $ lnmed   : num  11.23 8.51 8.71 9.29 7.56 ...
##  $ lnrlfood: num  8.64 9.35 10.23 9.26 9.59 ...
##  $ lnexp12m: num  11.23 8.51 8.71 9.29 7.56 ...
##  $ commune : Factor w/ 194 levels "1","10","100",..: 1 1 1 1 1 1 1 1 1 1 ...
corrplot(cor(na.omit(VietNamH[,c(-1,-4,-5,-11)])),method = 'number')

1.png

We need to remove the “commune” variable “lnexp12m” and the “lntotal” variable. The “commune” variable should be removed because it doesn’t provide much information. The “lntotal” variable should be removed because it is the total expenditures that the family spends. This is represented by other variables such as food “lnrlfood” which “lntotal” highly correlates with. the “lnexp12m” should be removed because it has a perfect correlation with “lnmed”. Below is the code

VietNamH$commune<-NULL
VietNamH$lnexp12m<-NULL
VietNamH$lntotal<-NULL

Save as CSV file

We now need to save our modified dataset as a csv file that we can send to h2o. The code is as follows.

write.csv(VietNamH, file="viet.csv",row.names = F)

Connect to H2O

Now we can connect to H2o and start what is called an instance.

localH2O<-h2o.init()
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         50 minutes 18 seconds 
##     H2O cluster version:        3.10.4.6 
##     H2O cluster version age:    27 days  
##     H2O cluster name:           H2O_started_from_R_darrin_hsl318 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   3.44 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  2 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     R Version:                  R version 3.4.0 (2017-04-21)

The output indicates that we are connected. The next step is where it really gets complicated. We need to upload our data to h2o as an h2o dataframe, which is different from a regular data frame. We also need to indicate the location of the csv file on your computer that needs to be converted. All of this is done in the code below.

viet.hex<-h2o.uploadFile(path="/home/darrin/Documents/R working directory/blog/blog/viet.csv",destination_frame = "viet.hex")

In the code above we create an object called “viet.hex”. This object uses the “h2o.uploadFile” function to send our csv to h2o. We can check if everything worked by using the “class” function and the “str” function on “viet.hex”.

class(viet.hex)
## [1] "H2OFrame"
str(viet.hex)
## Class 'H2OFrame'  
##  - attr(*, "op")= chr "Parse"
##  - attr(*, "id")= chr "viet.hex"
##  - attr(*, "eval")= logi FALSE
##  - attr(*, "nrow")= int 5999
##  - attr(*, "ncol")= int 8
##  - attr(*, "types")=List of 8
##   ..$ : chr "enum"
##   ..$ : chr "int"
##   ..$ : chr "real"
##   ..$ : chr "enum"
##   ..$ : chr "enum"
##   ..$ : chr "int"
##   ..$ : chr "real"
##   ..$ : chr "real"
##  - attr(*, "data")='data.frame': 10 obs. of  8 variables:
##   ..$ sex     : Factor w/ 2 levels "female","male": 1 1 2 1 1 1 1 2 2 2
##   ..$ age     : num  68 57 42 72 73 66 73 46 50 45
##   ..$ educyr  : num  4 8 14 9 1 13 2 9 12 12
##   ..$ farm    : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1
##   ..$ urban   : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2
##   ..$ hhsize  : num  6 6 6 6 8 7 9 4 5 4
##   ..$ lnmed   : num  11.23 8.51 8.71 9.29 7.56 ...
##   ..$ lnrlfood: num  8.64 9.35 10.23 9.26 9.59 ...

The “summary” function also provides insight into the data.

summary(viet.hex)
##  sex          age             educyr           farm      urban    
##  male  :4375  Min.   :16.00   Min.   : 0.000   yes:3438  no :4269 
##  female:1624  1st Qu.:37.00   1st Qu.: 3.982   no :2561  yes:1730 
##               Median :46.00   Median : 6.996                      
##               Mean   :48.01   Mean   : 7.094                      
##               3rd Qu.:58.00   3rd Qu.: 9.988                      
##               Max.   :95.00   Max.   :22.000                      
##  hhsize           lnmed            lnrlfood        
##  Min.   : 1.000   Min.   : 0.000   Min.   : 6.356  
##  1st Qu.: 4.000   1st Qu.: 4.166   1st Qu.: 8.372  
##  Median : 5.000   Median : 5.959   Median : 8.689  
##  Mean   : 4.752   Mean   : 5.266   Mean   : 8.680  
##  3rd Qu.: 6.000   3rd Qu.: 7.171   3rd Qu.: 9.001  
##  Max.   :19.000   Max.   :12.363   Max.   :11.384

Create Training and Testing Sets

We now need to create our train and test sets. We need to use slightly different syntax to do this with h2o. The code below is how it is done to create a 70/30 split in the data.

rand<-h2o.runif(viet.hex,seed = 123)
train<-viet.hex[rand<=.7,]
train<-h2o.assign(train, key = "train")
test<-viet.hex[rand>.7,]
test<-h2o.assign(test, key = "test")

Here is what we did

  1. We created an object called “rand” that created random numbers for or “viet.hex” dataset.
  2. All values less than .7 were assigned to the “train” variable
  3. The train variable was given the key name “train” in order to use it in the h2o framework
  4. All values greater than .7 were assigned to test and test was given a key name

You can check the proportions of the train and test sets using the “h2o.table” function.

h2o.table(train$sex)
##      sex Count
## 1 female  1146
## 2   male  3058
## 
## [2 rows x 2 columns]
h2o.table(test$sex)
##      sex Count
## 1 female   478
## 2   male  1317
## 
## [2 rows x 2 columns]

Model Development

We can now create our model.

vietdlmodel<-h2o.deeplearning(x=2:8,y=1,training_frame = train,validation_frame = test,seed=123,variable_importances = T)

Here is what the code above means.

  1. We created an object called “vietdlmodel”
  2. We used the “h2o.deeplearning” function.
  3.  x = 2:8 is all the independent variables in the dataframe and y=1 is the first variable “sex”
  4. We set the training and validation frame to “train” and “test” and set the seed.
  5. Finally, we indicated that we want to know the variable importance.

We can check the performance of the model with the code below.

vietdlmodel
## Model Details:
## training
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##        female male    Error       Rate
## female    435  711 0.620419  =711/1146
## male      162 2896 0.052976  =162/3058
## Totals    597 3607 0.207659  =873/4204

## testing
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##        female male    Error       Rate
## female    151  327 0.684100   =327/478
## male       60 1257 0.045558   =60/1317
## Totals    211 1584 0.215599  =387/1795

There is a lot of output here. For simplicity, we will focus on the confusion matrices for the training and testing sets.The error rate for the training set is 19.8% and for the testing set, it is 21.2%. Below we can see which variable were most useful

vietdlmodel@model$variable_importances
## Variable Importances: 
##             variable relative_importance scaled_importance percentage
## 1           urban.no            1.000000          1.000000   0.189129
## 2          urban.yes            0.875128          0.875128   0.165512
## 3            farm.no            0.807208          0.807208   0.152666
## 4           farm.yes            0.719517          0.719517   0.136081
## 5                age            0.451581          0.451581   0.085407
## 6             hhsize            0.410472          0.410472   0.077632
## 7           lnrlfood            0.386189          0.386189   0.073039
## 8             educyr            0.380398          0.380398   0.071944
## 9              lnmed            0.256911          0.256911   0.048589
## 10  farm.missing(NA)            0.000000          0.000000   0.000000
## 11 urban.missing(NA)            0.000000          0.000000   0.000000

The numbers speak for themselves. “Urban” and “farm” are both the most important variables for predicting sex. Below is the code for obtaining the predicted results and placing them into a dataframe. This is useful if you need to send in final results to a data science competition such as those found at kaggle.

vietdlPredict<-h2o.predict(vietdlmodel,newdata = test)
vietdlPredict
##   predict     female      male
## 1    male 0.06045560 0.9395444
## 2    male 0.10957121 0.8904288
## 3    male 0.27459108 0.7254089
## 4    male 0.14721353 0.8527865
## 5    male 0.05493486 0.9450651
## 6    male 0.10598351 0.8940165
## 
## [1795 rows x 3 columns]
vietdlPred<-as.data.frame(vietdlPredict)
head(vietdlPred)
##   predict     female      male
## 1    male 0.06045560 0.9395444
## 2    male 0.10957121 0.8904288
## 3    male 0.27459108 0.7254089
## 4    male 0.14721353 0.8527865
## 5    male 0.05493486 0.9450651
## 6    male 0.10598351 0.8940165

Conclusion

This was a complicated experience. However, we learned how to upload and download results from h2.

Leave a Reply