Gradient boosting is a machine learning tool for “boosting” or improving model performance. How this works is that you first develop an initial model called the base learner using whatever algorithm of your choice (linear, tree, etc.).
Gradient boosting looks at the error and develops a second model using what is called da loss function. The loss function is the difference between the current accuracy and the desired prediction whether it’s accuracy for classification or error in regression. This process of making additional models based only on the misclassified ones continues until the level of accuracy is reached.
Gradient boosting is also stochastic. This means that it randomly draws from the sample as it iterates over the data. This helps to improve accuracy and or reduce error.
In this post, we will use gradient boosting for regression trees. In particular, we will use the “Sacramento” dataset from the “caret” package. Our goal is to predict a house’s price based on the available variables. Below is some initial code
library(caret);library(gbm);library(corrplot)
data("Sacramento")
str(Sacramento)
## 'data.frame': 932 obs. of 9 variables:
## $ city : Factor w/ 37 levels "ANTELOPE","AUBURN",..: 34 34 34 34 34 34 34 34 29 31 ...
## $ zip : Factor w/ 68 levels "z95603","z95608",..: 64 52 44 44 53 65 66 49 24 25 ...
## $ beds : int 2 3 2 2 2 3 3 3 2 3 ...
## $ baths : num 1 1 1 1 1 1 2 1 2 2 ...
## $ sqft : int 836 1167 796 852 797 1122 1104 1177 941 1146 ...
## $ type : Factor w/ 3 levels "Condo","Multi_Family",..: 3 3 3 3 3 1 3 3 1 3 ...
## $ price : int 59222 68212 68880 69307 81900 89921 90895 91002 94905 98937 ...
## $ latitude : num 38.6 38.5 38.6 38.6 38.5 ...
## $ longitude: num -121 -121 -121 -121 -121 ...
Data Preparation
Already there are some actions that need to be made. We need to remove the variables “city” and “zip” because they both have a large number of factors. Next, we need to remove “latitude” and “longitude” because these values are hard to interpret in a housing price model. Let’s run the correlations before removing this information
corrplot(cor(Sacramento[,c(-1,-2,-6)]),method = 'number')
There also appears to be a high correlation between “sqft” and beds and bathrooms. As such, we will remove “sqft” from the model. Below is the code for the revised variables remaining for the model.
sacto.clean<-Sacramento
sacto.clean[,c(1,2,5)]<-NULL
sacto.clean[,c(5,6)]<-NULL
str(sacto.clean)
## 'data.frame': 932 obs. of 4 variables:
## $ beds : int 2 3 2 2 2 3 3 3 2 3 ...
## $ baths: num 1 1 1 1 1 1 2 1 2 2 ...
## $ type : Factor w/ 3 levels "Condo","Multi_Family",..: 3 3 3 3 3 1 3 3 1 3 ...
## $ price: int 59222 68212 68880 69307 81900 89921 90895 91002 94905 98937 ...
We will now develop our training and testing sets
set.seed(502)
ind=sample(2,nrow(sacto.clean),replace=T,prob=c(.7,.3))
train<-sacto.clean[ind==1,]
test<-sacto.clean[ind==2,]
We need to create a grid in order to develop the many different potential models available. We have to tune three different parameters for gradient boosting, These three parameters are number of trees, interaction depth, and shrinkage. Number of trees is how many trees gradient boosting g will make, interaction depth is the number of splits, shrinkage controls the contribution of each tree and stump to the final model. We also have to determine the type of cross-validation using the “trainControl”” function. Below is the code for the grid.
grid<-expand.grid(.n.trees=seq(100,500,by=200),.interaction.depth=seq(1,4,by=1),.shrinkage=c(.001,.01,.1),
.n.minobsinnode=10)
control<-trainControl(method = "CV")
Model Training
We now can train our model
gbm.train<-train(price~.,data=train,method='gbm',trControl=control,tuneGrid=grid) gbm.train
Stochastic Gradient Boosting 685 samples 4 predictors No pre-processing Resampling: Cross-Validated (25 fold) Summary of sample sizes: 659, 657, 658, 657, 657, 657, ... Resampling results across tuning parameters: shrinkage interaction.depth n.trees RMSE Rsquared 0.001 1 100 128372.32 0.4850879 0.001 1 300 120272.16 0.4965552 0.001 1 500 113986.08 0.5064680 0.001 2 100 127197.20 0.5463527 0.001 2 300 117228.42 0.5524074 0.001 2 500 109634.39 0.5566431 0.001 3 100 126633.35 0.5646994 0.001 3 300 115873.67 0.5707619 0.001 3 500 107850.02 0.5732942 0.001 4 100 126361.05 0.5740655 0.001 4 300 115269.63 0.5767396 0.001 4 500 107109.99 0.5799836 0.010 1 100 103554.11 0.5286663 0.010 1 300 90114.05 0.5728993 0.010 1 500 88327.15 0.5838981 0.010 2 100 97876.10 0.5675862 0.010 2 300 88260.16 0.5864650 0.010 2 500 86773.49 0.6007150 0.010 3 100 96138.06 0.5778062 0.010 3 300 87213.34 0.5975438 0.010 3 500 86309.87 0.6072987 0.010 4 100 95260.93 0.5861798 0.010 4 300 86962.20 0.6011429 0.010 4 500 86380.39 0.6082593 0.100 1 100 86808.91 0.6022690 0.100 1 300 86081.65 0.6100963 0.100 1 500 86197.52 0.6081493 0.100 2 100 86810.97 0.6036919 0.100 2 300 87251.66 0.6042293 0.100 2 500 88396.21 0.5945206 0.100 3 100 86649.14 0.6088309 0.100 3 300 88565.35 0.5942948 0.100 3 500 89971.44 0.5849622 0.100 4 100 86922.22 0.6037571 0.100 4 300 88629.92 0.5894188 0.100 4 500 91008.39 0.5718534 Tuning parameter 'n.minobsinnode' was held constant at a value of 10 RMSE was used to select the optimal model using the smallest value. The final values used for the model were n.trees = 300, interaction.depth = 1, shrinkage = 0.1 and n.minobsinnode = 10.
The printout shows you the values for each potential model. At the bottom of the printout are the recommended parameters for our model. We take the values at the bottom to create our model for the test data.
gbm.price<-gbm(price~.,data=train,n.trees = 300,interaction.depth = 1,
shrinkage = .1,distribution = 'gaussian')
Test Model
Now we use the test data, below we predict as well as calculate the error and make a plot.
gbm.test<-predict(gbm.price,newdata = test,n.trees = 300)
gbm.resid<-gbm.test-test$price
mean(gbm.resid^2)
## [1] 8721772767
plot(gbm.test,test$price)
The actual value for the mean squared error is relative and means nothing by its self. The plot, however, looks good and indicates that our model may be doing well. The mean squared error is only useful when comparing one model to another it does not mean much by its self.