In this data exploration post, we will analyze a dataset from Kaggle using R . Below are the packages we are going to use.
library(ggplot2);library(readr);library(magrittr);library(dplyr);library(tidyverse);library(data.table);library(DT);library(GGally);library(gridExtra);library(ggExtra);library(fastDummies);library(caret);library(glmnet)
Let’s look at our data for a second.
train <- read_csv("~/Downloads/train.csv")
## Parsed with column specification:
## cols(
## .default = col_character(),
## Id = col_integer(),
## MSSubClass = col_integer(),
## LotFrontage = col_integer(),
## LotArea = col_integer(),
## OverallQual = col_integer(),
## OverallCond = col_integer(),
## YearBuilt = col_integer(),
## YearRemodAdd = col_integer(),
## MasVnrArea = col_integer(),
## BsmtFinSF1 = col_integer(),
## BsmtFinSF2 = col_integer(),
## BsmtUnfSF = col_integer(),
## TotalBsmtSF = col_integer(),
## `1stFlrSF` = col_integer(),
## `2ndFlrSF` = col_integer(),
## LowQualFinSF = col_integer(),
## GrLivArea = col_integer(),
## BsmtFullBath = col_integer(),
## BsmtHalfBath = col_integer(),
## FullBath = col_integer()
## # ... with 18 more columns
## )
## See spec(...) for full column specifications.
Data Visualization
Lets take a look at our target variable first
p1<-train%>%
ggplot(aes(SalePrice))+geom_histogram(bins=10,fill='red')+labs(x="Type")+ggtitle("Global")
p1
Here is the frequency of certain values for the target variable
p2<-train%>%
mutate(tar=as.character(SalePrice))%>%
group_by(tar)%>%
count()%>%
arrange(desc(n))%>%
head(10)%>%
ggplot(aes(reorder(tar,n,FUN=min),n))+geom_col(fill='blue')+coord_flip()+labs(x='Target',y='freq')+ggtitle('Freuency')
p2
Let’s examine the correlations. First we need to find out which variables are numeric. Then we can use ggcorr to see if there are any interesting associations. The code is as follows.
nums <- unlist(lapply(train, is.numeric)) train[ , nums]%>%
select(-Id) %>%
ggcorr(method =c('pairwise','spearman'),label = FALSE,angle=-0,hjust=.2)+coord_flip()
There are some strong associations in the data set. Below we see what the top 10 correlations.
n1 <- 20
m1 <- abs(cor(train[ , nums],method='spearman'))
out <- as.table(m1) %>%
as_data_frame %>%
transmute(Var1N = pmin(Var1, Var2), Var2N = pmax(Var1, Var2), n) %>%
distinct %>%
filter(Var1N != Var2N) %>%
arrange(desc(n)) %>%
group_by(grp = as.integer(gl(n(), n1, n())))
out
## # A tibble: 703 x 4
## # Groups: grp [36]
## Var1N Var2N n grp
##
## 1 GarageArea GarageCars 0.853 1
## 2 1stFlrSF TotalBsmtSF 0.829 1
## 3 GrLivArea TotRmsAbvGrd 0.828 1
## 4 OverallQual SalePrice 0.810 1
## 5 GrLivArea SalePrice 0.731 1
## 6 GarageCars SalePrice 0.691 1
## 7 YearBuilt YearRemodAdd 0.684 1
## 8 BsmtFinSF1 BsmtFullBath 0.674 1
## 9 BedroomAbvGr TotRmsAbvGrd 0.668 1
## 10 FullBath GrLivArea 0.658 1
## # ... with 693 more rows
There are about 4 correlations that are perhaps too strong.
Descriptive Statistics
Below are some basic descriptive statistics of our variables.
train_mean<-na.omit(train[ , nums]) %>%
select(-Id,-SalePrice) %>%
summarise_all(funs(mean)) %>%
gather(everything(),key='feature',value='mean')
train_sd<-na.omit(train[ , nums]) %>%
select(-Id,-SalePrice) %>%
summarise_all(funs(sd)) %>%
gather(everything(),key='feature',value='sd')
train_median<-na.omit(train[ , nums]) %>%
select(-Id,-SalePrice) %>%
summarise_all(funs(median)) %>%
gather(everything(),key='feature',value='median')
stat<-na.omit(train[ , nums]) %>%
select(-Id,-SalePrice) %>%
summarise_all(funs(sum(.<0.001))) %>%
gather(everything(),key='feature',value='zeros')%>%
left_join(train_mean,by='feature')%>%
left_join(train_median,by='feature')%>%
left_join(train_sd,by='feature')
stat$zeropercent<-(stat$zeros/(nrow(train))*100)
stat[order(stat$zeropercent,decreasing=T),]
## # A tibble: 36 x 6
## feature zeros mean median sd zeropercent
##
## 1 PoolArea 1115 2.93 0 40.2 76.4
## 2 LowQualFinSF 1104 4.57 0 41.6 75.6
## 3 3SsnPorch 1103 3.35 0 29.8 75.5
## 4 MiscVal 1087 23.4 0 166. 74.5
## 5 BsmtHalfBath 1060 0.0553 0 0.233 72.6
## 6 ScreenPorch 1026 16.1 0 57.8 70.3
## 7 BsmtFinSF2 998 44.6 0 158. 68.4
## 8 EnclosedPorch 963 21.8 0 61.3 66.0
## 9 HalfBath 700 0.382 0 0.499 47.9
## 10 BsmtFullBath 668 0.414 0 0.512 45.8
## # ... with 26 more rows
We have a lot of information stored in the code above. We have the means, median and the sd in one place for all of the features. Below are visuals of this information. We add 1 to the mean and sd to preserve features that may have a mean of 0.
p1<-stat %>%
ggplot(aes(mean+1))+geom_histogram(bins = 20,fill='red')+scale_x_log10()+labs(x="means + 1")+ggtitle("Feature means")
p2<-stat %>%
ggplot(aes(sd+1))+geom_histogram(bins = 30,fill='red')+scale_x_log10()+labs(x="sd + 1")+ggtitle("Feature sd")
p3<-stat %>%
ggplot(aes(median+1))+geom_histogram(bins = 20,fill='red')+labs(x="median + 1")+ggtitle("Feature median")
p4<-stat %>%
mutate(zeros=zeros/nrow(train)*100) %>%
ggplot(aes(zeros))+geom_histogram(bins = 20,fill='red')+labs(x="Percent of Zeros")+ggtitle("Zeros")
p5<-stat %>%
ggplot(aes(mean+1,sd+1))+geom_point()+scale_x_log10()+scale_y_log10()+labs(x="mean + 1",y='sd + 1')+ggtitle("Feature mean & sd")
grid.arrange(p1,p2,p3,p4,p5,layout_matrix=rbind(c(1,2,3),c(4,5)))
## Warning in rbind(c(1, 2, 3), c(4, 5)): number of columns of result is not a
## multiple of vector length (arg 2)
Below we check for variables with zero variance. Such variables would cause problems if included in any model development
stat%>%
mutate(zeros = zeros/nrow(train)*100)%>%
filter(mean == 0 | sd == 0 | zeros==100)%>%
DT::datatable()
There are no zero-variance features in this dataset that may need to be remove.
Correlations
Let’s look at correlation with the SalePrice variable. The plot is a histogram of all the correlations with the target variable.
sp_cor<-train[, nums] %>% select(-Id,-SalePrice) %>% cor(train$SalePrice,method="spearman") %>% as.tibble() %>% rename(cor_p=V1) stat<-stat%>% #filter(sd>0) bind_cols(sp_cor) stat%>% ggplot(aes(cor))+geom_histogram()+labs(x="Correlations")+ggtitle("Cors with SalePrice")
We have several high correlations but we already knew this previously. Below we have some code that provides visuals of the correlations
top<-stat%>% arrange(desc(cor_p))%>% head(10)%>% .$feature p1<-train%>% select(SalePrice,one_of(top))%>% ggcorr(method=c("pairwise","pearson"),label=T,angle=-0,hjust=.2)+coord_flip()+ggtitle("Strongest Correlations") p2<-train%>% select(SalePrice, OverallQual)%>% ggplot(aes(SalePrice,OverallQual))+geom_point()+labs(y="OverallQual")+ggtitle("Strongest Correlation") p3<-train%>% select(SalePrice, OverallQual)%>% ggplot(aes(SalePrice,OverallQual))+geom_point()+geom_smooth(method= 'lm')+labs(y="OverallQual")+ggtitle("Strongest Correlation") ggMarginal(p3,type = 'histogram') p3 grid.arrange(p1,p2,layout_matrix=rbind(c(1,2)))
The first plot show us the top correlations. Plot 1 show us the relationship between the strongest predictor and our target variable. Plot 2 shows us the trend-line and the histograms for the strongest predictor with our target variable.
The code below is for the categorical variables. Our primary goal is to see the protections inside each variable. If a categorical variable lacks variance in terms of frequencies in each category it may need to be removed for model developing purposes. Below is the code
ig_zero<-train[, nums]%>% na_if(0)%>% select(-Id,-SalePrice)%>% cor(train$SalePrice,use="pairwise",method="spearman")%>% as.tibble()%>% rename(cor_s0=V1) stat<-stat%>% bind_cols(ig_zero)%>% mutate(non_zero=nrow(train)-zeros) char <- unlist(lapply(train, is.character)) me<-names(train[,char]) List=list() for (var in train[,char]){ wow= print(prop.table(table(var))) List[[length(List)+1]] = wow } names(List)<-me List
This list is not printed here in order to save space
# $MSZoning
## var
## C (all) FV RH RL RM
## 0.006849315 0.044520548 0.010958904 0.788356164 0.149315068
##
## $Street
## var
## Grvl Pave
## 0.004109589 0.995890411
##
## $Alley
## var
## Grvl Pave
## 0.5494505 0.4505495
##
## $LotShape
## var
## IR1 IR2 IR3 Reg
## 0.331506849 0.028082192 0.006849315 0.633561644
##
## $LandContour
## var
## Bnk HLS Low Lvl
## 0.04315068 0.03424658 0.02465753 0.89794521
##
## $Utilities
## var
## AllPub NoSeWa
## 0.9993150685 0.0006849315
##
## $LotConfig
## var
## Corner CulDSac FR2 FR3 Inside
## 0.180136986 0.064383562 0.032191781 0.002739726 0.720547945
##
## $LandSlope
## var
## Gtl Mod Sev
## 0.94657534 0.04452055 0.00890411
##
## $Neighborhood
## var
## Blmngtn Blueste BrDale BrkSide ClearCr CollgCr
## 0.011643836 0.001369863 0.010958904 0.039726027 0.019178082 0.102739726
## Crawfor Edwards Gilbert IDOTRR MeadowV Mitchel
## 0.034931507 0.068493151 0.054109589 0.025342466 0.011643836 0.033561644
## NAmes NoRidge NPkVill NridgHt NWAmes OldTown
## 0.154109589 0.028082192 0.006164384 0.052739726 0.050000000 0.077397260
## Sawyer SawyerW Somerst StoneBr SWISU Timber
## 0.050684932 0.040410959 0.058904110 0.017123288 0.017123288 0.026027397
## Veenker
## 0.007534247
##
## $Condition1
## var
## Artery Feedr Norm PosA PosN RRAe
## 0.032876712 0.055479452 0.863013699 0.005479452 0.013013699 0.007534247
## RRAn RRNe RRNn
## 0.017808219 0.001369863 0.003424658
##
## $Condition2
## var
## Artery Feedr Norm PosA PosN
## 0.0013698630 0.0041095890 0.9897260274 0.0006849315 0.0013698630
## RRAe RRAn RRNn
## 0.0006849315 0.0006849315 0.0013698630
##
## $BldgType
## var
## 1Fam 2fmCon Duplex Twnhs TwnhsE
## 0.83561644 0.02123288 0.03561644 0.02945205 0.07808219
##
## $HouseStyle
## var
## 1.5Fin 1.5Unf 1Story 2.5Fin 2.5Unf 2Story
## 0.105479452 0.009589041 0.497260274 0.005479452 0.007534247 0.304794521
## SFoyer SLvl
## 0.025342466 0.044520548
##
## $RoofStyle
## var
## Flat Gable Gambrel Hip Mansard Shed
## 0.008904110 0.781506849 0.007534247 0.195890411 0.004794521 0.001369863
##
## $RoofMatl
## var
## ClyTile CompShg Membran Metal Roll
## 0.0006849315 0.9821917808 0.0006849315 0.0006849315 0.0006849315
## Tar&Grv WdShake WdShngl
## 0.0075342466 0.0034246575 0.0041095890
##
## $Exterior1st
## var
## AsbShng AsphShn BrkComm BrkFace CBlock
## 0.0136986301 0.0006849315 0.0013698630 0.0342465753 0.0006849315
## CemntBd HdBoard ImStucc MetalSd Plywood
## 0.0417808219 0.1520547945 0.0006849315 0.1506849315 0.0739726027
## Stone Stucco VinylSd Wd Sdng WdShing
## 0.0013698630 0.0171232877 0.3527397260 0.1410958904 0.0178082192
##
## $Exterior2nd
## var
## AsbShng AsphShn Brk Cmn BrkFace CBlock
## 0.0136986301 0.0020547945 0.0047945205 0.0171232877 0.0006849315
## CmentBd HdBoard ImStucc MetalSd Other
## 0.0410958904 0.1417808219 0.0068493151 0.1465753425 0.0006849315
## Plywood Stone Stucco VinylSd Wd Sdng
## 0.0972602740 0.0034246575 0.0178082192 0.3452054795 0.1349315068
## Wd Shng
## 0.0260273973
##
## $MasVnrType
## var
## BrkCmn BrkFace None Stone
## 0.01033058 0.30647383 0.59504132 0.08815427
##
## $ExterQual
## var
## Ex Fa Gd TA
## 0.035616438 0.009589041 0.334246575 0.620547945
##
## $ExterCond
## var
## Ex Fa Gd Po TA
## 0.0020547945 0.0191780822 0.1000000000 0.0006849315 0.8780821918
##
## $Foundation
## var
## BrkTil CBlock PConc Slab Stone Wood
## 0.100000000 0.434246575 0.443150685 0.016438356 0.004109589 0.002054795
##
## $BsmtQual
## var
## Ex Fa Gd TA
## 0.08503162 0.02459592 0.43429375 0.45607871
##
## $BsmtCond
## var
## Fa Gd Po TA
## 0.031623331 0.045678145 0.001405481 0.921293043
##
## $BsmtExposure
## var
## Av Gd Mn No
## 0.15541491 0.09423347 0.08016878 0.67018284
##
## $BsmtFinType1
## var
## ALQ BLQ GLQ LwQ Rec Unf
## 0.15460295 0.10400562 0.29374561 0.05200281 0.09346451 0.30217850
##
## $BsmtFinType2
## var
## ALQ BLQ GLQ LwQ Rec Unf
## 0.013361463 0.023206751 0.009845288 0.032348805 0.037974684 0.883263010
##
## $Heating
## var
## Floor GasA GasW Grav OthW
## 0.0006849315 0.9780821918 0.0123287671 0.0047945205 0.0013698630
## Wall
## 0.0027397260
##
## $HeatingQC
## var
## Ex Fa Gd Po TA
## 0.5075342466 0.0335616438 0.1650684932 0.0006849315 0.2931506849
##
## $CentralAir
## var
## N Y
## 0.06506849 0.93493151
##
## $Electrical
## var
## FuseA FuseF FuseP Mix SBrkr
## 0.064427690 0.018505826 0.002056203 0.000685401 0.914324880
##
## $KitchenQual
## var
## Ex Fa Gd TA
## 0.06849315 0.02671233 0.40136986 0.50342466
##
## $Functional
## var
## Maj1 Maj2 Min1 Min2 Mod
## 0.0095890411 0.0034246575 0.0212328767 0.0232876712 0.0102739726
## Sev Typ
## 0.0006849315 0.9315068493
##
## $FireplaceQu
## var
## Ex Fa Gd Po TA
## 0.03116883 0.04285714 0.49350649 0.02597403 0.40649351
##
## $GarageType
## var
## 2Types Attchd Basment BuiltIn CarPort Detchd
## 0.004350979 0.630891951 0.013778100 0.063814358 0.006526468 0.280638144
##
## $GarageFinish
## var
## Fin RFn Unf
## 0.2552574 0.3060189 0.4387237
##
## $GarageQual
## var
## Ex Fa Gd Po TA
## 0.002175489 0.034807832 0.010152284 0.002175489 0.950688905
##
## $GarageCond
## var
## Ex Fa Gd Po TA
## 0.001450326 0.025380711 0.006526468 0.005076142 0.961566352
##
## $PavedDrive
## var
## N P Y
## 0.06164384 0.02054795 0.91780822
##
## $PoolQC
## var
## Ex Fa Gd
## 0.2857143 0.2857143 0.4285714
##
## $Fence
## var
## GdPrv GdWo MnPrv MnWw
## 0.20996441 0.19217082 0.55871886 0.03914591
##
## $MiscFeature
## var
## Gar2 Othr Shed TenC
## 0.03703704 0.03703704 0.90740741 0.01851852
##
## $SaleType
## var
## COD Con ConLD ConLI ConLw CWD
## 0.029452055 0.001369863 0.006164384 0.003424658 0.003424658 0.002739726
## New Oth WD
## 0.083561644 0.002054795 0.867808219
##
## $SaleCondition
## var
## Abnorml AdjLand Alloca Family Normal Partial
## 0.069178082 0.002739726 0.008219178 0.013698630 0.820547945 0.085616438
You can judge for yourself which of these variables are appropriate or not.
Conclusion
This post provided an example of data exploration. Through this analysis we have a beter understanding of the characteristics of the dataset. This information can be used for further analyst and or model development.