In this data exploration post, we will analyze a dataset from Kaggle using R . Below are the packages we are going to use.

library(ggplot2);library(readr);library(magrittr);library(dplyr);library(tidyverse);library(data.table);library(DT);library(GGally);library(gridExtra);library(ggExtra);library(fastDummies);library(caret);library(glmnet)

Let’s look at our data for a second.

train <- read_csv("~/Downloads/train.csv")

## Parsed with column specification:
## cols(
##   .default = col_character(),
##   Id = col_integer(),
##   MSSubClass = col_integer(),
##   LotFrontage = col_integer(),
##   LotArea = col_integer(),
##   OverallQual = col_integer(),
##   OverallCond = col_integer(),
##   YearBuilt = col_integer(),
##   YearRemodAdd = col_integer(),
##   MasVnrArea = col_integer(),
##   BsmtFinSF1 = col_integer(),
##   BsmtFinSF2 = col_integer(),
##   BsmtUnfSF = col_integer(),
##   TotalBsmtSF = col_integer(),
##   `1stFlrSF` = col_integer(),
##   `2ndFlrSF` = col_integer(),
##   LowQualFinSF = col_integer(),
##   GrLivArea = col_integer(),
##   BsmtFullBath = col_integer(),
##   BsmtHalfBath = col_integer(),
##   FullBath = col_integer()
##   # ... with 18 more columns
## )

## See spec(...) for full column specifications.

Data Visualization

Lets take a look at our target variable first

p1<-train%>%
        ggplot(aes(SalePrice))+geom_histogram(bins=10,fill='red')+labs(x="Type")+ggtitle("Global")
p1

Here is the frequency of certain values for the target variable

p2<-train%>%
        mutate(tar=as.character(SalePrice))%>%
        group_by(tar)%>%
        count()%>%
        arrange(desc(n))%>%
        head(10)%>%
        ggplot(aes(reorder(tar,n,FUN=min),n))+geom_col(fill='blue')+coord_flip()+labs(x='Target',y='freq')+ggtitle('Freuency')
p2

Let’s examine the correlations. First we need to find out which variables are numeric. Then we can use ggcorr to see if there are any interesting associations. The code is as follows.

nums <- unlist(lapply(train, is.numeric))   train[ , nums]%>%
        select(-Id) %>%
        ggcorr(method =c('pairwise','spearman'),label = FALSE,angle=-0,hjust=.2)+coord_flip()

There are some strong associations in the data set. Below we see what the top 10 correlations.

n1 <- 20 
m1 <- abs(cor(train[ , nums],method='spearman'))
out <- as.table(m1) %>%
        as_data_frame %>% 
        transmute(Var1N = pmin(Var1, Var2), Var2N = pmax(Var1, Var2), n) %>% 
        distinct %>% 
        filter(Var1N != Var2N) %>% 
        arrange(desc(n)) %>%
        group_by(grp = as.integer(gl(n(), n1, n())))
out

## # A tibble: 703 x 4
## # Groups:   grp [36]
##    Var1N        Var2N            n   grp
##                     
##  1 GarageArea   GarageCars   0.853     1
##  2 1stFlrSF     TotalBsmtSF  0.829     1
##  3 GrLivArea    TotRmsAbvGrd 0.828     1
##  4 OverallQual  SalePrice    0.810     1
##  5 GrLivArea    SalePrice    0.731     1
##  6 GarageCars   SalePrice    0.691     1
##  7 YearBuilt    YearRemodAdd 0.684     1
##  8 BsmtFinSF1   BsmtFullBath 0.674     1
##  9 BedroomAbvGr TotRmsAbvGrd 0.668     1
## 10 FullBath     GrLivArea    0.658     1
## # ... with 693 more rows

There are about 4 correlations that are perhaps too strong.

Descriptive Statistics

Below are some basic descriptive statistics of our variables.

train_mean<-na.omit(train[ , nums]) %>% 
        select(-Id,-SalePrice) %>%
        summarise_all(funs(mean)) %>%
        gather(everything(),key='feature',value='mean')
train_sd<-na.omit(train[ , nums]) %>%
        select(-Id,-SalePrice) %>%
        summarise_all(funs(sd)) %>%
        gather(everything(),key='feature',value='sd')
train_median<-na.omit(train[ , nums]) %>%
        select(-Id,-SalePrice) %>%
        summarise_all(funs(median)) %>%
        gather(everything(),key='feature',value='median')
stat<-na.omit(train[ , nums]) %>%
        select(-Id,-SalePrice) %>%
        summarise_all(funs(sum(.<0.001))) %>%
        gather(everything(),key='feature',value='zeros')%>%
        left_join(train_mean,by='feature')%>%
        left_join(train_median,by='feature')%>%
        left_join(train_sd,by='feature')
stat$zeropercent<-(stat$zeros/(nrow(train))*100)
stat[order(stat$zeropercent,decreasing=T),]

## # A tibble: 36 x 6
##    feature       zeros    mean median      sd zeropercent
##                            
##  1 PoolArea       1115  2.93        0  40.2          76.4
##  2 LowQualFinSF   1104  4.57        0  41.6          75.6
##  3 3SsnPorch      1103  3.35        0  29.8          75.5
##  4 MiscVal        1087 23.4         0 166.           74.5
##  5 BsmtHalfBath   1060  0.0553      0   0.233        72.6
##  6 ScreenPorch    1026 16.1         0  57.8          70.3
##  7 BsmtFinSF2      998 44.6         0 158.           68.4
##  8 EnclosedPorch   963 21.8         0  61.3          66.0
##  9 HalfBath        700  0.382       0   0.499        47.9
## 10 BsmtFullBath    668  0.414       0   0.512        45.8
## # ... with 26 more rows

We have a lot of information stored in the code above. We have the means, median and the sd in one place for all of the features. Below are visuals of this information. We add 1 to the mean and sd to preserve features that may have a mean of 0.

p1<-stat %>%
        ggplot(aes(mean+1))+geom_histogram(bins = 20,fill='red')+scale_x_log10()+labs(x="means + 1")+ggtitle("Feature means")

p2<-stat %>%
        ggplot(aes(sd+1))+geom_histogram(bins = 30,fill='red')+scale_x_log10()+labs(x="sd + 1")+ggtitle("Feature sd")

p3<-stat %>%
        ggplot(aes(median+1))+geom_histogram(bins = 20,fill='red')+labs(x="median + 1")+ggtitle("Feature median")

p4<-stat %>%
        mutate(zeros=zeros/nrow(train)*100) %>%
        ggplot(aes(zeros))+geom_histogram(bins = 20,fill='red')+labs(x="Percent of Zeros")+ggtitle("Zeros")

p5<-stat %>%
        ggplot(aes(mean+1,sd+1))+geom_point()+scale_x_log10()+scale_y_log10()+labs(x="mean + 1",y='sd + 1')+ggtitle("Feature mean & sd")
grid.arrange(p1,p2,p3,p4,p5,layout_matrix=rbind(c(1,2,3),c(4,5)))

## Warning in rbind(c(1, 2, 3), c(4, 5)): number of columns of result is not a
## multiple of vector length (arg 2)

Below we check for variables with zero variance. Such variables would cause problems if included in any model development

stat%>%
        mutate(zeros = zeros/nrow(train)*100)%>%
        filter(mean == 0 | sd == 0 | zeros==100)%>%
        DT::datatable()

	feature	zeros	mean	median	sd	zeropercent
No data available in table

There are no zero-variance features in this dataset that may need to be remove.

Correlations

Let’s look at correlation with the SalePrice variable. The plot is a histogram of all the correlations with the target variable.

sp_cor<-train[, nums] %>% 
select(-Id,-SalePrice) %>%
cor(train$SalePrice,method="spearman") %>%
as.tibble() %>%
rename(cor_p=V1)

stat<-stat%>%
#filter(sd>0)
bind_cols(sp_cor)

stat%>%
ggplot(aes(cor))+geom_histogram()+labs(x="Correlations")+ggtitle("Cors with SalePrice")

We have several high correlations but we already knew this previously. Below we have some code that provides visuals of the correlations

top<-stat%>%
        arrange(desc(cor_p))%>%
        head(10)%>%
        .$feature
p1<-train%>%
        select(SalePrice,one_of(top))%>%
        ggcorr(method=c("pairwise","pearson"),label=T,angle=-0,hjust=.2)+coord_flip()+ggtitle("Strongest Correlations")
p2<-train%>%
        select(SalePrice, OverallQual)%>%
        ggplot(aes(SalePrice,OverallQual))+geom_point()+labs(y="OverallQual")+ggtitle("Strongest Correlation")
p3<-train%>%
        select(SalePrice, OverallQual)%>%
        ggplot(aes(SalePrice,OverallQual))+geom_point()+geom_smooth(method= 'lm')+labs(y="OverallQual")+ggtitle("Strongest Correlation")
ggMarginal(p3,type = 'histogram')
p3
grid.arrange(p1,p2,layout_matrix=rbind(c(1,2)))

The first plot show us the top correlations. Plot 1 show us the relationship between the strongest predictor and our target variable. Plot 2 shows us the trend-line and the histograms for the strongest predictor with our target variable.

The code below is for the categorical variables. Our primary goal is to see the protections inside each variable. If a categorical variable lacks variance in terms of frequencies in each category it may need to be removed for model developing purposes. Below is the code

ig_zero<-train[, nums]%>%
        na_if(0)%>%
        select(-Id,-SalePrice)%>%
        cor(train$SalePrice,use="pairwise",method="spearman")%>%
        as.tibble()%>%
        rename(cor_s0=V1)
stat<-stat%>%
        bind_cols(ig_zero)%>%
        mutate(non_zero=nrow(train)-zeros)

char <- unlist(lapply(train, is.character))  
me<-names(train[,char])

List=list()
    for (var in train[,char]){
        wow= print(prop.table(table(var)))
        List[[length(List)+1]] = wow
    }
names(List)<-me
List

This list is not printed here in order to save space

# $MSZoning
## var
##     C (all)          FV          RH          RL          RM 
## 0.006849315 0.044520548 0.010958904 0.788356164 0.149315068 
## 
## $Street
## var
##        Grvl        Pave 
## 0.004109589 0.995890411 
## 
## $Alley
## var
##      Grvl      Pave 
## 0.5494505 0.4505495 
## 
## $LotShape
## var
##         IR1         IR2         IR3         Reg 
## 0.331506849 0.028082192 0.006849315 0.633561644 
## 
## $LandContour
## var
##        Bnk        HLS        Low        Lvl 
## 0.04315068 0.03424658 0.02465753 0.89794521 
## 
## $Utilities
## var
##       AllPub       NoSeWa 
## 0.9993150685 0.0006849315 
## 
## $LotConfig
## var
##      Corner     CulDSac         FR2         FR3      Inside 
## 0.180136986 0.064383562 0.032191781 0.002739726 0.720547945 
## 
## $LandSlope
## var
##        Gtl        Mod        Sev 
## 0.94657534 0.04452055 0.00890411 
## 
## $Neighborhood
## var
##     Blmngtn     Blueste      BrDale     BrkSide     ClearCr     CollgCr 
## 0.011643836 0.001369863 0.010958904 0.039726027 0.019178082 0.102739726 
##     Crawfor     Edwards     Gilbert      IDOTRR     MeadowV     Mitchel 
## 0.034931507 0.068493151 0.054109589 0.025342466 0.011643836 0.033561644 
##       NAmes     NoRidge     NPkVill     NridgHt      NWAmes     OldTown 
## 0.154109589 0.028082192 0.006164384 0.052739726 0.050000000 0.077397260 
##      Sawyer     SawyerW     Somerst     StoneBr       SWISU      Timber 
## 0.050684932 0.040410959 0.058904110 0.017123288 0.017123288 0.026027397 
##     Veenker 
## 0.007534247 
## 
## $Condition1
## var
##      Artery       Feedr        Norm        PosA        PosN        RRAe 
## 0.032876712 0.055479452 0.863013699 0.005479452 0.013013699 0.007534247 
##        RRAn        RRNe        RRNn 
## 0.017808219 0.001369863 0.003424658 
## 
## $Condition2
## var
##       Artery        Feedr         Norm         PosA         PosN 
## 0.0013698630 0.0041095890 0.9897260274 0.0006849315 0.0013698630 
##         RRAe         RRAn         RRNn 
## 0.0006849315 0.0006849315 0.0013698630 
## 
## $BldgType
## var
##       1Fam     2fmCon     Duplex      Twnhs     TwnhsE 
## 0.83561644 0.02123288 0.03561644 0.02945205 0.07808219 
## 
## $HouseStyle
## var
##      1.5Fin      1.5Unf      1Story      2.5Fin      2.5Unf      2Story 
## 0.105479452 0.009589041 0.497260274 0.005479452 0.007534247 0.304794521 
##      SFoyer        SLvl 
## 0.025342466 0.044520548 
## 
## $RoofStyle
## var
##        Flat       Gable     Gambrel         Hip     Mansard        Shed 
## 0.008904110 0.781506849 0.007534247 0.195890411 0.004794521 0.001369863 
## 
## $RoofMatl
## var
##      ClyTile      CompShg      Membran        Metal         Roll 
## 0.0006849315 0.9821917808 0.0006849315 0.0006849315 0.0006849315 
##      Tar&Grv      WdShake      WdShngl 
## 0.0075342466 0.0034246575 0.0041095890 
## 
## $Exterior1st
## var
##      AsbShng      AsphShn      BrkComm      BrkFace       CBlock 
## 0.0136986301 0.0006849315 0.0013698630 0.0342465753 0.0006849315 
##      CemntBd      HdBoard      ImStucc      MetalSd      Plywood 
## 0.0417808219 0.1520547945 0.0006849315 0.1506849315 0.0739726027 
##        Stone       Stucco      VinylSd      Wd Sdng      WdShing 
## 0.0013698630 0.0171232877 0.3527397260 0.1410958904 0.0178082192 
## 
## $Exterior2nd
## var
##      AsbShng      AsphShn      Brk Cmn      BrkFace       CBlock 
## 0.0136986301 0.0020547945 0.0047945205 0.0171232877 0.0006849315 
##      CmentBd      HdBoard      ImStucc      MetalSd        Other 
## 0.0410958904 0.1417808219 0.0068493151 0.1465753425 0.0006849315 
##      Plywood        Stone       Stucco      VinylSd      Wd Sdng 
## 0.0972602740 0.0034246575 0.0178082192 0.3452054795 0.1349315068 
##      Wd Shng 
## 0.0260273973 
## 
## $MasVnrType
## var
##     BrkCmn    BrkFace       None      Stone 
## 0.01033058 0.30647383 0.59504132 0.08815427 
## 
## $ExterQual
## var
##          Ex          Fa          Gd          TA 
## 0.035616438 0.009589041 0.334246575 0.620547945 
## 
## $ExterCond
## var
##           Ex           Fa           Gd           Po           TA 
## 0.0020547945 0.0191780822 0.1000000000 0.0006849315 0.8780821918 
## 
## $Foundation
## var
##      BrkTil      CBlock       PConc        Slab       Stone        Wood 
## 0.100000000 0.434246575 0.443150685 0.016438356 0.004109589 0.002054795 
## 
## $BsmtQual
## var
##         Ex         Fa         Gd         TA 
## 0.08503162 0.02459592 0.43429375 0.45607871 
## 
## $BsmtCond
## var
##          Fa          Gd          Po          TA 
## 0.031623331 0.045678145 0.001405481 0.921293043 
## 
## $BsmtExposure
## var
##         Av         Gd         Mn         No 
## 0.15541491 0.09423347 0.08016878 0.67018284 
## 
## $BsmtFinType1
## var
##        ALQ        BLQ        GLQ        LwQ        Rec        Unf 
## 0.15460295 0.10400562 0.29374561 0.05200281 0.09346451 0.30217850 
## 
## $BsmtFinType2
## var
##         ALQ         BLQ         GLQ         LwQ         Rec         Unf 
## 0.013361463 0.023206751 0.009845288 0.032348805 0.037974684 0.883263010 
## 
## $Heating
## var
##        Floor         GasA         GasW         Grav         OthW 
## 0.0006849315 0.9780821918 0.0123287671 0.0047945205 0.0013698630 
##         Wall 
## 0.0027397260 
## 
## $HeatingQC
## var
##           Ex           Fa           Gd           Po           TA 
## 0.5075342466 0.0335616438 0.1650684932 0.0006849315 0.2931506849 
## 
## $CentralAir
## var
##          N          Y 
## 0.06506849 0.93493151 
## 
## $Electrical
## var
##       FuseA       FuseF       FuseP         Mix       SBrkr 
## 0.064427690 0.018505826 0.002056203 0.000685401 0.914324880 
## 
## $KitchenQual
## var
##         Ex         Fa         Gd         TA 
## 0.06849315 0.02671233 0.40136986 0.50342466 
## 
## $Functional
## var
##         Maj1         Maj2         Min1         Min2          Mod 
## 0.0095890411 0.0034246575 0.0212328767 0.0232876712 0.0102739726 
##          Sev          Typ 
## 0.0006849315 0.9315068493 
## 
## $FireplaceQu
## var
##         Ex         Fa         Gd         Po         TA 
## 0.03116883 0.04285714 0.49350649 0.02597403 0.40649351 
## 
## $GarageType
## var
##      2Types      Attchd     Basment     BuiltIn     CarPort      Detchd 
## 0.004350979 0.630891951 0.013778100 0.063814358 0.006526468 0.280638144 
## 
## $GarageFinish
## var
##       Fin       RFn       Unf 
## 0.2552574 0.3060189 0.4387237 
## 
## $GarageQual
## var
##          Ex          Fa          Gd          Po          TA 
## 0.002175489 0.034807832 0.010152284 0.002175489 0.950688905 
## 
## $GarageCond
## var
##          Ex          Fa          Gd          Po          TA 
## 0.001450326 0.025380711 0.006526468 0.005076142 0.961566352 
## 
## $PavedDrive
## var
##          N          P          Y 
## 0.06164384 0.02054795 0.91780822 
## 
## $PoolQC
## var
##        Ex        Fa        Gd 
## 0.2857143 0.2857143 0.4285714 
## 
## $Fence
## var
##      GdPrv       GdWo      MnPrv       MnWw 
## 0.20996441 0.19217082 0.55871886 0.03914591 
## 
## $MiscFeature
## var
##       Gar2       Othr       Shed       TenC 
## 0.03703704 0.03703704 0.90740741 0.01851852 
## 
## $SaleType
## var
##         COD         Con       ConLD       ConLI       ConLw         CWD 
## 0.029452055 0.001369863 0.006164384 0.003424658 0.003424658 0.002739726 
##         New         Oth          WD 
## 0.083561644 0.002054795 0.867808219 
## 
## $SaleCondition
## var
##     Abnorml     AdjLand      Alloca      Family      Normal     Partial 
## 0.069178082 0.002739726 0.008219178 0.013698630 0.820547945 0.085616438

You can judge for yourself which of these variables are appropriate or not.

Conclusion

This post provided an example of data exploration. Through this analysis we have a beter understanding of the characteristics of the dataset. This information can be used for further analyst and or model development.

educational research techniques

Research techniques and education

Data Exploration with R: Housing Prices

Data Visualization

Like this:

Related

Leave a ReplyCancel reply

Data Visualization

Share this:

Like this:

Related

Leave a ReplyCancel reply

Discover more from educational research techniques