Principal Component Regression in R

This post will explain and provide an example of principal component regression (PCR). Principal component regression involves having the model construct components from the independent variables that are a linear combination of the independent variables. This is similar to principal component analysis but the components are designed in a way to best explain the dependent variable. Doing this often allows you to use fewer variables in your model and usually improves the fit of your model as well.

Since PCR is based on principal component analysis it is an unsupervised method, which means the dependent variable has no influence on the development of the components. As such, there are times when the components that are developed may not be beneficial for explaining the dependent variable.

Our example will use the “Mroz” dataset from the “Ecdat” package. Our goal will be to predict “income” based on the variables in the dataset. Below is the initial code

library(pls);library(Ecdat)
data(Mroz)
str(Mroz)
## 'data.frame':    753 obs. of  18 variables:
##  $ work      : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hoursw    : int  1610 1656 1980 456 1568 2032 1440 1020 1458 1600 ...
##  $ child6    : int  1 0 1 0 1 0 0 0 0 0 ...
##  $ child618  : int  0 2 3 3 2 0 2 0 2 2 ...
##  $ agew      : int  32 30 35 34 31 54 37 54 48 39 ...
##  $ educw     : int  12 12 12 12 14 12 16 12 12 12 ...
##  $ hearnw    : num  3.35 1.39 4.55 1.1 4.59 ...
##  $ wagew     : num  2.65 2.65 4.04 3.25 3.6 4.7 5.95 9.98 0 4.15 ...
##  $ hoursh    : int  2708 2310 3072 1920 2000 1040 2670 4120 1995 2100 ...
##  $ ageh      : int  34 30 40 53 32 57 37 53 52 43 ...
##  $ educh     : int  12 9 12 10 12 11 12 8 4 12 ...
##  $ wageh     : num  4.03 8.44 3.58 3.54 10 ...
##  $ income    : int  16310 21800 21040 7300 27300 19495 21152 18900 20405 20425 ...
##  $ educwm    : int  12 7 12 7 12 14 14 3 7 7 ...
##  $ educwf    : int  7 7 7 7 14 7 7 3 7 7 ...
##  $ unemprate : num  5 11 5 5 9.5 7.5 5 5 3 5 ...
##  $ city      : Factor w/ 2 levels "no","yes": 1 2 1 1 2 2 1 1 1 1 ...
##  $ experience: int  14 5 15 6 7 33 11 35 24 21 ...

Our first step is to divide our dataset into a train and test set. We will do a simple 50/50 split for this demonstration.

train<-sample(c(T,F),nrow(Mroz),rep=T) #50/50 train/test split
test<-(!train)

In the code above we use the “sample” function to create a “train” index based on the number of rows in the “Mroz” dataset. Basically, R is making a vector that randomly assigns different rows in the “Mroz” dataset to be marked as True or False. Next, we use the “train” vector and we assign everything or every number that is not in the “train” vector to the test vector by using the exclamation mark.

We are now ready to develop our model. Below is the code

set.seed(777)
pcr.fit<-pcr(income~.,data=Mroz,subset=train,scale=T,validation="CV")

To make our model we use the “pcr” function from the “pls” package. The “subset” argument tells r to use the “train” vector to select examples from the “Mroz” dataset. The “scale” argument makes sure everything is measured the same way. This is important when using a component analysis tool as variables with different scale have a different influence on the components. Lastly, the “validation” argument enables cross-validation. This will help us to determine the number of components to use for prediction. Below is the results of the model using the “summary” function.

summary(pcr.fit)
## Data:    X dimension: 381 17 
##  Y dimension: 381 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           12102    11533    11017     9863     9884     9524     9563
## adjCV        12102    11534    11011     9855     9878     9502     9596
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        9149     9133     8811      8527      7265      7234      7120
## adjCV     9126     9123     8798      8877      7199      7172      7100
##        14 comps  15 comps  16 comps  17 comps
## CV         7118      7141      6972      6992
## adjCV      7100      7123      6951      6969
## 
## TRAINING: % variance explained
##         1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X        21.359    38.71    51.99    59.67    65.66    71.20    76.28
## income    9.927    19.50    35.41    35.63    41.28    41.28    46.75
##         8 comps  9 comps  10 comps  11 comps  12 comps  13 comps  14 comps
## X         80.70    84.39     87.32     90.15     92.65     95.02     96.95
## income    47.08    50.98     51.73     68.17     68.29     68.31     68.34
##         15 comps  16 comps  17 comps
## X          98.47     99.38    100.00
## income     68.48     70.29     70.39

There is a lot of information here.The VALIDATION: RMSEP section gives you the root mean squared error of the model broken down by component. The TRAINING section is similar the printout of any PCA but it shows the amount of cumulative variance of the components, as well as the variance, explained for the dependent variable “income.” In this model, we are able to explain up to 70% of the variance if we use all 17 components.

We can graph the MSE using the “validationplot” function with the argument “val.type” set to “MSEP”. The code is below.

validationplot(pcr.fit,val.type = "MSEP")

1

How many components to pick is subjective, however, there is almost no improvement beyond 13 so we will use 13 components in our prediction model and we will calculate the means squared error.

set.seed(777)
pcr.pred<-predict(pcr.fit,Mroz[test,],ncomp=13)
mean((pcr.pred-Mroz$income[test])^2)
## [1] 48958982

MSE is what you would use to compare this model to other models that you developed. Below is the performance of a least squares regression model

set.seed(777)
lm.fit<-lm(income~.,data=Mroz,subset=train)
lm.pred<-predict(lm.fit,Mroz[test,])
mean((lm.pred-Mroz$income[test])^2)
## [1] 47794472

If you compare the MSE the least squares model performs slightly better than the PCR one. However, there are a lot of non-significant features in the model as shown below.

summary(lm.fit)
## 
## Call:
## lm(formula = income ~ ., data = Mroz, subset = train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -27646  -3337  -1387   1860  48371 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.215e+04  3.987e+03  -5.556 5.35e-08 ***
## workno      -3.828e+03  1.316e+03  -2.909  0.00385 ** 
## hoursw       3.955e+00  7.085e-01   5.582 4.65e-08 ***
## child6       5.370e+02  8.241e+02   0.652  0.51512    
## child618     4.250e+02  2.850e+02   1.491  0.13673    
## agew         1.962e+02  9.849e+01   1.992  0.04709 *  
## educw        1.097e+02  2.276e+02   0.482  0.63013    
## hearnw       9.835e+02  2.303e+02   4.270 2.50e-05 ***
## wagew        2.292e+02  2.423e+02   0.946  0.34484    
## hoursh       6.386e+00  6.144e-01  10.394  < 2e-16 ***
## ageh        -1.284e+01  9.762e+01  -0.132  0.89542    
## educh        1.460e+02  1.592e+02   0.917  0.35982    
## wageh        2.083e+03  9.930e+01  20.978  < 2e-16 ***
## educwm       1.354e+02  1.335e+02   1.014  0.31115    
## educwf       1.653e+02  1.257e+02   1.315  0.18920    
## unemprate   -1.213e+02  1.148e+02  -1.057  0.29140    
## cityyes     -2.064e+02  7.905e+02  -0.261  0.79421    
## experience  -1.165e+02  5.393e+01  -2.159  0.03147 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6729 on 363 degrees of freedom
## Multiple R-squared:  0.7039, Adjusted R-squared:   0.69 
## F-statistic: 50.76 on 17 and 363 DF,  p-value: < 2.2e-16

Removing these and the MSE is almost the same for the PCR and least square models

set.seed(777)
lm.fit2<-lm(income~work+hoursw+hearnw+hoursh+wageh,data=Mroz,subset=train)
lm.pred2<-predict(lm.fit2,Mroz[test,])
mean((lm.pred2-Mroz$income[test])^2)
## [1] 47968996

Conclusion

Since the least squares model is simpler it is probably the superior model. PCR is strongest when there are a lot of variables involve and if there are issues with multicollinearity.

Advertisements

Accommodation Theory

Accommodation theory attempts to explain how people adjust the way they talk depending on who the audience is. Generally, there are two ways in which a person can adjust their speech. The two ways are convergence and divergence. In this post, we will look at these two ways of accommodating.

Speech Convergence

Converging is when you change the way you talk about sound more like the person you are talking to. This is seen as polite in many cultures and signals that you are accepting the person who is talking.

There are many different ways in which convergence can take place. The speaker may begin to use similar vocabulary. Another way is to imitate the pronunciation of the person you are talking to. Another common way os to translate technical jargon into simpler English.

Speech Divergence

Speech divergence is often seen as the opposite of speech convergence. Speech divergence is deliberately selecting a style of language different from the speaker. This often communicates dissatisfaction with the person you are speaking with. For example, most teenagers deliberately speak differently from their parents. This serves a role in their identifying with peers and to distances from their parents.

However, a slight divergence is expected of non-native speakers. Many people enjoy the accents of athletes and actresses. To have perfect control of two languages is at times seen negatively in some parts of the world.

A famous example of speech divergence is the speaking of former Federal Reserve Chairman Alan Greenspan and his ‘Fedspeak.’ Fedspeak was used whenever Greenspan appears before Congress or made announcements about changing the Federal Reserve interest rate. The goal of this form of communication was to sound as divergent and incoherent as possible below is an example.

The members of the Board of Governors and the Reserve Bank presidents foresee an implicit strengthening of activity after the current rebalancing is over, although the central tendency of their individual forecasts for real GDP still shows a substantial slowdown, on balance, for the year as a whole.

Make little sense unless you have an MBA in finance. It sounds like he sees no change in the growth of the economy

The reason behind this mysterious form of communication was that people place a strong emphasis on whatever the Federal Reserve and Alan Greenspan said. This led to swings in the stock market. To prevent this,  Greenspan diverged his language to make it as confusing as possible to avoid massive changes in the stock market

Conclusion 

When communicating we can choose to adapt ourselves are deliberately diverge. Which choice we choose depends a great deal on the context that we find ourselves end

Ways Language Change Spreads

All languages change if there is any doubt just pick up a book that is over 100 years old and regardless of the language it will at a minimum sound slightly different from current language use or radically different.

In this post, we will look at three common ways in which language change is spread. These three ways are from…

  • From one group to another
  • From one style to another
  • Lexical diffusion

Changes from Group to Group

This view of language change is that changes in a language move from one group. A group can be any sort of social or work circle. Examples can include family, colleagues, church affiliation, etc.

The change of language in a group is often facilitated by “gatekeepers.” Gatekeepers are people who are members of different groups. Most people are members of many different groups at the same time.

What happens is that a person picks up language in one group and shares this style of communication in another.  An example would be a child learning slang at school and using it at home. Naturally, language change moves at different speeds in different groups depending on the acceptability of the change.

Changes from Style to Style

A style is a way of communicating. It simple terms a person’s style can be formal and informal with varying shades of grey in-between. These two extremes can also be viewed as prestigious vs not prestigious.

Normally, formal/prestigious styles of language move down into informal styles of language. For example, a movie star or some other celebrity speaks a certain way and thus style is transferred downward among those who are not so famous.

There are times where informal and un-prestigious language change spreads upward. Normally, this is much slower than change moving downward. In addition, it also involves words and styles that are so old that what really happened was the young people who used these words become part of the “establishment” in their middle age and continued to use the style. For example, the word “cool” used to be slang but is commonly used among some of the most elite leaders of the world now. Therefore, it wasn’t the language that changed as much as the people who used it with the passing of one generation to another.

Lexical Diffusion

Lexical diffusion is the change of how a word is pronounced. This is often an exceedingly slow process and can take centuries. The English language is full of words that have strange pronunciations when considering the spelling. This is due to English being thoroughly influenced by other languages such as French and Latin.

Conclusion

These three theories are just some of the ways langauge change can spread. In addition, it may not be practical to thinnk of them each happening independently from the others. Rather, these three theories can often be seen as working at the same time to slowly change a language over time.

Example of Best Subset Regression in R

This post will provide an example of best subset regression. This is a topic that has been covered before in this blog. However, in the current post, we will approach this using a slightly different coding and a different dataset. We will be using the “HI” dataset from the “Ecdat” package. Our goal will be to predict the number of hours a women works based on the other variables in the dataset. Below is some initial code.

library(leaps);library(Ecdat)
data(HI)
str(HI)
## 'data.frame':    22272 obs. of  13 variables:
##  $ whrswk    : int  0 50 40 40 0 40 40 25 45 30 ...
##  $ hhi       : Factor w/ 2 levels "no","yes": 1 1 2 1 2 2 2 1 1 1 ...
##  $ whi       : Factor w/ 2 levels "no","yes": 1 2 1 2 1 2 1 1 2 1 ...
##  $ hhi2      : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 2 1 1 2 ...
##  $ education : Ord.factor w/ 6 levels "<9years"<"9-11years"<..: 4 4 3 4 2 3 5 3 5 4 ...
##  $ race      : Factor w/ 3 levels "white","black",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ hispanic  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ experience: num  13 24 43 17 44.5 32 14 1 4 7 ...
##  $ kidslt6   : int  2 0 0 0 0 0 0 1 0 1 ...
##  $ kids618   : int  1 1 0 1 0 0 0 0 0 0 ...
##  $ husby     : num  12 1.2 31.3 9 0 ...
##  $ region    : Factor w/ 4 levels "other","northcentral",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ wght      : int  214986 210119 219955 210317 219955 208148 213615 181960 214874 214874 ...

To develop a model we use the “regsubset” function from the “leap” package. Most of the coding is the same as linear regression. The only difference is the “nvmax” argument which is set to 13. The default setting for “nvmax” is 8. This is good if you only have 8 variables. However, the results from the “str” function indicate that we have 13 functions. Therefore, we need to set the “nvmax” argument to 13 instead of the default value of 8 in order to be sure to include all variables. Below is the code

regfit.full<-regsubsets(whrswk~.,HI, nvmax = 13)

We can look at the results with the “summary” function. For space reasons, the code is shown but the results will not be shown here.

summary(regfit.full)

If you run the code above in your computer you will 13 columns that are named after the variables created. A star in a column means that that variable is included in the model. To the left is the numbers 1-13 which. One means one variable in the model two means two variables in the model etc.

Our next step is to determine which of these models is the best. First, we need to decide what our criteria for inclusion will be. Below is a list of available fit indices.

names(summary(regfit.full))
## [1] "which"  "rsq"    "rss"    "adjr2"  "cp"     "bic"    "outmat" "obj"

For our purposes, we will use “rsq” (r-square) and “bic” “Bayesian Information Criteria.” In the code below we are going to save the values for these two fit indices in their own objects.

rsq<-summary(regfit.full)$rsq
bic<-summary(regfit.full)$bic

Now let’s plot them

plot(rsq,type='l',main="R-Square",xlab="Number of Variables")

1

plot(bic,type='l',main="BIC",xlab="Number of Variables")

1.png

You can see that for r-square the values increase and for BIC the values decrease. We will now make both of these plots again but we will have r tell the optimal number of variables when considering each model index. For we use the “which” function to determine the max r-square and the minimum BIC

which.max(rsq)
## [1] 13
which.min(bic)
## [1] 12

The model with the best r-square is the one with 13 variables. This makes sense as r-square always improves as you add variables. Since this is a demonstration we will not correct for this. For BIC the lowest values was for 12 variables. We will now plot this information and highlight the best model in the plot using the “points” function, which allows you to emphasis one point in a graph

plot(rsq,type='l',main="R-Square with Best Model Highlighted",xlab="Number of Variables")
points(13,(rsq[13]),col="blue",cex=7,pch=20)

1.png

plot(bic,type='l',main="BIC with Best Model Highlighted",xlab="Number of Variables")
points(12,(bic[12]),col="blue",cex=7,pch=20)

1.png

Since BIC calls for only 12 variables it is simpler than the r-square recommendation of 13. Therefore, we will fit our final model using the BIC recommendation of 12. Below is the code.

coef(regfit.full,12)
##        (Intercept)             hhiyes             whiyes 
##        30.31321796         1.16940604        18.25380263 
##        education.L        education^4        education^5 
##         6.63847641         1.54324869        -0.77783663 
##          raceblack        hispanicyes         experience 
##         3.06580207        -1.33731802        -0.41883100 
##            kidslt6            kids618              husby 
##        -6.02251640        -0.82955827        -0.02129349 
## regionnorthcentral 
##         0.94042820

So here is our final model. This is what we would use for our test set.

Conclusion

Best subset regression provides the researcher with insights into every possible model as well as clues as to which model is at least statistically superior. This knowledge can be used for developing models for data science applications.

Koines

There are many different ways that languages or a language can interact with each other. One way is how different dialects of a language interact. A koine is a dialect that is a blend of other dialects that have had direct contact with each other.

In this post, we will discuss the follow

  • Koine vs pidgins and creoles
  • Development of koines

Koine vs Pidgins and Creoles

Koine is a lesser known term to the general public in comparison to pidgin and creole. The word “koine” comes from the same Greek word that means “common.” In other words, koine is what two or more dialects have in common. For those familiar with biblical languages you would n=know that the term “Koine” Greek is the language of the New Testament, which means the New Testament was written in common Greek.

As mentioned in the introduction a koine is the interaction of two or more dialects to create new common dialect. In contrast, a pidgin is the interaction of two languages to make a new third language. A pidgin can eventually mature into a Creole which is a pidgin that is spoken as the native language of people. There seems to be no term for a koine that is spoken natively.

Developing a Koine

The process of developing a Koine is known as koineization. There are both linguistic and social factors that contribute to the development of a koine. For linguistic factors, they include leveling and simplification and for the social factor, it involves accommodation, prestige.

Levelling is the elimination from the koine distinct sounds from the colliding dialects. An example of this is the loss of the post-vocalic [r] in parts of England.

Simplification is the process of the simplest characteristics of the dialects being included in the new koine. The dialect with the fewer rules and exceptions almost always emerges as having more influence on the koine.

The social factor of accommodations means that people will copy something they believe is prestigious or “cool.” If a certain dialect is considered unacceptable it will not lead to a koine because of people’s dislike of it. As such accommodation and prestige are interrelated concepts.

Conclusion

Koine and koineization are two words that help to explain where dialects come from. As such, for those who have an interest in linguistics, these are terms to be familiar with

High Dimensionality Regression

There are times when least squares regression is not able to provide accurate predictions or explanation in an object. One example in which least scares regression struggles with a small sample size. By small, we mean when the total number of variables is greater than the sample size. Another term for this is high dimensions which means more variables than examples in the dataset

This post will explain the consequences of what happens when high dimensions is a problem and also how to address the problem.

Inaccurate measurements

One problem with high dimensions in regression is that the results for the various metrics are overfitted to the data. Below is an example using the “attitude” dataset. There are 2 variables and 3 examples for developing a model. This is not strictly high dimensions but it is an example of a small sample size.

data("attitude")
reg1 <- lm(complaints[1:3]~rating[1:3],data=attitude[1:3]) 
summary(reg1)
## 
## Call:
## lm(formula = complaints[1:3] ~ rating[1:3], data = attitude[1:3])
## 
## Residuals:
##       1       2       3 
##  0.1026 -0.3590  0.2564 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) 21.95513    1.33598   16.43   0.0387 *
## rating[1:3]  0.67308    0.02221   30.31   0.0210 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4529 on 1 degrees of freedom
## Multiple R-squared:  0.9989, Adjusted R-squared:  0.9978 
## F-statistic: 918.7 on 1 and 1 DF,  p-value: 0.021

With only 3 data points the fit is perfect. You can also examine the mean squared error of the model. Below is a function for this followed by the results

mse <- function(sm){ 
        mean(sm$residuals^2)}
mse(reg1)
## [1] 0.06837607

Almost no error. Lastly, let’s look at a visual of the model

with(attitude[1:3],plot(complaints[1:3]~ rating[1:3]))
title(main = "Sample Size 3")
abline(lm(complaints[1:3]~rating[1:3],data = attitude))

1.png

You can see that the regression line goes almost perfectly through each data point. If we tried to use this model on the test set in a real data science problem there would be a huge amount of bias. Now we will rerun the analysis this time with the full sample.

reg2<- lm(complaints~rating,data=attitude) 
summary(reg2)
## 
## Call:
## lm(formula = complaints ~ rating, data = attitude)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.3880  -6.4553  -0.2997   6.1462  13.3603 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.2445     7.6706   1.075    0.292    
## rating        0.9029     0.1167   7.737 1.99e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.65 on 28 degrees of freedom
## Multiple R-squared:  0.6813, Adjusted R-squared:  0.6699 
## F-statistic: 59.86 on 1 and 28 DF,  p-value: 1.988e-08

You can clearly see a huge reduction in the r-square from .99 to .68. Next, is the mean-square error

mse(reg2)
## [1] 54.61425

The error has increased a great deal. Lastly, we fit the regression line

with(attitude,plot(complaints~ rating))
title(main = "Full Sample Size")
abline(lm(complaints~rating,data = attitude))

1.png

Naturally, the second model is more likely to perform better with a test set. The problem is that least squares regression is too flexible when the number of features is greater than or equal to the number of examples in a dataset.

What to Do?

If least squares regression must be used. One solution to overcoming high dimensionality is to use some form of regularization regression such as ridge, lasso, or elastic net. Any of these regularization approaches will help to reduce the number of variables or dimensions in the final model through the use of shrinkage.

However, keep in mind that no matter what you do as the number of dimensions increases so does the r-square even if the variable is useless. This is known as the curse of dimensionality. Again, regularization can help with this.

Remember that with a large number of dimensions there are normally several equally acceptable models. To determine which is most useful depends on understanding the problem and context of the study.

Conclusion

With the ability to collect huge amounts of data has led to the growing problem of high dimensionality. One there are more features than examples it can lead to statistical errors. However, regularization is one tool for dealing with this problem.

Regression with Shrinkage Methods

One problem with least squares regression is determining what variables to keep in a model. One solution to this problem is the use of shrinkage methods. Shrinkage regression involves constraining or regularizing the coefficient estimates towards zero. The benefit of this is that it is an efficient way to either remove variables from a model or significantly reduce the influence of less important variables.

In this post, we will look at two common forms of regularization and these are.

Ridge

Ridge regression includes a tuning parameter called lambda that can be used to reduce to weak coefficients almost to zero. This shrinkage penalty helps with the bias-variance trade-off. Lambda can be set to any value from 0 to infinity. A lambda set to 0 is the same as least square regression while a lambda set to infinity will produce a null model. The technical term for lambda when ridge is used is the “l2 norm”

Finding the right value of lambda is the primary goal when using this algorithm,. Finding it involves running models with several values of lambda and seeing which returns the best results on predetermined metrics.

The primary problem with ridge regression is that it does not actually remove any variables from the model. As such, the prediction might be excellent but explanatory power is not improve if there are a large number of variables.

Lasso

Lasso regression has the same characteristics as Ridge with one exception. The one exception is the Lasso algorithm can actually remove variables by setting them to zero. This means that lasso regression models are usually superior in terms of the ability to interpret and explain them. The technical term for lambda when lasso is used is the “l1 norm.”

It is not clear when lasso or ridge is superior. Normally, if the goal is explanatory lasso is often stronger. However, if the goal is prediction, ridge may be an improvement but not always.

Conclusion

Shrinkage methods are not limited to regression. Many other forms of analysis can employ shrinkage such as artificial neural networks. Most machine learning models can accommodate shrinkage.

Generally, ridge and lasso regression is employed when you have a huge number of predictors as well as a larger dataset. The primary goal is the simplification of an overly complex model. Therefore, the shrinkage methods mentioned here are additional ways to use statistical models in regression.

Subset Selection Regression

There are many different ways in which the variables of a regression model can be selected. In this post, we look at several common ways in which to select variables or features for a regression model. In particular, we look at the following.

  • Best subset regression
  • Stepwise selection

Best Subset Regression

Best subset regression fits a regression model for every possible combination of variables. The “best” model can be selected based on such criteria as the adjusted r-square, BIC (Bayesian Information Criteria), etc.

The primary drawback to best subset regression is that it becomes impossible to compute the results when you have a large number of variables. Generally, when the number of variables exceeds 40 best subset regression becomes too difficult to calculate.

Stepwise Selection

Stepwise selection involves adding or taking away one variable at a time from a regression model. There are two forms of stepwise selection and they are forward and backward selection.

In forward selection, the computer starts with a null model ( a model that calculates the mean) and adds one variable at a time to the model. The variable chosen is the one the provides the best improvement to the model fit. This process reduces greatly the number of models that need to be fitted in comparison to best subset regression.

Backward selection starts the full model and removes one variable at a time based on which variable improves the model fit the most. The main problem with either forward or backward selection is that the best model may not always be selected in this process. In addition, backward selection must have a sample size that is larger than the number of variables.

Deciding Which to Choose

Best subset regression is perhaps most appropriate when you have a small number of variables to develop a model with, such as less than 40. When the number of variables grows forward or backward selection are appropriate. If the sample size is small forward selection may be a better choice. However, if the sample size is large as in the number of examples is greater than the number of variables it is now possible to use backward selection.

Conclusion

The examples here are some of the most basic ways to develop a regression model. However, these are not the only ways in which this can be done. What these examples provide is an introduction to regression model development. In addition, these models provide some sort of criteria for the addition or removal of a variable based on statistics rather than intuition.

Leave One Out Cross Validation in R

Leave one out cross validation. (LOOCV) is a variation of the validation approach in that instead of splitting the dataset in half, LOOCV uses one example as the validation set and all the rest as the training set. This helps to reduce bias and randomness in the results but unfortunately, can increase variance. Remember that the goal is always to reduce the error rate which is often calculated as the mean-squared error.

In this post, we will use the “Hedonic” dataset from the “Ecdat” package to assess several different models that predict the taxes of homes In order to do this, we will also need to use the “boot” package. Below is the code.

library(Ecdat);library(boot)
data(Hedonic)
str(Hedonic)
## 'data.frame':    506 obs. of  15 variables:
##  $ mv     : num  10.09 9.98 10.45 10.42 10.5 ...
##  $ crim   : num  0.00632 0.02731 0.0273 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 ...
##  $ chas   : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ nox    : num  28.9 22 22 21 21 ...
##  $ rm     : num  43.2 41.2 51.6 49 51.1 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 ...
##  $ dis    : num  1.41 1.6 1.6 1.8 1.8 ...
##  $ rad    : num  0 0.693 0.693 1.099 1.099 ...
##  $ tax    : int  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 ...
##  $ blacks : num  0.397 0.397 0.393 0.395 0.397 ...
##  $ lstat  : num  -3 -2.39 -3.21 -3.53 -2.93 ...
##  $ townid : int  1 2 2 3 3 3 4 4 4 4 ...

First, we need to develop our basic least squares regression model. We will do this with the “glm” function. This is because the “cv.glm” function (more on this later) only works when models are developed with the “glm” function. Below is the code.

tax.glm<-glm(tax ~ mv+crim+zn+indus+chas+nox+rm+age+dis+rad+ptratio+blacks+lstat, data = Hedonic)

We now need to calculate the MSE. To do this we will use the “cv.glm” function. Below is the code.

cv.error<-cv.glm(Hedonic,tax.glm)
cv.error$delta
## [1] 4536.345 4536.075

cv.error$delta contains two numbers. The first is the MSE for the training set and the second is the error for the LOOCV. As you can see the numbers are almost identical.

We will now repeat this process but with the inclusion of different polynomial models. The code for this is a little more complicated and is below.

cv.error=rep(0,5)
for (i in 1:5){
        tax.loocv<-glm(tax ~ mv+poly(crim,i)+zn+indus+chas+nox+rm+poly(age,i)+dis+rad+ptratio+blacks+lstat, data = Hedonic)
        cv.error[i]=cv.glm(Hedonic,tax.loocv)$delta[1]
}
cv.error
## [1] 4536.345 4515.464 4710.878 7047.097 9814.748

Here is what happen.

  1. First, we created an empty object called “cv.error” with five empty spots, which we will use to store information later.
  2. Next, we created a for loop that repeats 5 times
  3. Inside the for loop, we create the same regression model except we added the “poly” function in front of “age”” and also “crim”. These are the variables we want to try polynomials 1-5 one to see if it reduces the error.
  4. The results of the polynomial models are stored in the “cv.error” object and we specifically request the results of “delta” Finally, we printed “cv.error” to the console.

From the results, you can see that the error decreases at a second order polynomial but then increases after that. This means that high order polynomials are not beneficial generally.

Conclusion

LOOCV is another option in assessing different models and determining which is most appropriate. As such, this is a tool that is used by many data scientist.

Validation Set for Regression in R

Estimating error and looking for ways to reduce it is a key component of machine learning. In this post, we will look at a simple way of addressing this problem through the use of the validation set method.

The validation set method is a standard approach in model development. To put it simply, you divide your dataset into a training and a hold-out set. The model is developed on the training set and then the hold-out set is used for prediction purposes. The error rate of the hold-out set is assumed to be reflective of the test error rate.

In the example below, we will use the “Carseats” dataset from the “ISLR” package. Our goal is to predict the competitors’ price for a carseat based on the other available variables. Below is some initial code

library(ISLR)
data("Carseats")
str(Carseats)
## 'data.frame':    400 obs. of  11 variables:
##  $ Sales      : num  9.5 11.22 10.06 7.4 4.15 ...
##  $ CompPrice  : num  138 111 113 117 141 124 115 136 132 132 ...
##  $ Income     : num  73 48 35 100 64 113 105 81 110 113 ...
##  $ Advertising: num  11 16 10 4 3 13 0 15 0 0 ...
##  $ Population : num  276 260 269 466 340 501 45 425 108 131 ...
##  $ Price      : num  120 83 80 97 128 72 108 120 124 124 ...
##  $ ShelveLoc  : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
##  $ Age        : num  42 65 59 55 38 78 71 67 76 76 ...
##  $ Education  : num  17 10 12 14 13 16 15 10 10 17 ...
##  $ Urban      : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
##  $ US         : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...

We need to divide our dataset into two part. One will be the training set and the other the hold-out set. Below is the code.

set.seed(7)
train<-sample(x=400,size=200)

Now, for those who are familiar with R you know that we haven’t actually made our training set. We are going to use the “train” object to index items from the “Carseat” dataset. What we did was set the seed so that the results can be replicated. Then we used the “sample” function using two arguments “x” and “size”. X represents the number of examples in the “Carseat” dataset. Size represents how big we want the sample to be. In other words, we want a sample size of 200 of the 400 examples to be in the training set. Therefore, R will randomly select 200 numbers from 400.

We will now fit our initial model

car.lm<-lm(CompPrice ~ Income+Sales+Advertising+Population+Price+ShelveLoc+Age+Education+Urban, data = Carseats,subset = train)

The code above should not be new. However, one unique twist is the use of the “subset” argument. What this argument does is tell R to only use rows that are in the “train” index. Next, we calculate the mean squared error.

mean((Carseats$CompPrice-predict(car.lm,Carseats))[-train]^2)
## [1] 77.13932

Here is what the code above means

  1. We took the “CompPrice” results and subtracted them from the prediction made by the “car.lm” model we developed.
  2. Used the test set which here is identified as “-train” minus means everything that is not in the “train”” index
  3. the results were squared.

The results here are the baseline comparison. We will now make two more models each with a polynomial in one of the variables. First, we will square the “income” variable

car.lm2<-lm(CompPrice ~ Income+Sales+Advertising+Population+I(Income^2)+Price+ShelveLoc+Age+Education+Urban, data = Carseats,subset = train)
mean((Carseats$CompPrice-predict(car.lm2,Carseats))[-train]^2)
## [1] 75.68999

You can see that there is a small decrease in the MSE. Also, notice the use of the “I” function which allows us to square “income”. Now, let’s try a cubic model

car.lm3<-lm(CompPrice ~ Income+Sales+Advertising+Population+I(Income^3)+Price+ShelveLoc+Age+Education+Urban, data = Carseats,subset = train)
mean((Carseats$CompPrice-predict(car.lm3,Carseats))[-train]^2)
## [1] 75.84575

This time there was an increase when compared to the second model. As such, higher order polynomials will probably not improve the model.

Conclusion

This post provided a simple example of assessing several different models use the validation approach. However, in practice, this approach is not used as frequently as there are so many more ways to do this now. Yet, it is still good to be familiar with a standard approach such as this.

Teaching Small Children to Write

Teaching a child to write is an interesting experience. In this post, I will share some basics ideas on one way this can be done.

To Read or not to Read

Often writing is taught after the child has learned to read. A major exception to this is the Montessori method of reading. For Montessori, a child should learn to write before reading. This is probably because writing is a more tactile experience when compared to reading and Montessori was a huge proponent of experiential learning. In addition, if you can write you can definitely read under this assumption.

Generally, I teach young children how to read first. This is because I want the child to know the letters before trying to write them.

The Beginning

If the child is already familiar with the basics of reading writing is probably more about hand-eye coordination than anything else. The first few letters are quite the experience. This is affected by age as well. Smaller children will have much more difficulty with writing than older children.

A common strategy to motivate a child to write is to have them first learn to spell their name. This can work depending on how hard the child’s name is to spell. A kid named “Dan” will master writing his name quickly. However, a kid with a longer name or a transliterated name from another language is going to have a tough time. I knew one student who misspelled their name for almost a year and a half because it was so hard to write in English.

A common way to teach actually writing is to allow the child to trace the words on dot paper. By doing this they develop the muscle memory for writing. Once this is successful the child will then attempt to write the letters with the tracing paper. This process can easily take a year.

Sentences and Paragraphs

After,  they learn to write letters and words it is time to begin writing sentences. A six-year-old, with good penmanship, will probably not be able to write a sentence with support. Writing and spelling and different skills initially and it is the adult’s job to provide support for the spelling aspect as the child explains what they want to write about.

With help, children can create short little stories that may be one to two paragraphs in length. Yet they will still need a lot of support to do this.

By eight years of age, a child can probably write a paragraph on their own about simple concepts or stories. This is when the teaching and learning can really get interesting as the child can now write to learn instead of focusing on learning to write.

Conclusion

Writing is a skill that is hard to find these days. With so many other forms of communication, writing is not a skill that children want to focus on. Nevertheless, learning to write by basic literacy is an excellent way to develop communication skills and interact with people in situations where face-to-face contact is not possible.

Homeschooling Concerns

Parents frequently have questions about homeschooling. In this post, we look at three common questions related to homeschooling.

  1. How do you know if your child has learned
  2. What do you do about socializing
  3. What about college

How do You know if they Learned

One definition of learning is a change in observable behavior. In other words, one-way a parent can know that their child is learning is through watching for changes in behavior. For example, you are teaching addition and the child begins to do addition on their own. It is evidence that they have learned something. There is no need for standardized testing in order to indicate this.

A lot of the more advanced forms of assessment including standardized test was created in order to assess the progress of a huge number of students. In the context of homeschooling with only a few students, such rigorous measures are unnecessary. governments need sophisticated measures of achievement because of the huge populations that they serve which would be inappropriate when dealing with one or two elementary students.

Another way to know what your child has learned is to look at what they are studying right now. For example, if my child is reading I know that they have probably mastered the alphabet. Otherwise, how could the read? I also know that they probably have mastered the most of the phonics. In other words, current struggles are an indication of what was mastered before.

What about Socializing

The answer to this question really depends on your position on socializing. Many parents want their child to act like other children. For example, if my child is 7 I want him to act like other 7-year-olds.

Other parents want their child to learn how to act like an adult. For them, they want their 7-year-old child to imitate the behavior of them (the parents) rather than the behavior of other 7-year-olds. A child will only rise to the expectations of those around them. Being around children encourages childish behavior because that’s the example. Again for many parents, this is what they want, however, others see this differently.

The reality is that until middle-age most of the people we interact with are older than us. As such, it is beneficial for a child to spend a large amount of time around people who are older than them and understand the importance of setting an example that can be imitated.

All socializing is not the same. Adult-to-child socializing provides a child with an example of how to be an adult rather than how to be a child. Besides, most small children would love to be around their parents all day. They only grow to love friends so much because those are the people who give them the most attention.

What about College

This question is the hardest to answer as it depends on context a great deal. Concerns with college can be alleviated by having the child take the GED in the US or local college entrance examinations in other countries.

It is also important to keep careful records of what the child studies during high school. Most colleges do not care about K-8 learning but really want to know what happens during grades 9-12. Keep records of the courses the child took as well as the grades. It will also be necessary to take the SAT or ACT in most countries as well.

Conclusion

Homeschooling is an option for people who want to spend the maximum amount of time possible with their children. Concerns about learning, socializing, and college are unnecessary if the parents are willing to thoroughly dedicate themselves and provide their children with a learning environment that develops their children wholistically.

What it Takes to Homeschool

Some may be wondering what does it take to homeschool. Below are some characteristics of the homeschool.

Time management

Being able to adhere to a schedule is a prerequisite for homeschooling. It is tempting to just kind of doing things whenever when you have this kind of freedom. However, in order to be successful, you have to hold yourself responsibility like your boss would. This is difficult for most people who are not used to autonomy.

This is not to say there should be no flexibility. Rather, the schedule should not be cheated because of laziness. There must be a set schedule for studying for the sake of behavior management of the children. If the child doesn’t know what to expect they may challenge you when you flippantly decide they need to study. Consistency is a foundational principle of homeschooling.

Discipline

Discipline means being able to do something even when you do not feel like doing it. In homeschooling, you have to teach whether you want to or not. Remember, sometimes we had to work at our jobs when we didn’t feel like it and the same with teaching in the home. If you’re tired you still have to teach, if you’re a little sick you still have to teach, if you’re angry you still have to teach.

The child is relying on you to provide them with the academic skills needed to compete in the world. This cannot be neglected for trivial reasons. Lesson plans are key. Either buy them or make them. Keep track of completed assignment and note the progress of the student.

Toughness

As a homeschooling parent, you are the only authority in the child’s life. This means all discipline falls under your jurisdiction. One reasons parents enjoy sending their kids to school is to burden the public school teachers with their own child’s poor behavior. “Let the school deal with him” is a common comment I have heard when I was a k12 teacher. However, when you teach as a homeschool parent only you have the pleasure of disciplining your child.

Discipline is not only about taking away privileges and causing general suffering for unacceptable behavior. Discipline also includes communicating clearly with your child to prevent poor behavior, have clear rules that are always enforced, as well as providing a stable environment in which to study.

Patience

Homeschooling also requires patience. For example, you are teaching a basic first-grade math concept to your child that takes several weeks for them to learn.  Naturally, you start to get angry with the child and yourself for the lack of progress. You may even begin to question if you have what it takes to do this. However, after waiting for what seems an eternity they child finally gets it.

This is the reality of homeschooling. No matter how bad you think you are the child will eventually get it when they are ready. This requires patience in the parent and some confidence in their own ability to help their child to grow.

Conclusion

There are many more ideas I could share. However, this is sufficient for now. In general, I would not recommend homeschooling for the typical family as the above traits are usually missing in the parents. Many parents want to homeschool for emotional reasons. The problem with this is that when they feel bad they will not want to continue the experience. Homeschooling can involve love but it must transcend emotions in order to endure for several years.

Teaching Math in the Homeschool

Teaching a child to count and do simple math is much more challenge then many would believe. Below is a simple process that I accidentally developed from working with kindergarten home-school student for two years. Keep in mind that often these steps overlapped.

  1. Number recognition
  2. Counting
  3. Counting with manipulatives
  4. Flashcards with larger numbers
  5. Writing numbers
  6. Adding with manipulatives
  7. Subtraction with manipulatives
  8. Visual math

1.  Number Recognition

Number recognition simple involved the use of flashcards with the child. I would hold up a number and tell the child what the number was. Memorizing is perhaps one of the easiest things the young mind can do as critical thinking comes much later. This initial process probably took about 6 months with a four-year-old to learn number 1-20.

2. Counting

With the numbers memorized, the next step was to actually learn to count. I did this by holding up the same flashcards. After the child identify what number it was I would then flip the flashcard over and have them count the number of objects on the card. My goal was to have them make a connection between the abstract number and the actual amount that could be seen and counted.

Again it took about six months for the four and half-year-old student to master this from numbers 1-20. It was a really stressful six months.

3. Counting with Manipulatives

The next few steps happen concurrently for the most part. I started to have the student count with manipulatives. I would show or say a number and expect the student to count the correct number using the manipulatives. This was done with numbers 1-20 only.

4. Flashcards with Larger Numbers 

At the same time, I worked with the student to learn numbers beyond 20. This was strictly for memorization purposes. This continued from 4.5 to 6 years of age. Eventually, the child could identify numbers 1-999. However, the never discovered the pattern of counting. By pattern, I mean how the 0-9 cycle repeats in the tens, how the 1-9 cycle repeats for the tens when moving to 100s, etc. The child only knew the numbers through brute memorization.

5. Writing Numbers

Writing numbers was used as preparation for doing addition. It was as simple as giving the student some numbers to trace on paper. It took about 8 months for the student to write numbers with any kind of consistency.

6. Adding with Manipulatives

This involved me writing a math problem and having the student solve the problem use manipulatives. For example, 2 + 2 would be solved by having the student count two manipulatives and then count two more and then count the total.

My biggest concern was having the child understand the + and = sign. The plus sign was easy but the equal sign was mysterious for a long time. However, the learning rate was picking up and the kid learn this in about 3 months

7. Subtraction with Manipulatives

Same as above but only took one month to learn

8. Visual Math

At this stage,  the child was doing worksheets on their own. Manipulatives were allowed as a crutch to get through the problems. However, the child was now being encouraged to use their fingers for counting purposes. This was a disaster for several weeks as the lack the coordination to open and close the fingers independent of each other.

Conclusion

This entire process took two years to complete from ages 4-6 working with the child one-on-one. By the age of six, the child could add and subtract anything from 1-30 and was ready for 1st grade.

I would recommend waiting longer to start math with a child. Being 4 was probably too young for this particular child. Better to wait untili 5 or 6 to learn numbers and counting. There more danger in starting early then there is in starting late.

Confusing Words for Small Children

In this post, we will look at some commonly used words that can bring a great deal of frustration to adults when communicating with small children. The terms are presented in the following categories

  • Deictic terms
  • Interrogatives
  • Locational terms
  • Temporal terms

Deictic Terms

Deictic terms fall under the umbrella of pragmatic development or understanding of the context in which words are used. Examples of deictic terms include such words as this, that, these, those, here, there, etc. What makes these words confusing for young children and even ESL speakers is that the meaning of these words depends on the context. Below is a clear way to communicate followed by a way that is unclear using a deixis term

Clear communication: Take the book
Unclear communication: Take that

The first sentence makes it clear what to take which in this example is the book. However, for a child or ESL speaker, the second sentences can be mysterious. What does “that” mean. It takes pragmatic or contextual knowledge to determine what “that” is referring to in the sentence. Children usually cannot figure this out while an ESL speaker will watch the body language (nonlinguistic cues) of the speaker to figure this out.

Interrogatives

A unique challenge for children is understanding interrogatives. These are such words as who, what, where, when, and why. The challenge with these questions is they involve explaining the cause, time, and or reasons. Many parents have asked the following question without receiving an adequate answer

Why did you take the book?

The typical 3-year old is going to wonder what the word “why” means. Off course, you can combine a deictic term with an interrogative and completely lose a child

Why did you do that?

Locational Terms

Locational terms are prepositions words such as in, under, above, behind etc. These words can be challenging for young children because they have to understand the perspective of the person speaking. Below is an example.

Put the book under the table.

Naturally, the child is trying to understand what “under” means. We can also completely confuse a child by using terms from all the categories we have discussed so far.

Why did you put that under the table?

This sentence would probably be unclear to many native speakers. The ambiguity is high especially with the term “that” included.

Temporal Terms

Temporal terms are about time. Commonly used words include before, after, while, etc. These terms are difficult for children because young children do not quite grasp the concept of time. Below is an example of a sentence with a temporal term.

Before, dinner, grab the book

The child is probably wondering when they are supposed to get the book. Naturally, we can combine all of our terms to make a truly nightmarish sentence.

Why did you put that under the table after dinner?

Conclusion

The different terms mentioned here are terms that can cause frustration when trying to communicate. To alleviate, these problems parents and teachers should avoid these terms when possible by using nouns. In addition, using body language to indicate position or pointing to whatever you are talking about can help young children to infer the meaning

Additive Assumption and Multiple Regression

In regression, one of the assumptions is the additive assumption. This assumption states that the influence of a predictor variable on the dependent variable is independent of any other influence. However, in practice, it is common that this assumption does not hold.

In this post, we will look at how to address violations of the additive assumption through the use of interactions in a regression model.

An interaction effect is when you have two predictor variables whose effect on the dependent variable is not the same. As such, their effect must be considered simultaneously rather than separately. This is done through the use of an interaction term. An interaction term is the product of the two predictor variables.

Let’s begin by making a regular regression model with an interaction. To do this we will use the “Carseats” data from the “ISLR” package to predict “Sales”. Below is the code.

library(ISLR);library(ggplot2)
data(Carseats)
saleslm<-lm(Sales~.,Carseats)
summary(saleslm)
## 
## Call:
## lm(formula = Sales ~ ., data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8692 -0.6908  0.0211  0.6636  3.4115 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      5.6606231  0.6034487   9.380  < 2e-16 ***
## CompPrice        0.0928153  0.0041477  22.378  < 2e-16 ***
## Income           0.0158028  0.0018451   8.565 2.58e-16 ***
## Advertising      0.1230951  0.0111237  11.066  < 2e-16 ***
## Population       0.0002079  0.0003705   0.561    0.575    
## Price           -0.0953579  0.0026711 -35.700  < 2e-16 ***
## ShelveLocGood    4.8501827  0.1531100  31.678  < 2e-16 ***
## ShelveLocMedium  1.9567148  0.1261056  15.516  < 2e-16 ***
## Age             -0.0460452  0.0031817 -14.472  < 2e-16 ***
## Education       -0.0211018  0.0197205  -1.070    0.285    
## UrbanYes         0.1228864  0.1129761   1.088    0.277    
## USYes           -0.1840928  0.1498423  -1.229    0.220    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.019 on 388 degrees of freedom
## Multiple R-squared:  0.8734, Adjusted R-squared:  0.8698 
## F-statistic: 243.4 on 11 and 388 DF,  p-value: < 2.2e-16

The results are rather excellent for the social sciences. The model explains 87.3% of the variance in “Sales”. The current results that we have are known as main effects. These are effects that directly influence the dependent variable. Most regression models only include main effects.

We will now examine an interaction effect between two continuous variables. Let’s see if there is an interaction between “Population” and “Income”.

saleslm1<-lm(Sales~CompPrice+Income+Advertising+Population+Price+Age+Education+US+
                     Urban+ShelveLoc+Population*Income, Carseats)
summary(saleslm1)
## 
## Call:
## lm(formula = Sales ~ CompPrice + Income + Advertising + Population + 
##     Price + Age + Education + US + Urban + ShelveLoc + Population * 
##     Income, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8699 -0.7624  0.0139  0.6763  3.4344 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        6.195e+00  6.436e-01   9.625   <2e-16 ***
## CompPrice          9.262e-02  4.126e-03  22.449   <2e-16 ***
## Income             7.973e-03  3.869e-03   2.061   0.0400 *  
## Advertising        1.237e-01  1.107e-02  11.181   <2e-16 ***
## Population        -1.811e-03  9.524e-04  -1.901   0.0580 .  
## Price             -9.511e-02  2.659e-03 -35.773   <2e-16 ***
## Age               -4.566e-02  3.169e-03 -14.409   <2e-16 ***
## Education         -2.157e-02  1.961e-02  -1.100   0.2722    
## USYes             -2.160e-01  1.497e-01  -1.443   0.1498    
## UrbanYes           1.330e-01  1.124e-01   1.183   0.2375    
## ShelveLocGood      4.859e+00  1.523e-01  31.901   <2e-16 ***
## ShelveLocMedium    1.964e+00  1.255e-01  15.654   <2e-16 ***
## Income:Population  2.879e-05  1.253e-05   2.298   0.0221 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.013 on 387 degrees of freedom
## Multiple R-squared:  0.8751, Adjusted R-squared:  0.8712 
## F-statistic:   226 on 12 and 387 DF,  p-value: < 2.2e-16

The new contribution is at the bottom of the coefficient table and is the “Income:Population” coefficient. What this means is “the increase of Sales given a one unit increase in Income and Population simultaneously” In other words the “Income:Population” coefficient looks at their combined simultaneous effect on Sales rather than just their independent effect on Sales.

This makes practical sense as well. The larger the population the more available income and vice versa. However, for our current model, the improvement in the r-squared is relatively small. The actual effect is a small increase in sales. Below is a graph of income and population by sales. Notice how the lines cross. This is a visual of what an interaction looks like. The lines are not parallel by any means.

ggplot(data=Carseats, aes(x=Income, y=Sales, group=1)) +geom_smooth(method=lm,se=F)+ 
        geom_smooth(aes(Population,Sales), method=lm, se=F,color="black")+xlab("Income and Population")+labs(
                title="Income in Blue Population in Black")

We will now repeat this process but this time using a categorical variable and a continuous variable. We will look at the interaction between “US” location (categorical) and “Advertising” (continuous).

saleslm2<-lm(Sales~CompPrice+Income+Advertising+Population+Price+Age+Education+US+
                     Urban+ShelveLoc+US*Advertising, Carseats)
summary(saleslm2)
## 
## Call:
## lm(formula = Sales ~ CompPrice + Income + Advertising + Population + 
##     Price + Age + Education + US + Urban + ShelveLoc + US * Advertising, 
##     data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8531 -0.7140  0.0266  0.6735  3.3773 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        5.6995305  0.6023074   9.463  < 2e-16 ***
## CompPrice          0.0926214  0.0041384  22.381  < 2e-16 ***
## Income             0.0159111  0.0018414   8.641  < 2e-16 ***
## Advertising        0.2130932  0.0530297   4.018 7.04e-05 ***
## Population         0.0001540  0.0003708   0.415   0.6782    
## Price             -0.0954623  0.0026649 -35.823  < 2e-16 ***
## Age               -0.0463674  0.0031789 -14.586  < 2e-16 ***
## Education         -0.0233500  0.0197122  -1.185   0.2369    
## USYes             -0.1057320  0.1561265  -0.677   0.4987    
## UrbanYes           0.1191653  0.1127047   1.057   0.2910    
## ShelveLocGood      4.8726025  0.1532599  31.793  < 2e-16 ***
## ShelveLocMedium    1.9665296  0.1259070  15.619  < 2e-16 ***
## Advertising:USYes -0.0933384  0.0537807  -1.736   0.0834 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.016 on 387 degrees of freedom
## Multiple R-squared:  0.8744, Adjusted R-squared:  0.8705 
## F-statistic: 224.5 on 12 and 387 DF,  p-value: < 2.2e-16

Again, you can see that when the store is in the US you have to also consider the advertising budget as well. When these two variables are considered there is a slight decline in sales. What this means in practice is that advertising in the US is not as beneficial as advertising outside the US.

Below you can again see a visual of the interaction effect when the lines for US yes and no cross each other in the plot below.

ggplot(data=Carseats, aes(x=Advertising, y=Sales, group = US, colour = US)) +
        geom_smooth(method=lm,se=F)+scale_x_continuous(limits = c(0, 25))+scale_y_continuous(limits = c(0, 25))

Lastly, we will look at an interaction effect for two categorical variables.

saleslm3<-lm(Sales~CompPrice+Income+Advertising+Population+Price+Age+Education+US+
                     Urban+ShelveLoc+ShelveLoc*US, Carseats)
summary(saleslm3)
## 
## Call:
## lm(formula = Sales ~ CompPrice + Income + Advertising + Population + 
##     Price + Age + Education + US + Urban + ShelveLoc + ShelveLoc * 
##     US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8271 -0.6839  0.0213  0.6407  3.4537 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            5.8120748  0.6089695   9.544   <2e-16 ***
## CompPrice              0.0929370  0.0041283  22.512   <2e-16 ***
## Income                 0.0158793  0.0018378   8.640   <2e-16 ***
## Advertising            0.1223281  0.0111143  11.006   <2e-16 ***
## Population             0.0001899  0.0003721   0.510   0.6100    
## Price                 -0.0952439  0.0026585 -35.826   <2e-16 ***
## Age                   -0.0459380  0.0031830 -14.433   <2e-16 ***
## Education             -0.0267021  0.0197807  -1.350   0.1778    
## USYes                 -0.3683074  0.2379400  -1.548   0.1225    
## UrbanYes               0.1438775  0.1128171   1.275   0.2030    
## ShelveLocGood          4.3491643  0.2734344  15.906   <2e-16 ***
## ShelveLocMedium        1.8967193  0.2084496   9.099   <2e-16 ***
## USYes:ShelveLocGood    0.7184116  0.3320759   2.163   0.0311 *  
## USYes:ShelveLocMedium  0.0907743  0.2631490   0.345   0.7303    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.014 on 386 degrees of freedom
## Multiple R-squared:  0.8753, Adjusted R-squared:  0.8711 
## F-statistic: 208.4 on 13 and 386 DF,  p-value: < 2.2e-16

In this case, we can see that when the store is in the US and the shelf location is good it has an effect on Sales when compared to a bad location. The plot below is a visual of this. However, it is harder to see this because the x-axis has only two categories

ggplot(data=Carseats, aes(x=US, y=Sales, group = ShelveLoc, colour = ShelveLoc)) +
        geom_smooth(method=lm,se=F)

Conclusion

Interactions effects are a great way to fine-tune a model, especially for explanatory purposes. Often, the change in r-square is not strong enough for prediction but can be used for nuanced understanding of the relationships among the variables.

Concepts to Consider for Model Development

When assessing how to conduct quantitative data analysis there are several factors to consider. In this post, we look at common either-or choices in data analysis. The concepts are as follows

  • Parametric vs Non-Parametric
  • Bias vs Variance
  • Accuracy vs Interpretability
  • Supervised vs Unsupervised
  • Regression vs Classifications

Parametric vs Non-Parametric

Parametric models make assumptions about the shape of a function. Often, the assumption is that the function is a linear in nature such as in linear regression. Making this assumption makes it much easier to estimate the actual function of a model.

Non-parametric models do not make any assumptions about the shape of the function. This allows the function to take any shape possible. Examples of non-parametric models include decision trees, support vector machines, and artificial neural networks.

The main concern with non-parametric models is they require a huge dataset when compared to parametric models.

Bias vs Variance Tradeoff

In relation to parametric and non-parametric models is the bias-variance tradeoff. A bias model is simply a model that does not listen to the data when it is tested on the test dataset. The function does what it wants as if it was not trained on the data. Parametric models tend to suffer from high bias.

Variance is the change in the function if it was estimated using new data. Variance is usually much higher in non-parametric models as they are more sensitive to the unique nature of each dataset.

Accuracy vs Interpretability 

It is also important to determine what you want to know. If your goal is accuracy then a complicated model may be more appropriate. However, if you want to infer and make conclusions about your model then it is preferred to make a simpler model that can be explained.

A model can be made simpler or more complex through the inclusion of more features or the use of a more complex algorithm. For example, regression is much easier to interpret than artificial neural networks.

Supervised vs Unsupervised

Supervised learning is learning that involves a dependent variable. Examples of supervised learning include regression, support vector machines, K nearest neighbor, and random forest.

Unsupervised learning involves a dataset that does not have a dependent variable. In this situation,  you are looking for patterns in the data. Examples of this include kmeans, principal component analysis, and cluster analysis.

Regression vs Classifications

Lastly, a problem can call for regression or classification. A regression problem involves a numeric dependent variable. A classification problem has a categorical dependent variable. Almost all models that are used for supervised machine learning can address regression or classification.

For example, regression includes numeric regression and logistic regression for classification. K nearest neighbor can do both as can random forest, support vector machines, and artificial neural networks.

Ways to Comprehend Academic Texts

In this post, we will look at some practical ways to better understand an academic text. The tips are broken down into three sections which are, what to do before, during, and after reading.

Before Reading

Read the Preface. The preface lays out the entire scope of the book. It provides the framework in which you can place the details of the chapters. This is critical in order to put the pieces together to make use of them. Almost all students skip this as it is normally not assigned reading. This step is only done before reading the first chapter of a text.

Read the Chapter titles. The chapter titles give you an idea of what the chapter is about. Again it helps you to zoom down one level to understand the subject of the book from one aspect of it. Again, most students fly past this when the chapter title provides clear clues as to what to expect in the text

Read the Objectives. The objectives tell you what you are going to visit in the chapter. They serve as signposts of what to expect and provide a framework for placing the details of the text.

Read the Chapter Headings. An academic text is broken down into chapters which are broken down into headings. Examining the headings provides more information about the chapter and the book. I also should mention that often the objectives and the headings of a book are the same with slight rewording. This seems lazy but is actually much clearer than when they two are not similar.

Look at the visuals (tables, graphs, pictures). Visuals summarize critical information. It is easy for anybody to become overwhelmed when reading a text. Therefore, visuals are created to summarize the most important information. Just like figure 1.2 above.

If you do these things you now know what to expect when you read. You are also beginning to develop an idea of what you did not know about the given subject. This leads to the next major point.

Ask Questions. After this inspection of the text, you should do the following.

  • Decide what you already know about this topic
  • Decide what you want to know about this topic and make questions

This two-step process prepares you for connecting your current level of understanding with the new knowledge within the text. You know what is a review for you and you focus on finding answers to the concepts and idea that are new for you.

During Reading

 After all of this preparatory work, it is now time to read. Having done all this you already know the following

  • Title of the chapter
  • Major headings/objectives of the chapter
  • What you already know about this subject
  • What you do not know about this subject

Now you read the text and answer your questions. You also can highlight key ideas as well as write in the margins of the text. Highlighting should generally be limited to main ideas in order to reduce the clutter of highlighting everything. Writing in the margins allows you to make quick notes to yourself about key points and or to summarize a dense concept. Doing either of these is a way to wrestle with a text in an active manner which is important for comprehension.

After Reading

After you have read and answered your questions in a text. There are several things left to do.

Determine what did you learn. Write briefly a few notes to yourself about what exactly you learned. This is for you and helps to make sense of all the details in your mind at the moment.

Look at the Resources at the back of the chapter. Many textbooks have several study tools at the back of the chapter. Example this includes an outline of the chapter which is a great summary, discussion questions which help in developing critical thinking skills, and often vocabulary words are here as well. When preparing for an exam this is an excellent resource.

Conclusion

This process is not as much work as it seems. With practice, it can become natural. In addition, you need to modify this so that you can be successful as a reader. The ideas here provide a framework in which you can develop your own style.

Insights into Reading Academic Text

In my experience as a teacher for several years at university, I have noticed how students consistently struggle with reading an academic text. It seemed as those they were able to “read” the words but always lack the ability to understand what the text was about. I’ve thought about this challenge and have been lead to the following conclusions.

  • Many Students believe reading and understanding are the same thing
  • Many Students believe they have no responsibility to think about what they have read
  • Many Students believe there is no reason to connect what they are reading to anything they currently know
  • Many Students see no point to determine how to use or apply what they have read
  • Many Students do not understand how academic writers structure their writing

None of these points apply to everybody. However, it is common for me to ask my students if they read something and they usually that they yes the did read it. However, as I begin to ask questions and to explore the text with them it quickly becomes clear they did not understand anything that they read. This is partially due to the problem that students read passively even though reading is active. The student never thinks of the relevance of the reading to their own life or future career.

In other words, reading is not the problem, rather it is what to do with what they have read. The purpose of studying is to use what you have learned. Few of us have the time, to simply learn for fun. Often, we learn to do something for monetary reasons. In other words, some sort of immediate application is critical to reading success.

Another important aspect of reading comprehension is understanding the structure of academic writing. Textbooks have different subjects but they all have a surprisingly similar structure which often starts with the big picture and zooms down to the details. If students can see the structure it can greatly improve their ability to understand what they are reading.

The Tour Guide Analogy

The analogy that I like to use is that of a tour guide. A tour guide’s job is to show you around a particular place. It could be an entire city or a single tourist attraction it all depends on the level of detail that he or she wants to provide you. Often, at the beginning of a tour, the tour guide will explain the itinerary of the tour. This provides the big picture purpose of the tour group as well as what to expect during the journey.

If the trip is especially detail you may visit several different places. At each place, there will be several places to see at each place that the tour guide will mention. For example, If I go to Thailand for vacation and visit Bangkok there will be several locations within Bangkok that I would visit such as Malls and maybe a museum. It is the tour guide’s job to guide me in the learning experience.

The author of an academic text is like a tour guide. Their job is to show you around the subject they are an expert in. The tour guide has an itinerary while the academic author has a preface/introduction. In the preface, the author explains the purpose of the book, as well as the major themes or “places” they will show you on the tour. The preface also explains who the book is for.

Each chapter in an academic text is one specific place the author wants to show you on the tour. Just as a tour guide may show you a museum in Bangkok so an author will show you one aspect of a subject in a chapter. Furthermore, every chapter has several headings within it. This is the same as me seeing the dinosaur exhibit at the museum or the ancient Thai instruments exhibit. These are the places within the place that you visited.

Tour guide Writer
Expert in their area Exepert in their area
Shows you around the tourist attraction Shows you around a subject area
Explains what you will see today Explains what they will share in a book/chapter
Provides details about the different sights Provides details for the main ideas

We can break this down further about subheadings and more but I think the point is clear. The layout for an academic text is not mysterious but rather highly consistent. Having said this here are some critical ideas to remember when you read.

Statistical Learning

Statistical learning is a discipline that focuses on understanding data. Understanding data can happen through classifying or making a numeric prediction which is called supervised learning or finding patterns in data which is called unsupervised learning,

In this post, we will examine the following

  • History of statistical learning
  • The purpose of statistical learning
  • Statistical learning vs Machine learning

History Of Statistical Learning

The early pioneers of statistical learning focused exclusively on supervised learning. Linear regression was developed in the 19th century by  Legendre and Gauss. In the 1930’s, Fisher created linear discriminant analysis. Logistic regression was created in the 1940’s as an alternative the linear discriminant analysis.

The developments of the late 19th century to the mid 20th century were limited due to the lack of computational power. However, by the 1970’s things began  to change and new algorithms emerged, specifically ones that can handle non-linear relationships

In the 1980’s Friedman and Stone developed classification and regression trees. The term generalized additive models were first used by Hastie and Tibshirani for non-linear generalized models.

Purpose of Statistical Learning

The primary goal of statistical learning is to develop a model of data you currently have to make decisions about the future. In terms of supervised learning with a numeric dependent variable, a teacher may have data on their students and want to predict future academic performance. For a categorical variable, a doctor may use data he has to predict whether someone has cancer or not. In both situations, the goal is to use what one knows to predict what one does not know.

A unique characteristic of supervised learning is that the purpose can be to predict future values or to explain the relationship between the dependent variable and another independent variable(s). Generally, data science is much more focused on prediction while the social sciences seem more concerned with explanations.

For unsupervised learning, there is no dependent variable. In terms of a practical example, a company may want to use the data they have to determine several unique categories of customers they have. Understanding large groups of customer behavior can allow the company to adjust their marketing strategy to cater to the different needs of their vast clientele.

Statistical Learning vs Machine Learning

The difference between statistical learning and machine learning is so small that for the average person it makes little difference. Generally, although some may disagree, these two terms mean essentially the same thing. Often statisticians speak of statistical learning while computer scientist speak of machine learning

Machine learning is the more popular term as it is easier to conceive of a machine learning rather than statistics learning.

Conclusion

Statistical or machine learning is a major force in the world today. With some much data and so much computing power, the possibilities are endless in terms of what kind of beneficial information can be gleaned. However, all this began with people creating a simple linear model in the 19th century.

Conversational Analysis: Request and Response

Within conversational analysis (CA) it is common to analysis peoples request as well as people’s response to a request in the context of a conversation. In this post, we will look at the categories that these requests and responses commonly fall into.

Request

Requests are a specific type of question in conversational analysis. Request almost always involve some sort of action. Either the person asking the request wants to do something or the speaker wants the listener to do something. As such there are only two categories in which request can be classified and they are…

  • Action request
  • Permission request

Action Request

An action request is made when the speaker wants the listener to do something.

A: Can you turn off the light?

Permission Request

A permission request is made when the speaker wants to perform an action and is seeking approval from the listener.

A: You mind if I turn off the light?

Response to Request

The response to a request can be positive or negative. However, when a response is negative it is often indirect.  As such, there are three categories in which a response to a request can be placed.

  • Accept
  • Reject
  • Evade

Accept

Accepting is to grant permission either for the speaker to do something or that the listener will perform the request.

A: Could you turn the light off?
B: No problem

Reject

Rejecting means that a person states directly that they cannot do something

A: Can you turn the light off?
B: No, I can’t

Evading

Evading is the art of saying “no” indirectly to a request. This is done through giving a reason why something cannot be done rather than directly responding

A: Can you turn the light off?
B: I’m busy with the baby

In the example above, person B never says no. Rather, they provide an excuse for not completing the task.

Conclusion

We all have used these various ways of requesting and responding to request. The benefit of CA is being able to breakdown these conversational pairs and understand what is happening beyond the surface level.

Conversational Analysis: Questions & Responses

Conversational analysis (CA) is the study of social interactions in everyday life. In this post, we will look at how questions and responses are categorized in CA.

Questions

In CA, there are generally three types of questions and they are as follows…

  • Identification question
  • Polarity question
  • Confirmation question

Identification Question

Identification questions are questions that employees one of the five W’s (who, what, where, when, why). The response can be opened or closed-ended. An example is below

Where are the keys?

Polarity Question

A polarity question is a question the calls for a yes/no response.

Can you come to work tomorrow?

Confirmation Question

Similiar to the polarity question, a confirmation question is a question that is seeking to gather support for something the speaker already said.

Didn’t Sam go to the store already?

This question is seeking an affirmative yes.

Responses

There are also several ways in which people respond to a question. Below is a list of common ways.

  • Comply
  • Supply
  • Imply
  • Evade
  • Disclaim

Comply

Complying means give a clear direct answer to a question. Below is an example

A: What time is it?
B: 6:30pm

Supply

Supplying is the act of giving a partial response, that is often irrelevant and fails to answer the question.

A: Is this your dog?
B: Well…I do feed it once in awhile

In the example above, person A asks a clear question. However, person B states what they do for the dog (feed it) rather than indicate if the dog belongs to them. Feeding the dog is irrelevant to ownership.

Imply

Implying is providing information indirectly to answer a question.

A: What time do you want to leave?
B: Not too late

The response from person B does not indicate any sort of specific time to leave. This leaves it up to person A to determine what is meant by “too late.”

Disclaim

Disclaiming is the person stating they do not n]know the answer.

A: Where are the keys?
B: I don’t know

Evade

Evading is the act of answering with really answering the question

A: Where is the car
B: David needed to go shopping

In the example above, person B never states where the car is. Rather, they share what someone is doing with the car. By doing this, the speaker never shares where the car is.

Conclusions

The interaction of a question and response can be interesting if it is examined more closely from a sociolinguistic perspective. The categories provided here can support the deeper analysis of conversation.

Conversational Analysis

Conversational analysis is a tool used by sociolinguist to examine dialog between two or more people. The analysis can include such aspects as social factors, social dimensions, and other characteristics.

One unique tool in conversational analysis identifying adjacency pairs. Adjacency pairs are two-part utterances in which the second speaker is replying to something the first speaker said. In this post, we will look at the following examples of adjacency pairs.

  • Request-agreement
  • Question-Answer
  • Assessment-Agreement
  • Greeting-Greeting
  • Compliment-Acceptance
  • Conversational Concluder
  • Complaint-Apology
  • Blame-Denial
  • Threat-Counterthreat
  • Warning-Acknowledgement
  • Offer-Acceptance

Request-Agreement

Request involves asking someone to do something and agreement indicates that the person will do it. Below is an example

A: Could you open the window?
B: No problem

Question-Answer

One person request information from another. THis is different from request agreement because there is no need to agree. Below is an example

A: Where are you from?
B: I am from Laos

Assessment-Agreement

Assessment seeks an opinion from someone and agreement is a positive position on the subject. The example is below

A: Do you like the food?
B: Yeah, it taste great!

Greeting-Greeting

Two people say hello to one another.

A: Hello
B: Hello

Compliment-Acceptance

One person commends something about the other who shows appreciation for the comment.

A: I really like your shoes
B: Thank you

Conversational Concluder

This is a comment that singles the end of a conversation.

A: Goodbye
B: See you later

Complaint-Apology

One person indicates they are not happy with something and the other person express regret over this.

A: The food is too spicy
B: We’re so sorry

Blame-Denial

One person accuses another who tries to defend himself.

A: You lost the phone?
B: No I didn’t!

Threat-Counterthreat

Two people mutually resist each other.

A: Sit down or I will call your parents!
B: Make me

Warning-Acknowledgement

One person issues a threat or danger and the other indicates they understand

A: Look both ways before crossing the street
B: No problem

Offer-Acceptance

One person gives something and the other person shows appreciation

A: Here’s the money
B: Thank you so much

Conclusion

These kinds of conversational pairs appear whenever people talk. For the average person, this is not important. However, when trying to look at the context of a conversation tot understanding what is affecting the way people are speaking understanding and identifying adjacency pairs can be useful.

The Structure of Academic Writing

“The book is boring.” This is a common complaint many lecturers receive from students about the assigned reading in a class. Although this is discouraging to hear it is usually a cry for help. What the student is really saying is that they cannot understand what they are reading. Yes, the read it but they didn’t get it.

The missing ingredient for students to appreciate academic reading is to understand the structure of academic writing. Lecturers forget that students are not scholars and thus do not quite understand how scholars organize their writing. If students knew this they would no longer be students. Therefore, lecturers need to help students not only understand the ideas of a book but the actual structure of how those ideas are framed in a textbook.

This post will try to explain the structure of academic writing in a general sense.

How it Works

Below is a brief outline of a common structure for an academic textbook.

  • Preface
    • Purpose of the book
    • Big themes of the book (chapters)
  • Chapter
    • Objectives/headings provide themes of the chapter
  • Headings
    • Provides theme of a section of a chapter

Here is what I consider to be a major secret of writing. The structure is highly redundant but at different levels of abstraction. The preface, chapter, and headings of a book are all the same in terms of purpose but at different levels of scope. The preface is the biggest picture you can get of the text. It’s similar to the map of a country. The chapter zooms in somewhat and is similar to the map of a city. Lastly, the headings within a chapter are similar to have a neighborhood map of a city.

The point is that academic writing is highly structured and organized. Students often think a text is boring. However, when they see the structure, they may not fall in love with academics but at least they will understand what the author is trying to say. A student must see the structure in order to appreciate the details.

Another way to look at this is as follows.

  • The paragraphs of a heading support the heading
  • The headings of a chapter support the chapter
  • The chapters of a book support the title of the book

A book is like a body, you have cells, you have tissues, and you have organs. Each is an abstraction of a higher level. Cells combine to make tissue, tissues combine to make organs, etc. This structure is how academic writing takes place.

The goal of academic writing is not to be entertaining. That role is normally set aside for fiction writing. Since most students enjoy entertainment they expect academic writing to use the same formula of fun. However, few authors place fun as one of the purposes in their preface. This yet another disconnect between students and textbooks.

Conclusion

Academic writing is repetitive in terms of its structure. Each sub-section supports a higher section in the book. This repetitive structure is probably one aspect of academic writing students find so boring. However, this repetitive nature makes the write highly efficient in terms of understanding giving that the reader is aware of this.

Understanding the Preface of a Textbook

A major problem students have in school is understanding what they read. However, the problem often is not reading in itself. By this I mean the student know what they read but they do not know what it means. In other words, they will read the text but cannot explain what the text was about.

There are several practical things a student can do to overcome this problem without having to make significant changes to their study habits. Some of the strategies that they can use involve looking at the structure of how the writing is developed. Examples of this include the following.

  • Reading the preface
  • Reading the chapter titles
  • Reading the chapter objectives
  • Reading the headings in the chapters
  • Make some questions
  • Now read & answer the questions

In this post, we will look at the benefits of reading the preface to a book.

Reading the Preface

When students are assigned reading they often skip straight to page one and start reading. This means they have no idea what the text is about or even what the chapter will be about. This is the same as jumping in your car to drive somewhere without directions. You might get there eventually but often you just end up lost.

One of the first things a student should do is read the preface of a book. The preface gives you some of the following information

  • Information about the author
  • The purpose of the book
  • The audience of the book
  • The major themes of the text
  • Assumptions

Knowing the purpose of the text is beneficial to understanding the author’s viewpoint. This is often more important in graduate studies than in undergrad.

Knowing the main themes of the book helps from a psychological perspective as well. These themes serve as mental hooks in your mind in which you can hang the details of the chapters that you will read. It is critical to see the overview and big picture of the text so that you have a framework in which to place the ideas of the chapters you will read.

Many books do not have a preface. Instead what they often do is make chapter one the “introduction” and include all the aspects of the preface in the first chapter. Both strategies are fine. However, it is common for teachers to skip the introduction chapter in order to get straight to the “content.” This is fast but can inhibit understanding of the text.

There are also usually an explanation of assumptions. The assumptions serve to tell the reader what they should already know as well as the biases of the author. This is useful as it communicates the position the author takes from the outset with the readers trying to infer this.

Conclusion

The preface serves the purpose of introducing the reader to the text. One of the goals if the preface is to convince the reader why they should read the book. It provides the big picture of the text, shares about the author, and indicates who the book is for, as well as sharing the author’s viewpoint.

Understanding Academic Text

Understanding academic text is possible through making some minor adjustment to one’s reading style. In this post, we will look at the following ideas for improving academic reading comprehension.

  • Reading the chapter titles
  • Reading the chapter objectives
  • Reading the headings in the chapters
  • Examine the Visuals
  • Make some questions
  • Now read & answer the questions

Read the Chapter Titles

You read the chapter title for the same reason as the preface. It gives you the big picture from which you develop a framework for placing the ideas of the author. I am always amazed how many times I ask my students what the title of the chapter is and they have no clue. This is because they were so determined to read that they never set things in place to understand.

For ESL readers, it is critical that they know the meaning of every word in the title. Again this has to do with the importance of the title for shaping the direction of the reading. If the student gets lost in the details this is where teaching support is there for. However, if they have no idea what the chapter is about there is little even the be3st teacher can do.

Read Chapter Objectives

The objectives of a chapter are a promise of what the author will write about. The student needs to know what the promises are so they know what to expect. This is similar to driving somewhere and expecting to see certain landmarks along the way. When you see these landmarks you know you are getting close to the destination.

The objectives provided the big picture of the chapter in a way that the preface provides the big picture of the entire book. Again, it is common for students to skip this aspect of reading comprehension.

Read the Chapter Headings

By now you probably know why to read the chapter headings. If not, it is because the chapter headings tell the student what to expect in a particular section of the chapter. They serve as a local landmark or a localized purpose.

For an extremely efficient (or perhaps lazy) writer, the objectives and the headings of a chapter will be exactly the same with perhaps slight rewording. This is extremely beneficial for readers because not only do they see the objectives at the beginning but the see them stated again as headings in the chapter.

Examine the Visuals

Visuals are used to illustrate ideas in the text. For now, the student simply wants to glance at them. Being familiar with the visuals now will be useful when the student wants to understand them when reading.

When looking at a visual, here are some things to look for

  • Title
  • author
  • date
  • what’s being measured
  • scale (units of measurement)

For an initial superficial glance, this is more than enough

Make Questions, Read, and Answer 

After examining the text, the student should have questions about what the text is about. Now they should write down what they want to know after examining the various characteristics of the chapter and then they begin to read so they can answer their questions

Examine End of the Chapter Tools

After reading the chapter, many authors provide some sort of study tools at the end. I find it most useful to read the chapter before looking too closely at this information. The reason for this is that the summary and questions at the end indicate what the author thinks is important about the chapter. It’s hard to appreciate this if you did not read the chapter yet.

Knowing what is happening at the end of the chapter helps in reinforcing what you read. You can quiz yourself about the information and use this information to prepare for any examines.

Conclusion

Previewing a chapter is a strategy for understanding a chapter. The ideas a student reads about must have a framework in which the pieces can fit. This framework can be developed through examining the chapter before reading it in detail.

Linear Regression vs Bayesian Regression

In this post, we are going to look at Bayesian regression. In particular, we will compare the results of ordinary least squares regression with Bayesian regression.

Bayesian Statistics

Bayesian statistics involves the use of probabilities rather than frequencies when addressing uncertainty. This allows you to determine the distribution of the model parameters and not only the values. This is done through averaging over the model parameters through marginalizing the joint probability distribution.

Linear Regression

We will now develop our two models. The first model will be a normal regression and the second a Bayesian model. We will be looking at factors that affect the tax rate of homes in the “Hedonic” dataset in the “Ecdat” package. We will load our packages and partition our data. Below is some initial code

library(ISLR);library(caret);library(arm);library(Ecdat);library(gridExtra)
data("Hedonic")
inTrain<-createDataPartition(y=Hedonic$tax,p=0.7, list=FALSE)
trainingset <- Hedonic[inTrain, ]
testingset <- Hedonic[-inTrain, ]
str(Hedonic)
## 'data.frame':    506 obs. of  15 variables:
##  $ mv     : num  10.09 9.98 10.45 10.42 10.5 ...
##  $ crim   : num  0.00632 0.02731 0.0273 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 ...
##  $ chas   : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ nox    : num  28.9 22 22 21 21 ...
##  $ rm     : num  43.2 41.2 51.6 49 51.1 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 ...
##  $ dis    : num  1.41 1.6 1.6 1.8 1.8 ...
##  $ rad    : num  0 0.693 0.693 1.099 1.099 ...
##  $ tax    : int  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 ...
##  $ blacks : num  0.397 0.397 0.393 0.395 0.397 ...
##  $ lstat  : num  -3 -2.39 -3.21 -3.53 -2.93 ...
##  $ townid : int  1 2 2 3 3 3 4 4 4 4 ...

We will now create our regression model

ols.reg<-lm(tax~.,trainingset)
summary(ols.reg)
## 
## Call:
## lm(formula = tax ~ ., data = trainingset)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -180.898  -35.276    2.731   33.574  200.308 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 305.1928   192.3024   1.587  0.11343    
## mv          -41.8746    18.8490  -2.222  0.02697 *  
## crim          0.3068     0.6068   0.506  0.61339    
## zn            1.3278     0.2006   6.618 1.42e-10 ***
## indus         7.0685     0.8786   8.045 1.44e-14 ***
## chasyes     -17.0506    15.1883  -1.123  0.26239    
## nox           0.7005     0.4797   1.460  0.14518    
## rm           -0.1840     0.5875  -0.313  0.75431    
## age           0.3054     0.2265   1.349  0.17831    
## dis          -7.4484    14.4654  -0.515  0.60695    
## rad          98.9580     6.0964  16.232  < 2e-16 ***
## ptratio       6.8961     2.1657   3.184  0.00158 ** 
## blacks      -29.6389    45.0043  -0.659  0.51061    
## lstat       -18.6637    12.4674  -1.497  0.13532    
## townid        1.1142     0.1649   6.758 6.07e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 63.72 on 341 degrees of freedom
## Multiple R-squared:  0.8653, Adjusted R-squared:  0.8597 
## F-statistic: 156.4 on 14 and 341 DF,  p-value: < 2.2e-16

The model does a reasonable job. Next, we will do our prediction and compare the results with the test set using correlation, summary statistics, and the mean absolute error. In the code below, we use the “predict.lm” function and include the arguments “interval” for the prediction as well as “se.fit” for the standard error

ols.regTest<-predict.lm(ols.reg,testingset,interval = 'prediction',se.fit = T)

Below is the code for the correlation, summary stats, and mean absolute error. For MAE, we need to create a function.

cor(testingset$tax,ols.regTest$fit[,1])
## [1] 0.9313795
summary(ols.regTest$fit[,1])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   144.7   288.3   347.6   399.4   518.4   684.1
summary(trainingset$tax)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   188.0   279.0   330.0   410.4   666.0   711.0
MAE<-function(actual, predicted){
        mean(abs(actual-predicted))
}
MAE(ols.regTest$fit[,1], testingset$tax)
## [1] 41.07212

The correlation is excellent. The summary stats are similar and the error is not unreasonable. Below is a plot of the actual and predicted values

We now need to combine some data into one dataframe. In particular, we need the following actual dependent variable results predicted dependent variable results The upper confidence value of the prediction THe lower confidence value of the prediction

The code is below

yout.ols <- as.data.frame(cbind(testingset$tax,ols.regTest$fit))
ols.upr <- yout.ols$upr
ols.lwr <- yout.ols$lwr

We can now plot this

p.ols <- ggplot(data = yout.ols, aes(x = testingset$tax, y = ols.regTest$fit[,1])) + geom_point() + ggtitle("Ordinary Regression") + labs(x = "Actual", y = "Predicted")
p.ols + geom_errorbar(ymin = ols.lwr, ymax = ols.upr)

1.png

You can see the strong linear relationship. However, the confidence intervals are rather wide. Let’s see how Bayes does.

Bayes Regression

Bayes regression uses the “bayesglm” function from the “arm” package. We need to set the family to “gaussian” and the link to “identity”. In addition, we have to set the “prior.df” (prior degrees of freedom) to infinity as this indicates we want a normal distribution

bayes.reg<-bayesglm(tax~.,family=gaussian(link=identity),trainingset,prior.df = Inf)
bayes.regTest<-predict.glm(bayes.reg,newdata = testingset,se.fit = T)
cor(testingset$tax,bayes.regTest$fit)
## [1] 0.9313793
summary(bayes.regTest$fit)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   144.7   288.3   347.5   399.4   518.4   684.1
summary(trainingset$tax)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   188.0   279.0   330.0   410.4   666.0   711.0
MAE(bayes.regTest$fit, testingset$tax)
## [1] 41.07352

The numbers are essentially the same. This leads to the question of what is the benefit of Bayesian regression? The answer is in the confidence intervals. Next, we will calculate the confidence intervals for the Bayesian model.

yout.bayes <- as.data.frame(cbind(testingset$tax,bayes.regTest$fit))
names(yout.bayes) <- c("tax", "fit")
critval <- 1.96 #approx for 95% CI
bayes.upr <- bayes.regTest$fit + critval * bayes.regTest$se.fit
bayes.lwr <- bayes.regTest$fit - critval * bayes.regTest$se.fit

We now create our Bayesian regression plot.

p.bayes <- ggplot(data = yout.bayes, aes(x = yout.bayes$tax, y = yout.bayes$fit)) + geom_point() + ggtitle("Bayesian Regression Prediction") + labs(x = "Actual", y = "Predicted")

Lastly, we display both plots as a comparison.

ols.plot <-  p.ols + geom_errorbar(ymin = ols.lwr, ymax = ols.upr)
bayes.plot <-  p.bayes + geom_errorbar(ymin = bayes.lwr, ymax = bayes.upr)
grid.arrange(ols.plot,bayes.plot,ncol=2)

1

As you can see, the Bayesian approach gives much more compact confidence intervals. This is because the Bayesian approach a distribution of parameters is calculated from a posterior distribution. These values are then averaged to get the final prediction that appears on the plot. This reduces the variance and strengthens the confidence we can have in each individual example.

Review of “Usborne World of Animals”

The Usborne World of Animals was written by Susanna Davidson and Mike Unwin (pp. 128).

The Summary

This book is about animals and how they live in the world. The book has ten sections. The first section covers topics about how animals live in general. Some of the topics in this section include how animals move, eat, smell, taste, touch, hide, etc. The next 8 sections

The next 8 sections cover different animals in different regions of the world. Examples include Toucans in South America, Bears in North America, Gorillas in Africa, Otters in Europe, Panda Bears in Asia, Kangaroos in Australia, and even Elephant Seals in Antartica.

The Good

This book is full of rich photographs and even illustrations that provide additional learning. The photos depict animals in daily life such as a tiger running, polar bears playing, anteaters searching for food, bats sleeping, monkeys jumping, etc. Children will enjoy the pictures tremendously.

The text is fairly readable. The font is normally large with smaller text being of less importance. There is even a little geography mixed as the book organized the animals based on the region they are from. At the beginning of the section is a map showing where on the continent the animals were from.

The Bad

There is little to criticize about this book. One minor problem is the maps are drawn way out of scale. Asia, in particular, looks really strange. Of course, this is not a geography book but it is distracting somewhat in the learning experience.

Another small complaint could be the superficial nature of the text. There are more animals than there is time to really go deeply into. Again, for an expert this m ay be troublesome but this may not be much of a problem for the typical child.

The Recommendation

This text is 5/5 stars. As a teacher, you can use it for reading to your students or add it to your library for personal reading. The photos and colors will provide a vided learning experience for students for years to come.

Common Task in Machine Learning

Machine learning is used for a variety of task today with a multitude of algorithms that can each do one or more of these tasks well. In this post, we will look at some of the most common tasks that machine learning algorithms perform. In particular, we will look at the following task.

  1. Regression
  2. Classification
  3. Forecasting
  4. Clustering
  5. Association rules
  6. Dimension reduction

Numbers 1-3 are examples of supervised learning, which is learning that involves a dependent variable. Numbers 4-6 are unsupervised which is learning that does not involve a clearly labeled dependent variable.

Regression

Regression involves understanding the relationship between a continuous dependent variable and categorical and continuous independent variables. Understanding this relationship allows for numeric prediction of the dependent continuous variable.

Example algorithms for regression include linear regression, numeric prediction random forest as well as support vector machines and artificial neural networks.

Classification

Classification involves the use of a categorical dependent variable with continuous and or categorical independent variables. The purpose is to classify examples into the groups in the dependent variable.

Examples of this are logisitic regression as well as all the algorithms mentioned in regression. Many algorithms can do both regression and classification.

Forecasting

Forecasting is similar to regression. However, the difference is that the data is a time series. The goal remains the same of predicting future outcomes based on current available data. As such, a slightly different approach is needed because of the type of data involved.

Common algorithms for forecasting is ARIMA even artificial neural networks.

Clustering

Clustering involves grouping together items that are similar in a dataset. This is done by detecting patterns in the data. The problem is that the number of clusters needed is usually no known in advanced which leads to a trial and error approach if there is no other theoretical support.

Common clustering algorithms include k-means and hierarchical clustering. Latent Dirichlet allocation is used often in text mining applications.

Association Rules

Associations rules find items that occur together in a dataset. A common application of association rules is market basket analysis.

Common algorithms include Apriori and frequent pattern matching algorithm.

Dimension Reduction

Dimension reduction involves combining several redundant features into one or more components that capture the majority of the variance. Reducing the number of features can increase the speed of the computation as well as reduce the risk of overfitting.

In machine learning, principal component analysis is often used for dimension reduction. However, factor analysis is sometimes used as well.

Conclusion

In machine learning, there is always an appropriate tool for the job. This post provided insight into the main task of machine learning as well as the algorithm for the situation.

Terms Related to Language

This post will examine different uses of the word language. There are several different ways that this word can be defined. We will look at the following terms for language.

  • Vernacular
  • Standard
  • National
  • Official
  • Lingua Franca

Vernacular Language

The term vernacular language can mean many different things. It can mean a language that is not standardized or a language that is not the standard language of a nation. Generally, a vernacular language is a language that lacks official status in a country.

Standard Language

A standard language is a language that has been codified. By this, it is meant that the language has dictionaries and other grammatical sources that describe and even prescribe the use of the language.

Most languages have experienced codification. However, codification is just one part of being a standard language. A language must also be perceived of as prestigious and serve a high function.

By prestigious it is meant that the language has influence in a community. For example, Japanese is a prestigious language in Japan. By high function, it is meant that the language is used in official settings such as government, business, etc., which Japanese is used for.

National Language

A national language is a language used for political and cultural reasons to unite a people. Many countries that have a huge number of languages and ethnic groups will select one language as a way to forge an identity. For example, in the Philippines, the national language is Tagalog even though hundreds of other languages are spoken.

In Myanmar, Burmese is the national language even though dozens of other languages are spoken. The selection of the language is political motivate with the dominant group imposing their language on others.

Official Language

An official language is the language of government business. Many former colonized nations will still use an official language that comes from the people who colonized them. This is especially true in African countries such as Ivory Coast and Chad which use French as their official language despite having other indigenous languages available.

Lingua Franca

A lingua franca is a language that serves as a vehicle of communication between two language groups whose mother tongues are different. For example, English is often the de facto lingua franca of people who do not speak the same language.

Multiple Categories

A language can fit into more than one of the definitions above. For example, English is a vernacular language in many countries such as Thailand and Malaysia. However, English is not considered a vernacular language in the United States.

To make things more confusing. English is the language of the United States but it is neither the National or Official Language as this has never been legislated. Yet English is a standard language as it has been codified and meets the other criteria for standardization.

Currently, English is viewed by many as an international Lingua Franca with a strong influence on the world today.

Lastly, a language can be in more than one category. Thai is the official, national, and standard language of Thailand.

Conclusion

Language is a term that is used that can also have many meanings. In this post, we looked at how there are different ways to see this word.