# Understanding Classification Trees Using R

Classification trees are similar to regression trees except that the determinant of success is not the residual sum of squares but rather the error rate. The strange thing about classification trees is that you can you can continue to gain information in splitting the tree without necessarily improving the misclassification rate. This is done through calculating a measure of error called the Gini coefficient

Gini coefficient is calculated using the values of the accuracy and error in an equation. For example, if we have a model that is 80% accurate with a 20% error rate the Gini coefficient is calculated as follows for a single node

``````n0gini<- 1 - (((8/10)^2) -((2/10)^2))
n0gini``````
``## [1] 0.4``

Now if we split this into two nodes notice the change in the Gini coefficient

``````n1gini<-1-(((5/6)^2)-((1/7)^2))
n2gini<-1-(((3/4)^2))-((1/3)^2)
newgini<-(.8*n1gini) + (.2*n2gini)
newgini``````
``## [1] 0.3260488``

The lower the Gini coefficient the better as it measures purity. IN the example, there is no improvement in the accuracy yet there is an improvement in the Gini coefficient. Therefore, classification is about purity and not the residual sum of squares.

In this post, we will make a classification tree to predict if someone is participating in the labor market. We will do this using the “Participation” dataset from the “Ecdat” package. Below is some initial code to get started.

``library(Ecdat);library(rpart);library(partykit)``
``````data(Participation)
str(Participation)``````
``````## 'data.frame':    872 obs. of  7 variables:
##  \$ lfp    : Factor w/ 2 levels "no","yes": 1 2 1 1 1 2 1 2 1 1 ...
##  \$ lnnlinc: num  10.8 10.5 11 11.1 11.1 ...
##  \$ age    : num  3 4.5 4.6 3.1 4.4 4.2 5.1 3.2 3.9 4.3 ...
##  \$ educ   : num  8 8 9 11 12 12 8 8 12 11 ...
##  \$ nyc    : num  1 0 0 2 0 0 0 0 0 0 ...
##  \$ noc    : num  1 1 0 0 2 1 0 2 0 2 ...
##  \$ foreign: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...``````

The ‘age’ feature needs to be transformed. Since it is doubtful that the survey was conducted among 4 and 5-year-olds. We need to multiply this variable by ten. In addition, the “lnnlinc” feature is the log of income and we want the actual income so we will exponentiate this information. Below is the code for these two steps.

``````Participation\$age<-10*Participation\$age #normal age
Participation\$lnnlinc<-exp(Participation\$lnnlinc) #actual income not log``````

We will now create our training and testing datasets with the code below.

``````set.seed(502)
ind=sample(2,nrow(Participation),replace=T,prob=c(.7,.3))
train<-Participation[ind==1,]
test<-Participation[ind==2,]``````

We can now create our classification tree and take a look at the output

``````tree.pros<-rpart(lfp~.,data = train)
tree.pros``````
``````## n= 636
##
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
##
##   1) root 636 295 no (0.5361635 0.4638365)
##     2) foreign=no 471 182 no (0.6135881 0.3864119)
##       4) nyc>=0.5 99  21 no (0.7878788 0.2121212) *
##       5) nyc< 0.5 372 161 no (0.5672043 0.4327957)
##        10) age>=49.5 110  25 no (0.7727273 0.2272727) *
##        11) age< 49.5 262 126 yes (0.4809160 0.5190840)
##          22) lnnlinc>=46230.43 131  50 no (0.6183206 0.3816794)
##            44) noc>=0.5 102  34 no (0.6666667 0.3333333) *
##            45) noc< 0.5 29  13 yes (0.4482759 0.5517241)
##              90) lnnlinc>=47910.86 22  10 no (0.5454545 0.4545455)
##               180) lnnlinc< 65210.78 12   3 no (0.7500000 0.2500000) *
##               181) lnnlinc>=65210.78 10   3 yes (0.3000000 0.7000000) *
##              91) lnnlinc< 47910.86 7   1 yes (0.1428571 0.8571429) *
##          23) lnnlinc< 46230.43 131  45 yes (0.3435115 0.6564885) *
##     3) foreign=yes 165  52 yes (0.3151515 0.6848485)
##       6) lnnlinc>=56365.39 16   5 no (0.6875000 0.3125000) *
##       7) lnnlinc< 56365.39 149  41 yes (0.2751678 0.7248322) *``````

In the text above, the first split is made on the feature “foreign” which is a yes or no possibility. 471 were not foreigners will 165 were foreigners. The accuracy here is not great at 61% for those not classified as foreigners and 31% for those classified as foreigners. For the 165 that are classified as foreigners, the next split is by their income, etc. This is hard to understand. Below is an actual diagram of the text above.

``plot(as.party(tree.pros))``

We now need to determining if pruning the tree is beneficial. We do this by looking at the cost complexity. Below is the code.

``tree.pros\$cptable``
``````##           CP nsplit rel error    xerror       xstd
## 1 0.20677966      0 1.0000000 1.0000000 0.04263219
## 2 0.04632768      1 0.7932203 0.7932203 0.04122592
## 3 0.02033898      4 0.6542373 0.6677966 0.03952891
## 4 0.01016949      5 0.6338983 0.6881356 0.03985120
## 5 0.01000000      8 0.6033898 0.6915254 0.03990308``````

The “rel error” indicates that our model is bad no matter how any splits. Even with 9 splits we have an error rate of 60%. Below is a plot of the table above

``plotcp(tree.pros)``

Based on the table, we will try to prune the tree to 5 splits. The plot above provides a visual as it has the lowest error. The table indicates that a tree of five splits (row number 4) has the lowest cross-validation error (xstd). Below is the code for pruning the tree followed by a plot of the modified tree.

``````cp<-min(tree.pros\$cptable[4,])
pruned.tree.pros<-prune(tree.pros,cp=cp)
plot(as.party(pruned.tree.pros))``````

IF you compare the two trees we have developed. One of the main differences is that the pruned.tree is missing the “noc” (number of older children) variable. There are also fewer splits on the income variable (lnnlinc). We can no use the pruned tree with the test data set.

``````party.pros.test<-predict(pruned.tree.pros,newdata=test,type="class")
table(party.pros.test,test\$lfp)``````
``````##
## party.pros.test no yes
##             no  90  41
##             yes 40  65``````

Now for the accuracy

``(90+65) / (90+41+40+65)``
``## [1] 0.6567797``

This is surprisingly high compared to the results for the training set but 65% is not great, However, this is fine for a demonstration.

Conclusion

Classification trees are one of many useful tools available for data analysis. When developing classification trees one of the key ideas to keep in mind is the aspect of prunning as this affects the complexity of the model.

# Numeric Prediction with Support Vector Machines in R

In this post, we will look at support vector machines for numeric prediction. SVM is used for both classification and numeric prediction. The advantage of SVM for numeric prediction is that SVM will automatically create higher dimensions of the features and summarizes this in the output. In other words, unlike in regression where you have to decide for yourself how to modify your features, SVM does this automatically using different kernels.

Different kernels transform the features in different ways. And the cost function determines the penalty for an example being on the wrong side of the margin developed by the kernel. Remember that SVM draws lines and separators to divide the examples. Examples on the wrong side are penalized as determined by the researcher.

Just like with regression, generally, the model with the least amount of error may be the best model. As such, the purpose of this post is to use SVM to predict income in the “Mroz” dataset from the “Ecdat” package. We will use several different kernels that will transformation the features different ways and calculate the mean-squared error to determine the most appropriate model. Below is some initial code.

``library(caret);library(e1071);library(corrplot);library(Ecdat)``
``````data(Mroz)
str(Mroz)``````
``````## 'data.frame':    753 obs. of  18 variables:
##  \$ work      : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 2 2 2 ...
##  \$ hoursw    : int  1610 1656 1980 456 1568 2032 1440 1020 1458 1600 ...
##  \$ child6    : int  1 0 1 0 1 0 0 0 0 0 ...
##  \$ child618  : int  0 2 3 3 2 0 2 0 2 2 ...
##  \$ agew      : int  32 30 35 34 31 54 37 54 48 39 ...
##  \$ educw     : int  12 12 12 12 14 12 16 12 12 12 ...
##  \$ hearnw    : num  3.35 1.39 4.55 1.1 4.59 ...
##  \$ wagew     : num  2.65 2.65 4.04 3.25 3.6 4.7 5.95 9.98 0 4.15 ...
##  \$ hoursh    : int  2708 2310 3072 1920 2000 1040 2670 4120 1995 2100 ...
##  \$ ageh      : int  34 30 40 53 32 57 37 53 52 43 ...
##  \$ educh     : int  12 9 12 10 12 11 12 8 4 12 ...
##  \$ wageh     : num  4.03 8.44 3.58 3.54 10 ...
##  \$ income    : int  16310 21800 21040 7300 27300 19495 21152 18900 20405 20425 ...
##  \$ educwm    : int  12 7 12 7 12 14 14 3 7 7 ...
##  \$ educwf    : int  7 7 7 7 14 7 7 3 7 7 ...
##  \$ unemprate : num  5 11 5 5 9.5 7.5 5 5 3 5 ...
##  \$ city      : Factor w/ 2 levels "no","yes": 1 2 1 1 2 2 1 1 1 1 ...
##  \$ experience: int  14 5 15 6 7 33 11 35 24 21 ...``````

We need to place the factor variables next to each other as it helps in having to remove them when we need to scale the data. We must scale the data because SVM is based on distance when making calculations. If there are different scales the larger scale will have more influence on the results. Below is the code

``````mroz.scale<-Mroz[,c(17,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,18)]
mroz.scale<-as.data.frame(scale(mroz.scale[,c(-1,-2)])) #remove factor variables for scaling
mroz.scale\$city<-Mroz\$city # add factor variable back into the dataset
mroz.scale\$work<-Mroz\$work # add factor variable back into the dataset
#mroz.cor<-cor(mroz.scale[,-17:-18])
#corrplot(mroz.cor,method='number', col='black')``````

Below is the code for creating the train and test datasets.

``````set.seed(502)
ind=sample(2,nrow(mroz.scale),replace=T,prob=c(.7,.3))
train<-mroz.scale[ind==1,]
test<-mroz.scale[ind==2,]``````

Linear Kernel

Our first kernel is the linear kernel. Below is the code. We use the “tune.svm” function from the “e1071” package. We set the kernel to “linear” and we pick our own values for the cost function. The numbers for the cost function can be whatever you want. Also, keep in mind that r will produce six different models because we have six different values in the “cost” argument.

The process we are using to develop the models is as follows

1. Set the seed
2. Develop the initial model by setting the formula, dataset, kernel, cost function, and other needed information.
3. Select the best model for the test set
4. Predict with the best model
5. Plot the predicted and actual results
6. Calculate the mean squared error

The first time we will go through this process step-by-step. However, all future models will just have the code followed by an interpretation.

``````linear.tune<-tune.svm(income~.,data=train,kernel="linear",cost = c(.001,.01,.1,1,5,10))
summary(linear.tune)``````
``````##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
##  cost
##    10
##
## - best performance: 0.3492453
##
## - Detailed performance results:
##    cost     error dispersion
## 1 1e-03 0.6793025  0.2285748
## 2 1e-02 0.3769298  0.1800839
## 3 1e-01 0.3500734  0.1626964
## 4 1e+00 0.3494828  0.1618478
## 5 5e+00 0.3493379  0.1611353
## 6 1e+01 0.3492453  0.1609774``````

The best model had a cost = 10 with a performance of .35. We will select the best model and use this on our test data. Below is the code.

``````best.linear<-linear.tune\$best.model
tune.test<-predict(best.linear,newdata=test)``````

Now we will create a plot so we can see how well our model predicts. In addition, we will calculate the mean squared error to have an actual number of our model’s performance

``plot(tune.test,test\$income)``

``````tune.test.resid<-tune.test-test\$income
mean(tune.test.resid^2)``````
``## [1] 0.215056``

The model looks good in the plot. However, we cannot tell if the error number is decent until it is compared to other models

Polynomial Kernel

The next kernel we will use is the polynomial one. The kernel requires two parameters the degree of the polynomial (3,4,5, etc) as well as the kernel coefficient. Below is the code

``````set.seed(123)
poly.tune<-tune.svm(income~.,data = train,kernal="polynomial",degree = c(3,4,5),coef0 = c(.1,.5,1,2,3,4))
best.poly<-poly.tune\$best.model
poly.test<-predict(best.poly,newdata=test)
plot(poly.test,test\$income)``````

``````poly.test.resid<-poly.test-test\$income
mean(poly.test.resid^2)``````
``## [1] 0.2453022``

The polynomial has an insignificant additional amount of error.

Next, we will use the radial kernel. One thing that is new here is the need for a parameter in the code call gamma. Below is the code.

``````set.seed(123)
summary(rbf.tune)``````
``````##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
##  gamma
##    0.1
##
## - best performance: 0.5225952
##
## - Detailed performance results:
##   gamma     error dispersion
## 1   0.1 0.5225952  0.4183170
## 2   0.5 0.9743062  0.5293211
## 3   1.0 1.0475714  0.5304482
## 4   2.0 1.0582550  0.5286129
## 5   3.0 1.0590367  0.5283465
## 6   4.0 1.0591208  0.5283059``````
``````best.rbf<-rbf.tune\$best.model
rbf.test<-predict(best.rbf,newdata=test)
plot(rbf.test,test\$income)``````

``````rbf.test.resid<-rbf.test-test\$income
mean(rbf.test.resid^2)``````
``## [1] 0.3138517``

The radial kernel is worst than the linear and polynomial kernel. However, there is not much different in the performance of the models so far.

Sigmoid Kernel

Next, we will try the sigmoid kernel. Sigmoid kernel relies on a “gamma” parameter and a cost function. Below is the code

``````set.seed(123)
sigmoid.tune<-tune.svm(income~., data=train,kernel="sigmoid",gamma = c(.1,.5,1,2,3,4),coef0 = c(.1,.5,1,2,3,4))
summary(sigmoid.tune)``````
``````##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
##  gamma coef0
##    0.1     3
##
## - best performance: 0.8759507
##
## - Detailed performance results:
##    gamma coef0        error  dispersion
## 1    0.1   0.1   27.0808221   6.2866615
## 2    0.5   0.1  746.9235624 129.0224096
## 3    1.0   0.1 1090.9660708 198.2993895
## 4    2.0   0.1 1317.4497885 214.7997608
## 5    3.0   0.1 1339.8455047 180.3195491
## 6    4.0   0.1 1299.7469190 201.6901577
## 7    0.1   0.5  151.6070833  38.8450961
## 8    0.5   0.5 1221.2396575 335.4320445
## 9    1.0   0.5 1225.7731007 190.7718103
## 10   2.0   0.5 1290.1784238 216.9249899
## 11   3.0   0.5 1338.1069460 223.3126800
## 12   4.0   0.5 1261.8861304 300.0001079
## 13   0.1   1.0  162.6041229  45.3216740
## 14   0.5   1.0 2276.4330973 330.1739559
## 15   1.0   1.0 2036.4791854 335.8051736
## 16   2.0   1.0 1626.4347749 290.6445164
## 17   3.0   1.0 1333.0626614 244.4424896
## 18   4.0   1.0 1343.7617925 194.2220729
## 19   0.1   2.0   19.2061993   9.6767496
## 20   0.5   2.0 2504.9271757 583.8943008
## 21   1.0   2.0 3296.8519140 542.7903751
## 22   2.0   2.0 2376.8169815 398.1458855
## 23   3.0   2.0 1949.9232179 319.6548059
## 24   4.0   2.0 1758.7879267 313.2581011
## 25   0.1   3.0    0.8759507   0.3812578
## 26   0.5   3.0 1405.9712578 389.0822797
## 27   1.0   3.0 3559.4804854 843.1905348
## 28   2.0   3.0 3159.9549029 492.6072149
## 29   3.0   3.0 2428.1144437 412.2854724
## 30   4.0   3.0 1997.4596435 372.1962595
## 31   0.1   4.0    0.9543167   0.5170661
## 32   0.5   4.0  746.4566494 201.4341061
## 33   1.0   4.0 3277.4331302 527.6037421
## 34   2.0   4.0 3643.6413379 604.2778089
## 35   3.0   4.0 2998.5102806 471.7848740
## 36   4.0   4.0 2459.7133632 439.3389369``````
``````best.sigmoid<-sigmoid.tune\$best.model
sigmoid.test<-predict(best.sigmoid,newdata=test)
plot(sigmoid.test,test\$income)``````

``````sigmoid.test.resid<-sigmoid.test-test\$income
mean(sigmoid.test.resid^2)``````
``## [1] 0.8004045``

The sigmoid performed much worst then the other models based on the metric of error. You can further see the problems with this model in the plot above.

Conclusion

The final results are as follows

• Linear kernel .21
• Polynomial kernel .24
• Sigmoid kernel .80

Which model to select depends on the goals of the study. However, it definitely looks as though you would be picking from among the first three models. The power of SVM is the ability to use different kernels to uncover different results without having to really modify the features yourself.

# How to Enroll Users in a Moodle Course VIDEO

Enrolling users in a moodle course video

# The Beginnings of English

What we now know as English today has a long and complex history. With any subject that is complex, it is necessary to pick a starting point and work from there. For this post, we will date the origins of English from the early 5th century.

Early History

English was not born in England.  Rather, it came to England through the invasion of Germanic warriors. These “barbarian” hoards push the indigenous Celts and Britons almost into the ocean.

However, it was not only war and conquest that brought English. The roots of English arrived also in the immigration of farmers. Either way, English slowly grew to be one of the prominent languages of England.

In the late sixth century, the Roman Catholic Church came to England. This left a mark on English in the various words taken from Latin and Greek. Such words as “angels”, “pope”, and “minister” all arrived through the Catholic Church.

Vikings and Alfred the Great

By the 8th and 9th century the Vikings were invading lands all over Europe. It was the Danes in particular that almost wiped out the inhabitants of England. However, thanks to the craftiness of Alfred the Great the Danes were defeated and their leader Guthrum was so shocked at Alfred’s comeback victory that he was baptized and became a Christian.

Alfred set to work using the English language to unite the people. He supported education in the English language and the use of language in general. Furthermore, to try and prevent future conflicts with the Danes, Alfred gave them some territory called “Dane Law” where they could live. Naturally, staying in the area meant that the Danish language had an effect on English as well.

Alfred also supported religion. Thanks to the Viking invasions, there was almost no priest left in the entire country. Alfred could barely find a priest who could read Latin. Without religious scholarship, there could be no passing on of religious teachings. This lead Alfred to encourage the translation books in other languages (like Latin) into English.

Conclusion

The story of English is not one continuous rise to prominence. There were several experiences of up and down as the language was in England. For example, there was a time when the French language almost overran the country. Yet this is a story for another day.

# Regression Tree Development in R

In this post, we will take a look at regression trees. Regression trees use a concept called recursive partitioning. Recursive partitioning involves splitting features in a way that reduces the error the most.

The splitting is also greedy which means that the algorithm will partition the data at one point without considered how it will affect future partitions. Ignoring how a current split affects the future splits can lead to unnecessary branches with high variance and low bias.

One of the main strengths of regression trees is their ability ti deal with nonlinear relationships. However, predictive performance can be hurt when a particular example is assigned the mean of a node. This forced assignment is a loss of data such as turning continuous variables into categorical variables.

in this post, we will use the “participation” dataset from the “ecdat” package to predict income based on the other variables in the dataset. Below is some initial code.

``library(rpart);library(partykit);library(caret);library(Ecdat)``
``````data("Participation")
str(Participation)``````
``````## 'data.frame':    872 obs. of  7 variables:
##  \$ lfp    : Factor w/ 2 levels "no","yes": 1 2 1 1 1 2 1 2 1 1 ...
##  \$ lnnlinc: num  10.8 10.5 11 11.1 11.1 ...
##  \$ age    : num  3 4.5 4.6 3.1 4.4 4.2 5.1 3.2 3.9 4.3 ...
##  \$ educ   : num  8 8 9 11 12 12 8 8 12 11 ...
##  \$ nyc    : num  1 0 0 2 0 0 0 0 0 0 ...
##  \$ noc    : num  1 1 0 0 2 1 0 2 0 2 ...
##  \$ foreign: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...``````

There are several things we need to do to make the results easier to interpret. The “age” variable needs to be multiplied by ten as it currently shows such results as 4.5, 3, etc. Common sense indicates that a four-year-old and a three-year-old is not earning an income.

In addition, we need to convert or income variable (lnnlinc) from the log of income to regular income. This will also help to understand the results. Below is the code.

``````Participation\$age<-10*Participation\$age #normal age
Participation\$lnnlinc<-exp(Participation\$lnnlinc) #actual income not log``````

The next step is to create our training and testing data sets. Below is the code.

``````set.seed(502)
ind=sample(2,nrow(Participation),replace=T,prob=c(.7,.3))
train<-Participation[ind==1,]
test<-Participation[ind==2,]``````

We can now develop our model. We will also use the ‘print’ command

``reg.tree<-rpart(lnnlinc~.,data = train)``

Below is a printout of the current tree

``reg.tree``
``````## n= 636
##
## node), split, n, deviance, yval
##       * denotes terminal node
##
##   1) root 636 390503700000  48405.08
##     2) educ< 11.5 473 127460900000  43446.69
##       4) educ< 9.5 335  70269440000  40758.25
##         8) foreign=yes 129  10617380000  36016.12 *
##         9) foreign=no 206  54934520000  43727.84 *
##       5) educ>=9.5 138  48892370000  49972.98 *
##     3) educ>=11.5 163 217668400000  62793.52
##       6) age< 34.5 79  34015680000  51323.86
##        12) age< 25.5 12    984764800  34332.97 *
##        13) age>=25.5 67  28946170000  54367.01 *
##       7) age>=34.5 84 163486000000  73580.46
##        14) lfp=yes 36  23888410000  58916.66 *
##        15) lfp=no 48 126050900000  84578.31
##          30) educ< 12.5 29  86940400000  74425.51
##            60) age< 38.5 8    763764600  57390.34 *
##            61) age>=38.5 21  82970650000  80915.10
##             122) age>=44 14  34091840000  68474.57 *
##             123) age< 44 7  42378600000 105796.20 *
##          31) educ>=12.5 19  31558550000 100074.70 *``````

I will not interpret all of this but here is a brief description use numbers 2,4, and 8. If the person has less than 11.5 years of education (473 qualify) If the person has less than 9.5 years of education (335 of the 473 qualify) If the person is a foreigner (129 of the 335 qualify) then their average salary is 36,016.12 dollars.

Perhaps now you can see how some data is lost. The average salary for people in this node is 36,016.12 dollars but probably nobody earns exactly this amount

If what I said does not make sense. Here is an actual plot of the current regression tree.

``plot(as.party(reg.tree))``

The little boxes at the bottom are boxplots of that node.

Tree modification

We now will make modifications to the tree. We will begin by examining the cptable. Below is the code

``reg.tree\$cptable``
``````##           CP nsplit rel error    xerror      xstd
## 1 0.11619458      0 1.0000000 1.0026623 0.1666662
## 2 0.05164297      1 0.8838054 0.9139383 0.1434768
## 3 0.03469034      2 0.8321625 0.9403669 0.1443843
## 4 0.02125215      3 0.7974721 0.9387060 0.1433101
## 5 0.01933892      4 0.7762200 0.9260030 0.1442329
## 6 0.01242779      5 0.7568810 0.9097011 0.1434606
## 7 0.01208066      7 0.7320255 0.9166627 0.1433779
## 8 0.01046022      8 0.7199448 0.9100704 0.1432901
## 9 0.01000000      9 0.7094846 0.9107869 0.1427025``````

The cptable shares a lot of information. First, cp stands for cost complexity and this is the column furthest to the left. This number decreases as the tree becomes more complex. “nsplit” indicates the number of splits in the tree. “rel error” as another term for the residual sum of squares or RSS error. The ‘xerror’ and ‘xstd’ are the cross-validated average error and standard deviation of the error respectively.

One thing we can see from the cptable is that 9 splits has the lowest error but 2 splits has the lowest cross-validated error. Below we will look at a printout of the current table.

We will now make a plot of the complexity parameter to determine at what point to prune the tree. Pruning helps in removing unnecessary splits that do not improve the model much. Below is the code. The information in the plot is a visual of the “cptable”

``plotcp(reg.tree)``

It appears that a tree of size 2 is the best but this is boring. The next lowest dip is a tree of size 8. Therefore, we will prune our tree to have a size of 8 or eight splits. First, we need to create an object that contains how many splits we want. Then we use the “prune” function to make the actually modified tree.

``````cp<-min(reg.tree\$cptable[8,])
pruned.reg.tree<-prune(reg.tree,cp=cp)``````

We will now make are modified tree

``plot(as.party(pruned.reg.tree))``

The only difference is the loss of the age nod for greater or less than 25.5.

Model Test

We can now test our tree to see how well it performs.

``````reg.tree.test<-predict(pruned.reg.tree,newdata=test)
reg.tree.resid<-reg.tree.test-test\$lnnlinc
mean(reg.tree.resid^2)``````
``## [1] 431928030``

The number we calculated is the mean squared error. This number must be compared to models that are developed differently in order to assess the current model. By it’s self it means nothing.

Conclusion

This post exposed you to regression trees. This type of tree can be used to m ake numeric predictions in nonlinear data. However, with the classification comes a loss of data as the uniqueness of each example is lost when placed in a node.

# Attendance Module Options in Moodle VIDEO

A look at the various attendance options in moodle

# Statistical Models

In research, the term ‘model’ is employed frequently. Normally, a model is some sort of a description or explanation of a real world phenomenon. In data science, we employ the use of statistical models. Statistical models used numbers to help us to understand something that happens in the real world.

A statistical model used numbers to help us to understand something that happens in the real world.

In the real world, quantitative research relies on numeric observations of some phenomenon, behavior, and or perception. For example, let’s say we have the quiz results of 20 students as show below.

`32 60 95 15 43 22 45 14 48 98 79 97 49 63 50 11 26 52 39 97`

This is great information but what if we want to go beyond how these students did and try to understand how students in the population would do on the quizzes. Doing this requires the development of a model.

A model is simply trying to describe how the data is generated in terms of whatever we are mesuring while allowing for randomness. It helps in summarizing a large collection of numbers while also providing structure to it.

One commonly used model is the normal model. This model is the famous bell-curve model that most of us are familiar with. To calculate this model we need to calculate the mean and standard deviation to get a plot similar to the one below

Now, this model is not completely perfect. For example, a student cannot normally get a score above 100 or below 0 on a quiz. Despite this weakness, normal distribution gives is an indication of what the population looks like.

With this, we can also calculate the probability of getting a specific score on the quiz. For example, if we want to calculate the probability that a student would get a score of 70  or higher we can do a simple test and find that it is about 26%.

Other Options

The normal model is not the only model. There are many different models to match different types of data. There are the gamma, student t, binomial, chi-square, etc. To determine which model to use requires examining the distribution of your data and match it to an appropriate model.

Another option is to transform the data. This is normally done to make data conform to a normal distribution. Which transformation to employ depends on how the data looks when it is plotted.

Conclusion

Modeling helps to bring order to data that has been collected for analysis. By using a model such as the normal distribution, you can begin to make inferences about what the population is like. This allows you to take a handful of data to better understand the world.

# K Nearest Neighbor in R

K-nearest neighbor is one of many nonlinear algorithms that can be used in machine learning. By non-linear I mean that a linear combination of the features or variables is not needed in order to develop decision boundaries. This allows for the analysis of data that naturally does not meet the assumptions of linearity.

KNN is also known as a “lazy learner”. This means that there are known coefficients or parameter estimates. When doing regression we always had coefficient outputs regardless of the type of regression (ridge, lasso, elastic net, etc.). What KNN does instead is used K nearest neighbors to give a label to an unlabeled example. Our job when using KNN is to determine the number of K neighbors to use that is most accurate based on the different criteria for assessing the models.

In this post, we will develop a KNN model using the “Mroz” dataset from the “Ecdat” package. Our goal is to predict if someone lives in the city based on the other predictor variables. Below is some initial code.

``library(class);library(kknn);library(caret);library(corrplot)``
`library(reshape2);library(ggplot2);library(pROC);library(Ecdat)`
``````data(Mroz)
str(Mroz)``````
``````## 'data.frame':    753 obs. of  18 variables:
##  \$ work      : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 2 2 2 ...
##  \$ hoursw    : int  1610 1656 1980 456 1568 2032 1440 1020 1458 1600 ...
##  \$ child6    : int  1 0 1 0 1 0 0 0 0 0 ...
##  \$ child618  : int  0 2 3 3 2 0 2 0 2 2 ...
##  \$ agew      : int  32 30 35 34 31 54 37 54 48 39 ...
##  \$ educw     : int  12 12 12 12 14 12 16 12 12 12 ...
##  \$ hearnw    : num  3.35 1.39 4.55 1.1 4.59 ...
##  \$ wagew     : num  2.65 2.65 4.04 3.25 3.6 4.7 5.95 9.98 0 4.15 ...
##  \$ hoursh    : int  2708 2310 3072 1920 2000 1040 2670 4120 1995 2100 ...
##  \$ ageh      : int  34 30 40 53 32 57 37 53 52 43 ...
##  \$ educh     : int  12 9 12 10 12 11 12 8 4 12 ...
##  \$ wageh     : num  4.03 8.44 3.58 3.54 10 ...
##  \$ income    : int  16310 21800 21040 7300 27300 19495 21152 18900 20405 20425 ...
##  \$ educwm    : int  12 7 12 7 12 14 14 3 7 7 ...
##  \$ educwf    : int  7 7 7 7 14 7 7 3 7 7 ...
##  \$ unemprate : num  5 11 5 5 9.5 7.5 5 5 3 5 ...
##  \$ city      : Factor w/ 2 levels "no","yes": 1 2 1 1 2 2 1 1 1 1 ...
##  \$ experience: int  14 5 15 6 7 33 11 35 24 21 ...``````

We need to remove the factor variable “work” as KNN cannot use factor variables. After this, we will use the “melt” function from the “reshape2” package to look at the variables when divided by whether the example was from the city or not.

``````Mroz\$work<-NULL
mroz.melt<-melt(Mroz,id.var='city')
Mroz_plots<-ggplot(mroz.melt,aes(x=city,y=value))+geom_boxplot()+facet_wrap(~variable, ncol = 4)
Mroz_plots``````

From the plots, it appears there are no differences in how the variable act whether someone is from the city or not. This may be a flag that classification may not work.

We now need to scale our data otherwise the results will be inaccurate. Scaling might also help our box-plots because everything will be on the same scale rather than spread all over the place. To do this we will have to temporarily remove our outcome variable from the data set because it’s a factor and then reinsert it into the data set. Below is the code.

``````mroz.scale<-as.data.frame(scale(Mroz[,-16]))
mroz.scale\$city<-Mroz\$city``````

We will now look at our box-plots a second time but this time with scaled data.

``````mroz.scale.melt<-melt(mroz.scale,id.var="city")
mroz_plot2<-ggplot(mroz.scale.melt,aes(city,value))+geom_boxplot()+facet_wrap(~variable, ncol = 4)
mroz_plot2``````

This second plot is easier to read but there is still little indication of difference.

We can now move to checking the correlations among the variables. Below is the code

``````mroz.cor<-cor(mroz.scale[,-17])
corrplot(mroz.cor,method = 'number')``````

There is a high correlation between husband’s age (ageh) and wife’s age (agew). Since this algorithm is non-linear this should not be a major problem.

We will now divide our dataset into the training and testing sets

``````set.seed(502)
ind=sample(2,nrow(mroz.scale),replace=T,prob=c(.7,.3))
train<-mroz.scale[ind==1,]
test<-mroz.scale[ind==2,]``````

Before creating a model we need to create a grid. We do not know the value of k yet so we have to run multiple models with different values of k in order to determine this for our model. As such we need to create a ‘grid’ using the ‘expand.grid’ function. We will also use cross-validation to get a better estimate of k as well using the “trainControl” function. The code is below.

``````grid<-expand.grid(.k=seq(2,20,by=1))
control<-trainControl(method="cv")``````

Now we make our model,

``````knn.train<-train(city~.,train,method="knn",trControl=control,tuneGrid=grid)
knn.train``````
``````## k-Nearest Neighbors
##
## 540 samples
##  16 predictors
##   2 classes: 'no', 'yes'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 487, 486, 486, 486, 486, 486, ...
## Resampling results across tuning parameters:
##
##   k   Accuracy   Kappa
##    2  0.6000095  0.1213920
##    3  0.6368757  0.1542968
##    4  0.6424325  0.1546494
##    5  0.6386252  0.1275248
##    6  0.6329998  0.1164253
##    7  0.6589619  0.1616377
##    8  0.6663344  0.1774391
##    9  0.6663681  0.1733197
##   10  0.6609510  0.1566064
##   11  0.6664018  0.1575868
##   12  0.6682199  0.1669053
##   13  0.6572111  0.1397222
##   14  0.6719586  0.1694953
##   15  0.6571425  0.1263937
##   16  0.6664367  0.1551023
##   17  0.6719573  0.1588789
##   18  0.6608811  0.1260452
##   19  0.6590979  0.1165734
##   20  0.6609510  0.1219624
##
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was k = 14.``````

R recommends that k = 16. This is based on a combination of accuracy and the kappa statistic. The kappa statistic is a measurement of the accuracy of a model while taking into account chance. We don’t have a model in the sense that we do not use the ~ sign like we do with regression. Instead, we have a train and a test set a factor variable and a number for k. This will make more sense when you see the code. Finally, we will use this information on our test dataset. We will then look at the table and the accuracy of the model.

``````knn.test<-knn(train[,-17],test[,-17],train[,17],k=16) #-17 removes the dependent variable 'city
table(knn.test,test\$city)``````
``````##
## knn.test  no yes
##      no   19   8
##      yes  61 125``````
``````prob.agree<-(15+129)/213
prob.agree``````
``## [1] 0.6760563``

Accuracy is 67% which is consistent with what we found when determining the k. We can also calculate the kappa. This done by calculating the probability and then do some subtraction and division. We already know the accuracy as we stored it in the variable “prob.agree” we now need the probability that this is by chance. Lastly, we calculate the kappa.

``````prob.chance<-((15+4)/213)*((15+65)/213)
kap<-(prob.agree-prob.chance)/(1-prob.chance)
kap``````
``## [1] 0.664827``

A kappa of .66 is actual good.

The example we just did was with unweighted k neighbors. There are times when weighted neighbors can improve accuracy. We will look at three different weighing methods. “Rectangular” is unweighted and is the one that we used. The other two are “triangular” and “epanechnikov”. How these calculate the weights is beyond the scope of this post. In the code below the argument “distance” can be set to 1 for euclidean and 2 for absolute distance.

``````kknn.train<-train.kknn(city~.,train,kmax = 25,distance = 2,kernel = c("rectangular","triangular",
"epanechnikov"))
plot(kknn.train)``````

``kknn.train``
``````##
## Call:
## train.kknn(formula = city ~ ., data = train, kmax = 25, distance = 2,     kernel = c("rectangular", "triangular", "epanechnikov"))
##
## Type of response variable: nominal
## Minimal misclassification: 0.3277778
## Best kernel: rectangular
## Best k: 14``````

If you look at the plot you can see which value of k is the best by looking at the point that is the lowest on the graph which is right before 15. Looking at the legend it indicates that the point is the “rectangular” estimate which is the same as unweighted. This means that the best classification is unweighted with a k of 14. Although it recommends a different value for k our misclassification was about the same.

Conclusion

In this post, we explored both weighted and unweighted KNN. This algorithm allows you to deal with data that does not meet the assumptions of regression by ignoring the need for parameters. However, because there are no numbers really attached to the results beyond accuracy it can be difficult to explain what is happening in the model to people. As such, perhaps the biggest drawback is communicating results when using KNN.

# Common Speech Functions

Functions of speech are different ways of communicating. The differences among the speech functions have to do with the intention of the communication. Different intention or goal leads to the use of a different function of speech. There are many different functions if speech but we will look at the six that are listed below.

• Referential
• Directive
• Expressive
• Phatic
• Poetic
• Metalinguistic

Referential

Referential speech provides information. For example, a person might share the time with someone (“It’s five o’clock” ). Referential speech can often provide information to a question (“what time is it?”).

Directive

Directives or commands that try to get someone to do something. Examples include “turn left” or “sit down”. The context of a directive is one in which something needs or should be done. As such, one person tries to make one or more other persons do something. Even children say directives towards their parents (“give me the ball”).

Expressive

Expressive speech shares a person’s feelings. An example would be “I feel happy today!”. Expressive communication can at times provide clear evidence of how someone is doing.

Phatic

Phatic speech is closely related to expressive speech. However, the main difference is that phatic speech is focused on the well-being of others while expressive speech focuses on the feelings of the person speaking.

An example of phatic speech is saying “how are you?”. This is clearly a question but it is focusing on how the person is doing. Another phrase might be “I hope you get well soon.” Again the focus on is on the welfare of someone else.

Poetic

Poetic speech is speech that is highly aesthetic. Songs and poetry are examples of language that is poetic in nature. An example would be the famous nursery rhyme “Roses are red, violets are blue…..). Poetic speech often has a powerful emotional effect as well.

Metalinguistic

Metalinguistic speech is communication about language. For example, this entire blog post would be considered by many to be metalinguistic because I a talking about language and not really using language as described in the other functons of speech.

Exceptions

There are many more categories than the ones presented. In addition, the categories presented are not mutually exclusive. Many phrases can be correctly classified into many different categories. For example, if someone says “I love you” you could argue that it’s expressive, poetic, and or even phatic. What is missing is the context in which such a statement is made.

Conclusion

The ways in which we communicated have been briefly explained here. Understanding how people communicate will help others to better understand those around us and improve our style of communicating.

# Attendance Module in Moodle VIDEO

How to setup the attendance module in Moodle

# Elastic Net Regression in R

Elastic net is a combination of ridge and lasso regression. What is most unusual about elastic net is that it has two tuning parameters (alpha and lambda) while lasso and ridge regression only has 1.

In this post, we will go through an example of the use of elastic net using the “VietnamI” dataset from the “Ecdat” package. Our goal is to predict how many days a person is ill based on the other variables in the dataset. Below is some initial code for our analysis

``library(Ecdat);library(corrplot);library(caret);library(glmnet)``
``````data("VietNamI")
str(VietNamI)``````
``````## 'data.frame':    27765 obs. of  12 variables:
##  \$ pharvis  : num  0 0 0 1 1 0 0 0 2 3 ...
##  \$ lnhhexp  : num  2.73 2.74 2.27 2.39 3.11 ...
##  \$ age      : num  3.76 2.94 2.56 3.64 3.3 ...
##  \$ sex      : Factor w/ 2 levels "female","male": 2 1 2 1 2 2 1 2 1 2 ...
##  \$ married  : num  1 0 0 1 1 1 1 0 1 1 ...
##  \$ educ     : num  2 0 4 3 3 9 2 5 2 0 ...
##  \$ illness  : num  1 1 0 1 1 0 0 0 2 1 ...
##  \$ injury   : num  0 0 0 0 0 0 0 0 0 0 ...
##  \$ illdays  : num  7 4 0 3 10 0 0 0 4 7 ...
##  \$ actdays  : num  0 0 0 0 0 0 0 0 0 0 ...
##  \$ insurance: num  0 0 1 1 0 1 1 1 0 0 ...
##  \$ commune  : num  192 167 76 123 148 20 40 57 49 170 ...
##  - attr(*, "na.action")=Class 'omit'  Named int 27734
##   .. ..- attr(*, "names")= chr "27734"``````

We need to check the correlations among the variables. We need to exclude the “sex” variable as it is categorical. Code is below.

``````p.cor<-cor(VietNamI[,-4])
corrplot.mixed(p.cor)``````

No major problems with correlations. Next, we set up our training and testing datasets. We need to remove the variable “commune” because it adds no value to our results. In addition, to reduce the computational time we will only use the first 1000 rows from the data set.

``````VietNamI\$commune<-NULL
VietNamI_reduced<-VietNamI[1:1000,]
ind<-sample(2,nrow(VietNamI_reduced),replace=T,prob = c(0.7,0.3))
train<-VietNamI_reduced[ind==1,]
test<-VietNamI_reduced[ind==2,]``````

We need to create a grid that will allow us to investigate different models with different combinations of alpha ana lambda. This is done using the “expand.grid” function. In combination with the “seq” function below is the code

``grid<-expand.grid(.alpha=seq(0,1,by=.5),.lambda=seq(0,0.2,by=.1))``

We also need to set the resampling method, which allows us to assess the validity of our model. This is done using the “trainControl” function” from the “caret” package. In the code below “LOOCV” stands for “leave one out cross-validation”.

``control<-trainControl(method = "LOOCV")``

We are no ready to develop our model. The code is mostly self-explanatory. This initial model will help us to determine the appropriate values for the alpha and lambda parameters

``````enet.train<-train(illdays~.,train,method="glmnet",trControl=control,tuneGrid=grid)
enet.train``````
``````## glmnet
##
## 694 samples
##  10 predictors
##
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation
## Summary of sample sizes: 693, 693, 693, 693, 693, 693, ...
## Resampling results across tuning parameters:
##
##   alpha  lambda  RMSE      Rsquared
##   0.0    0.0     5.229759  0.2968354
##   0.0    0.1     5.229759  0.2968354
##   0.0    0.2     5.229759  0.2968354
##   0.5    0.0     5.243919  0.2954226
##   0.5    0.1     5.225067  0.2985989
##   0.5    0.2     5.200415  0.3038821
##   1.0    0.0     5.244020  0.2954519
##   1.0    0.1     5.203973  0.3033173
##   1.0    0.2     5.182120  0.3083819
##
## RMSE was used to select the optimal model using  the smallest value.
## The final values used for the model were alpha = 1 and lambda = 0.2.``````

The output list all the possible alpha and lambda values that we set in the “grid” variable. It even tells us which combination was the best. For our purposes, the alpha will be .5 and the lambda .2. The r-square is also included.

We will set our model and run it on the test set. We have to convert the “sex” variable to a dummy variable for the “glmnet” function. We next have to make matrices for the predictor variables and a for our outcome variable “illdays”

``````train\$sex<-model.matrix( ~ sex - 1, data=train ) #convert to dummy variable
test\$sex<-model.matrix( ~ sex - 1, data=test )
predictor_variables<-as.matrix(train[,-9])
days_ill<-as.matrix(train\$illdays)
enet<-glmnet(predictor_variables,days_ill,family = "gaussian",alpha = 0.5,lambda = .2)``````

We can now look at specific coefficient by using the “coef” function.

``````enet.coef<-coef(enet,lambda=.2,alpha=.5,exact=T)
enet.coef``````
``````## 12 x 1 sparse Matrix of class "dgCMatrix"
##                         s0
## (Intercept)   -1.304263895
## pharvis        0.532353361
## lnhhexp       -0.064754000
## age            0.760864404
## sex.sexfemale  0.029612290
## sex.sexmale   -0.002617404
## married        0.318639271
## educ           .
## illness        3.103047473
## injury         .
## actdays        0.314851347
## insurance      .``````

You can see for yourself that several variables were removed from the model. Medical expenses (lnhhexp), sex, education, injury, and insurance do not play a role in the number of days ill for an individual in Vietnam.

With our model developed. We now can test it using the predict function. However, we first need to convert our test dataframe into a matrix and remove the outcome variable from it

``````test.matrix<-as.matrix(test[,-9])
enet.y<-predict(enet, newx = test.matrix, type = "response", lambda=.2,alpha=.5)``````

Let’s plot our results

``plot(enet.y)``

This does not look good. Let’s check the mean squared error

``````enet.resid<-enet.y-test\$illdays
mean(enet.resid^2)``````
``## [1] 20.18134``

We will now do a cross-validation of our model. We need to set the seed and then use the “cv.glmnet” to develop the cross-validated model. We can see the model by plotting it.

``````set.seed(317)
enet.cv<-cv.glmnet(predictor_variables,days_ill,alpha=.5)
plot(enet.cv)``````

You can see that as the number of features are reduce (see the numbers on the top of the plot) the MSE increases (y-axis). In addition, as the lambda increases, there is also an increase in the error but only when the number of variables is reduced as well.

The dotted vertical lines in the plot represent the minimum MSE for a set lambda (on the left) and the one standard error from the minimum (on the right). You can extract these two lambda values using the code below.

``enet.cv\$lambda.min``
``## [1] `0.3082347`
``enet.cv\$lambda.1se``
``## [1] `2.874607`

We can see the coefficients for a lambda that is one standard error away by using the code below. This will give us an alternative idea for what to set the model parameters to when we want to predict.

``coef(enet.cv,s="lambda.1se")``
``````## 12 x 1 sparse Matrix of class "dgCMatrix"
##                      1
## (Intercept)   2.34116947
## pharvis       0.003710399
## lnhhexp       .
## age           .
## sex.sexfemale .
## sex.sexmale   .
## married       .
## educ          .
## illness       1.817479480
## injury        .
## actdays       .
## insurance     .``````

Using the one standard error lambda we lose most of our features. We can now see if the model improves by rerunning it with this information.

``````enet.y.cv<-predict(enet.cv,newx = test.matrix,type='response',lambda="lambda.1se", alpha = .5)
enet.cv.resid<-enet.y.cv-test\$illdays
mean(enet.cv.resid^2)``````
``## [1] 25.47966``

A small improvement.  Our model is a mess but this post served as an example of how to conduct an analysis using elastic net regression.

# Exploratory Data Analyst

In data science, exploratory data analyst serves the purpose of assessing whether the data set that you have is suitable for answering the research questions of the project. As such, there are several steps that can be taken to make this process more efficient.

Therefore, the purpose of this post is to explain one process that can be used for exploratory data analyst. The steps include ethe following.

• Check the structure of the dataset
• Use visuals

Research questions give a project a sense of direction. They help you to know what you want to know. In addition, research questions help you to determine what type of analyst to conduct as well.

During the data exploration stage, the purpose of a research question is not for analyst but rather to determine if your data can actually provide answers to the questions. For example, if you want to know what the average height of men in America are and your data tells you the salary of office workers there is a problem,. Your question (average height) cannot be answered with the current data that you have (office workers salaries).

As such, the research questions need to be answerable and specific before moving forward. By answerable, we mean that the data can provide the solution. By specific, we mean a question moves away from generalities and deals with clearly define phenomenon. For example, “what is the average height of males age 20-30 in the United states?” This question clearly identifies the what we want to know (average height) and among who (20-30, male Americans).

Not can you confirm if your questions are answerable you can also decide if you need to be more or less specific with your questions. Returning to our average height question. We may find that we can be more specific and check average height by state if we want. Or, we might learn that we can only determine the average height for a region. All this depends on the type of data we have.

Check the Structure

Checking the structure involves determining how many rows and columns in the dataset, the sample size, as well as looking for missing data and erroneous data. Data sets in data science almost always need some sort of cleaning or data wrangling before analyst and checking the structure helps to determine what needs to be done.

You should have a priori expectations for the structure of the data set. If the stakeholders tell you that there should be several million rows in the data set and you check and there are only several thousand you know there is a problem. This concept also applies to the number of features you expect as well.

Make Visuals

Visuals, which can be plots or tables, help you further develop your expectations as well as to look for deviations or outliers. Tables are an excellent source for summarizing data. Plots, on the other hand, allow you to see deviations from your expectations in the data.What kind of tables and plots to make depends heavily on

What kind of tables and plots to make depends heavily on the type of data as well as the type of questions that you have. For example, for descriptive questions tables of summary statistics with bar plots might be sufficient. For comparison questions, summary stats and boxplots may be enough. For relationship question, summary stat tables with a scatterplot may be enough. Please keep in mind that it is much more complicated than this.

Conclusion

Before questions can be answered the data needs to be explored. This will help to make sure that the potential answers that are developed are appropriate.

# Accommodation Theory in Language Communication

Often when people communicate, they will make a subconscious or even a conscious decision to adjust their speech so that it is more alike or less alike. This is known as accommodation.

In this post, we will look at the following concepts related to accommodation

• Speech convergence
• Speech divergence

Speech Convergence

Speech convergence is when people speech starts to sound similar to each other. Often, this is a sign that the speakers are being polite to each other, like each other, and or when one speaker has the interest to please another.

Speech convergence is not only for social reasons. Another reason that a person will modify their speech is for the sake of removing technical jargon when dealing with people who are not familiar with it. For example, when a mechanic speaks to a doctor about what is wrong with their car or when a medical doctor speaks to a patient about the patient’s health. The modification happens so that the other person can understand.

Speech convergence can be overdone in terms of the perceptions of the hearers. For example, if a foreigner sounds too much like a native it can raise suspicion. Furthermore, over convergence can be perceived as insulting and or making fun of others.  As such, some difference is probably wise.

Speech Divergence

Speech divergence happens when people deliberately choose not to mirror each other speaking styles. The message that is sent when doing this is that the people communicating do not want to accommodate, seem polite, or perhaps that they do not like the people they are communicating with.

Examples of this often involve minority groups who desire to maintain their own cultural identity. Such a group will use their language judiciously, especially around the local dominant culture, as a sign of independence.

Accent divergence is also possible. For example, two people from the same country but different socioeconomic standings may deliberately choose to maintain their specific style of communication to indicate the differences between them.

Conclusion

Convergence and divergence in communication can send many different messages to people. It is difficult to determine how people will respond to how a people convergence or divergences from their speaking style. However, the main motivations for accommodation appear to be how such behavior benefits the communicator.

# Forum Options in Moodle VIDEO

Options for forums in Moodle

# Data Science Research Questions

Developing research questions is an absolute necessity in completing any research project. The questions you ask help to shape the type of analysis that you need to conduct.

The type of questions you ask in the context of analytics and data science are similar to those found in traditional quantitative research. Yet data science, like any other field, has its own distinct traits.

In this post, we will look at six different types of questions that are used frequently in the context of the field of data science. The six questions are…

1. Descriptive
2. Exploratory/Inferential
3. Predictive
4. Causal
5. Mechanistic

Understanding the types of question that can be asked will help anyone involved in data science to determine what exactly it is that they want to know.

Descriptive

A descriptive question seeks to describe a characteristic of the dataset. For example, if I collect the GPA of 100 university student I may want to what the average GPA of the students is. Seeking the average is one example of a descriptive question.

With descriptive questions, there is no need for a hypothesis as you are not trying to infer, establish a relationship, or generalize to a broader context. You simply want to know a trait of the dataset.

Exploratory/Inferential

Exploratory questions seek to identify things that may be “interesting” in the dataset. Examples of things that may be interesting include trends, patterns, and or relationships among variables.

Exploratory questions generate hypotheses. This means that they lead to something that may be more formal questioned and tested. For example, if you have GPA and hours of sleep for university students. You may explore the potential that there is a relationship between these two variables.

Inferential questions are an extension of exploratory questions. What this means is that the exploratory question is formally tested by developing an inferential question. Often, the difference between an exploratory and inferential question is the following

1. Exploratory questions are usually developed first
2. Exploratory questions generate inferential questions
3. Inferential questions are tested often on a different dataset from exploratory questions

In our example, if we find a relationship between GPA and sleep in our dataset. We may test this relationship in a different, perhaps larger dataset. If the relationship holds we can then generalize this to the population of the study.

Causal

Causal questions address if a change in one variable directly affects another. In analytics, A/B testing is one form of data collection that can be used to develop causal questions. For example, we may develop two version of a website and see which one generates more sales.

In this example, the type of website is the independent variable and sales is the dependent variable. By controlling the type of website people see we can see if this affects sales.

Mechanistic

Mechanistic questions deal with how one variable affects another. This is different from causal questions that focus on if one variable affects another. Continuing with the website example, we may take a closer look at the two different websites and see what it was about them that made one more succesful in generating sales. It may be that one had more banners than another or fewer pictures. Perhaps there were different products offered on the home page.

All of these different features, of course, require data that helps to explain what is happening. This leads to an important point that the questions that can be asked are limited by the available data. You can’t answer a question that does not contain data that may answer it.

Conclusion

Answering questions is essential what research is about. In order to do this, you have to know what your questions are. This information will help you to decide on the analysis you wish to conduct. Familiarity with the types of research questions that are common in data science can help you to approach and complete analysis much faster than when this is unclear

# Lasso Regression in R

In this post, we will conduct an analysis using the lasso regression. Remember lasso regression will actually eliminate variables by reducing them to zero through how the shrinkage penalty can be applied.

We will use the dataset “nlschools” from the “MASS” packages to conduct our analysis. We want to see if we can predict language test scores “lang” with the other available variables. Below is some initial code to begin the analysis

``library(MASS);library(corrplot);library(glmnet)``
``````data("nlschools")
str(nlschools)``````
``````## 'data.frame':    2287 obs. of  6 variables:
##  \$ lang : int  46 45 33 46 20 30 30 57 36 36 ...
##  \$ IQ   : num  15 14.5 9.5 11 8 9.5 9.5 13 9.5 11 ...
##  \$ class: Factor w/ 133 levels "180","280","1082",..: 1 1 1 1 1 1 1 1 1 1 ...
##  \$ GS   : int  29 29 29 29 29 29 29 29 29 29 ...
##  \$ SES  : int  23 10 15 23 10 10 23 10 13 15 ...
##  \$ COMB : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...``````

We need to remove the “class” variable as it is used as an identifier and provides no useful data. After this, we can check the correlations among the variables. Below is the code for this.

``````nlschools\$class<-NULL
p.cor<-cor(nlschools[,-5])
corrplot.mixed(p.cor)``````

No problems with collinearity. We will now setup are training and testing sets.

``````ind<-sample(2,nrow(nlschools),replace=T,prob = c(0.7,0.3))
train<-nlschools[ind==1,]
test<-nlschools[ind==2,]``````

Remember that the ‘glmnet’ function does not like factor variables. So we need to convert our “COMB” variable to a dummy variable. In addition, “glmnet” function does not like data frames so we need to make two data frames. The first will include all the predictor variables and the second we include only the outcome variable. Below is the code

``````train\$COMB<-model.matrix( ~ COMB - 1, data=train ) #convert to dummy variable
test\$COMB<-model.matrix( ~ COMB - 1, data=test )
predictor_variables<-as.matrix(train[,2:4])
language_score<-as.matrix(train\$lang)``````

We can now run our model. We place both matrices inside the “glmnet” function. The family is set to “gaussian” because our outcome variable is continuous. The “alpha” is set to 1 as this indicates that we are using lasso regression.

``lasso<-glmnet(predictor_variables,language_score,family="gaussian",alpha=1)``

Now we need to look at the results using the “print” function. This function prints a lot of information as explained below.

• Df = number of variables including in the model (this is always the same number in a ridge model)
• %Dev = Percent of deviance explained. The higher the better
• Lambda = The lambda used to obtain the %Dev

When you use the “print” function for a lasso model it will print up to 100 different models. Fewer models are possible if the percent of deviance stops improving. 100 is the default stopping point. In the code below we will use the “print” function but, I only printed the first 5 and last 5 models in order to reduce the size of the printout. Fortunately, it only took 60 models to converge.

``print(lasso)``
``````##
## Call:  glmnet(x = predictor_variables, y = language_score, family = "gaussian",      alpha = 1)
##
##       Df    %Dev  Lambda
##  [1,]  0 0.00000 5.47100
##  [2,]  1 0.06194 4.98500
##  [3,]  1 0.11340 4.54200
##  [4,]  1 0.15610 4.13900
##  [5,]  1 0.19150 3.77100
............................
## [55,]  3 0.39890 0.03599
## [56,]  3 0.39900 0.03280
## [57,]  3 0.39900 0.02988
## [58,]  3 0.39900 0.02723
## [59,]  3 0.39900 0.02481
## [60,]  3 0.39900 0.02261``````

The results from the “print” function will allow us to set the lambda for the “test” dataset. Based on the results we can set the lambda at 0.02 because this explains the highest amount of deviance at .39.

The plot below shows us lambda on the x-axis and the coefficients of the predictor variables on the y-axis. The numbers next to the coefficient lines refers to the actual coefficient of a particular variable as it changes from using different lambda values. Each number corresponds to a variable going from left to right in a dataframe/matrix using the “View” function. For example, 1 in the plot refers to “IQ” 2 refers to “GS” etc.

``plot(lasso,xvar="lambda",label=T)``

As you can see, as lambda increase the coefficient decrease in value. This is how regularized regression works. However, unlike ridge regression which never reduces a coefficient to zero, lasso regression does reduce a coefficient to zero. For example, coefficient 3 (SES variable) and coefficient 2 (GS variable) are reduced to zero when lambda is near 1.

You can also look at the coefficient values at a specific lambda values. The values are unstandardized and are used to determine the final model selection. In the code below the lambda is set to .02 and we use the “coef” function to do see the results

``````lasso.coef<-coef(lasso,s=.02,exact = T)
lasso.coef``````
``````## 4 x 1 sparse Matrix of class "dgCMatrix"
##                       1
## (Intercept)  9.35736325
## IQ           2.34973922
## GS          -0.02766978
## SES          0.16150542``````

Results indicate that for a 1 unit increase in IQ there is a 2.41 point increase in language score. When GS (class size) goes up 1 unit there is a .03 point decrease in language score. Finally, when SES (socioeconomic status) increase  1 unit language score improves .13 point.

The second plot shows us the deviance explained on the x-axis. On the y-axis is the coefficients of the predictor variables. Below is the code

``plot(lasso,xvar='dev',label=T)``

If you look carefully, you can see that the two plots are completely opposite to each other. increasing lambda cause a decrease in the coefficients. Furthermore, increasing the fraction of deviance explained leads to an increase in the coefficient. You may remember seeing this when we used the “print”” function. As lambda became smaller there was an increase in the deviance explained.

Now, we will assess our model using the test data. We need to convert the test dataset to a matrix. Then we will use the “predict”” function while setting our lambda to .02. Lastly, we will plot the results. Below is the code.

``````test.matrix<-as.matrix(test[,2:4])
lasso.y<-predict(lasso,newx = test.matrix,type = 'response',s=.02)
plot(lasso.y,test\$lang)``````

The visual looks promising. The last thing we need to do is calculated the mean squared error. By its self this number does not mean much. However, it provides a benchmark for comparing our current model with any other models that we may develop. Below is the code

``````lasso.resid<-lasso.y-test\$lang
mean(lasso.resid^2)``````
``## [1] 46.74314``

Knowing this number, we can, if we wanted, develop other models using other methods of analysis to try to reduce it. Generally, the lower the error the better while keeping in mind the complexity of the model.

# Theories on Language Change in Groups

As people interact with each other, it naturally leads to changes in how communication  takes place. Fortunately, there are several views that attempt to explain in a systematic way how language changes. In general, there are at least 3 viewpoints on how language changes. These viewpoints are

• Group to group
• Style to style
• Word to word

In this post, we will look at each of these viewpoints on language change.

Group to Group

The group to group hypothesis sees language change like a wave in a lake. The changes originates from one or more groups and slowly spreads to other groups.  This happens because different groups interact with each other. Furthermore, many people are members of more than one group and bring the language they use in one group to another.

Style to Style

The style to style hypothesis suggest that language changes as there are shifts between language styles. For example, from a formal way of speaking to a colloquial way of speaking and vice versa.

A change in the language  that is seen as prestigious is usually from a higher more affluent section of society. Of course, the opposite is also true and un-prestigious language change comes from the least fortunate.

The style of a speaker also changes over time. The younger the person is the more they use vernacular and slang in general.

Word to Word

There are times in which individual words will change within a language and this change will spread to other languages. This is known as lexical diffusion.

Such a change can take decades and even century to take place. It is also common when two languages interact through mutually changing each other pronunciation. Such as the role of French in England for several centuries.

Conclusion

It is not so much that any of the examples discussed here are exclusively responsible for change. Rather, all of these examples play varying roles in influencing changes in a language.

# Q&A Forums in Moodle VIDEO

Creating Q&A forums in Moodle

# Ridge Regression in R

In this post, we will conduct an analysis using ridge regression. Ridge regression is a type of regularized regression. By applying a shrinkage penalty, we are able to reduce the coefficients of many variables almost to zero while still retaining them in the model. This allows us to develop models that have many more variables in them compared to models using best subset or stepwise regression.

In the example used in this post, we will use the “SAheart” dataset from the “ElemStatLearn” package. We want to predict systolic blood pressure (sbp) using all of the other variables available as predictors. Below is some initial code that we need to begin.

``````library(ElemStatLearn);library(car);library(corrplot)
library(leaps);library(glmnet);library(caret)``````
```data(SAheart)
`str(SAheart)````
``````## 'data.frame':    462 obs. of  10 variables:
##  \$ sbp      : int  160 144 118 170 134 132 142 114 114 132 ...
##  \$ tobacco  : num  12 0.01 0.08 7.5 13.6 6.2 4.05 4.08 0 0 ...
##  \$ ldl      : num  5.73 4.41 3.48 6.41 3.5 6.47 3.38 4.59 3.83 5.8 ...
##  \$ adiposity: num  23.1 28.6 32.3 38 27.8 ...
##  \$ famhist  : Factor w/ 2 levels "Absent","Present": 2 1 2 2 2 2 1 2 2 2 ...
##  \$ typea    : int  49 55 52 51 60 62 59 62 49 69 ...
##  \$ obesity  : num  25.3 28.9 29.1 32 26 ...
##  \$ alcohol  : num  97.2 2.06 3.81 24.26 57.34 ...
##  \$ age      : int  52 63 46 58 49 45 38 58 29 53 ...
##  \$ chd      : int  1 1 0 1 1 0 0 1 0 1 ...``````

A look at the object using the “str” function indicates that one variable “famhist” is a factor variable. The “glmnet” function that does the ridge regression analysis cannot handle factors so we need to converts this to a dummy variable. However, there are two things we need to do before this. First, we need to check the correlations to make sure there are no major issues with multi-collinearity Second, we need to create our training and testing data sets. Below is the code for the correlation plot.

``````p.cor<-cor(SAheart[,-5])
corrplot.mixed(p.cor)``````

First we created a variable called “p.cor” the -5 in brackets means we removed the 5th column from the “SAheart” data set which is the factor variable “Famhist”. The correlation plot indicates that there is one strong relationship between adiposity and obesity. However, one common cut-off for collinearity is 0.8 and this value is 0.72 which is not a problem.

We will now create are training and testing sets and convert “famhist” to a dummy variable.

``````ind<-sample(2,nrow(SAheart),replace=T,prob = c(0.7,0.3))
train<-SAheart[ind==1,]
test<-SAheart[ind==2,]
train\$famhist<-model.matrix( ~ famhist - 1, data=train ) #convert to dummy variable
test\$famhist<-model.matrix( ~ famhist - 1, data=test )``````

We are still not done preparing our data yet. “glmnet” cannot use data frames, instead, it can only use matrices. Therefore, we now need to convert our data frames to matrices. We have to create two matrices, one with all of the predictor variables and a second with the outcome variable of blood pressure. Below is the code

``````predictor_variables<-as.matrix(train[,2:10])
blood_pressure<-as.matrix(train\$sbp)``````

We are now ready to create our model. We use the “glmnet” function and insert our two matrices. The family is set to Gaussian because “blood pressure” is a continuous variable. Alpha is set to 0 as this indicates ridge regression. Below is the code

``ridge<-glmnet(predictor_variables,blood_pressure,family = 'gaussian',alpha = 0)``

Now we need to look at the results using the “print” function. This function prints a lot of information as explained below.

•  Df = number of variables including in the model (this is always the same number in a ridge model)
•  %Dev = Percent of deviance explained. The higher the better
• Lambda = The lambda used to attain the %Dev

When you use the “print” function for a ridge model it will print up to 100 different models. Fewer models are possible if the percent of deviance stops improving. 100 is the default stopping point. In the code below we have the “print” function. However, I have only printed the first 5 and last 5 models in order to save space.

``print(ridge)``
``````##
## Call:  glmnet(x = predictor_variables, y = blood_pressure, family = "gaussian",      alpha = 0)
##
##        Df      %Dev    Lambda
##   [1,] 10 7.622e-37 7716.0000
##   [2,] 10 2.135e-03 7030.0000
##   [3,] 10 2.341e-03 6406.0000
##   [4,] 10 2.566e-03 5837.0000
##   [5,] 10 2.812e-03 5318.0000
................................
##  [95,] 10 1.690e-01    1.2290
##  [96,] 10 1.691e-01    1.1190
##  [97,] 10 1.692e-01    1.0200
##  [98,] 10 1.693e-01    0.9293
##  [99,] 10 1.693e-01    0.8468
## [100,] 10 1.694e-01    0.7716``````

The results from the “print” function are useful in setting the lambda for the “test” dataset. Based on the results we can set the lambda at 0.83 because this explains the highest amount of deviance at .20.

The plot below shows us lambda on the x-axis and the coefficients of the predictor variables on the y-axis. The numbers refer to the actual coefficient of a particular variable. Inside the plot, each number corresponds to a variable going from left to right in a data-frame/matrix using the “View” function. For example, 1 in the plot refers to “tobacco” 2 refers to “ldl” etc. Across the top of the plot is the number of variables used in the model. Remember this number never changes when doing ridge regression.

``plot(ridge,xvar="lambda",label=T)``

As you can see, as lambda increase the coefficient decrease in value. This is how ridge regression works yet no coefficient ever goes to absolute 0.

You can also look at the coefficient values at a specific lambda value. The values are unstandardized but they provide a useful insight when determining final model selection. In the code below the lambda is set to .83 and we use the “coef” function to do this

``````ridge.coef<-coef(ridge,s=.83,exact = T)
ridge.coef``````
``````## 11 x 1 sparse Matrix of class "dgCMatrix"
##                                   1
## (Intercept)            105.69379942
## tobacco                 -0.25990747
## ldl                     -0.13075557
## famhist.famhistAbsent    0.42532887
## famhist.famhistPresent  -0.40000846
## typea                   -0.01799031
## obesity                  0.29899976
## alcohol                  0.03648850
## age                      0.43555450
## chd                     -0.26539180``````

The second plot shows us the deviance explained on the x-axis and the coefficients of the predictor variables on the y-axis. Below is the code

``plot(ridge,xvar='dev',label=T)``

The two plots are completely opposite to each other. Increasing lambda cause a decrease in the coefficients while increasing the fraction of deviance explained leads to an increase in the coefficient. You can also see this when we used the “print” function. As lambda became smaller there was an increase in the deviance explained.

We now can begin testing our model on the test data set. We need to convert the test dataset to a matrix and then we will use the predict function while setting our lambda to .83 (remember a lambda of .83 explained the most of the deviance). Lastly, we will plot the results. Below is the code.

``````test.matrix<-as.matrix(test[,2:10])
ridge.y<-predict(ridge,newx = test.matrix,type = 'response',s=.83)
plot(ridge.y,test\$sbp)``````

The last thing we need to do is calculated the mean squared error. By it’s self this number is useless. However, it provides a benchmark for comparing the current model with any other models you may develop. Below is the code

``````ridge.resid<-ridge.y-test\$sbp
mean(ridge.resid^2)``````
``## [1] 372.4431``

Knowing this number, we can develop other models using other methods of analysis to try to reduce it as much as possible.

# Regularized Linear Regression

Traditional linear regression has been a tried and true  model for making predictions for decades. However, with the growth of Big Data and datasets with 100’s of variables  problems have begun to arise. For example, using stepwise or best subset method with regression could take hours if not days to converge in even some of the best computers.

To deal with this problem, regularized regression has been developed to help to determine which features or variables to keep when developing models from large datasets with a huge number of variables. In this post, we will look at the following concepts

• Definition of regularized regression
• Ridge regression
• Lasso regression
• Elastic net regression

Regularization

Regularization involves the use of a shrinkage penalty in order to reduce the residual sum of squares (RSS). This is done through selecting a value for a tuning parameter called “lambda”. Tuning parameters are used in machine learning algorithms to control the behavior of the models that are developed.

The lambda is multiplied by the normalized coefficients of the model and added to the RSS. Below is an equation of what was just said

The benefits of regularization are at least three-fold. First, regularization is highly computationally efficient. Instead of fitting k-1 models when k is the number of variables available (for example, 50 variables would lead 49 models!), with regularization only one model is developed for each value of lambda you specify.

Second, regularization helps to deal with the bias-variance headache of model development. When small changes are made to data, such as switching from the training to testing data, there can be wild changes in the estimates. Regularization can often smooth this problem out substantially.

Finally, regularization can help to reduce or eliminate any multicollenarity in a model. As such, the benefits of using regularization make it clear that this should be considering when working with larger data sets.

Ridge Regression

Ridge regression involves the normalization of the squared weights or as shown in the equation below

This is also refered to as the L2-norm. As lambda increase in value the coefficients in the model are shrunk towards 0 but never reach 0. This is how the error is shrunk. The higher the lambda the lower the value of the coefficients as they are reduce more and more thus reducing the RSS.

The benefit is that predictive accuracy is often increased. However, interpreting and communicating your results can become difficult because no variables are removed from the model. Instead the variables are reduced near to zero. This can be especially tough if you have dozens of variables remaining in your model to try to explain.

Lasso

Lasso is short for “Least Absolute Shrinkage and Selection Operator”. This approach uses the L1-norm which is the sum of the absolute value of the coefficients or as shown in the equation below

This shrinkage penalty will reduce a coefficient to 0 which is another way of saying that variables will be removed from the model. One problem is that highly correlated variables that need to be  in your model my be removed when Lasso shrinks coefficients. This is one reason why ridge regression is still used.

Elastic Net

Elastic net is the best of ridge and Lasso without the weaknesses of either. It combines extracts variables like Lasso and Ridge does not while also group variables like Ridge does but Lasso does not.

This is done by including a second tuning parameter called “alpha”. If alpha is set to 0 it is the same as ridge regression and if alpha is set to 1 it is the same as lasso regression. If you are able to appreciate it below is the formula used for elastic net regression

(RSS + l[(1 – alpha)(S|normalized coefficients|2)/2 + alpha(S|normalized coefficients|)])/N)

As such when working with elastic net you have to set two different tuning parameters (alpha and lambda) in order to develop a model.

Conclusion

Regularized regressio was developed as an answer to the growth in the size and number of variables in a data set today. Ridge, lasso an elastic net all provide solutions to converging over large datasets and selecting features.

# Social Networks and Language Habits

In this post, we will look at how relationships that people have can play a role in how they communicate with those around them. Understanding this can help people to comprehend differences in communication style.

In sociolinguistics, social networks  can refer to the pattern of informal relationships that people have and experience on a consistent basis. There are two dimensions that can be used to describe a persons social network. These two terms are density and plexity.

Density

The density of a social network refers to how well people in your network know each other. In other words, density is ow well your friends know each other. We all have friends, we have friends who know each other, and we have friends who do not know each other.

If many of your friends know each other then the density is high. If your friends do not know each other the density is low. An example of a high density network would be the typical family. Everybody knows each other. An example of a low density network would be employees at a large company. In such a situation it would not be hard to find a friend of a friend that you do not know.

Plexity

Plexity is a  measure of the various types of interactions that you are involved in with other people. Plexity can be uniplex, which involves one type of interaction with a person or multiplex, which involves many types of interactions with a person.

An example of a uniplex interaction may be a worker with their boss. They only interact at work. A multiplex interaction would again be with members of one’s family. When dealing with family interactions could include school, work, recreation, shopping, etc. In all these examples it is the same people interacting in a multitude of settings.

Language Use in Social Networks

A person’s speech almost always reflects the network that they belong too. If the group is homogeneous we will almost always speak the way everyone else does assuming we want to be a part of the group. For example, a group of local construction workers will more than likely use similar language patterns due to the homogeneous nature of the group while a group of ESL bankers would not as they come from many different countries.

When a person belongs to more than one social network they will almost always unconsciously change the way they communicate based on the context. For example, anybody who has moved away from home communicates differently where they live then when they communicate with family and friends back home. This is true even when moving from one place to another in the same province or state in your country.

Conclusion

The language that people employ is affected by the dynamics of the social network. We naturally will adjust our communication to accommodate who we are talking too.

# Each Person Post One Discussion Forum in Moodle VIDEO

Using the Moodle forum option  of each person posts one discussion

# Primary Tasks in Data Analysis

Performing a data analysis in the realm of data science is a difficult task due to the huge number of decisions that need to be made. For some people,  plotting the course to conduct an analysis is easy. However, for most of us, beginning a project leads to a sense of paralysis as we struggle to determine what to do.

In light of this challenge, there are at least 5 core task that you need to consider when preparing to analyze data. These five task are

2. Data exploration
3. Developing a statistical model
4. Interpreting the results
5. Sharing the results

You really cannot analyze data until you first determine what it is you want to know. It is tempting to just jump in and start looking for interesting stuff but you will not know if something you find is interesting unless it helps to answer your question(s).

There are several types of research questions. The point is you need to ask them in order to answer them.

Data Exploration

Data exploration allows you to determine if you can answer your questions with the data you have. In data science, the data is normally already collected by the time you are called upon to analyze it. As such, what you want to find may not be possible.

In addition, exploration of the data allows you to determine if there are any problems with the data set such as missing data, strange variables, and if necessary to develop a data dictionary so you know the characteristics of the variables.

Data exploration allows you to determine what kind of data wrangling needs to be done. This involves the preparation of the data for a more formal analysis when you develop your statistical models. This process takes up the majority of a data scientist time and is not easy at all.  Mastery of this in many ways means being a master of data science

Develop a Statistical Model

Your research questions  and the data exploration  process helps you to determine what kind of model to develop. The factors that can affect this is whether your data is supervised or unsupervised and whether you want to classify or predict numerical values.

This is probably the funniest part of data analysis and is much easier then having to wrangle with the data. Your goal is to determine if the model helps to answer your question(s)

Interpreting the Results

Once a model is developed it is time to explain what it means. Sometimes you can make a really cool model that nobody (including yourself) can explain. This is especially true of “black box” methods such as support vector machines and artificial neural networks. Models need to normally be explainable to non-technical stakeholders.

With interpretation you are trying to determine “what does this answer mean to the stakeholders?”  For example, if you find that people who smoke are 5 times more likely to die before the age of 50 what are the implications of this? How can the stakeholders use this information to achieve their own goals? In other words, why should they care about what you found out?

Communication of Results

Now  is the time to actually share the answer(s) to your question(s). How this is done varies but it can be written, verbal or both. Whatever the mode of communication it is necessary to consider the following

• The audience or stakeholders
• The actual answers to the questions
• The benefits of knowing this

You must remember the stakeholders because this affects how you communicate. How you speak to business professionals would be  different from academics. Next, you must share the answers to the questions. This can be done with charts, figures, illustrations etc. Data visualization is an expertise of its own. Lastly, you explain how this information is useful in a practical way.

Conclusion

The process shared here is one way to approach the analysis of data. Think of this as a framework from which to develop your own method of analysis.

# Linear VS Quadratic Discriminant Analysis in R

In this post we will look at linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA). Discriminant analysis is used when the dependent variable is categorical. Another commonly used option is logistic regression but there are differences between logistic regression and discriminant analysis. Both LDA and QDA are used in situations in which there is a clear separation between the classes you want to predict. If the categories are fuzzier logistic regression is often the better choice.

For our example, we will use the “Mathlevel” dataset found in the “Ecdat” package. Our goal will be to predict the sex of a respondent based on SAT math score, major, foreign language proficiency, as well as the number of math, physic, and chemistry classes a respondent took. Below is some initial code to start our analysis.

``library(MASS);library(Ecdat)``
`data("Mathlevel")`

The first thing we need to do is clean up the data set. We have to remove any missing data in order to run our model. We will create a dataset called “math” that has the “Mathlevel” dataset but with the “NA”s removed use the “na.omit” function. After this, we need to set our seed for the purpose of reproducibility using the “set.seed” function. Lastly, we will split the data using the “sample” function using a 70/30 split. The training dataset will be called “math.train” and the testing dataset will be called “math.test”. Below is the code

``````math<-na.omit(Mathlevel)
set.seed(123)
math.ind<-sample(2,nrow(math),replace=T,prob = c(0.7,0.3))
math.train<-math[math.ind==1,]
math.test<-math[math.ind==2,]``````

Now we will make our model and it is called “lda.math” and it will include all available variables in the “math.train” dataset. Next we will check the results by calling the modle. Finally, we will examine the plot to see how our model is doing. Below is the code.

``````lda.math<-lda(sex~.,math.train)
lda.math``````
``````## Call:
## lda(sex ~ ., data = math.train)
##
## Prior probabilities of groups:
##      male    female
## 0.5986079 0.4013921
##
## Group means:
##        mathlevel.L mathlevel.Q mathlevel.C mathlevel^4 mathlevel^5
## male   -0.10767593  0.01141838 -0.05854724   0.2070778  0.05032544
## female -0.05571153  0.05360844 -0.08967303   0.2030860 -0.01072169
##        mathlevel^6      sat languageyes  majoreco  majoross   majorns
## male    -0.2214849 632.9457  0.07751938 0.3914729 0.1472868 0.1782946
## female  -0.2226767 613.6416  0.19653179 0.2601156 0.1907514 0.2485549
##          majorhum mathcourse physiccourse chemistcourse
## male   0.05426357   1.441860    0.7441860      1.046512
## female 0.07514451   1.421965    0.6531792      1.040462
##
## Coefficients of linear discriminants:
##                       LD1
## mathlevel.L    1.38456344
## mathlevel.Q    0.24285832
## mathlevel.C   -0.53326543
## mathlevel^4    0.11292817
## mathlevel^5   -1.24162715
## mathlevel^6   -0.06374548
## sat           -0.01043648
## languageyes    1.50558721
## majoreco      -0.54528930
## majoross       0.61129797
## majorns        0.41574298
## majorhum       0.33469586
## mathcourse    -0.07973960
## physiccourse  -0.53174168
## chemistcourse  0.16124610``````
``plot(lda.math,type='both')``

Calling “lda.math” gives us the details of our model. It starts be indicating the prior probabilities of someone being male or female. Next is the means for each variable by sex. The last part is the coefficients of the linear discriminants. Each of these values is used to determine the probability that a particular example is male or female. This is similar to a regression equation.

The plot provides us with densities of the discriminant scores for males and then for females. The output indicates a problem. There is a great deal of overlap between male and females in the model. What this indicates is that there is a lot of misclassification going on as the two groups are not clearly separated. Furthermore, this means that logistic regression is probably a better choice for distinguishing between male and females. However, since this is for demonstrating purposes we will not worry about this.

We will now use the “predict” function on the training set data to see how well our model classifies the respondents by gender. We will then compare the prediction of the model with thee actual classification. Below is the code.

``````math.lda.predict<-predict(lda.math)
math.train\$lda<-math.lda.predict\$class
table(math.train\$lda,math.train\$sex)``````
``````##
##          male female
##   male    219    100
##   female   39     73``````
``mean(math.train\$lda==math.train\$sex)``
``## [1] 0.6774942``

As you can see, we have a lot of misclassification happening. A large amount of false negatives which is a lot of males being classified as female. The overall accuracy us only 59% which is not much better than chance.

We will now conduct the same analysis on the test data set. Below is the code.

``````lda.math.test<-predict(lda.math,math.test)
math.test\$lda<-lda.math.test\$class
table(math.test\$lda,math.test\$sex)``````
``````##
##          male female
##   male     92     43
##   female   23     20``````
``mean(math.test\$lda==math.test\$sex)``
``## [1] 0.6292135``

As you can see the results are similar. To put it simply, our model is terrible. The main reason is that there is little distinction between males and females as shown in the plot. However, we can see if perhaps a quadratic discriminant analysis will do better

QDA allows for each class in the dependent variable to have it’s own covariance rather than a shared covariance as in LDA. This allows for quadratic terms in the development of the model. To complete a QDA we need to use the “qda” function from the “MASS” package. Below is the code for the training data set.

``````math.qda.fit<-qda(sex~.,math.train)
math.qda.predict<-predict(math.qda.fit)
math.train\$qda<-math.qda.predict\$class
table(math.train\$qda,math.train\$sex)``````
``````##
##          male female
##   male    215     84
##   female   43     89``````
``mean(math.train\$qda==math.train\$sex)``
``## [1] 0.7053364``

You can see there is almost no difference. Below is the code for the test data.

``````math.qda.test<-predict(math.qda.fit,math.test)
math.test\$qda<-math.qda.test\$class
table(math.test\$qda,math.test\$sex)``````
``````##
##          male female
##   male     91     43
##   female   24     20``````
``mean(math.test\$qda==math.test\$sex)``
``## [1] 0.6235955``

Still disappointing. However, in this post we reviewed linear discriminant analysis as well as learned about the use of quadratic linear discriminant analysis. Both of these statistical tools are used for predicting categorical dependent variables. LDA assumes shared covariance in the dependent variable categories will QDA allows for each category in the dependent variable to have it’s own variance.

# Simple Discussion Forum in Moodle VIDEO

How to create a simple discussion forum in Moodle

# Sociolinguistic Insights into Female Communication

In general, women tend to prefer to use the most standard or prestige form of a language regardless of cultural background or geography. Linguist have proposed several potential reasons for this. This post will share some of the most common ideas on why women often used the standard form of their language.

Social Status

There is a belief among many linguist that women use the most prestigious forms of their language because they are more status-conscious than men. By using the standard version of their language a women is able to claim a higher status.

The implication of this is that women have a lower status in society and try to elevate themselves through their use of language. However, this conclusion has been refuted as women who work outside the home use more of the standard form of their language then women who work in their home.

If the social status hypothesis was correct women who work at home, and thus have the lowest status, should use more of the standard form then women who work. Currently, this is not the case.

Women as Protector of Society’s Values

The women as protector of values view see social pressure as a constraint on how women communicate. Simply, women use more standard forms of their language then men because women are expected to behave better. It is thrust upon women to serve as an example for their community and especially for their children.

This answer is considered correct but depends highly on context. For example, this idea falls a part most frequently when women communicate with their children. The informal and intimate setting often leads to most women using the vernacular aspects of their language.

Women as Subordinate Group

A third suggestion is that women, who are often a subordinate group, use the more standard version of their language to show deference to those over them. In other words, women use the most polite forms of their language to avoid offending men.

However, this suggestion also fails because it equates politeness with the standard form of a language. People can be polite using vernacular and they can be rude using the most prestigious form of their language possible.

Vernacular as Masculine

A final common hypothesis on women’s use of standard forms is the perception that the use of the vernacular is masculine and tough. Women choose the standard form as a way of demonstrating behaviors traditionally associated with gender in their culture. Men on the other hand, use vernacular forms to show traits that are traditionally associated with male behaviors.

The problem with this belief is the informal settings. As mentioned previously, women and men use more vernacular forms of their language in informal settings. As such, it seems that context is one of the strongest factors in how language is used and not necessarily gender.

# Using Standard General Forums in MOODLE VIDEO

Explanation on using general forums in Moodle

# Discrete-Point and Integrative Language Testing Methods

Within language testing there has arisen over time at least two major viewpoints on assessment. Originally,  the view was that assessing language should look specific elements of a language or you could say that language assessment should look at discrete aspects of the language.

A reaction to this discrete methods came about with the idea that language is wholistic so testing should be integrative or address many aspects of language simultaneously. In this post, we will take a closer look at discrete and integrative language testing methods through providing examples of each along with a comparison.

Discrete-Point Testing

Discrete-point testing works on the assumption that language can be reduce to several discrete component “points” and that these “points” can be assessed. Examples of discrete-point test items in language testing include multiple choice, true/false, fill in the blank, and spelling.

What all of these example items have in common is that they usually isolate an aspect of the language from the broader context. For example, a simple spelling test is highly focus on the orthographic characteristics of the language. True/false can be used to assess knowledge of various grammar rules etc.

The primary criticism of discrete-point testing was its discreteness. Many believe that language is wholistic and that in the real world students will never have to deal with language in such an isolated way. This led to the development of integrative language testing methods.

Integrative Language Testing Methods

Integrative language testing is based on the unitary trait hypothesis, which states that language is indivisible. This is in complete contrast to discrete-point methods which supports dividing language into specific components.  Two common integrative language assessments includes cloze test and dictation.

Cloze test involves taking an authentic reading passage and removing words from it. Which words remove depends on the test creator. Normally, it is every 6th or 7th word but it could be more or less or only the removal of key vocabulary. In addition, sometimes potential words are given to the student to select from or sometimes the list of words is not given to the student

The students job is to look at the context of the entire story to determine which words to write into the blank space.  This is an integrative experience as the students have to consider grammar, vocabulary, context, etc. to complete the assessment.

Dictation is simply writing down what was heard. This also requires the use of several language skills simultaneously in a realistic context.

Integrative language testing also has faced criticism. For example, discrete-point testing has always shown that people score differently in different language skills and this fact has  been replicated in many studies. As such, the exclusive use of integrative language approaches is not supported by most TESOL scholars.

Conclusion

As with many other concepts in education the best choice between discrete-point and integrative testing is a combination  of both. The exclusive use of either will not allow the students to demonstrate mastery of the language.

# Validating a Logistic Model in R

In this post, we are going to continue are analysis of the logistic regression model from the the post on logistic regression  in R. We need to rerun all of the code from the last post to be ready to continue. As such the code form the last post is all below

``````library(MASS);library(bestglm);library(reshape2);library(corrplot);
library(ggplot2);library(ROCR)``````
``````data(survey)
survey\$Clap<-NULL
survey\$W.Hnd<-NULL
survey\$Fold<-NULL
survey\$Exer<-NULL
survey\$Smoke<-NULL
survey\$M.I<-NULL
survey<-na.omit(survey)
pm<-melt(survey, id.var="Sex")
ggplot(pm,aes(Sex,value))+geom_boxplot()+facet_wrap(~variable,ncol = 3)``````

```pc<-cor(survey[,2:5]) corrplot.mixed(pc)```

`set.seed(123) ind<-sample(2,nrow(survey),replace=T,prob = c(0.7,0.3)) train<-survey[ind==1,] test<-survey[ind==2,] fit<-glm(Sex~.,binomial,train) exp(coef(fit))`

``````train\$probs<-predict(fit, type = 'response')
train\$predict<-rep('Female',123)
train\$predict[train\$probs>0.5]<-"Male"
table(train\$predict,train\$Sex)``````
``mean(train\$predict==train\$Sex)``
``````test\$prob<-predict(fit,newdata = test, type = 'response')
test\$predict<-rep('Female',46)
test\$predict[test\$prob>0.5]<-"Male"
table(test\$predict,test\$Sex)``````
``mean(test\$predict==test\$Sex)``

Model Validation

We will now do a K-fold cross validation in order to further see how our model is doing. We cannot use the factor variable “Sex” with the K-fold code so we need to create a dummy variable. First, we create a variable called “y” that has 123 spaces, which is the same size as the “train” dataset. Second, we fill “y” with 1 in every example that is coded “male” in the “Sex” variable.

In addition, we also need to create a new dataset and remove some variables from our prior analysis otherwise we will confuse the functions that we are going to use. We will remove “predict”, “Sex”, and “probs”

``````train\$y<-rep(0,123)
train\$y[train\$Sex=="Male"]=1
my.cv<-train[,-8]
my.cv\$Sex<-NULL
my.cv\$probs<-NULL``````

We now can do our K-fold analysis. The code is complicated so you can trust it and double check on your own.

``bestglm(Xy=my.cv,IC="CV",CVArgs = list(Method="HTF",K=10,REP=1),family = binomial)``
``## Morgan-Tatar search since family is non-gaussian.``
``````## CV(K = 10, REP = 1)
## BICq equivalent for q in (6.66133814775094e-16, 0.0328567092272112)
## Best Model:
##                Estimate Std. Error   z value     Pr(>|z|)
## (Intercept) -45.2329733 7.80146036 -5.798014 6.710501e-09
## Height        0.2615027 0.04534919  5.766425 8.097067e-09``````

The results confirm what we alreaedy knew that only the “Height” variable is valuable in predicting Sex. We will now create our new model using only the recommendation of the kfold validation analysis. Then we check the new model against the train dataset and with the test dataset. The code below is a repeat of prior code but based on the cross-validation

``````reduce.fit<-glm(Sex~Height, family=binomial,train)
train\$cv.probs<-predict(reduce.fit,type='response')
train\$cv.predict<-rep('Female',123)
train\$cv.predict[train\$cv.probs>0.5]='Male'
table(train\$cv.predict,train\$Sex)``````
``````##
##          Female Male
##   Female     61   11
##   Male        7   44``````
``mean(train\$cv.predict==train\$Sex)``
``## [1] 0.8536585``
``````test\$cv.probs<-predict(reduce.fit,test,type = 'response')
test\$cv.predict<-rep('Female',46)
test\$cv.predict[test\$cv.probs>0.5]='Male'
table(test\$cv.predict,test\$Sex)``````
``````##
##          Female Male
##   Female     16    7
##   Male        1   22``````
``mean(test\$cv.predict==test\$Sex)``
``## [1] 0.826087``

The results are consistent for both the train and test dataset. We are now going to create the ROC curve. This will provide a visual and the AUC number to further help us to assess our model. However, a model is only good when it is compared to another model. Therefore, we will create a really bad model in order to compare it to the original model, and the cross validated model. We will first make a bad model and store the probabilities in the “test” dataset. The bad model will use “age” to predict “Sex” which doesn’t make any sense at all. Below is the code followed by the ROC curve of the bad model.

``````bad.fit<-glm(Sex~Age,family = binomial,test)

The more of a diagonal the line is the worst it is. As, we can see the bad model is really bad.

What we just did with the bad model we will now repeat for the full model and the cross-validated model.  As before, we need to store the prediction in a way that the ROCR package can use them. We will create a variable called “pred.full” to begin the process of graphing the the original full model from the last blog post. Then we will use the “prediction” function. Next, we will create the “perf.full” variable to store the performance of the model. Notice, the arguments ‘tpr’ and ‘fpr’ for true positive rate and false positive rate. Lastly, we plot the results

``````pred.full<-prediction(test\$prob,test\$Sex)
perf.full<-performance(pred.full,'tpr','fpr')
plot(perf.full, col=2)``````

We repeat this process for the cross-validated model

``````pred.cv<-prediction(test\$cv.probs,test\$Sex)
perf.cv<-performance(pred.cv,'tpr','fpr')
plot(perf.cv,col=3)``````

Now let’s put all the different models on one plot

``````plot(perf.bad,col=1)

Finally, we can calculate the AUC for each model

``````auc.bad<-performance(pred.bad,'auc')
``````## [[1]]
## [1] 0.4766734``````
``````auc.full<-performance(pred.full,"auc")
auc.full@y.values``````
``````## [[1]]
## [1] 0.959432``````
``````auc.cv<-performance(pred.cv,'auc')
auc.cv@y.values``````
``````## [[1]]
## [1] 0.9107505``````

The higher the AUC the better. As such, the full model with all variables is superior to the cross-validated or bad model. This is despite the fact that there are many high correlations in the full model as well. Another point to consider is that the cross-validated model is simpler so this may be a reason to pick it over the full model. As such, the statistics provide support for choosing a model but the do not trump the ability of the research to pick based on factors beyond just numbers.

# Logistic Regression in R

In this post, we will conduct a logistic regression analysis. Logistic regression is used when you want to predict a categorical dependent variable using continuous or categorical dependent variables. In our example, we want to predict Sex (male or female) when using several continuous variables from the “survey” dataset in the “MASS” package.

``library(MASS);library(bestglm);library(reshape2);library(corrplot)``
``````data(survey)
?MASS::survey #explains the variables in the study``````

The first thing we need to do is remove the independent factor variables from our dataset. The reason for this is that the function that we will use for the cross-validation does not accept factors. We will first use the “str” function to identify factor variables and then remove them from the dataset. We also need to remove in examples that are missing data so we use the “na.omit” function for this. Below is the code

``````survey\$Clap<-NULL
survey\$W.Hnd<-NULL
survey\$Fold<-NULL
survey\$Exer<-NULL
survey\$Smoke<-NULL
survey\$M.I<-NULL
survey<-na.omit(survey)``````

We now need to check for collinearity using the “corrplot.mixed” function form the “corrplot” package.

``````pc<-cor(survey[,2:5])
corrplot.mixed(pc)
corrplot.mixed(pc)``````

We have extreme correlation between “We.Hnd” and “NW.Hnd” this makes sense because people’s hands are normally the same size. Since this blog post  is a demonstration of logistic regression we will not worry about this too much.

We now need to divide our dataset into a train and a test set. We set the seed for. First we need to make a variable that we call “ind” that is randomly assigns 70% of the number of rows of survey 1 and 30% 2. We then subset the “train” dataset by taking all rows that are 1’s based on the “ind” variable and we create the “test” dataset for all the rows that line up with 2 in the “ind” variable. This means our data split is 70% train and 30% test. Below is the code

``````set.seed(123)
ind<-sample(2,nrow(survey),replace=T,prob = c(0.7,0.3))
train<-survey[ind==1,]
test<-survey[ind==2,]``````

We now make our model. We use the “glm” function for logistic regression. We set the family argument to “binomial”. Next, we look at the results as well as the odds ratios.

``````fit<-glm(Sex~.,family=binomial,train)
summary(fit)``````
``````##
## Call:
## glm(formula = Sex ~ ., family = binomial, data = train)
##
## Deviance Residuals:
##     Min       1Q   Median       3Q      Max
## -1.9875  -0.5466  -0.1395   0.3834   3.4443
##
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept) -46.42175    8.74961  -5.306 1.12e-07 ***
## Wr.Hnd       -0.43499    0.66357  -0.656    0.512
## NW.Hnd        1.05633    0.70034   1.508    0.131
## Pulse        -0.02406    0.02356  -1.021    0.307
## Height        0.21062    0.05208   4.044 5.26e-05 ***
## Age           0.00894    0.05368   0.167    0.868
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
##     Null deviance: 169.14  on 122  degrees of freedom
## Residual deviance:  81.15  on 117  degrees of freedom
## AIC: 93.15
##
## Number of Fisher Scoring iterations: 6``````
``exp(coef(fit))``
``````##  (Intercept)       Wr.Hnd       NW.Hnd        Pulse       Height
## 6.907034e-21 6.472741e-01 2.875803e+00 9.762315e-01 1.234447e+00
##          Age
## 1.008980e+00``````

The results indicate that only height is useful in predicting if someone is a male or female. The second piece of code shares the odds ratios. The odds ratio tell how a one unit increase in the independent variable leads to an increase in the odds of being male in our model. For example, for every one unit increase in height there is a 1.23 increase in the odds of a particular example being male.

We now need to see how well our model does on the train and test dataset. We first capture the probabilities and save them to the train dataset as “probs”. Next we create a “predict” variable and place the string “Female” in the same number of rows as are in the “train” dataset. Then we rewrite the “predict” variable by changing any example that has a probability above 0.5 as “Male”. Then we make a table of our results to see the number correct, false positives/negatives. Lastly, we calculate the accuracy rate. Below is the code.

``````train\$probs<-predict(fit, type = 'response')
train\$predict<-rep('Female',123)
train\$predict[train\$probs>0.5]<-"Male"
table(train\$predict,train\$Sex)``````
``````##
##          Female Male
##   Female     61    7
##   Male        7   48``````
``mean(train\$predict==train\$Sex)``
``## [1] 0.8861789``

Despite the weaknesses of the model with so many insignificant variables it is surprisingly accurate at 88.6%. Let’s see how well we do on the “test” dataset.

``````test\$prob<-predict(fit,newdata = test, type = 'response')
test\$predict<-rep('Female',46)
test\$predict[test\$prob>0.5]<-"Male"
table(test\$predict,test\$Sex)``````
``````##
##          Female Male
##   Female     17    3
##   Male        0   26``````
``mean(test\$predict==test\$Sex)``
``## [1] 0.9347826``

As you can see, we do even better on the test set with an accuracy of 93.4%. Our model is looking pretty good and height is an excellent predictor of sex which makes complete sense. However, in the next post we will use cross-validation and the ROC plot to further assess the quality of it.

# Teaching Vocabulary to ESL Students

Language acquisition  requires the acquisition of thousands of words for fluent communication. This is a daunting task for the most talented and eager student. Fortunately, there are some basic concepts to keep in mind when teaching students vocabulary. This post will share some suggestion and helping students to develop there vocabulary in the target language.

Learn Vocabulary in Context

A common technique for teaching vocabulary in language classrooms is out of context memorization. Students are given a long and often boring list of words to memorize. There is little immediate use of these words and they are quickly forgotten after the quiz.

Instead, it is better to teach new words within a framework in which they will be used. For example, students learn business terms through role play at a bank or store rather than through a stack of index cards. The context of the bank connects the words to a real-world setting, which is critical for retention in the long-term memory.

Reduce Reliance on Bilingual Dictionaries

This may seem as a surprise, however, the proliferation of bilingual dictionaries provides the definition to a word but does not normally help with memorization and  the future use of the word. If the goal is communication then bilingual dictionaries will slow a student’s ability to achieve mastery.

Children learn language much faster do in part to the immense effort it takes to learn what new words mean without the easy answer of a dictionary. The effort leads to memorization which allows for the use of the language. This serves as a valuable lesson for adults who prefer the easy route of bilingual dictionaries.

Set Aside Class Time to Deal with Vocabulary

The teacher should have a systematic plan for helping students to develop relevant vocabulary. This can be done through activities as well as the teaching of context clues. Vocabulary development needs to be intentional, which means there must be a systematic plan for supporting students in this.

However, there are also times were unplanned vocabulary teaching can take place. For example, while the students are reading together they become puzzled over a word you thought they knew (this is common). When this happens a break with explanation can be helpful. This is especially true if you let the students work together without dictionaries to try and determining the meaning of the word.

Conclusion

Vocabulary is a necessary element to language learning. It would be nice to ignores this but normally this is impossible.  As such, teachers need to support students in their vocabulary development.

# Probability,Odds, and Odds Ratio

In logistic regression, there are three terms that are used frequently but can be confusing if they are not thoroughly explained. These three terms are probability, odds, and odds ratio. In this post, we will look at these three terms and provide an explanation of them.

Probability

Probability is probably (no pun intended) the easiest of these three terms to understand. Probability is simply the likelihood that a certain even will happen.  To calculate the probability in the traditional sense you need to know the number of events and outcomes to find the probability.

Bayesian probability uses prior probabilities to develop a posterior probability based on new evidence. For example, at one point during Super Bowl LI the Atlanta Falcons had a 99.7% chance of winning. This was base don such factors as the number points they were ahead and the time remaining.  As these changed, so did the probability of them winning. yet the Patriots still found a way to win with less then a 1% chance

Bayesian probability was also used for predicting who would win the 2016 US presidential race. It is important to remember that probability is an expression of confidence and not a guarantee as we saw in both examples.

Odds

Odds are the expression of relative probabilities. Odds are calculated using the following equation

probability of the event ⁄ 1 – probability of the event

For example, at one point during Super Bowl LI the odds of the Atlanta Falcons winning were as follows

0.997 ⁄ 1 – 0.997 = 332

This can be interpreted as the odds being 332 to 1! This means that Atlanta was 332 times more likely to win the Super Bowl then loss the Super Bowl.

Odds are commonly used in gambling and this is probably (again no pun intended) where most of us have heard the term before. The odds is just an extension of probabilities and the are most commonly expressed as a fraction such as one in four, etc.

Odds Ratio

A ratio is the comparison of of two numbers and indicates how many times one number is contained or contains another number. For example, a ration of boys to girls is 5 to 1 it means that there are five boys for every one girl.

By  extension odds ratio is the comparison of two different odds. For example, if the odds of Team A making the playoffs is 45% and the odds of Team B making the playoffs is 35% the odds ratio is calculated as follows.

0.45 ⁄ 0.35 = 1.28

Team A is 1.28 more likely to make the playoffs then Team B.

The value of the odds and the odds ratio can sometimes be the same.  Below is the odds ratio of the Atlanta Falcons winning and the New Patriots winning Super Bowl LI

0.997⁄ 0.003 = 332

As such there is little difference between odds and odds ratio except that odds ratio is the ratio of two odds ratio. As you can tell, there is a lot of confusion about this for the average person. However, understanding these terms is critical to the application of logistic regression.

# Assessing Writing from a Discourse Perspective

Often, when teachers provide feedback on a student’s writing, they tend to focus on the grammatical/punctuation aspects of the paper. However, this often does not make a lasting impression and it also can frequently cause students to freeze up when the need to write as they become obsess with the details of grammar rather than with the shaping of ideas.

Another approach to providing feedback to students is to analyze and assess their writing from the perspective of discourse. Discourse rules have to do with the overall structure of a paper. It is the big picture aspects of writing. Clear discourse can often help to overcome poor grammar/punctuation but excellent grammar/punctuation can overcome a poorly structured paper. This post will provide some of the components of discourse as they relate to writing a paper.

The Organizational Level

At the highest broadest level is the organizational level. At this level, you are looking to be sure that the students have  included an introduction, body, and conclusion to their paper. This seems elementary but it is common for students to forget to include an introduction and or a conclusion to their writing.

You also want to check that the introduction, body, and conclusion are in proportion to each  other based on how long the paper was intended to be. Often, students write short intros, have a long body section, and have little conclusion as they are exhausted from the writing.

At this point thorough reading is not taking place but rather you are glancing to see if all the parts are there.  You also are searching to see if the ideas in the introduction, are present in the body, and reiterated in the conclusion. Students frequently wander when writing as they do not plan what to say but rather what and see what google provides them.

The Section Level

At the section level, you are looking to make sure that the various parts that belong within the introduction, body, and conclusion are present.  For the introduction, if it is a standard research, paper some of the things to look for includes the following

• background to the problem
• problem statement
• objectives
• significance statement

For the body section, things to look for includes

• Discussion of first objective
• Discussion of second objective
• etc

For the conclusion, it is more fluid in how this can be done but you can look for the following

• Summary of the introduction
• Main point of each objective
• Concluding remark(s)

First, you are checking that these components are there. Second you are checking for the clarity. Normally, if the problem and objectives are unclear the entire paper is doomed to incomprehensibility.

However, bad grammar is not a reason that problems and objectives are unclear. Instead it may be the problem is too broad, cannot be dealt with in the space provide, etc. Objectives normally have the same problem but can also be unrelated to the problem as well.

Sometimes the problem and objectives are to narrowly defined in terms of the expertise of the student. As such, it is highly subjective in terms of what works but the comments given to the student need to be substantive and not just something vague as “look at this a second time.”

If you cannot give substantive feedback it is normally better to ignore whatever weakness you found until you can articulate it clearly. If this is not possible it’s better to remain silent.

The body section must address all objectives mentioned in the introduction. Otherwise, the reader will become confuse as promises made in the introduction were never fulfilled in the body.

The conclusion is more art than science. However, there should be a emphasis on what has been covered as well as what does this mean for the reader.

The Paragraph Level

At the paragraph level, you are looking for two things in every paragraph

• main idea
• supporting details

Every paragraph should have one main idea, which summarizes the point of the paragraph. The main idea is always singular. If there are more than one main idea then the student should develop a second paragraph for the second main idea.

In addition, the supporting details in the paragraph should be on topic with the main idea. Often, students will have inconsistencies between the main idea and the supporting details. This can be solved by doing one of the following

• Change the main idea to be consistent with the supporting details
• Change the supporting details to be consistent with the main idea

At the paragraph level, you are also assessing that the individual paragraphs are supporting the objective of the section. This again has to do with focusing on a singular thought in a particular section and within each paragraph. Students love to wander when writing as stated previously. Writing is about breaking down a problem into smaller and smaller pieces through explanation.

Conclusion

The assessment of the discourse of a paper should come before the grammatical marking of it. When ideas flow, the grammatical issues are harder to notice often. It is the shaping of discourse that engages the thinking and improves the writing of a student in ways that grammatical comments can never achieve.

# The Grader Report in Moodle VIDEO

This video explains the grader report view in Moodle

# Best Subset Regression in R

In this post, we will take a look at best subset regression. Best subset regression fits a model for all possible feature or variable combinations and the decision for the most appropriate model is made by the analyst based on judgment or some statistical criteria.

Best subset regression is an alternative to both Forward and Backward stepwise regression. Forward stepwise selection adds one variable at a time based on the lowest residual sum of squares until no more variables continues to lower the residual sum of squares. Backward stepwise regression starts with all variables in the model and removes variables one at a time. The concern with stepwise methods is they can produce biased regression coefficients, conflicting models, and inaccurate confidence intervals.

Best subset regression bypasses these weaknesses of stepwise models by creating all models possible and then allowing you to assess which variables should be including in your final model. The one drawback to best subset is that a large number of variables means a large number of potential models, which can make it difficult to make a decision among several choices.

In this post, we will use the “Fair” dataset from the “Ecdat” package to predict marital satisfaction based on age, Sex, the presence of children, years married, religiosity, education, occupation, and number of affairs in the past year. Below is some initial code.

``library(leaps);library(Ecdat);library(car);library(lmtest)``
`data(Fair)`

We begin our analysis by building the initial model with all variables in it. Below is the code

``````fit<-lm(rate~.,Fair)
summary(fit)``````
``````##
## Call:
## lm(formula = rate ~ ., data = Fair)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -3.2049 -0.6661  0.2298  0.7705  2.2292
##
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)  3.522875   0.358793   9.819  < 2e-16 ***
## sexmale     -0.062281   0.099952  -0.623  0.53346
## age         -0.009683   0.007548  -1.283  0.20005
## ym          -0.019978   0.013887  -1.439  0.15079
## childyes    -0.206976   0.116227  -1.781  0.07546 .
## religious    0.042142   0.037705   1.118  0.26416
## education    0.068874   0.021153   3.256  0.00119 **
## occupation  -0.015606   0.029602  -0.527  0.59825
## nbaffairs   -0.078812   0.013286  -5.932 5.09e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.03 on 592 degrees of freedom
## Multiple R-squared:  0.1405, Adjusted R-squared:  0.1289
## F-statistic:  12.1 on 8 and 592 DF,  p-value: 4.487e-16``````

The initial results are already interesting even though the r-square is low. When couples have children the have less martial satisfaction than couples without children when controlling for the other factors and this is the strongest regression weight. In addition, the more education a person has there is an increase in marital satisfaction. Lastly, as the number of affairs increases there is also a decrease in martial satisfaction. Keep in mind that the “rate” variable goes from 1 to 5 with one meaning a terrible marriage to five being a great one. The mean marital satisfaction was 3.52 when controlling for the other variables.

We will now create our subset models. Below is the code.

``````sub.fit<-regsubsets(rate~.,Fair)
best.summary<-summary(sub.fit)``````

In the code above we create the sub models using the “regsubsets” function from the “leaps” package and saved it in the variable called “sub.fit”. We then saved the summary of “sub.fit” in the variable “best.summary”. We will use the “best.summary” “sub.fit variables several times to determine which model to use.

There are many different ways to assess the model. We will use the following statistical methods that come with the results from the “regsubset” function.

• Mallow’ Cp
• Bayesian Information Criteria

We will make two charts for each of the criteria above. The plot to the left will explain how many features to include in the model. The plot to the right will tell you which variables to include. It is important to note that for both of these methods, the lower the score the better the model. Below is the code for Mallow’s Cp.

``````par(mfrow=c(1,2))
plot(best.summary\$cp)
plot(sub.fit,scale = "Cp")``````

The plot on the left suggests that a four feature model is the most appropriate. However, this chart does not tell me which four features. The chart on the right is read in reverse order. The high numbers are at the bottom and the low numbers are at the top when looking at the y-axis. Knowing this, we can conclude that the most appropriate variables to include in the model are age, children presence, education, and number of affairs. Below are the results using the Bayesian Information Criterion

``````par(mfrow=c(1,2))
plot(best.summary\$bic)
plot(sub.fit,scale = "bic")``````

These results indicate that a three feature model is appropriate. The variables or features are years married, education, and number of affairs. Presence of children was not considered beneficial. Since our original model and Mallow’s Cp indicated that presence of children was significant we will include it for now.

Below is the code for the model based on the subset regression.

``````fit2<-lm(rate~age+child+education+nbaffairs,Fair)
summary(fit2)``````
``````##
## Call:
## lm(formula = rate ~ age + child + education + nbaffairs, data = Fair)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -3.2172 -0.7256  0.1675  0.7856  2.2713
##
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)  3.861154   0.307280  12.566  < 2e-16 ***
## age         -0.017440   0.005057  -3.449 0.000603 ***
## childyes    -0.261398   0.103155  -2.534 0.011531 *
## education    0.058637   0.017697   3.313 0.000978 ***
## nbaffairs   -0.084973   0.012830  -6.623 7.87e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.029 on 596 degrees of freedom
## Multiple R-squared:  0.1352, Adjusted R-squared:  0.1294
## F-statistic: 23.29 on 4 and 596 DF,  p-value: < 2.2e-16``````

The results look ok. The older a person is the less satisfied they are with their marriage. If children are present the marriage is less satisfying. The more educated the more satisfied they are. Lastly, the higher the number of affairs indicate less marital satisfaction. However, before we get excited we need to check for collinearity and homoscedasticity. Below is the code

``vif(fit2)``
``````##       age     child education nbaffairs
##  1.249430  1.228733  1.023722  1.014338``````

No issues with collinearity.For vif values above 5 or 10 indicate a problem. Let’s check for homoscedasticity

``````par(mfrow=c(2,2))
plot(fit2)``````

The normal qqplot and residuals vs leverage plot can be used for locating outliers. The residual vs fitted and the scale-location plot do not look good as there appears to be a pattern in the dispersion which indicates homoscedasticity. To confirm this we will use Breusch-Pagan test from the “lmtest” package. Below is the code

``bptest(fit2)``
``````##
##  studentized Breusch-Pagan test
##
## data:  fit2
## BP = 16.238, df = 4, p-value = 0.002716``````

There you have it. Our model violates the assumption of homoscedasticity. However, this model was developed for demonstration purpose to provide an example of subset regression.

# Writing Techniques for the ESL Classroom

In-class writing is common in many many ESL context. This post will provide several different ways that teachers can get their students writing in an ESL classroom.

Imitation

Perhaps the simplest way to get ESL students writing is to have them imitate what is read to them. This allows the students to learn the conventions of writing in the target language.

This is usually done through some form of dictation. The teacher reads a few words or reads slowly. This provides students with time to write down what they heard.

The actually marking of such an activity would involve the use of rubrics or some sort of count system for the number of words the student was able to write down. Often, spelling and pronunciation are not  considered major factors in the grade because of the rush nature of the writing.

Controlled and Guided

Controlled writing involves having students modify an existing writing sample. For example, changing all the verb in a paragraph from past to present. This will require them to often change more than just the verbs but other aspects of writing as well

Guided writing involves having the students respond to some sort of question or stimuli. For example, the students may watch a video and then are asked to write about and or answer questions. They may also be try to rewrite something that they heard at normal speed.

Self-Writing

The most common form of self-writing is the writing of a journal. The writing is only intended for the student. Even note-taking is considered a form of self-writing even though it is not normally comprehensible to others.

Self-writing, particularly journals, can be useful in developing reflective thinking in students in general even with the language barriers of writing in another language.

Display  and Real Writing

Display writing is writing that is primarily intended for the teacher, who already knows the answer that the student is addressing. Examples of this type of writing include essays and other writing for the purpose of a summative assessment. The student is literally displaying what they already know.

Real writing is writing in which  the reader does not know the answer to that the student is addressing. As such, one of the main differences between display and real writing is the knowledge that the audience of the writing has.

Conclusion

When working with students it is important to provide them with learning experiences that stimulate the growth and development that they need. Understanding the various forms of writing that can happen in an ESL classroom can provide teachers with ideas on how to help their students.

# Data Wrangling in R

Collecting and preparing data for analysis is the primary job of a data scientist. This experience is called data wrangling. In this post, we will look at an example of data wrangling using a simple artificial data set. You can create the table below in r or excel. If you created it in excel just save it as a csv and load it into r. Below is the initial code

``````library(readr)
``````## # A tibble: 10 × 2
##        weight      location
##         <chr>         <chr>
## 1         3.2        Europe
## 2       4.2kg       europee
## 3      1.3 kg          U.S.
## 4  7200 grams           USA
## 5          42 United States
## 6         2.3       europee
## 7       2.1kg        Europe
## 8       3.1kg           USA
## 9  2700 grams          U.S.
## 10         24 United States``````

This a small dataset with the columns of “weight” and “location”. Here are some of the problems

• Weights are in different units
• Weights are written in different ways
• Location is not consistent

In order to have any success with data wrangling you need to state specifically what it is you want to do. Here are our goals for this project

• Convert the “Weight variable” to a numerical variable instead of character
• Remove the text and have only numbers in the “weight variable”
• Change weights in grams to kilograms
• Convert the “location” variable to a factor variable instead of character
• Have consistent spelling for Europe and United States in the “location” variable

We will begin with the “weight” variable. We want to convert it to a numerical variable and remove any non-numerical text. Below is the code for this

``````corrected.weight<-as.numeric(gsub(pattern = "[[:alpha:]]","",apple\$weight))
corrected.weight``````
``##  [1]    3.2    4.2    1.3 7200.0   42.0    2.3    2.1    3.1 2700.0   24.0``

Here is what we did.

1. We created a variable called “corrected.weight”
2. We use the function “as.numeric” this makes whatever results inside it to be a numerical variable
3. Inside “as.numeric” we used the “gsub” function which allows us to substitute one value for another.
4. Inside “gsub” we used the argument pattern and set it to “[[alpha:]]” and “” this told r to look for any lower or uppercase letters and replace with nothing or remove it. This all pertains to the “weight” variable in the apple dataframe.

We now need to convert the weights in grams to kilograms so that everything is the same unit. Below is the code

``````gram.error<-grep(pattern = "[[:digit:]]{4}",apple\$weight)
corrected.weight[gram.error]<-corrected.weight[gram.error]/1000
corrected.weight``````
``##  [1]  3.2  4.2  1.3  7.2 42.0  2.3  2.1  3.1  2.7 24.0``

Here is what we did

1. We created a variable called “gram.error”
2. We used the grep function to search are the “weight” variable in the apple data frame for input that is a digit and is 4 digits in length this is what the “[[:digit:]]{4}” argument means. We do not change any values yet we just store them in “gram.error”
3. Once this information is stored in “gram.error” we use it as a subset for the “corrected.weight” variable.
4. We tell r to save into the “corrected.weight” variable any value that is changeable according to the criteria set in “gram.error” and to divided it by 1000. Dividing it by 1000 converts the value from grams to kilograms.

We have completed the transformation of the “weight” and will move to dealing with the problems with the “location” variable in the “apple” dataframe. To do this we will first deal with the issues related to the values that relate to Europe and then we will deal with values related to United States. Below is the code.

``````europe<-agrep(pattern = "europe",apple\$location,ignore.case = T,max.distance = list(insertion=c(1),deletions=c(2)))
america<-agrep(pattern = "us",apple\$location,ignore.case = T,max.distance = list(insertion=c(0),deletions=c(2),substitutions=0))
corrected.location<-apple\$location
corrected.location[europe]<-"europe"
corrected.location[america]<-"US"
corrected.location<-gsub(pattern = "United States","US",corrected.location)
corrected.location``````
``````##  [1] "europe" "europe" "US"     "US"     "US"     "europe" "europe"
##  [8] "US"     "US"     "US"``````

The code is a little complicated to explain but in short We used the “agrep” function to tell r to search the “location” to look for values similar to our term “europe”. The other arguments provide some exceptions that r should change because the exceptions are close to the term europe. This process is repeated for the term “us”. We then store are the location variable from the “apple” dataframe in a new variable called “corrected.location” We then apply the two objects we made called “europe” and “america” to the “corrected.location” variable. Next we have to make some code to deal with “United States” and apply this using the “gsub” function.

We are almost done, now we combine are two variables “corrected.weight” and “corrected.location” into a new data.frame. The code is below

``````cleaned.apple<-data.frame(corrected.weight,corrected.location)
names(cleaned.apple)<-c('weight','location')
cleaned.apple``````
``````##    weight location
## 1     3.2   europe
## 2     4.2   europe
## 3     1.3       US
## 4     7.2       US
## 5    42.0       US
## 6     2.3   europe
## 7     2.1   europe
## 8     3.1       US
## 9     2.7       US
## 10   24.0       US``````

If you use the “str” function on the “cleaned.apple” dataframe you will see that “location” was automatically converted to a factor.

This looks much better especially if you compare it to the original dataframe that is printed at the top of this post.

# Understanding ESL Writing Patterns Across Cultures

When people are learning the English they will almost always bring how they communicate with them when they are speaking or writing in English. However, for native speakers of English the written communication style of ESL students can be bewildering even if it is grammatically sound.

This phenomenon of the L1 influencing the writing style of the L2 is known as contrastive rhetoric. This post will provide examples from different cultures in terms of how they approach writing in English and compare it to how a native-speaking person from a Western country writes to show the differences.

The Native English Speaker Writing Example

Below is a simple paragraph written by a Native English speaking person.

Exercise is good for a person for several reasons. For example, exercises helps to strengthen the body. As a person moves he or she is utilizing their muscles which promotes maintenance and potentially growth of the muscle. Second, exercises helps to remove waste from the body. Strenuous exercise causes people to sweat and  breath deeply and this increases the removal of harmful elements from the body. Lastly, exercise makes people feel good. Exercise encourages the release of various hormones that makes a person feel better.  Therefore, people should exercise in order to enjoy these clear benefits

The writing style of an English speaker is usually highly linear in nature. In the example above, the first sentence is clearly the main idea or the point. Right from the beginning the English writer shares with you where they stand on the subject. There is little mystery or suspense as to what will be talked about.

The  rest of the paragraph are supporting details for the main idea. The supporting details start with the discourse markers of “for example”, “second”, and “lastly”. Everything in the paragraph is laid out in a step-by-step manner that is highly clear as this is important for English speakers.

Unfortunately, this style of writing is what many ESL students from other cultures is compared too. The next examples have perfect “English” however, the style of communication is not in this linear manner.

Eastern Writing Style

According to Robert Kaplan, people from  Eastern countries write in a circular indirect manner. This means that Eastern writing lacks the direct point or main idea of western writing and also lacks the clearly structured supporting details. Below is the same paragraph example as the one in the English example but written in a more Eastern style

As a person moves he or she is utilizing their muscles which promotes maintenance and potentially growth of the muscle. Strenuous exercise causes people to sweat and  breath deeply and this increases the removal of harmful elements from the body. Exercise encourages the release of various hormones that makes a person feel better.

The example is grammatical sound but for an native English speaker there are several problems with the writing

1. There is no main idea. The entire paragraph is missing a point. The writer is laying down claims about their point but they never actually tell you what the point is. Native speakers want a succinct summary of the point when information is shared with them. Eastern writers prefer an indirect or implied main ideas because being too direct is considered rude. In addition, if you are too clear in an Eastern context it is hard to evade and prevaricate if someone is offended by what is said.
2. The discourse markers are missing. There are no “for example” or “second” mention. Discourse markers give a paragraph a strong sense of linear direction. The native English speaker can tell where they are in a train of thought when these markers are there. When they are missing the English reader is wondering when is the experience is going to be over.
3. There are no transition sentences. In the native English speaking example, every discourse marker served as the first word in a transition sentence which move the reader from the first supporting detail to the next supporting detail. The Eastern example has only details without in guidepost from one detail to the other. If a paragraph is really long this can become overwhelming for the Native English speaker.

The example is highly fluent and this kind of writing is common in many English speaking countries that are not found in the West. Even with excellent knowledge of the language the discourse skills affect the ability to communicate.

Conclusion

My student have shared with me that English writing is clear and easy to understand but too direct in nature. Whereas the complaints of teachers is the the ESL students written is unclear and indirect.

This is not a matter of right in wrong but differences in how to communicate when writing. A student who is aware of how the communicate can make adjustments so that whoever they are speaking with can understand them. The goal should not be to change students but to make them aware of their assumptions so they can adjust depending on the situation and to not change them to act a certain way all the time.