In sociolinguistics, social networks can refer to the pattern of informal relationships that people have and experience on a consistent basis. There are two dimensions that can be used to describe a persons social network. These two terms are density and plexity.

**Density**

The density of a social network refers to how well people in your network know each other. In other words, density is ow well your friends know each other. We all have friends, we have friends who know each other, and we have friends who do not know each other.

If many of your friends know each other then the density is high. If your friends do not know each other the density is low. An example of a high density network would be the typical family. Everybody knows each other. An example of a low density network would be employees at a large company. In such a situation it would not be hard to find a friend of a friend that you do not know.

**Plexity**

Plexity is a measure of the various types of interactions that you are involved in with other people. Plexity can be uniplex, which involves one type of interaction with a person or multiplex, which involves many types of interactions with a person.

An example of a uniplex interaction may be a worker with their boss. They only interact at work. A multiplex interaction would again be with members of one’s family. When dealing with family interactions could include school, work, recreation, shopping, etc. In all these examples it is the same people interacting in a multitude of settings.

**Language Use in Social Networks**

A person’s speech almost always reflects the network that they belong too. If the group is homogeneous we will almost always speak the way everyone else does assuming we want to be a part of the group. For example, a group of local construction workers will more than likely use similar language patterns due to the homogeneous nature of the group while a group of ESL bankers would not as they come from many different countries.

When a person belongs to more than one social network they will almost always unconsciously change the way they communicate based on the context. For example, anybody who has moved away from home communicates differently where they live then when they communicate with family and friends back home. This is true even when moving from one place to another in the same province or state in your country.

**Conclusion**

The language that people employ is affected by the dynamics of the social network. We naturally will adjust our communication to accommodate who we are talking too.

]]>

]]>

In light of this challenge, there are at least 5 core task that you need to consider when preparing to analyze data. These five task are

- Developing your question(s)
- Data exploration
- Developing a statistical model
- Interpreting the results
- Sharing the results

**Developing Your Question(s)**

You really cannot analyze data until you first determine what it is you want to know. It is tempting to just jump in and start looking for interesting stuff but you will not know if something you find is interesting unless it helps to answer your question(s).

There are several types of research questions. The point is you need to ask them in order to answer them.

**Data Exploration**

Data exploration allows you to determine if you can answer your questions with the data you have. In data science, the data is normally already collected by the time you are called upon to analyze it. As such, what you want to find may not be possible.

In addition, exploration of the data allows you to determine if there are any problems with the data set such as missing data, strange variables, and if necessary to develop a data dictionary so you know the characteristics of the variables.

Data exploration allows you to determine what kind of data wrangling needs to be done. This involves the preparation of the data for a more formal analysis when you develop your statistical models. This process takes up the majority of a data scientist time and is not easy at all. Mastery of this in many ways means being a master of data science

**Develop a Statistical Model**

Your research questions and the data exploration process helps you to determine what kind of model to develop. The factors that can affect this is whether your data is supervised or unsupervised and whether you want to classify or predict numerical values.

This is probably the funniest part of data analysis and is much easier then having to wrangle with the data. Your goal is to determine if the model helps to answer your question(s)

**Interpreting the Results**

Once a model is developed it is time to explain what it means. Sometimes you can make a really cool model that nobody (including yourself) can explain. This is especially true of “black box” methods such as support vector machines and artificial neural networks. Models need to normally be explainable to non-technical stakeholders.

With interpretation you are trying to determine “what does this answer mean to the stakeholders?” For example, if you find that people who smoke are 5 times more likely to die before the age of 50 what are the implications of this? How can the stakeholders use this information to achieve their own goals? In other words, why should they care about what you found out?

**Communication of Results**

Now is the time to actually share the answer(s) to your question(s). How this is done varies but it can be written, verbal or both. Whatever the mode of communication it is necessary to consider the following

- The audience or stakeholders
- The actual answers to the questions
- The benefits of knowing this

You must remember the stakeholders because this affects how you communicate. How you speak to business professionals would be different from academics. Next, you must share the answers to the questions. This can be done with charts, figures, illustrations etc. Data visualization is an expertise of its own. Lastly, you explain how this information is useful in a practical way.

**Conclusion**

The process shared here is one way to approach the analysis of data. Think of this as a framework from which to develop your own method of analysis.

]]>

For our example, we will use the “Mathlevel” dataset found in the “Ecdat” package. Our goal will be to predict the sex of a respondent based on SAT math score, major, foreign language proficiency, as well as the number of math, physic, and chemistry classes a respondent took. Below is some initial code to start our analysis.

`library(MASS);library(Ecdat)`

data("Mathlevel")

The first thing we need to do is clean up the data set. We have to remove any missing data in order to run our model. We will create a dataset called “math” that has the “Mathlevel” dataset but with the “NA”s removed use the “na.omit” function. After this, we need to set our seed for the purpose of reproducibility using the “set.seed” function. Lastly, we will split the data using the “sample” function using a 70/30 split. The training dataset will be called “math.train” and the testing dataset will be called “math.test”. Below is the code

```
math<-na.omit(Mathlevel)
set.seed(123)
math.ind<-sample(2,nrow(math),replace=T,prob = c(0.7,0.3))
math.train<-math[math.ind==1,]
math.test<-math[math.ind==2,]
```

Now we will make our model and it is called “lda.math” and it will include all available variables in the “math.train” dataset. Next we will check the results by calling the modle. Finally, we will examine the plot to see how our model is doing. Below is the code.

```
lda.math<-lda(sex~.,math.train)
lda.math
```

```
## Call:
## lda(sex ~ ., data = math.train)
##
## Prior probabilities of groups:
## male female
## 0.5986079 0.4013921
##
## Group means:
## mathlevel.L mathlevel.Q mathlevel.C mathlevel^4 mathlevel^5
## male -0.10767593 0.01141838 -0.05854724 0.2070778 0.05032544
## female -0.05571153 0.05360844 -0.08967303 0.2030860 -0.01072169
## mathlevel^6 sat languageyes majoreco majoross majorns
## male -0.2214849 632.9457 0.07751938 0.3914729 0.1472868 0.1782946
## female -0.2226767 613.6416 0.19653179 0.2601156 0.1907514 0.2485549
## majorhum mathcourse physiccourse chemistcourse
## male 0.05426357 1.441860 0.7441860 1.046512
## female 0.07514451 1.421965 0.6531792 1.040462
##
## Coefficients of linear discriminants:
## LD1
## mathlevel.L 1.38456344
## mathlevel.Q 0.24285832
## mathlevel.C -0.53326543
## mathlevel^4 0.11292817
## mathlevel^5 -1.24162715
## mathlevel^6 -0.06374548
## sat -0.01043648
## languageyes 1.50558721
## majoreco -0.54528930
## majoross 0.61129797
## majorns 0.41574298
## majorhum 0.33469586
## mathcourse -0.07973960
## physiccourse -0.53174168
## chemistcourse 0.16124610
```

`plot(lda.math,type='both')`

Calling “lda.math” gives us the details of our model. It starts be indicating the prior probabilities of someone being male or female. Next is the means for each variable by sex. The last part is the coefficients of the linear discriminants. Each of these values is used to determine the probability that a particular example is male or female. This is similar to a regression equation.

The plot provides us with densities of the discriminant scores for males and then for females. The output indicates a problem. There is a great deal of overlap between male and females in the model. What this indicates is that there is a lot of misclassification going on as the two groups are not clearly separated. Furthermore, this means that logistic regression is probably a better choice for distinguishing between male and females. However, since this is for demonstrating purposes we will not worry about this.

We will now use the “predict” function on the training set data to see how well our model classifies the respondents by gender. We will then compare the prediction of the model with thee actual classification. Below is the code.

```
math.lda.predict<-predict(lda.math)
math.train$lda<-math.lda.predict$class
table(math.train$lda,math.train$sex)
```

```
##
## male female
## male 219 100
## female 39 73
```

`mean(math.train$lda==math.train$sex)`

`## [1] 0.6774942`

As you can see, we have a lot of misclassification happening. A large amount of false negatives which is a lot of males being classified as female. The overall accuracy us only 59% which is not much better than chance.

We will now conduct the same analysis on the test data set. Below is the code.

```
lda.math.test<-predict(lda.math,math.test)
math.test$lda<-lda.math.test$class
table(math.test$lda,math.test$sex)
```

```
##
## male female
## male 92 43
## female 23 20
```

`mean(math.test$lda==math.test$sex)`

`## [1] 0.6292135`

As you can see the results are similar. To put it simply, our model is terrible. The main reason is that there is little distinction between males and females as shown in the plot. However, we can see if perhaps a quadratic discriminant analysis will do better

QDA allows for each class in the dependent variable to have it’s own covariance rather than a shared covariance as in LDA. This allows for quadratic terms in the development of the model. To complete a QDA we need to use the “qda” function from the “MASS” package. Below is the code for the training data set.

```
math.qda.fit<-qda(sex~.,math.train)
math.qda.predict<-predict(math.qda.fit)
math.train$qda<-math.qda.predict$class
table(math.train$qda,math.train$sex)
```

```
##
## male female
## male 215 84
## female 43 89
```

`mean(math.train$qda==math.train$sex)`

`## [1] 0.7053364`

You can see there is almost no difference. Below is the code for the test data.

```
math.qda.test<-predict(math.qda.fit,math.test)
math.test$qda<-math.qda.test$class
table(math.test$qda,math.test$sex)
```

```
##
## male female
## male 91 43
## female 24 20
```

`mean(math.test$qda==math.test$sex)`

`## [1] 0.6235955`

Still disappointing. However, in this post we reviewed linear discriminant analysis as well as learned about the use of quadratic linear discriminant analysis. Both of these statistical tools are used for predicting categorical dependent variables. LDA assumes shared covariance in the dependent variable categories will QDA allows for each category in the dependent variable to have it’s own variance.

]]>

]]>

**Social Status**

There is a belief among many linguist that women use the most prestigious forms of their language because they are more status-conscious than men. By using the standard version of their language a women is able to claim a higher status.

The implication of this is that women have a lower status in society and try to elevate themselves through their use of language. However, this conclusion has been refuted as women who work outside the home use more of the standard form of their language then women who work in their home.

If the social status hypothesis was correct women who work at home, and thus have the lowest status, should use more of the standard form then women who work. Currently, this is not the case.

**Women as Protector of Society’s Values**

The women as protector of values view see social pressure as a constraint on how women communicate. Simply, women use more standard forms of their language then men because women are expected to behave better. It is thrust upon women to serve as an example for their community and especially for their children.

This answer is considered correct but depends highly on context. For example, this idea falls a part most frequently when women communicate with their children. The informal and intimate setting often leads to most women using the vernacular aspects of their language.

**Women as Subordinate Group**

A third suggestion is that women, who are often a subordinate group, use the more standard version of their language to show deference to those over them. In other words, women use the most polite forms of their language to avoid offending men.

However, this suggestion also fails because it equates politeness with the standard form of a language. People can be polite using vernacular and they can be rude using the most prestigious form of their language possible.

**Vernacular as Masculine**

A final common hypothesis on women’s use of standard forms is the perception that the use of the vernacular is masculine and tough. Women choose the standard form as a way of demonstrating behaviors traditionally associated with gender in their culture. Men on the other hand, use vernacular forms to show traits that are traditionally associated with male behaviors.

The problem with this belief is the informal settings. As mentioned previously, women and men use more vernacular forms of their language in informal settings. As such, it seems that context is one of the strongest factors in how language is used and not necessarily gender.

]]>

]]>

A reaction to this discrete methods came about with the idea that language is wholistic so testing should be integrative or address many aspects of language simultaneously. In this post, we will take a closer look at discrete and integrative language testing methods through providing examples of each along with a comparison.

**Discrete-Point Testing**

Discrete-point testing works on the assumption that language can be reduce to several discrete component “points” and that these “points” can be assessed. Examples of discrete-point test items in language testing include multiple choice, true/false, fill in the blank, and spelling.

What all of these example items have in common is that they usually isolate an aspect of the language from the broader context. For example, a simple spelling test is highly focus on the orthographic characteristics of the language. True/false can be used to assess knowledge of various grammar rules etc.

The primary criticism of discrete-point testing was its discreteness. Many believe that language is wholistic and that in the real world students will never have to deal with language in such an isolated way. This led to the development of integrative language testing methods.

**Integrative Language Testing Methods**

Integrative language testing is based on the unitary trait hypothesis, which states that language is indivisible. This is in complete contrast to discrete-point methods which supports dividing language into specific components. Two common integrative language assessments includes cloze test and dictation.

Cloze test involves taking an authentic reading passage and removing words from it. Which words remove depends on the test creator. Normally, it is every 6th or 7th word but it could be more or less or only the removal of key vocabulary. In addition, sometimes potential words are given to the student to select from or sometimes the list of words is not given to the student

The students job is to look at the context of the entire story to determine which words to write into the blank space. This is an integrative experience as the students have to consider grammar, vocabulary, context, etc. to complete the assessment.

Dictation is simply writing down what was heard. This also requires the use of several language skills simultaneously in a realistic context.

Integrative language testing also has faced criticism. For example, discrete-point testing has always shown that people score differently in different language skills and this fact has been replicated in many studies. As such, the exclusive use of integrative language approaches is not supported by most TESOL scholars.

**Conclusion**

As with many other concepts in education the best choice between discrete-point and integrative testing is a combination of both. The exclusive use of either will not allow the students to demonstrate mastery of the language.

]]>

```
library(MASS);library(bestglm);library(reshape2);library(corrplot);
library(ggplot2);library(ROCR)
```

```
data(survey)
survey$Clap<-NULL
survey$W.Hnd<-NULL
survey$Fold<-NULL
survey$Exer<-NULL
survey$Smoke<-NULL
survey$M.I<-NULL
survey<-na.omit(survey)
pm<-melt(survey, id.var="Sex")
ggplot(pm,aes(Sex,value))+geom_boxplot()+facet_wrap(~variable,ncol = 3)
```

`pc<-cor(survey[,2:5])`

corrplot.mixed(pc)

`set.seed(123) ind<-sample(2,nrow(survey),replace=T,prob = c(0.7,0.3)) train<-survey[ind==1,] test<-survey[ind==2,] fit<-glm(Sex~.,binomial,train) exp(coef(fit))`

```
train$probs<-predict(fit, type = 'response')
train$predict<-rep('Female',123)
train$predict[train$probs>0.5]<-"Male"
table(train$predict,train$Sex)
```

`mean(train$predict==train$Sex)`

```
test$prob<-predict(fit,newdata = test, type = 'response')
test$predict<-rep('Female',46)
test$predict[test$prob>0.5]<-"Male"
table(test$predict,test$Sex)
```

`mean(test$predict==test$Sex)`

**Model Validation**

We will now do a K-fold cross validation in order to further see how our model is doing. We cannot use the factor variable “Sex” with the K-fold code so we need to create a dummy variable. First, we create a variable called “y” that has 123 spaces, which is the same size as the “train” dataset. Second, we fill “y” with 1 in every example that is coded “male” in the “Sex” variable.

In addition, we also need to create a new dataset and remove some variables from our prior analysis otherwise we will confuse the functions that we are going to use. We will remove “predict”, “Sex”, and “probs”

```
train$y<-rep(0,123)
train$y[train$Sex=="Male"]=1
my.cv<-train[,-8]
my.cv$Sex<-NULL
my.cv$probs<-NULL
```

We now can do our K-fold analysis. The code is complicated so you can trust it and double check on your own.

`bestglm(Xy=my.cv,IC="CV",CVArgs = list(Method="HTF",K=10,REP=1),family = binomial)`

`## Morgan-Tatar search since family is non-gaussian.`

```
## CV(K = 10, REP = 1)
## BICq equivalent for q in (6.66133814775094e-16, 0.0328567092272112)
## Best Model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -45.2329733 7.80146036 -5.798014 6.710501e-09
## Height 0.2615027 0.04534919 5.766425 8.097067e-09
```

The results confirm what we alreaedy knew that only the “Height” variable is valuable in predicting Sex. We will now create our new model using only the recommendation of the kfold validation analysis. Then we check the new model against the train dataset and with the test dataset. The code below is a repeat of prior code but based on the cross-validation

```
reduce.fit<-glm(Sex~Height, family=binomial,train)
train$cv.probs<-predict(reduce.fit,type='response')
train$cv.predict<-rep('Female',123)
train$cv.predict[train$cv.probs>0.5]='Male'
table(train$cv.predict,train$Sex)
```

```
##
## Female Male
## Female 61 11
## Male 7 44
```

`mean(train$cv.predict==train$Sex)`

`## [1] 0.8536585`

```
test$cv.probs<-predict(reduce.fit,test,type = 'response')
test$cv.predict<-rep('Female',46)
test$cv.predict[test$cv.probs>0.5]='Male'
table(test$cv.predict,test$Sex)
```

```
##
## Female Male
## Female 16 7
## Male 1 22
```

`mean(test$cv.predict==test$Sex)`

`## [1] 0.826087`

The results are consistent for both the train and test dataset. We are now going to create the ROC curve. This will provide a visual and the AUC number to further help us to assess our model. However, a model is only good when it is compared to another model. Therefore, we will create a really bad model in order to compare it to the original model, and the cross validated model. We will first make a bad model and store the probabilities in the “test” dataset. The bad model will use “age” to predict “Sex” which doesn’t make any sense at all. Below is the code followed by the ROC curve of the bad model.

```
bad.fit<-glm(Sex~Age,family = binomial,test)
test$bad.probs<-predict(bad.fit,type='response')
pred.bad<-prediction(test$bad.probs,test$Sex)
perf.bad<-performance(pred.bad,'tpr','fpr')
plot(perf.bad,col=1)
```

The more of a diagonal the line is the worst it is. As, we can see the bad model is really bad.

What we just did with the bad model we will now repeat for the full model and the cross-validated model. As before, we need to store the prediction in a way that the ROCR package can use them. We will create a variable called “pred.full” to begin the process of graphing the the original full model from the last blog post. Then we will use the “prediction” function. Next, we will create the “perf.full” variable to store the performance of the model. Notice, the arguments ‘tpr’ and ‘fpr’ for true positive rate and false positive rate. Lastly, we plot the results

```
pred.full<-prediction(test$prob,test$Sex)
perf.full<-performance(pred.full,'tpr','fpr')
plot(perf.full, col=2)
```

We repeat this process for the cross-validated model

```
pred.cv<-prediction(test$cv.probs,test$Sex)
perf.cv<-performance(pred.cv,'tpr','fpr')
plot(perf.cv,col=3)
```

Now let’s put all the different models on one plot

```
plot(perf.bad,col=1)
plot(perf.full, col=2, add=T)
plot(perf.cv,col=3,add=T)
legend(.7,.4,c("BAD","FULL","CV"), 1:3)
```

Finally, we can calculate the AUC for each model

```
auc.bad<-performance(pred.bad,'auc')
auc.bad@y.values
```

```
## [[1]]
## [1] 0.4766734
```

```
auc.full<-performance(pred.full,"auc")
auc.full@y.values
```

```
## [[1]]
## [1] 0.959432
```

```
auc.cv<-performance(pred.cv,'auc')
auc.cv@y.values
```

```
## [[1]]
## [1] 0.9107505
```

The higher the AUC the better. As such, the full model with all variables is superior to the cross-validated or bad model. This is despite the fact that there are many high correlations in the full model as well. Another point to consider is that the cross-validated model is simpler so this may be a reason to pick it over the full model. As such, the statistics provide support for choosing a model but the do not trump the ability of the research to pick based on factors beyond just numbers.

]]>

`library(MASS);library(bestglm);library(reshape2);library(corrplot)`

```
data(survey)
?MASS::survey #explains the variables in the study
```

The first thing we need to do is remove the independent factor variables from our dataset. The reason for this is that the function that we will use for the cross-validation does not accept factors. We will first use the “str” function to identify factor variables and then remove them from the dataset. We also need to remove in examples that are missing data so we use the “na.omit” function for this. Below is the code

```
survey$Clap<-NULL
survey$W.Hnd<-NULL
survey$Fold<-NULL
survey$Exer<-NULL
survey$Smoke<-NULL
survey$M.I<-NULL
survey<-na.omit(survey)
```

We now need to check for collinearity using the “corrplot.mixed” function form the “corrplot” package.

```
pc<-cor(survey[,2:5])
corrplot.mixed(pc)
corrplot.mixed(pc)
```

We have extreme correlation between “We.Hnd” and “NW.Hnd” this makes sense because people’s hands are normally the same size. Since this blog post is a demonstration of logistic regression we will not worry about this too much.

We now need to divide our dataset into a train and a test set. We set the seed for. First we need to make a variable that we call “ind” that is randomly assigns 70% of the number of rows of survey 1 and 30% 2. We then subset the “train” dataset by taking all rows that are 1’s based on the “ind” variable and we create the “test” dataset for all the rows that line up with 2 in the “ind” variable. This means our data split is 70% train and 30% test. Below is the code

```
set.seed(123)
ind<-sample(2,nrow(survey),replace=T,prob = c(0.7,0.3))
train<-survey[ind==1,]
test<-survey[ind==2,]
```

We now make our model. We use the “glm” function for logistic regression. We set the family argument to “binomial”. Next, we look at the results as well as the odds ratios.

```
fit<-glm(Sex~.,family=binomial,train)
summary(fit)
```

```
##
## Call:
## glm(formula = Sex ~ ., family = binomial, data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9875 -0.5466 -0.1395 0.3834 3.4443
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -46.42175 8.74961 -5.306 1.12e-07 ***
## Wr.Hnd -0.43499 0.66357 -0.656 0.512
## NW.Hnd 1.05633 0.70034 1.508 0.131
## Pulse -0.02406 0.02356 -1.021 0.307
## Height 0.21062 0.05208 4.044 5.26e-05 ***
## Age 0.00894 0.05368 0.167 0.868
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 169.14 on 122 degrees of freedom
## Residual deviance: 81.15 on 117 degrees of freedom
## AIC: 93.15
##
## Number of Fisher Scoring iterations: 6
```

`exp(coef(fit))`

```
## (Intercept) Wr.Hnd NW.Hnd Pulse Height
## 6.907034e-21 6.472741e-01 2.875803e+00 9.762315e-01 1.234447e+00
## Age
## 1.008980e+00
```

The results indicate that only height is useful in predicting if someone is a male or female. The second piece of code shares the odds ratios. The odds ratio tell how a one unit increase in the independent variable leads to an increase in the odds of being male in our model. For example, for every one unit increase in height there is a 1.23 increase in the odds of a particular example being male.

We now need to see how well our model does on the train and test dataset. We first capture the probabilities and save them to the train dataset as “probs”. Next we create a “predict” variable and place the string “Female” in the same number of rows as are in the “train” dataset. Then we rewrite the “predict” variable by changing any example that has a probability above 0.5 as “Male”. Then we make a table of our results to see the number correct, false positives/negatives. Lastly, we calculate the accuracy rate. Below is the code.

```
train$probs<-predict(fit, type = 'response')
train$predict<-rep('Female',123)
train$predict[train$probs>0.5]<-"Male"
table(train$predict,train$Sex)
```

```
##
## Female Male
## Female 61 7
## Male 7 48
```

`mean(train$predict==train$Sex)`

`## [1] 0.8861789`

Despite the weaknesses of the model with so many insignificant variables it is surprisingly accurate at 88.6%. Let’s see how well we do on the “test” dataset.

```
test$prob<-predict(fit,newdata = test, type = 'response')
test$predict<-rep('Female',46)
test$predict[test$prob>0.5]<-"Male"
table(test$predict,test$Sex)
```

```
##
## Female Male
## Female 17 3
## Male 0 26
```

`mean(test$predict==test$Sex)`

`## [1] 0.9347826`

As you can see, we do even better on the test set with an accuracy of 93.4%. Our model is looking pretty good and height is an excellent predictor of sex which makes complete sense. However, in the next post we will use cross-validation and the ROC plot to further assess the quality of it.

]]>