Category Archives: Uncategorized

Multiple Regression in Python

In this post, we will go through the process of setting up and a regression model with a training and testing set using Python. We will use the insurance dataset from kaggle. Our goal will be to predict charges. In this analysis, the following steps will be performed.

  1. Data preparation
  2. Model training
  3. model testing

Data Preparation

Below is a list of the modules we will need in order to complete the analysis

import matplotlib.pyplot as plt
import pandas as pd
from sklearn import linear_model,model_selection, feature_selection,preprocessing
import statsmodels.formula.api as sm
from statsmodels.tools.eval_measures import mse
from statsmodels.tools.tools import add_constant
from sklearn.metrics import mean_squared_error

After you download the dataset you need to load it and take a look at it. You will use the  .read_csv function from pandas to load the data and .head() function to look at the data. Below is the code and the output.

insure=pd.read_csv('YOUR LOCATION HERE')

1.png

We need to create some dummy variables for sex, smoker, and region. We will address that in a moment, right now we will look at descriptive stats for our continuous variables. We will use the .describe() function for descriptive stats and the .corr() function to find the correlations.

1.png

The descriptives are left for your own interpretation. As for the correlations, they are generally weak which is an indication that regression may be appropriate.

As mentioned earlier, we need to make dummy variables sex, smoker, and region in order to do the regression analysis. To complete this we need to do the following.

  1. Use the pd.get_dummies function from pandas to create the dummy
  2. Save the dummy variable in an object called ‘dummy’
  3. Use the pd.concat function to add our new dummy variable to our ‘insure’ dataset
  4. Repeat this three times

Below is the code for doing this

dummy=pd.get_dummies(insure['sex'])
insure=pd.concat([insure,dummy],axis=1)
dummy=pd.get_dummies(insure['smoker'])
insure=pd.concat([insure,dummy],axis=1)
dummy=pd.get_dummies(insure['region'])
insure=pd.concat([insure,dummy],axis=1)
insure.head()

1.png

The .get_dummies function requires the name of the dataframe and in the brackets the name of the variable to convert. The .concat function requires the name of the two datasets to combine as well the axis on which to perform it.

We now need to remove the original text variables from the dataset. In addition, we need to remove the y variable “charges” because this is the dependent variable.

y = insure.charges
insure=insure.drop(['sex', 'smoker','region','charges'], axis=1)

We can now move to model development.

Model Training

Are train and test sets are model with the model_selection.trainin_test_split function. We will do an 80-20 split of the data. Below is the code.

X_train, X_test, y_train, y_test = model_selection.train_test_split(insure, y, test_size=0.2)

In this single line of code, we create a train and test set of our independent variables and our dependent variable.

We can not run our regression analysis. This requires the use of the .OLS function from statsmodels module. Below is the code.

answer=sm.OLS(y_train, add_constant(X_train)).fit()

In the code above inside the parentheses, we put the dependent variable(y_train) and the independent variables (X_train). However, we had to use the function add_constant to get the intercept for the output. All of this information is then used inside the .fit() function to fit a model.

To see the output you need to use the .summary() function as shown below.

answer.summary()

1.png

The assumption is that you know regression but our reading this post to learn python. Therefore, we will not go into great detail about the results. The r-square is strong, however, the region and gender are not statistically significant.

We will now move to model testing

Model Testing

Our goal here is to take the model that we developed and see how it does on other data. First, we need to predict values with the model we made with the new data. This is shown in the code below

ypred=answer.predict(add_constant(X_test))

We use the .predict() function for this action and we use the X_test data as well. With this information, we will calculate the mean squared error. This metric is useful for comparing models. We only made one model so it is not that useful in this situation. Below is the code and results.

print(mse(ypred,y_test))
33678660.23480476

For our final trick, we will make a scatterplot with the predicted and actual values of the test set. In addition, we will calculate the correlation of the predict values and test set values. This is an alternative metric for assessing a model.

1.png

You can see the first two lines are for making the plot. Lines 3-4 are for making the correlation matrix and involves the .concat() function. The correlation is high at 0.86 which indicates the model is good at accurately predicting the values. THis is confirmed with the scatterplot which is almost a straight line.

Conclusion

IN this post we learned how to do a regression analysis in Python. We prepared the data, developed a model, and tested a model with an evaluation of it.

Advertisements

Working with a Dataframe in Python

In this post, we will learn to do some basic exploration of a dataframe in Python. Some of the task we will complete include the following…

  • Import data
  • Examine data
  • Work with strings
  • Calculating descriptive statistics

Import Data 

First, you need data, therefore, we will use the Titanic dataset, which is readily available on the internet. We will need to use the pd.read_csv() function from the pandas package. This means that we must also import pandas. Below is the code.

import pandas as pd
df=pd.read_csv('FILE LOCATION HERE')

In the code above we imported pandas as pd so we can use the functions within it. Next, we create an object called ‘df’. Inside this object, we used the pd.read_csv() function to read our file into the system. The location of the file needs to type in quotes inside the parentheses. Having completed this we can now examine the data.

Data Examination

Now we want to get an idea of the size of our dataset, any problems with missing. To determine the size we use the .shape function as shown below.

df.shape
Out[33]: (891, 12)

Results indicate that we have 891 rows and 12 columns/variables. You can view the whole dataset by typing the name of the dataframe “df” and pressing enter. If you do this you may notice there are a lot of NaN values in the “Cabin” variable. To determine exactly how many we can use is.null() in combination with the values_count. variables.

df['Cabin'].isnull().value_counts()
Out[36]: 
True     687
False    204
Name: Cabin, dtype: int64

The code starts with the name of the dataframe. In the brackets, you put the name of the variable. After that, you put the functions you are using. Keep in mind that the order of the functions matters. You can see we have over 200 missing examples. For categorical varable, you can also see how many examples are part of each category as shown below.

df['Embarked'].value_counts()
Out[39]: 
S    644
C    168
Q     77
Name: Embarked, dtype: int64

This time we used our ‘Embarked’ variable. However, we need to address missing values before we can continue. To deal with this we will use the .dropna() function on the dataset. THen we will check the size of the dataframe again with the “shape” function.

df=df.dropna(how='any')
df.shape
Out[40]: (183, 12)

You can see our dataframe is much smaller going 891 examples to 183. We can now move to other operations such as dealing with strings.

Working with Strings

What you do with strings really depends or your goals. We are going to look at extraction, subsetting, determining the length. Our first step will be to extract the last name of the first five people. We will do this with the code below.

df['Name'][0:5].str.extract('(\w+)')
Out[44]: 
1 Cumings
3 Futrelle
6 McCarthy
10 Sandstrom
11 Bonnell
Name: Name, dtype: object

As you can see we got the last names of the first five examples. We did this by using the following format…

dataframe name[‘Variable Name’].function.function(‘whole word’))

.str is a function for dealing with strings in dataframes. The .extract() function does what its name implies.

If you want, you can even determine how many letters each name is. We will do this with the .str and .len() function on the first five names in the dataframe.

df['Name'][0:5].str.len()
Out[64]: 
1 51
3 44
6 23
10 31
11 24
Name: Name, dtype: int64

Hopefully, the code is becoming easier to read and understand.

Aggregation

We can also calculate some descriptive statistics. We will do this for the “Fare” variable. The code is repetitive in that only the function changes so we will run all of them at once. Below we are calculating the mean, max, minimum, and standard deviation  for the price of a fare on the Titanic

df['Fare'].mean()
Out[77]: 78.68246885245901

df['Fare'].max()
Out[78]: 512.32920000000001

df['Fare'].min()
Out[79]: 0.0

df['Fare'].std()
Out[80]: 76.34784270040574

Conclusion

This post provided you with some ways in which you can maneuver around a dataframe in Python.

Numpy Arrays in Python

In this post, we are going to explore arrays is created by the numpy package in Python. Understanding how arrays are created and manipulated is useful when you need to perform complex coding and or analysis. In particular, we will address the following,

  1. Creating and exploring arrays
  2. Math with arrays
  3. Manipulating arrays

Creating and Exploring an Array

Creating an array is simple. You need to import the numpy package and then use the np.array function to create the array. Below is the code.

import numpy as np
example=np.array([[1,2,3,4,5],[6,7,8,9,10]])

Making an array requires the use of square brackets. If you want multiple dimensions or columns than you must use inner square brackets. In the example above I made an array with two dimensions and each dimension has it’s own set of brackets.

Also, notice that we imported numpy as np. This is a shorthand so that we do not have to type the word numpy but only np. In addition, we now created an array with ten data points spread in two dimensions.

There are several functions you can use to get an idea of the size of a data set. Below is a list with the function and explanation.

  • .ndim = number of dimensions
  • .shape =  Shares the number of rows and columns
  • .size = Counts the number of individual data points
  • .dtype.name = Tells you the data structure type

Below is code that uses all four of these functions with our array.

example.ndim
Out[78]: 2

example.shape
Out[79]: (2, 5)

example.size
Out[80]: 10

example.dtype.name
Out[81]: 'int64'

You can see we have 2 dimensions. The .shape function tells us we have 2 dimensions and 5 examples in each one. The .size function tells us we have 10 total examples (5 * 2). Lastly, the .dtype.name function tells us that this is an integer data type.

Math with Arrays

All mathematical operations can be performed on arrays. Below are examples of addition, subtraction, multiplication, and conditionals.

example=np.array([[1,2,3,4,5],[6,7,8,9,10]]) 
example+2
Out[83]: 
array([[ 3, 4, 5, 6, 7],
[ 8, 9, 10, 11, 12]])

example-2
Out[84]: 
array([[-1, 0, 1, 2, 3],
[ 4, 5, 6, 7, 8]])

example*2
Out[85]: 
array([[ 2, 4, 6, 8, 10],
[12, 14, 16, 18, 20]])

example<3
Out[86]: 
array([[ True, True, False, False, False],
[False, False, False, False, False]], dtype=bool)

Each number inside the example array was manipulated as indicated. For example, if we typed example + 2 all the values in the array increased by 2. Lastly, the example < 3 tells python to look inside the array and find all the values in the array that are less than 3.

Manipulating Arrays

There are also several ways you can manipulate or access data inside an array. For example, you can pull a particular element in an array by doing the following.

example[0,0]
Out[92]: 1

The information in the brackets tells python to access the first bracket and the first number in the bracket. Recall that python starts from 0. You can also access a range of values using the colon as shown below

example=np.array([[1,2,3,4,5],[6,7,8,9,10]]) 
example[:,2:4]
Out[96]: 
array([[3, 4],
[8, 9]])

In this example, the colon means take all values or dimension possible for finding numbers. This means to take columns 1 & 2. After the comma we have 2:4, this means take the 3rd and 4th value but not the 5th.

It is also possible to turn a multidimensional array into a single dimension with the .ravel() function and also to transpose with the transpose() function. Below is the code for each.

example=np.array([[1,2,3,4,5],[6,7,8,9,10]]) 
example.ravel()
Out[97]: array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

example.transpose()
Out[98]: 
array([[ 1, 6],
[ 2, 7],
[ 3, 8],
[ 4, 9],
[ 5, 10]])

You can see the .ravel function made a one-dimensional array. The .transpose broke the array into several more dimensions with two numbers each.

Conclusion

We now have a basic understanding of how numpy array work using python. As mention before, this is valuable information to understand when trying to wrestling with different data science questions.

Lists in Python

Lists allow you to organize information. In the real world, we make list all the time to keep track of things. This same concept applies in Python when making list. A list is a sequence of stored data. By sequence, it is mean a data structure that allows multiple items to exist in a single storage unit. By making list we are explaining to the computer how to store the data in the computer’s memory.

In this post, we learn the following about list

  • How to make a list
  • Accessing items in a list
  • Looping through a list
  • Modifying a list

Making a List

Making a list is not difficult at all. To make one you first create a variable name followed by the equal sign and then place your content inside square brackets. Below is an example of two different lists.

numList=[1,2,3,4,5]
alphaList=['a','b','c','d','e']
print(numList,alphaList)
[1, 2, 3, 4, 5] ['a', 'b', 'c', 'd', 'e']

Above we made two lists, a numeric and a character list. We then printed both. In general, you want your list to have similar items such as all numbers or all characters. This makes it easier to recall what is in them then if you mixed them. However, Python can handle mixed list as well.

Access a List

To access individual items in a list is the same as for a sting. Just employ brackets with the index that you want. Below are some examples.

numList=[1,2,3,4,5]
alphaList=['a','b','c','d','e']

numList[0]
Out[255]: 1

numList[0:3]
Out[256]: [1, 2, 3]

alphaList[0]
Out[257]: 'a'

alphaList[0:3]
Out[258]: ['a', 'b', 'c']

numList[0] gives us the first value in the list. numList[0:3] gives us the first three values. This is repeated with the alphaList as well.

Looping through a List

A list can be looped through as well. Below is a simple example.

for item in numList :
    print(item)


for item in alphaList :
    print(item)


1
2
3
4
5
a
b
c
d
e

By making the two for loops above we are able to print all of the items inside each list.

Modifying List

There are several functions for modifying lists. Below are a few

The append() function as a new item to the list

numList.append(9)
print(numList)
alphaList.append('h')
print(alphaList)

[1, 2, 3, 4, 5, 9]
['a', 'b', 'c', 'd', 'e', 'h']

You can see our lists new have one new member each at the end.

You can also remove the last member of a list with the pop() function.

numList.pop()
print(numList)
alphaList.pop()
print(alphaList)
[1, 2, 3, 4, 5]
['a', 'b', 'c', 'd', 'e']

By using the pop() function we have returned our lists back to there original size.

Another trick is to merge lists together with the extend() function. For this, we will merge the same list with its self. This will cause the list to have duplicates of all of its original values.

numList.extend(numList)
print(numList)
alphaList.extend(alphaList)
print(alphaList)

[1, 2, 3, 4, 5, 1, 2, 3, 4, 5]
['a', 'b', 'c', 'd', 'e', 'a', 'b', 'c', 'd', 'e']

All the values in each list have been duplicated. Finally, you can sort a list using the sort() function.

numList.sort()
print(numList)
alphaList.sort()
print(alphaList)
[1, 1, 2, 2, 3, 3, 4, 4, 5, 5]
['a', 'a', 'b', 'b', 'c', 'c', 'd', 'd', 'e', 'e']

Now all the numbers and letters are sorted.

Conclusion

THere is way more that could be done with lists. However, the purpose here was just to cover some of the basic ways that list can be used in Python.

Introducing Google Classroom

Google Classroom is yet another player in the learning management system industry. This platform provides most of the basics that are expected in a lms.  This post is not a critique of Google Classroom. Rather, the focus here is on how to use it. It is better for you to decide for yourself about the quality of Google Classroom.

In this post, we will learn how to set up a class in order to prepare the learning experience.

Before we begin it is assumed that you have a Gmail account as this is needed to access Google Classroom. In addition, this demonstration is from an individual account and not through the institutional account that a school would set up with Google if they adopted tGoogle Classroom.

Creating a Google Class

Once you are logged in to your Gmail account you can access Google Classroom by clicking on the little gray squares in the upper right-hand corner of your browser. Doing so will show the following.

1

In the example above, Google Classroom is the icon in the bottom row in the middle. When you click on it you will see the following.

1

You might see a screen before this asking if you are a student or teacher. In the screen above, Google tells you where to click to make your first class. Therefore, click on the plus sign and click on “create class” and you will see the following.

1

Click on the box which promises Google you will only use your classroom with adults. After this, you will see a dialog box where you can give your class a name as shown below.

1

Give your course a name and click “create”. Then you will see the following.

1.pngThere is a lot of information here. The name of the class is at the top followed by the name of the teacher below. In the middle of the page, you have something called the “stream”. This is where most of the action happens in terms of posting assignments, leading discussions, and making announcements. To the left are some options for dealing with the stream, a calendar, and a way to organize information in the stream by topic.

The topic feature is valuable because it allows you to organize information in a way similar to topics in Moodle. When creating an activity just be sure to assign it to a topic so students can see expectations for that week’s work. This will be explained more in the future.

One thing that was not mentioned was the tabs at the very top of the screen.

1

We started in the “stream” tab. If you click on the “students” tab you will see the following.

1.png

The “invite students” button allows you to add students by typing their email. To the left, you have the class code. This is the code people need in order to add your course.

If you click on the “about” tab you will see the following.

1

Here you can access the drive where all files are saved, the class calendar, your Google calendar, and even invite teachers. In the middle, you can edit the information about the course as well as additional materials that the students will need. This page is useful because it is not dynamic like the stream page. Posted files staying easy to find when using the “about” page.

Conclusion

Google Classroom is not extremely difficult to learn. You can set-up a course with minimal computer knowledge in less than 20 minutes. The process shared hear was simply the development of a course. In a future post, we will look at how setup teaching activities and other components of a balanced learning experience.

Luther and Educational Reform

Martin Luther (1483-1546) is best known for his religious work as one of the main catalysts for the Protestant Reformation. However, Luther was also a powerful influence on education during his lifetime. This post will take a look at Luther’s early life and his contributions to education

Early Life

Luther was born during the late 15th century. His father was a tough miner with a severe disciplinarian streak. You would think that this would be a disaster but rather the harsh discipline gave Luther a toughness that would come in handy when standing alone for his beliefs.

Upon reaching adulthood Luther studied law as his father diseased for him to become a lawyer. However, Luther decided instead to become a monk much to the consternation of his father.

As a monk, Luther was a diligent student and studied for several additional degrees. Eventually, he was given an opportunity to visit Rome which was the headquarters of his church. However, Luther saw things there that troubled him and in many laid the foundation for his doubt in the direction of his church.

Eventually, Luther had a serious issue with several church doctrines. This motivated him to nail his 95 theses onto the door of a church in 1517. This act was a challenge to defend the statements in the theses and was actually a common behavior among the scholarly community at the time.

For the next several years it was a back forth intellectual battle with the church. A common pattern was the church would use some sort of psychological torture such as the eternal damnation of his soul and Luther would ask for biblical evidence which was normally not given. Finally, in 1521 at the Diet of Worms, Luther was forced to flee for his life and the Protestant Reformation had in many was begun.

Views on Education

Luther’s views on education would not be considered radical or innovative today but they were during his lifetime. For our purposes, we will look at three tenets of Luther’s position on education

  • People should be educated so they can read the scriptures
  • Men and women should receive an education
  • Education  should benefit the church and state

People Should be Educated so they Can Read the Scriptures

The thought that everyone should be educated was rather radical. By education, we mean developing literacy skills and not some form of vocational training. Education was primarily for those who needed it which was normally the clergy, merchants, and some of the nobility.

If everyone was able to read it would significantly weaken the churches position to control spiritual ideas and the state’s ability to maintain secular control, which is one reason why widespread literacy was uncommon. Luther’s call for universal education would not truly be repeated until Horace Mann and the common school. movement.

The idea of universal literacy also held with it a sense of personal responsibility. No one could rely on another to understand scripture. Everyone needs to know how to read and interpret scripture for themselves.

Men and Women Should be Educated

The second point is related to the first. Luther said that everyone should be educated he truly meant everyone. This means men and women should learn literacy. The women could not hide behind the man for her spiritual development but needed to read for herself.

Again the idea of women education was controversial at the time. The Greeks believed that educating women was embarrassing although this view was not shared by all in any manner.

WOmen were not only educated for spiritual reasons but also so they could manage the household as well. Therefore, there was a spiritual and a practical purpose to the education of women for Luther

Education Benefits the Church and the State

Although it was mentioned that education had been neglected to maintain the power of the church and state. For Luther, educated citizens would be of a greater benefit to the church and state.

The rationale is that the church would receive ministers, teachers, pastors, etc. and the state would receive future civil servants. Therefore, education would not tear down society but would rather build it up.

Conclusion

Luther was primarily a reformer but also was a powerful force in education. His plea for the development of education in Germany led to the construction of schools all over the Protestant controlled parts of Germany. His work was of such importance that he has been viewed as one of the leading educational reformers of the 16th century.

Education During the Reformation

By the 16th century, Europe was facing some major challenges to the established order of doing things. Some of the causes of the upheaval are less obvious than others.

For example, the invention of gunpowder made knights useless. This was significant because now any common soldier could be more efficient and useful in battle than a knight that took over ten years to train. This weakened the prestige of the nobility at least temporarily while adjustments were made within the second estate and led to a growth in the prestige of the third estate who were adept at using guns.

The church was also facing majors issues. After holding power for almost 1000 years people began to chaff at the religious power of Europe. There was a revival in learning that what aggressively attacked by monks, who attacked the study of biblical languages accusing this as the source of all heresies.

The scholars of the day mock religion as a superstition. Furthermore, the church was accused of corruption and for abusing power. The scholars or humanists called for a return to the Greek and Romans classics, which was the prevailing worldview before the ascension of Catholicism.

Out of the chaos sprang the protestant reformation which rejects the teachings of the medieval church. The Protestants did not only have a different view on religion but also on how to educate as we shall see.

Protestant Views of Education

A major tenet of Protestantism that influenced their view on education was the idea of personal responsibility. What this meant was that people needed to study for themselves and not just listen to the teacher. In a spiritual sense that meant reading the Bible for one’s self. In an educational sense, it meant confirming authority with personal observation and study.

Out of this first principal springs two other principles which are education that matches an individual’s interest and the study of nature. Protestants believed that education should support the natural interest and ablities of a person rather than the interest of the church.

This was and still is a radical idea. Most education today is about the student adjusting themselves to various standards and benchmarks developed by the government. Protestants challenged this view and said education should match the talents of the child. If a child shows interest in woodworking teach this to him. If he shows interest in agriculture teach that to him.

To be fair, attempts have been made in education to “meet the needs” of the child and to differentiate instruction. However, these goals are made in order to take a previously determined curriculum and make it palpable to the student rather than designing something specifically for the individual student. The point is that a child is more than a cog in a machine to be trained as a screwdriver or hammer but rather an individual whose value is priceless.

Protestants also support the study of nature. Be actually observing nature it reduced a great deal of the superstition of the time. At one point, the religious power of Europe forbade the study of human anatomy through the performing autopsies. In addition, Galileo was in serious trouble for denying the geocentric model of the solar system. Such restrictions stalled science for years and were removed through Protestantism.

Conclusion

The destabilization that marks the reformation marks a major break in history. With the decline of the church came the rise of the common man to a position of independent thought and action. These ideas of personal responsibility came from the growing influence of Protestants in the world.

Education in Ancient China

As one of the oldest civilizations in the world, China has a rich past when it comes to education. This post will explore education in Ancient China by providing a brief overview of it. The following topics

  1. Background
  2. What was Taught
  3. How was it Taught
  4. The Organization of what was Taught
  5. The Evidence Students Provided of their Learning

Background

Ancient Chinese education is an interesting contrast. On the one hand, they were major innovators of some of the greatest invention of mankind which includes paper, printing, gunpowder, and the compass. On the other hand, Chinese education in the past was strongly collective in nature with heavy governmental control. There was extreme pressure to conform to ancient customs and independent deviate behavior was looked down upon.  Despite this, there as still innovation.

Most communities had a primary school and most major cities had a college. Completing university study was a great way to achieve a government position in ancient China.

What Did they Teach

Ancient Chinese education focused almost exclusively on Chinese Classics. By classics, it is meant the writings of mainly Confucius. Confucius emphasized strict obedience in a hierarchical setting. The order was loosely King, Father, Mother, then the child. Deference to authority was the ultimate duty of everyone. There is little surprise that the government support such an education that demanded obedience to them.

Another aspect of Confucius writings that was stressed was the Five Cardinal Virtues which were charity, justice, righteousness, sincerity, and conformity to tradition. This was the heart of the moral training that young people received. Even leaders needed to demonstrate these traits which limited abuses of power at times.

What China is also famous for in their ancient curriculum is what they did not teach.  Supposedly, they did not cover in great detail geography, history, math, science, or language. The focus was on Confucius apparently almost exclusively.

How Did they Teach

Ancient Chinese education was taught almost exclusively by rote memory. Students were expected to memorized large amounts of information.  This contributed to a focus on the conservation of knowledge rather than the expansion of it. If something new or unusual happened it was difficult to deal with since there was no prior way already developed to address it.

How was Learning Organized

School began at around 6-7 years of age in the local school. After completing studies at the local school. Some students went to the academy for additional studies.  From Academy, some students would go to university with the hopes of completing their studies to obtain a government position.

Generally,  the education was for male students as it was considered shameful to not educate a boy. Girls often did not go to school and often handle traditional roles in the home.

Evidence of Learning

Evidence of learning in the Chinese system was almost strictly through examinations. The examinations were exceedingly demanding and stressful. If a student was able to pass the gauntlet of rot memory exams he would achieve his dream of completing college and joining the prestigious Imperial Academy as a Mandarin.

Conclusion

Education in Ancient China was focused on memorization, tradition,  and examination. Even with this focus, Ancient China developed several inventions that have had a significant influence on the world. Explaining this will only lead to speculation but what can be said is that progress happens whether it is encouraged or not.

Augmented Matrix for a System of Equations

Matrices are a common tool used in algebra. They provide a way to deal with equations that have commonly held variables. In this post, we learn some of the basics of developing matrices.

From Equation to Matrix

Using a matrix involves making sure that the same variables and constants are all in the same column in the matrix. This will allow you to do any elimination or substitution you may want to do in the future. Below is an example

11

Above we have a system of equations to the left and an augmented matrix to the right. If you look at the first column in the matrix it has the same values as the x variables in the system of equations (2 & 3). This is repeated for the y variable (-1 & 3) and the constant (-3 & 6).

The number of variables that can be included in a matrix is unlimited. Generally,  when learning algebra, you will commonly see 2 & 3 variable matrices. The example above is a 2 variable matrix below is a three-variable matrix.

11

If you look closely you can see there is nothing here new except the z variable with its own column in the matrix.

Row Operations 

When a system of equations is in an augmented matrix we can perform calculations on the rows to achieve an answer. You can switch the order of rows as in the following.

1.png

You can multiply a row by a constant of your choice. Below we multiple all values in row 2 by 2. Notice the notation in the middle as it indicates the action performed.

1

You can also add rows together. In the example below row 1 and row 2, are summed to create a new row 1.

1

You can even multiply a row by a constant and then sum it with another row to make a new row. Below we multiply row 2 by 2 and then sum it with row 1 to make a new row 1.

1

The purpose of row operations is to provide a way to solve a system of equations in a matrix. In addition, writing out the matrices provides a way to track the work that was done. It is easy to get confused even the actual math is simple

Conclusion

System of equations can be difficult to solve. However, the use of matrices can reduce the computational load needed to solve them. You do need to be careful with how you modify the rows and columns and this is where the use of row operations can be beneficial.

System of Equations and Mixture Application

Solving a system of equations with a mixture application involves combining two or more quantities. The general setup for the equations is as follows

Quantity * value = total

This equation is used for both equations. You simply read the problem and plug in the information. The examples in this post are primarily related to business as this is one of the more practical applications of solving a system of equations for the average person. However, a system of equations for mixtures can also be used for determining solutions but this is more common in chemistry.

Example 1: Making Food 

John wants to make 20 lbs of granola using nuts and raisins. His budget requires that the granola cost $3.80 per pound. Nuts are $4.50 per pound and raisins are $1.00 per pound. How many pounds of nuts and raisins can he use?

The first thing we need to determine what we know

  • cost of the raisins
  • cost of the nuts
  • total cost of the granola
  • number of pounds of granola to make

Below is all of our information in a table

Pounds * Price Total
Nuts n 4.50 4.5n
Raisins r 1 r
Granola 20 3.80 3.8(20) = 76

What we need to know is how many pounds of nuts and raisins can we use to have the total price per pound be $3.80.

With this information, we can set up our system of equations. We take the pounds column and create the first equation and the total column to create the second equation.

1

We will use elimination to solve this system. We will multiply the first equation by -1 and combine them. Then we solve for n as in the steps below

1.png

We know n = 16 or that we can have 16 pounds of nuts. To determine the amount of raisins we use our first equation in the system.

1.png

You can check this yourself if you desire.

Example 2: Interests

Below is an example that involves two loans with different interest rates. Our job will be to determine the principal amount of the loan.

Tom owes $43,080 on two student loans. The bank’s interest rate is 5.25% and the federal loan rate is 2.95%. The total amount of interest he paid last two years was 6678.72. What was the principal for each loan

The first thing we need to determine what we know

  • bank interest rate
  • Federal interest rate
  • time of repayment
  • Amount of loan
  • Interest paid so far

Below is all of our information in a table

Principal * Rate Time Total
Bank b 0.0525 1 0.0525b
Federal f 0.0295 1 0.0295f
Total 43080 1752.45

Below is our system of equation

1.png

To solve the system of equations we will use substitution. First, we need to solve for b as shown below

1.png

We now substitute  and solve

1

We know the federal loan is $22,141.30 we can use this information to find the bank loan amount using the first equation.

1.png

The bank loan was $20,938.70

Conclusion

Hopefully, it is clear by now that solving a system of equations can have real-world significance. Applications of this concept can be useful in the context of business as shown here.

Education in Ancient India

In this post, we take a look at India education in the ancient past. The sub-continent of India has one of the oldest civilizations in the world. Their culture has had a strong influence on both the East and West.

Background

One unique characteristic of ancient education in India is the influence of religion. The effect of Hinduism is strong. The idea of the caste system is derived from Hinduism with people being divided primarily into four groups

  1. Brahmins-teachers/religious leaders
  2. Kshatriyas-soldiers kings
  3. Vaisyas-farmers/merchants
  4. Sudras-slaves

This system was ridged. There was no moving between caste and marriages between castes was generally forbidden. The Brahmins were the only teachers as it was embarrassing to allow one’s children to be taught by another class. They received no salary but rather received gifts from their students

What Did they Teach

The Brahmins served as the teachers and made it their life work to reinforce the caste system through education. It was taught to all children to understand the importance of this system as well as the role of the  Brahmin at the top of it.

Other subjects taught at the elementary level include the 3 r’s. At the university level, the subjects included grammar, math, history, poetry, philosophy, law, medicine, and astronomy. Only the Brahmins completed formal universities studies so that they could become teachers. Other classes may receive practical technical training to work in the government, serve in the military, or manage a business.

Something that was missing from education in ancient India was physical education. For whatever reason, this was not normally considered important and was rarely emphasized.

How Did they Teach

The teaching style was almost exclusively rote memorization. Students would daily recite mathematical tables and the alphabet. It would take a great deal of time to learn to read and write through this system.

There was also the assistance of an older student to help the younger ones to learn. In a way, this could be considered as a form of tutoring.

How was Learning Organized

School began at 6-7. The next stage of learning was university 12 years later. Women did not go to school beyond the cultural training everyone received in early childhood.

Evidence of Learning

Learning mastery was demonstrated through the ability to memorize. Other forms of thought and effort were not the main criteria for demonstrating mastery.


Conclusion

Education in India serves a purpose that is familiar to many parts of the world. That purpose was social stability. With the focus on the caste system before other forms of education, India was seeking stability before knowledge expansion and personal development. This can be seen in many ways but can be agreed upon is that the country is still mostly intact after several thousand years and few can make such a claim even if their style of education is superior to India’s.

Solving a System of Equations with Direct Translation

In this post, we will look at two simple problems that require us to solve for a system of equations. Recall that a system of equations involves two or more variables that must be solved. With each problem, we will use the direct translation to set up the problem so that it can be solved.

Direct Translation 

Direct translation involves reading a problem and translating it into a system of equations. In order to do this, you must consider the following steps

  1. Determine what you want to know
  2. Assigned variables to what you want to know
  3. Setup the system of equations
  4. Solve the system

Example `1

Below is an example  followed by a step-by-step breakdown

The sum of two numbers is zero. One number is 18 less than the other. Find the numbers.

Step 1: We want to know what the two numbers are

Step 2: n = first number & m =  second number

Step 3: Set up system

1

Solving this is simple we know n = m – 18 so we plug this into the first equation n + m = 0  and solve for m.

1.png

Now that we now m we can solve for n in the second equation

1.png

The answer is m = 9 and n = -9. If you add these together they would come to zero and meet the criteria for the problem.

Example 2

Below is a second example involving a decision for salary options.

Dan has been offered two options for his salary as a salesman. Option A would pay him $50,000 plus $30 for each sale he closes. Option B would pay him $35,000 plus $80 for each sale he closes. How many sales before the salaries are equal

Step 1: We want to know when the salaries are equal based on sales

Step 2: d =  Dan’s salary & s = number of sales

Step 3: Set up system

1.png

To solve this problem we can simply substitute d  for one of the salaries as shown below

1

You can check to see if this answer is correct yourself. In order for the two salaries to equal each other Dan would need to sale 300 units. After 300 units option B is more lucrative. Deciding which salary option to take would probably depend on how many sales Dan expects to make in a year.

Conclusion

Algebraic concepts can move beyond theoretical ideas and rearrange numbers to practical applications. This post showed how even something as obscure as a system of equations can actually be used to make financial decisions.

Solving a System of Equations by Substitution and Elimination

A system of equations involves trying to solve for more than one variable. What this means is that a system of equations helps you to see how to different equations relate or where they intersect if you were to graph them.

There are several different ways to solve a system of equations. In this post, we will solve y using the substitution and the elimination methods.

Substitution

Substitution involves choosing one of the two equations and solving for one variable. Once this is done we substitute the expression into the equation for which we did not solve a variable for. When this is done the second equation only has one unknown variable and this is basic algebra to solve.

The explanation above is abstract so here is a mathematical example

1.png

We are not done. We now need to use are x value to find our y value. We will use the first equation and replace x to find y.

1

This means that our ordered pair is (4, -1) and this is the solution to the system. You can check this answer by plugging both numbers into the x and y variable in both equations.

Elimination

Elimination begins with two equations and two variables but eliminates one variable to have one equation with one variable. This is done through the use of the addition property of equality which states when you add the same quantity to both sides of an equation you still have equality. For example 2+2 = 2 and if at 5 to both sides I get 7 + 7 = 7. The equality remains.

Therefore, we can change one equation using the addition property of equality until one of the variables has the same absolute value for both equations. Then we add across to eliminate one of the variables. If one variable is positive in one equation and negative in the other and has the same absolute value they will eliminate each other. Below is an example using the same system of equations as the previous example.

.1.png

You can take the x value and plug it into y. We already know y =1 from the previous example so we will skip this.

There are also times when you need to multiply both equations by a constant so that you can eliminate one of the variables

1.png

We now replace x with 0 in the second equation

1

Our ordered pair is (0, -3) which also means this is where the two lines intersect if they were graphed.

Conclusion

Solving a system of equations allows you to handle two variables (or more) simultaneously. In terms of what method to use it really boils down to personal choice as all methods should work. Generally, the best method is the one with the least amount of calculation.

Writing Discussion & Conclusions in Research

The Discussion & Conclusion section of a research article/thesis/dissertation is probably the trickiest part of a project to write. Unlike the other parts of a paper, the Discussion & Conclusions are hard to plan in advance as it depends on the results. In addition, since this is the end of a paper the writer is often excited and wants to finish it quickly, which can lead to superficial analysis.

This post will discuss common components of the Discussion & Conclusion section of a paper. Not all disciplines have all of these components nor do they use the same terms as the ones mentioned below.

Discussion

The discussion is often a summary of the findings of a paper. For a thesis/dissertation, you would provide the purpose of the study again but you probably would not need to share this in a short article. In addition, you also provide highlights of what you learn with interpretation. In the results section of a paper, you simply state the statistical results. In the discussion section, you can now explain what those results mean for the average person.

The ordering of the summary matters as well. Some recommend that you go from the most important finding to the least important. Personally, I prefer to share the findings by the order in which the research questions are presented. This maintains a cohesiveness across sections of a paper that a reader can appreciate. However, there is nothing superior to either approach. Just remember to connect the findings with the purpose of the study as this helps to connect the themes of the paper together.

What really makes this a discussion is to compare/contrast your results with the results of other studies and to explain why the results are similar and or different. You also can consider how your results extend the works of other writers. This takes a great deal of critical thinking and familiarity with the relevant literature.

Recommendation/Implications

The next component of this final section of the paper is either recommendations or implications but almost never both. Recommendations are practical ways to apply the results of this study through action. For example, if your study finds that sleeping 8 hours a night improves test scores then the recommendation would be that students should sleep 8 hours a night to improve their test scores. This is not an amazing insight but the recommendations must be grounded in the results and not just opinion.

Implications, on the other hand, explain why the results are important. Implications are often more theoretical in nature and lack the application of recommendations. Often implications are used when it is not possible to provide a strong recommendation.

The terms conclusion and implications are often used interchangeably in different disciplines and this is highly confusing. Therefore, keep in mind your own academic background when considering what these terms mean.

There is one type of recommendation that is almost always present in a study and that is recommendations for further study. This is self-explanatory but recommendations for further study are especially important if the results are preliminary in nature. A common way to recommend further studies is to deal with inconclusive results in the current study. In other words, if something weird happened in your current paper or if something surprised you this could be studied in the future. Another term for this is “suggestions for further research.”

Limitations

Limitations involve discussing some of the weaknesses of your paper. There is always some sort of weakness with a sampling method, statistical analysis, measurement, data collection etc. This section is an opportunity to confess these problems in a transparent matter that further researchers may want to control for.

Conclusion

Finally, the conclusion of the Discussion & Conclusion is where you try to summarize the results in a sentence or two and connect them with the purpose of the study. In other words, trying to shrink the study down to a one-liner. If this sounds repetitive it is and often the conclusion just repeats parts of the discussion.

Blog Conclusion

This post provides an overview of writing the final section of a research paper. The explanation here provides just one view on how to do this. Every discipline and every researcher has there own view on how to construct this section of a paper.

Common Problems with Research for Students

I have worked with supporting undergrad and graduate students with research projects for several years. This post is what I consider to be the top reasons why students and even the occasional faculty member struggles to conduct research. The reasons are as follows

  1. They don’t read
  2. No clue what  a problem is
  3. No questions
  4. No clue how to measure
  5. No clue how to analyze
  6. No clue how to report

Lack of Reading

The first obstacle to conducting research is that students frequently do not read enough to conceptualize how research is done. Reading not just anything bust specifically research allows a student to synthesize the vocabulary and format of research writing. You cannot do research unless you first read research. This axiom applies to all genres of writing.

A common complaint is the difficulty with understanding research articles. For whatever reason, the academic community has chosen to write research articles in an exceedingly dense and unclear manner. This is not going to change because one graduate student cannot understand what the experts are saying. Therefore, the only solution to understand research English is exposure to this form of communication.

Determining the Problem

If a student actually reads they often go to the extreme of trying to conduct Nobel Prize type research. In other words, their expectations are overinflated given what they know. What this means is that the problem they want to study is infeasible given the skillset they currently possess.

The opposite extreme is to find such a minute problem that nobody cares about it. Again, reading will help in avoiding this two pitfalls.

Another problem is not knowing exactly how to articulate a problem. A student will come to me with excellent examples of a problem but they never abstract or take a step away from the examples of the problem to develop a researchable problem. There can be no progress without a clearly defined research problem.

Lack the Ability to Ask Questions about the Problem

If a student actually has a problem they never think of questions that they want to answer about the problem. Another extreme is they ask questions they cannot answer. Without question, you can never better understand your problem. Bad questions or no questions means no answers.

Generally, there are three types of quantitative research questions while qualitative is more flexible. If a student does not know this they have no clue how to even begin to explore their problem.

Issues with Measurement

Let’s say a student does know what their questions are, the next mystery for many is measuring the variables if the study is quantitative. This is were applying statistical knowledge rather than simply taking quizzes and test comes to play. The typical student does not understand often how to operationalize their variables and determine what type of variables they will include in their study. If you don’t know how you will measure your variables you cannot answer any questions about your problem.

Lost at the Analysis Stage

The measurement affects the analysis. I cannot tell you how many times a student or even a colleague wanted me to analyze their data without telling me what the research questions were. How can you find answers without questions? The type of measurement affects the potential ways of analyzing data. How you summary categorical data is different from continuous data. Lacking this knowledge leads to inaction.

No Plan for the Write-Up

If a student makes it to this stage, firstly congratulations are in order, however, many students have no idea what to report or how. This is because students lose track of the purpose of their study which was to answer their research questions about the problem. Therefore, in the write-up, you present the answers systematically. First, you answer question 1, then 2, etc.

If necessary you include visuals of the answers. Again Visuals are determined by the type of variable as well as the type of question. A top reason for article rejection is an unclear write-up. Therefore, great care is needed in order for this process to be successful.

Conclusion

Whenever I deal with research students I often walk through these six concepts. Most students never make it past the second or third concept. Perhaps the results will differ for others.

Successful research writing requires the ability to see the big picture and connection the various section of a paper so that the present a cohesive whole. Too many students focus on the little details and forget the purpose of their study. Losing the main idea makes the details worthless.

If I left out any common problems with research please add them in the comments section.

Reading Comprehension Strategies

Students frequently struggle with understanding what they read. There can be many reasons for this such as vocabulary issues, to struggles with just sounding out the text. Another common problem, frequently seen among native speakers of a language, is the students just read without taking a moment to think about what they read. This lack of reflection and intellectual wrestling with the text can make so that the student knows they read something but knows nothing about what they read.

In this post, we will look at several common strategies to support reading comprehension. These strategies include the following…

Walking a Student Through the Text

As students get older, there is a tendency for many teachers to ignore the need to guide students through a reading before the students read it. One way to improve reading comprehension is to go through the assigned reading and give an idea to the students of what to expect from the text.

Doing this provides a framework within the student’s mind in which they can add the details to as they do the reading. When walking through a text with students the teacher can provide insights into important ideas, explain complex words, explain visuals, and give general ideas as to what is important.

Ask Questions

Asking question either before or after a reading is another great way to support students understanding. Prior questions give an idea of what the students should be expected to know after reading. On the other hand, questions after the reading should aim to help students to coalesce the ideals they were exposed to in the reading.

The type of questions is endless. The questions can be based on Bloom’s taxonomy in order to stimulate various thinking skills. Another skill is probing and soliciting responses from students through encouraging and asking reasonable follow-up questions.

Develop Relevance

Connecting what a student knows what they do not know is known as relevance.If a teacher can stretch a student from what they know and use it to understand what is new it will dramatically improve comprehension.

This is trickier than it sounds. It requires the teacher to have a firm grasp of the subject as well as the habits and knowledge of the students. Therefore, patience is required.

Conclusion

Reading is a skill that can improve a great deal through practice. However, mastery will require the knowledge and application of strategies. Without this next level of training, a student will often become more and more frustrated with reading challenging text.

Criticism of Grades

Grading has recently been under attack with people bringing strong criticism against the practice. Some schools have even stopped using grades altogether. In this post, we will look at problems with grading as well as alternatives.

It Depends on the Subject

The weakness of grading is often seen much more clearly in subjects that have more of a subjective nature to them from the Social sciences and humanities such as English, History, or Music. Subjects from the hard sciences such as biology, math, and engineering are more objective in nature. If a student states that 2 + 2 = 5 there is little left to persuasion or critical thinking to influence the grade.

However, when it comes to judging thinking or musical performance it is much more difficult to assess this without bringing the subjectivity of opinion. This is not bad as a teacher should be an expert in their domain but it still brings an arbitrary unpredictability to the system of grading that is difficult to avoid.

Returning to the math problem, if a student stats 2 +2 =  4 this answer is always right whether the teacher likes the student or not. However, an excellent historical essay on slavery can be graded poorly if the history teacher has issues with the thesis of the student. To assess the essay requires subjective though into the quality of the student’s writing and subjectivity means that the assessment cannot be objective.

Obsession of Students

Many students become obsess and almost worship the grades they receive. This often means that the focus becomes more about getting an ‘A’ than on actually learning. This means that the students take no-risk in their learning and conform strictly to the directions of the teacher. Mindless conformity is not a sign of future success.

There are many comments on the internet about the differences between ‘A’ and ‘C’ students. How ‘A’ students are conformist and ‘C’ students are innovators. The point is that the better the academic performance of a student the better they are at obeying orders and not necessarily on thinking independently.

Alternatives to Grades

There are several alternatives to grading. One of the most common is Pass/fail. Either the student passes the course or they do not. This is common at the tertiary level especially in highly subjective courses such as writing a thesis or dissertation. In such cases, the student meets the “mysterious” standard or they do not.

Another alternative is has been the explosion in the use of gamification. As the student acquires the badges, hit points, etc. it is evidence of learning. Of course, this idea is applied primarily at the K-12 level but it the concept of gamification seems to be used in almost all of the game apps available on cellphones as well as many websites.

Lastly, observation is another alternative. In this approach, the teacher makes weekly observations of each student. These observations are then used to provide feedback for the students. Although time-consuming this is a way to support students without grades.

Conclusion

As long as there is education there must be some sort of way to determine if students are meeting expectations. Grades are the current standard. As with any system, grades have their strengths and weaknesses. With this in mind, it is the responsibility of teachers to always search for ways to improve how students are assessed.

Supporting ESL Student’s Writing

ESL students usually need to learn to write in the second language. This is especially true for those who have academic goals. Learning to write is difficult even in one’s mother tongue let alone in a second language.

In this post, we will look at several practical ways to help students to learn to write in their L2. Below are some useful strategies

  • Build on what they know
  • Encourage coherency in writing
  • Encourage collaboration
  • Support Consistency

Build on Prior Knowledge

It is easier for most students to write about what they know rather than what they do not know.  As such, as a teacher, it is better to have students write about a familiar topic. This reduces the cognitive load on the students allows them to focus more on their language issues.

In addition, building on prior knowledge is consistent with constructivism. Therefore, students are deepening their learning through using writing to express ideas and opinions.

Support Coherency 

Coherency has to do with whether the paragraph makes sense or not. In order to support this, the teacher needs to guide the students in developing main ideas and supporting details and illustrate how these concepts work together at the paragraph level. For more complex writing this involves how various paragraphs work together to support a thesis or purpose statement.

Students struggle tremendously with these big-picture ideas. This in part due to the average student’s obsession with grammar. Grammar is critical after the student has ideas to share clearer and never before that.

Encourage Collaboration

Students should work together to improve their writing. This can involve peer editing and or brainstorming activities. These forms of collaboration give students different perspectives on their writing beyond just depending on the teacher.

Collaboration is also consistent with cooperative learning. In today’s marketplace, few people are granted the privilege of working exclusively alone on anything.  In addition, working together can help the students to develop their English speaking communication skills.

Consistency

Writing needs to be scheduled and happen frequently in order to see progress at the ESL level. This is different from a native speaking context in which the students may have several large papers that they work on alone. In the ESL classroom, the students should write smaller and more frequent papers to provide more feedback and scaffolding.

Small incremental growth should be the primary goal for ESL students. This should be combined with support from the teacher through a consistent commitment to writing.

Conclusion

Writing is a major component of academic life. Many ESL students learning a second language to pursue academic goals. Therefore, it is important that teachers have ideas on how they can support ESL student to achieve the fluency they desire in their writing for further academic success.

Tips for Teaching Online

Teaching online is a unique experience due in part to the platform of instruction. Often, there is no face to face interaction and all communication is in some sort of digital format. Although this can be a rewarding experience there are still several things to consider when teaching in this format. Some tips for successful online teaching include the following.

  • Planning in advance
  • Having a presence
  • Knowing your technology
  • Being consistent

Plan in Advance

All teaching involves advance planning. However, there are those teaching moments in a regular classroom where a teacher can change midstream to hit a particular interest in the class. In addition, more experienced teachers tend to plan less as they are so comfortable with the content and have an intuitive sense of how to support students.

In online teaching, the entire course should be planned and laid out accordingly before the course starts. It is a nightmare to try and develop course material while trying to teach online. This is partially due to the fact that there are so many reminders and due dates sprinkled throughout the course that are inflexible. This means a teacher must know the end from the beginning in terms of what the curriculum covers and what assignments are coming. Changing midstream is really tough.

In addition, the asynchronous nature of online teaching means that instructional material must be thoroughly clear or students will be lost. This again places an emphasis on strong preparation. Online teaching isn’t really for the person who likes to live in the moment but rather for the person who plans ahead.

Have Presence

Having presence means making clear that you are monitoring progress and communicating with students frequently. When students complete assignments they should receive feedback. There should be announcements made in terms of assignments due, general feedback about activities, as well as Q&A with students.

Many people think that teaching online takes less time and can have larger classes. This is far from the case. Online teaching is as time intensive as regular teaching because you must provide feedback and communication or the students will often feel abandon.

Know Your Technology

An online teacher must be familiar and a proponent of technology. This does not mean that you know everything but rather you know how to get stuff done. You don’t need a master in web design but knowing the basics of HTML can really help when communicating with the IT people.

Whatever learning management system you use should actually be familiar with it and not just a consumer. Too many people just upload text for students to read and provide several forums and call that online learning. In many ways, that’s online boredom, especially for younger students.

Consistency

Consistency is about the user experience. The different modules in the course should have the same format with different activities. This way, students focus on learning and not trying to figure out what you want them to do. This applies across classes as well. There needs to be some sense of stability in terms of how content is delivered. There is no single best way but it needs to similar within and across courses for the sake of learning.

Conclusion

These are just some of many ideas to consider when teaching an online course. The main point is the need for preparation and dedication when teaching online.

Principal Component Regression in R

This post will explain and provide an example of principal component regression (PCR). Principal component regression involves having the model construct components from the independent variables that are a linear combination of the independent variables. This is similar to principal component analysis but the components are designed in a way to best explain the dependent variable. Doing this often allows you to use fewer variables in your model and usually improves the fit of your model as well.

Since PCR is based on principal component analysis it is an unsupervised method, which means the dependent variable has no influence on the development of the components. As such, there are times when the components that are developed may not be beneficial for explaining the dependent variable.

Our example will use the “Mroz” dataset from the “Ecdat” package. Our goal will be to predict “income” based on the variables in the dataset. Below is the initial code

library(pls);library(Ecdat)
data(Mroz)
str(Mroz)
## 'data.frame':    753 obs. of  18 variables:
##  $ work      : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hoursw    : int  1610 1656 1980 456 1568 2032 1440 1020 1458 1600 ...
##  $ child6    : int  1 0 1 0 1 0 0 0 0 0 ...
##  $ child618  : int  0 2 3 3 2 0 2 0 2 2 ...
##  $ agew      : int  32 30 35 34 31 54 37 54 48 39 ...
##  $ educw     : int  12 12 12 12 14 12 16 12 12 12 ...
##  $ hearnw    : num  3.35 1.39 4.55 1.1 4.59 ...
##  $ wagew     : num  2.65 2.65 4.04 3.25 3.6 4.7 5.95 9.98 0 4.15 ...
##  $ hoursh    : int  2708 2310 3072 1920 2000 1040 2670 4120 1995 2100 ...
##  $ ageh      : int  34 30 40 53 32 57 37 53 52 43 ...
##  $ educh     : int  12 9 12 10 12 11 12 8 4 12 ...
##  $ wageh     : num  4.03 8.44 3.58 3.54 10 ...
##  $ income    : int  16310 21800 21040 7300 27300 19495 21152 18900 20405 20425 ...
##  $ educwm    : int  12 7 12 7 12 14 14 3 7 7 ...
##  $ educwf    : int  7 7 7 7 14 7 7 3 7 7 ...
##  $ unemprate : num  5 11 5 5 9.5 7.5 5 5 3 5 ...
##  $ city      : Factor w/ 2 levels "no","yes": 1 2 1 1 2 2 1 1 1 1 ...
##  $ experience: int  14 5 15 6 7 33 11 35 24 21 ...

Our first step is to divide our dataset into a train and test set. We will do a simple 50/50 split for this demonstration.

train<-sample(c(T,F),nrow(Mroz),rep=T) #50/50 train/test split
test<-(!train)

In the code above we use the “sample” function to create a “train” index based on the number of rows in the “Mroz” dataset. Basically, R is making a vector that randomly assigns different rows in the “Mroz” dataset to be marked as True or False. Next, we use the “train” vector and we assign everything or every number that is not in the “train” vector to the test vector by using the exclamation mark.

We are now ready to develop our model. Below is the code

set.seed(777)
pcr.fit<-pcr(income~.,data=Mroz,subset=train,scale=T,validation="CV")

To make our model we use the “pcr” function from the “pls” package. The “subset” argument tells r to use the “train” vector to select examples from the “Mroz” dataset. The “scale” argument makes sure everything is measured the same way. This is important when using a component analysis tool as variables with different scale have a different influence on the components. Lastly, the “validation” argument enables cross-validation. This will help us to determine the number of components to use for prediction. Below is the results of the model using the “summary” function.

summary(pcr.fit)
## Data:    X dimension: 381 17 
##  Y dimension: 381 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           12102    11533    11017     9863     9884     9524     9563
## adjCV        12102    11534    11011     9855     9878     9502     9596
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        9149     9133     8811      8527      7265      7234      7120
## adjCV     9126     9123     8798      8877      7199      7172      7100
##        14 comps  15 comps  16 comps  17 comps
## CV         7118      7141      6972      6992
## adjCV      7100      7123      6951      6969
## 
## TRAINING: % variance explained
##         1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X        21.359    38.71    51.99    59.67    65.66    71.20    76.28
## income    9.927    19.50    35.41    35.63    41.28    41.28    46.75
##         8 comps  9 comps  10 comps  11 comps  12 comps  13 comps  14 comps
## X         80.70    84.39     87.32     90.15     92.65     95.02     96.95
## income    47.08    50.98     51.73     68.17     68.29     68.31     68.34
##         15 comps  16 comps  17 comps
## X          98.47     99.38    100.00
## income     68.48     70.29     70.39

There is a lot of information here.The VALIDATION: RMSEP section gives you the root mean squared error of the model broken down by component. The TRAINING section is similar the printout of any PCA but it shows the amount of cumulative variance of the components, as well as the variance, explained for the dependent variable “income.” In this model, we are able to explain up to 70% of the variance if we use all 17 components.

We can graph the MSE using the “validationplot” function with the argument “val.type” set to “MSEP”. The code is below.

validationplot(pcr.fit,val.type = "MSEP")

1

How many components to pick is subjective, however, there is almost no improvement beyond 13 so we will use 13 components in our prediction model and we will calculate the means squared error.

set.seed(777)
pcr.pred<-predict(pcr.fit,Mroz[test,],ncomp=13)
mean((pcr.pred-Mroz$income[test])^2)
## [1] 48958982

MSE is what you would use to compare this model to other models that you developed. Below is the performance of a least squares regression model

set.seed(777)
lm.fit<-lm(income~.,data=Mroz,subset=train)
lm.pred<-predict(lm.fit,Mroz[test,])
mean((lm.pred-Mroz$income[test])^2)
## [1] 47794472

If you compare the MSE the least squares model performs slightly better than the PCR one. However, there are a lot of non-significant features in the model as shown below.

summary(lm.fit)
## 
## Call:
## lm(formula = income ~ ., data = Mroz, subset = train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -27646  -3337  -1387   1860  48371 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.215e+04  3.987e+03  -5.556 5.35e-08 ***
## workno      -3.828e+03  1.316e+03  -2.909  0.00385 ** 
## hoursw       3.955e+00  7.085e-01   5.582 4.65e-08 ***
## child6       5.370e+02  8.241e+02   0.652  0.51512    
## child618     4.250e+02  2.850e+02   1.491  0.13673    
## agew         1.962e+02  9.849e+01   1.992  0.04709 *  
## educw        1.097e+02  2.276e+02   0.482  0.63013    
## hearnw       9.835e+02  2.303e+02   4.270 2.50e-05 ***
## wagew        2.292e+02  2.423e+02   0.946  0.34484    
## hoursh       6.386e+00  6.144e-01  10.394  < 2e-16 ***
## ageh        -1.284e+01  9.762e+01  -0.132  0.89542    
## educh        1.460e+02  1.592e+02   0.917  0.35982    
## wageh        2.083e+03  9.930e+01  20.978  < 2e-16 ***
## educwm       1.354e+02  1.335e+02   1.014  0.31115    
## educwf       1.653e+02  1.257e+02   1.315  0.18920    
## unemprate   -1.213e+02  1.148e+02  -1.057  0.29140    
## cityyes     -2.064e+02  7.905e+02  -0.261  0.79421    
## experience  -1.165e+02  5.393e+01  -2.159  0.03147 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6729 on 363 degrees of freedom
## Multiple R-squared:  0.7039, Adjusted R-squared:   0.69 
## F-statistic: 50.76 on 17 and 363 DF,  p-value: < 2.2e-16

Removing these and the MSE is almost the same for the PCR and least square models

set.seed(777)
lm.fit2<-lm(income~work+hoursw+hearnw+hoursh+wageh,data=Mroz,subset=train)
lm.pred2<-predict(lm.fit2,Mroz[test,])
mean((lm.pred2-Mroz$income[test])^2)
## [1] 47968996

Conclusion

Since the least squares model is simpler it is probably the superior model. PCR is strongest when there are a lot of variables involve and if there are issues with multicollinearity.

Leave One Out Cross Validation in R

Leave one out cross validation. (LOOCV) is a variation of the validation approach in that instead of splitting the dataset in half, LOOCV uses one example as the validation set and all the rest as the training set. This helps to reduce bias and randomness in the results but unfortunately, can increase variance. Remember that the goal is always to reduce the error rate which is often calculated as the mean-squared error.

In this post, we will use the “Hedonic” dataset from the “Ecdat” package to assess several different models that predict the taxes of homes In order to do this, we will also need to use the “boot” package. Below is the code.

library(Ecdat);library(boot)
data(Hedonic)
str(Hedonic)
## 'data.frame':    506 obs. of  15 variables:
##  $ mv     : num  10.09 9.98 10.45 10.42 10.5 ...
##  $ crim   : num  0.00632 0.02731 0.0273 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 ...
##  $ chas   : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ nox    : num  28.9 22 22 21 21 ...
##  $ rm     : num  43.2 41.2 51.6 49 51.1 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 ...
##  $ dis    : num  1.41 1.6 1.6 1.8 1.8 ...
##  $ rad    : num  0 0.693 0.693 1.099 1.099 ...
##  $ tax    : int  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 ...
##  $ blacks : num  0.397 0.397 0.393 0.395 0.397 ...
##  $ lstat  : num  -3 -2.39 -3.21 -3.53 -2.93 ...
##  $ townid : int  1 2 2 3 3 3 4 4 4 4 ...

First, we need to develop our basic least squares regression model. We will do this with the “glm” function. This is because the “cv.glm” function (more on this later) only works when models are developed with the “glm” function. Below is the code.

tax.glm<-glm(tax ~ mv+crim+zn+indus+chas+nox+rm+age+dis+rad+ptratio+blacks+lstat, data = Hedonic)

We now need to calculate the MSE. To do this we will use the “cv.glm” function. Below is the code.

cv.error<-cv.glm(Hedonic,tax.glm)
cv.error$delta
## [1] 4536.345 4536.075

cv.error$delta contains two numbers. The first is the MSE for the training set and the second is the error for the LOOCV. As you can see the numbers are almost identical.

We will now repeat this process but with the inclusion of different polynomial models. The code for this is a little more complicated and is below.

cv.error=rep(0,5)
for (i in 1:5){
        tax.loocv<-glm(tax ~ mv+poly(crim,i)+zn+indus+chas+nox+rm+poly(age,i)+dis+rad+ptratio+blacks+lstat, data = Hedonic)
        cv.error[i]=cv.glm(Hedonic,tax.loocv)$delta[1]
}
cv.error
## [1] 4536.345 4515.464 4710.878 7047.097 9814.748

Here is what happen.

  1. First, we created an empty object called “cv.error” with five empty spots, which we will use to store information later.
  2. Next, we created a for loop that repeats 5 times
  3. Inside the for loop, we create the same regression model except we added the “poly” function in front of “age”” and also “crim”. These are the variables we want to try polynomials 1-5 one to see if it reduces the error.
  4. The results of the polynomial models are stored in the “cv.error” object and we specifically request the results of “delta” Finally, we printed “cv.error” to the console.

From the results, you can see that the error decreases at a second order polynomial but then increases after that. This means that high order polynomials are not beneficial generally.

Conclusion

LOOCV is another option in assessing different models and determining which is most appropriate. As such, this is a tool that is used by many data scientist.

Conversational Analysis: Questions & Responses

Conversational analysis (CA) is the study of social interactions in everyday life. In this post, we will look at how questions and responses are categorized in CA.

Questions

In CA, there are generally three types of questions and they are as follows…

  • Identification question
  • Polarity question
  • Confirmation question

Identification Question

Identification questions are questions that employees one of the five W’s (who, what, where, when, why). The response can be opened or closed-ended. An example is below

Where are the keys?

Polarity Question

A polarity question is a question the calls for a yes/no response.

Can you come to work tomorrow?

Confirmation Question

Similiar to the polarity question, a confirmation question is a question that is seeking to gather support for something the speaker already said.

Didn’t Sam go to the store already?

This question is seeking an affirmative yes.

Responses

There are also several ways in which people respond to a question. Below is a list of common ways.

  • Comply
  • Supply
  • Imply
  • Evade
  • Disclaim

Comply

Complying means give a clear direct answer to a question. Below is an example

A: What time is it?
B: 6:30pm

Supply

Supplying is the act of giving a partial response, that is often irrelevant and fails to answer the question.

A: Is this your dog?
B: Well…I do feed it once in awhile

In the example above, person A asks a clear question. However, person B states what they do for the dog (feed it) rather than indicate if the dog belongs to them. Feeding the dog is irrelevant to ownership.

Imply

Implying is providing information indirectly to answer a question.

A: What time do you want to leave?
B: Not too late

The response from person B does not indicate any sort of specific time to leave. This leaves it up to person A to determine what is meant by “too late.”

Disclaim

Disclaiming is the person stating they do not n]know the answer.

A: Where are the keys?
B: I don’t know

Evade

Evading is the act of answering with really answering the question

A: Where is the car
B: David needed to go shopping

In the example above, person B never states where the car is. Rather, they share what someone is doing with the car. By doing this, the speaker never shares where the car is.

Conclusions

The interaction of a question and response can be interesting if it is examined more closely from a sociolinguistic perspective. The categories provided here can support the deeper analysis of conversation.

Terms Related to Language

This post will examine different uses of the word language. There are several different ways that this word can be defined. We will look at the following terms for language.

  • Vernacular
  • Standard
  • National
  • Official
  • Lingua Franca

Vernacular Language

The term vernacular language can mean many different things. It can mean a language that is not standardized or a language that is not the standard language of a nation. Generally, a vernacular language is a language that lacks official status in a country.

Standard Language

A standard language is a language that has been codified. By this, it is meant that the language has dictionaries and other grammatical sources that describe and even prescribe the use of the language.

Most languages have experienced codification. However, codification is just one part of being a standard language. A language must also be perceived of as prestigious and serve a high function.

By prestigious it is meant that the language has influence in a community. For example, Japanese is a prestigious language in Japan. By high function, it is meant that the language is used in official settings such as government, business, etc., which Japanese is used for.

National Language

A national language is a language used for political and cultural reasons to unite a people. Many countries that have a huge number of languages and ethnic groups will select one language as a way to forge an identity. For example, in the Philippines, the national language is Tagalog even though hundreds of other languages are spoken.

In Myanmar, Burmese is the national language even though dozens of other languages are spoken. The selection of the language is political motivate with the dominant group imposing their language on others.

Official Language

An official language is the language of government business. Many former colonized nations will still use an official language that comes from the people who colonized them. This is especially true in African countries such as Ivory Coast and Chad which use French as their official language despite having other indigenous languages available.

Lingua Franca

A lingua franca is a language that serves as a vehicle of communication between two language groups whose mother tongues are different. For example, English is often the de facto lingua franca of people who do not speak the same language.

Multiple Categories

A language can fit into more than one of the definitions above. For example, English is a vernacular language in many countries such as Thailand and Malaysia. However, English is not considered a vernacular language in the United States.

To make things more confusing. English is the language of the United States but it is neither the National or Official Language as this has never been legislated. Yet English is a standard language as it has been codified and meets the other criteria for standardization.

Currently, English is viewed by many as an international Lingua Franca with a strong influence on the world today.

Lastly, a language can be in more than one category. Thai is the official, national, and standard language of Thailand.

Conclusion

Language is a term that is used that can also have many meanings. In this post, we looked at how there are different ways to see this word.

Code -Switching & Lexical Borrowing

Code-switching involves a speaker changing languages as they talk. This post will explore some of the reasons behind why people code-switch. In addition, we will look at lexical borrowing and its use in communication

Code-Switching

Code-switching is most commonly caused by social factors and social dimensions of pragmatics. By social factors, it is meant the who, what, where, when and why of communication. Social dimensions involve distance, status, formality, emotions, referential traits.

For example, two people from the same ethnicity may briefly switch to their language to say hello to each other before returning to English. The “what” is two people meeting each other and the use of the mother-tongue indicates high intimacy with each other.

The topic of discussion can also lead to code-switching. For example, I have commonly seen students with the same mother-tongue switch to using English when discussion academic subjects. This may be because their academic studies use the English language as a medium of instruction.

Switching can also take place for emotional reasons. For example, a person may switch languages to communicate anger such as a mother switching to the mother-tongue to scold their child.

There is a special type of code-switching called metaphorical switching. This type of switching happens when the speaker switches languages for symbolic reasons. For example, when I person agrees about something they use their mother tongue. However, when they disagree about something they may switch to English. This switching back and forth is to indicate their opinion on a matter without having to express it too directly.

Lexical Borrowing

Lexical borrowing is used when a person takes a word from one language to replace an unknown word in a different language. Code-switching happens at the sentence level whereas lexical borrowing happens at the individual word level.

Borrowing does not always happen because of a poor memory. Another reason for lexical borrowing is that some words do not translate into another language. This forces the speaker to borrow. For example, many langauges do not have a word for computer or internet. Therefore, these words are borrowed when speaking.

Perceptions

Often, people have no idea that the are code-switching or even borrowing. However, those who are conscious of it usually have a negative attitude towards it. The criticism of code-switching often involves complaints of how it destroys both languages. However, it takes a unique mastery of both languages to effectively code-switch or borrowing lexically.

Conclusion

Code-switching and lexical borrowing are characteristics of communication. For those who want to prescribe language, it may be frustrating to watch two languages being mixed together. However, from a descriptive perspective, this is a natural result of language interaction.

Social Dimensions of Language

In sociolinguistics, social dimensions are the characteristics of the context that affect how language is used. Generally, there are four dimensions to the social context that are measured are analyzed through the use of five scales. The four dimension and five scales are as follows.

  • Social distance
  • Status
  • Formality
  • Functional (which includes a referential and affective function)

This post will explore each of these four social dimensions of language.

Social Distance

Social distance is an indicator of how well we know someone that we are talking to.  Many languages have different pronouns and even declensions in their verbs based on how well they know someone.

For example, in English, a person might say “what’s up?” to a friend. However, when speaking to a stranger, regardless of the strangers status, a person may say something such as “How are you?”. The only reason for the change in language use is the lack of intimacy with the stranger as compared to the friend.

Status

Status is related to social ranking. The way we speak to peers is different than how we speak to superiors. Friends are called by their first name while a boss, in some cultures, is always referred to by Mr/Mrs or sir/madam.

The rules for status can be confusing. Frequently we will refer to our parents as mom or dad but never Mr/Mrs. Even though Mr/Mrs is a sign of respect it violates the intimacy of the relationship between a parent and child. As such, often parents would be upset if their children called them Mr/Mrs.

Formality

Formality can be seen as the presence or absences of colloquial/slang in a person’s communication. In a highly formal setting, such as a speech, the language will often lack the more earthy style of speaking. Contractions may disappear, idioms may be reduced, etc. However, when spending time with friends at home a more laid-back manner of speaking will emerge

However, when spending time with friends at home a more laid-back manner of speaking will emerge. One’s accent becomes more promeneint, slang terms are permissiable, etc.

Function (Referential & Affective)

Referential is a measure of the amount of information being shared in a discourse. The use of facts, statistics, directions, etc. Affective relates to the emotional content of communication and indicates how someone feels about the topic.

Often referential and affective functions interrelated such as in the following example.

James is a 45 year-old professor of research who has written several books but is still a complete idiot!

This example above shares a lot of information as it shares the person’s name, job, and accomplishments. However, the emotions of the speaker are highly negative towards James as they call James a “complete idiot.”

Conclusion 

The social dimensions of language are useful to know in order to understand what is affecting how people communicate. The concepts behind the four dimensions impact how we talk without most us knowing why or how. This can be frustrating but also empowering as people will understand why they adjust to various contexts of language use.

Journal Writing

A journal is a log that a student uses to record their thoughts about something. This post will provide examples of journals as well as guidelines for using journals in the classroom.

Types of Journals

There are many different types of journals. Normally, all journals have some sort of dialog happening between the student and the teacher. This allows both parties to get to know each other better.

Normally, journals will have a theme or focus. Examples in TESOL would include journals that focus on grammar, learning strategies, language-learning, or recording feelings. Most journals will focus on one of these to the exclusion of the others.

Guidelines for Using Journals

Journals can be useful if they are properly planned. As such, a teacher should consider the following when using journals.

  1. Provide purpose-Students need to know why they are writing journals. Most students seem to despise reflection and will initially reject this learning experience
  2. Forget grammar-Journals are for writing. Students need to set aside the obsession they have acquired for perfect grammar and focus on developing their thoughts about something. There is a time and place for grammar and that is for summative assessments such as final drafts of research papers.
  3. Explain the grading process-Students need to know what they must demonstrate in order to receive adequate credit.
  4. Provide feedback-Journals are a dialog. As such, the feedback should encourage and or instruct the students.  The feedback should also be provided consistently at scheduled intervals.

Journals take a lot of time to read and provide feedback too. In addition, the handwriting quality of students can vary radically which means that some students journals are unreadable.

Conclusion

Journaling is an experience that allows students to focus on the process of learning rather than the product. This is often neglected in the school experience. Through journals, students are able to focus on the development of ideas without wasting working memory capacity on grammar and syntax. As such, journals can be a powerful in developing critical thinking skills.

Cradle Approach to Portfolio Development

Portfolio development is one of many forms of alternative assessment available to teachers. When this approach is used, generally the students collected their work and try to make sense of it through reflection.

It is surprisingly easy for portfolio development to amount to nothing more than archiving work. However, the CRADLE approach was developed by Gottlieb to alleviate potential confusion over this process. CRADLE stands for the following

C ollecting
R eflecting
A ssessing
D ocumenting
L inking
E valuating

Collecting

Collecting is the process in which the students gather materials to include in their portfolio. It is left to the students to decide what to include. However, it is still necessary for the teacher to provide clear guidelines in terms of what can be potentially selected.

Clear guidelines include stating the objectives as well as explaining how the portfolio will be assessed. It is also important to set aside class time for portfolio development.

Some examples of work that can be included in a portfolio include the following.

  • tests, quizzes
  • compositions
  • electronic documents (powerpoints, pdfs, etc)

Reflecting

Reflecting happens through the student thinking about the work they have placed in the portfolio. This can be demonstrated many different ways. Common ways to reflect include the use of journals in which students comment on their work. Another way for young students is the use of checklist.

Another way for young students is the use of a checklist. Students simply check the characteristics that are present in their work. As such, the teacher’s role is to provide class time so that students are able to reflect on their work.

Assessing

Assessing involves checking and maintaining the quality of the portfolio over time. Normally, there should a gradual improvement in work quality in a portfolio. This is a subjective matter that is negotiated by the student and teacher often in the form of conferences.

Documenting

Documenting serves more as a reminder than an action. Simply, documenting means that the teacher and student maintain the importance of the portfolio over the course of its usefulness. This is critical as it is easy to forget about portfolios through the pressure of the daily teaching experience.

Linking

Linking is the use of a portfolio to serve as a mode of communication between students, peers, teachers, and even parents. Students can look at each other portfolios and provide feedback. Parents can also examine the work of their child through the use of portfolios.

Evaluating

Evaluating is the process of receiving a grade for this experience. For the teacher, the goal is to provide positive washback when assessing the portfolios. The focus is normally less on grades and more qualitative in nature.

Conclusions

Portfolios provide rich opportunities for developing intrinsic motivation, individualize learning, and critical thinking. However, the trying to affix a grade to such a learning experience is often impractical. As such, portfolios are useful but it can be hard to prove that any learning took place.

Data Munging with Dplyr

Data preparation aka data munging is what most data scientist spend the majority of their time doing. Extracting and transforming data is difficult, to say the least. Every dataset is different with unique problems. This makes it hard to generalize best practices for transforming data so that it is suitable for analysis.

In this post, we will look at how to use the various functions in the “dplyr”” package. This package provides numerous ways to develop features as well as explore the data. We will use the “attitude” dataset from base r for our analysis. Below is some initial code.

library(dplyr)
data("attitude")
str(attitude)
## 'data.frame':    30 obs. of  7 variables:
##  $ rating    : num  43 63 71 61 81 43 58 71 72 67 ...
##  $ complaints: num  51 64 70 63 78 55 67 75 82 61 ...
##  $ privileges: num  30 51 68 45 56 49 42 50 72 45 ...
##  $ learning  : num  39 54 69 47 66 44 56 55 67 47 ...
##  $ raises    : num  61 63 76 54 71 54 66 70 71 62 ...
##  $ critical  : num  92 73 86 84 83 49 68 66 83 80 ...
##  $ advance   : num  45 47 48 35 47 34 35 41 31 41 ...

You can see we have seven variables and only 30 observations. Our first function that we will learn to use is the “select” function. This function allows you to select columns of data you want to use. In order to use this feature, you need to know the names of the columns you want. Therefore, we will first use the “names” function to determine the names of the columns and then use the “select”” function.

names(attitude)[1:3]
## [1] "rating"     "complaints" "privileges"
smallset<-select(attitude,rating:privileges)
head(smallset)
##   rating complaints privileges
## 1     43         51         30
## 2     63         64         51
## 3     71         70         68
## 4     61         63         45
## 5     81         78         56
## 6     43         55         49

The difference is probably obvious. Using the “select” function we have 3 instead of 7 variables. We can also exclude columns we do not want by placing a negative in front of the names of the columns. Below is the code

head(select(attitude,-(rating:privileges)))
##   learning raises critical advance
## 1       39     61       92      45
## 2       54     63       73      47
## 3       69     76       86      48
## 4       47     54       84      35
## 5       66     71       83      47
## 6       44     54       49      34

We can also use the “rename” function to change the names of columns. In our example below, we will change the name of the “rating” to “rates.” The code is below. Keep in mind that the new name for the column is to the left of the equal sign and the old name is to the right

attitude<-rename(attitude,rates=rating)
head(attitude)
##   rates complaints privileges learning raises critical advance
## 1    43         51         30       39     61       92      45
## 2    63         64         51       54     63       73      47
## 3    71         70         68       69     76       86      48
## 4    61         63         45       47     54       84      35
## 5    81         78         56       66     71       83      47
## 6    43         55         49       44     54       49      34

The “select”” function can be used in combination with other functions to find specific columns in the dataset. For example, we will use the “ends_with” function inside the “select” function to find all columns that end with the letter s.

s_set<-head(select(attitude,ends_with("s")))
s_set
##   rates complaints privileges raises
## 1    43         51         30     61
## 2    63         64         51     63
## 3    71         70         68     76
## 4    61         63         45     54
## 5    81         78         56     71
## 6    43         55         49     54

The “filter” function allows you to select rows from a dataset based on criteria. In the code below we will select only rows that have a 75 or higher in the “raises” variable.

bigraise<-filter(attitude,raises>75)
bigraise
##   rates complaints privileges learning raises critical advance
## 1    71         70         68       69     76       86      48
## 2    77         77         54       72     79       77      46
## 3    74         85         64       69     79       79      63
## 4    66         77         66       63     88       76      72
## 5    78         75         58       74     80       78      49
## 6    85         85         71       71     77       74      55

If you look closely all values in the “raise” column are greater than 75. Of course, you can have more than one criteria. IN the code below there are two.

filter(attitude, raises>70 & learning<67)
##   rates complaints privileges learning raises critical advance
## 1    81         78         56       66     71       83      47
## 2    65         70         46       57     75       85      46
## 3    66         77         66       63     88       76      72

The “arrange” function allows you to sort the order of the rows. In the code below we first sort the data ascending by the “critical” variable. Then we sort it descendingly by adding the “desc” function.

ascCritical<-arrange(attitude, critical)
head(ascCritical)
##   rates complaints privileges learning raises critical advance
## 1    43         55         49       44     54       49      34
## 2    81         90         50       72     60       54      36
## 3    40         37         42       58     50       57      49
## 4    69         62         57       42     55       63      25
## 5    50         40         33       34     43       64      33
## 6    71         75         50       55     70       66      41
descCritical<-arrange(attitude, desc(critical))
head(descCritical)
##   rates complaints privileges learning raises critical advance
## 1    43         51         30       39     61       92      45
## 2    71         70         68       69     76       86      48
## 3    65         70         46       57     75       85      46
## 4    61         63         45       47     54       84      35
## 5    81         78         56       66     71       83      47
## 6    72         82         72       67     71       83      31

The “mutate” function is useful for engineering features. In the code below we will transform the “learning” variable by subtracting its mean from its self

attitude<-mutate(attitude,learningtrend=learning-mean(learning))
head(attitude)
##   rates complaints privileges learning raises critical advance
## 1    43         51         30       39     61       92      45
## 2    63         64         51       54     63       73      47
## 3    71         70         68       69     76       86      48
## 4    61         63         45       47     54       84      35
## 5    81         78         56       66     71       83      47
## 6    43         55         49       44     54       49      34
##   learningtrend
## 1    -17.366667
## 2     -2.366667
## 3     12.633333
## 4     -9.366667
## 5      9.633333
## 6    -12.366667

You can also create logical variables with the “mutate” function.In the code below, we create a logical variable that is true when the “critical” variable” is higher than 80 and false when “critical”” is less than 80. The new variable is called “highCritical”

attitude<-mutate(attitude,highCritical=critical>=80)
head(attitude)
##   rates complaints privileges learning raises critical advance
## 1    43         51         30       39     61       92      45
## 2    63         64         51       54     63       73      47
## 3    71         70         68       69     76       86      48
## 4    61         63         45       47     54       84      35
## 5    81         78         56       66     71       83      47
## 6    43         55         49       44     54       49      34
##   learningtrend highCritical
## 1    -17.366667         TRUE
## 2     -2.366667        FALSE
## 3     12.633333         TRUE
## 4     -9.366667         TRUE
## 5      9.633333         TRUE
## 6    -12.366667        FALSE

The “group_by” function is used for creating summary statistics based on a specific variable. It is similar to the “aggregate” function in R. This function works in combination with the “summarize” function for our purposes here. We will group our data by the “highCritical” variable. This means our data will be viewed as either TRUE for “highCritical” or FALSE. The results of this function will be saved in an object called “hcgroups”

hcgroups<-group_by(attitude,highCritical)
head(hcgroups)
## # A tibble: 6 x 9
## # Groups:   highCritical [2]
##   rates complaints privileges learning raises critical advance
##                            
## 1    43         51         30       39     61       92      45
## 2    63         64         51       54     63       73      47
## 3    71         70         68       69     76       86      48
## 4    61         63         45       47     54       84      35
## 5    81         78         56       66     71       83      47
## 6    43         55         49       44     54       49      34
## # ... with 2 more variables: learningtrend , highCritical 

Looking at the data you probably saw no difference. This is because we are not done yet. We need to summarize the data in order to see the results for our two groups in the “highCritical” variable.

We will now generate the summary statistics by using the “summarize” function. We specifically want to know the mean of the “complaint” variable based on the variable “highCritical.” Below is the code

summarize(hcgroups,complaintsAve=mean(complaints))
## # A tibble: 2 x 2
##   highCritical complaintsAve
##                   
## 1        FALSE      67.31579
## 2         TRUE      65.36364

Of course, you could have learned this through doing a t.test but this is another approach.

Conclusion

The “dplyr” package is one powerful tool for wrestling with data. There is nothing new in this package. Instead, the coding is simpler than what you can excute using base r.

Guiding the Writing Process

How a teacher guides the writing process can depend on a host of factors. Generally, how you support a student at the beginning of the writing process is different from how you support them at the end. In this post, we will look at the differences between these two stages of writing.

The Beginning

At the beginning of writing, there are a lot of decisions that need to be made as well as extensive planning. Generally, at this point, grammar is not the deciding factor in terms of the quality of the writing. Rather, the teacher is trying to help the students to determine the focus of the paper as well as the main ideas.

The teacher needs to help the student to focus on the big picture of the purpose of their writing. This means that only major issues are addressed at least initially. You only want to point at potential disaster decisions rather than mundane details.

It is tempting to try and fix everything when looking at rough drafts. This not only takes up a great deal of your time but it is also discouraging to students as they deal with intense criticism while still trying to determine what they truly want to do. As such, it is better to view your role at this point as a counselor or guide and not as detail oriented control freak.

At this stage, the focus is on the discourse and not so much on the grammar.

The End

At the end of the writing process, there is a move from general comments to specific concerns. As the student gets closer and closer to the final draft the “little things” become more and more important. Grammar comes to the forefront. In addition, referencing and the strength of the supporting details become more important.

Now is the time to get “picky” this is because major decisions have been made and the cognitive load of fixing small stuff is less stressful once the core of the paper is in place. The analogy I like to give is that first, you build the house. Which involves lots of big movements such as pouring a foundation, adding walls, and including a roof. This is the beginning of writing. The end of building a house includes more refined aspects such as painting the walls, adding the furniture, etc. This is the end of the writing process.

Conclusion

For writers and teachers, it is important to know where they are in the writing process. In my experience, it seems as if it is all about grammar from the beginning when this is not necessarily the case. At the beginning of a writing experience, the focus is on ideas. At the end of a writing experience, the focus is on grammar. The danger is always in trying to do too much at the same time.

Review of “First Encyclopedia of the Human Body”

The First Encyclopedia of the Human Body (First Encyclopedias)by Fiona Chandler (pp. 64) provides insights into science for young children.

The Summary
This book explains all of the major functions of the human body as well as some aspects of health and hygiene. Students will learn about the brain, heart, hormones, where babies come from, as well as healthy eating and visiting the doctor.

The Good
This book is surprisingly well-written. The author was able to take the complexities of
the human body and word them in a way that a child can
understand. In addition, the illustrations are rich and interesting. For example, there are pictures of an infare-red scan of a child’s hands, x-rays of broken bones, as well as
pictures of people doing things with their bodies such as running or jumping.

There is also a good mix of small and large photos which allows this book to be used individually or for whole class reading. The large size of the text also allows for younger readers to appreciate not only the pictures but also the reading.

There are also several activities in the book at different places. For example, students are invited to take their pulse, determine how much air is in their lungs, as well as an activity for testing your sense of touch.

In every section of the book, there are links to online activities as well. It seems as though this book has every angle covered in terms of learning.

The Bad
There is little to criticize in this book. It’s a really fun text. Perhaps if you are an expert in the human body you may find things that are disappointing. However, for a layman called to teach young people science, this text is more than adequate.

The Recommendation
I would give this book 5/5 stars. My students loved it and I was able to use it in so many different ways to build activities and discussions. I am sure that the use of this book would be beneficial to almost any teacher in any classroom

Reading Assessment at the Perceptual and Selective Level

This post will provide examples of assessments that can be used for reading at the perceptual and selective level.

Perceptual Level

The perceptual level is focused on bottom-up processing of text. Comprehension ability is not critical at this point. Rather, you are just determining if the student can accomplish the mechanical process of reading.

Examples

Reading Aloud-How this works is probably obvious to most teachers. The students read a text out loud in the presence of an assessor.

Picture-Cued-Students are shown a picture. At the bottom of the picture are words. The students read the word and point to a visual example of it in the picture. For example, if the picture has a cat in it. At the bottom of the picture would be the word cat. The student would read the word cat and point to the actual cat in the picture.

This can be extended by using sentences instead of words. For example, if the actual picture shows a man driving a car. There may be a sentence at the bottom of the picture that says “a man is driving a car”. The student would then point to the man in the actual picture who is driving.

Another option is T/F statements. Using our cat example from above. We might write that “There is one cat in the picture” the student would then select T/F.

Other Examples-These includes multiple-choice and written short answer.

Selective Level

The selective level is the next above perceptual. At this level, the student should be able to recognize various aspects of grammar.

Examples

Editing Task-Students are given a reading passage and are asked to fix the grammar. This can happen many different ways. They could be asked to pick the incorrect word in a sentence or to add or remove punctuation.

Pictured-Cued Task-This task appeared at the perceptual level. Now it is more complicated. For example, the students might be required to read statements and label a diagram appropriately, such as the human body or aspects of geography.

Gap-Filling Task-Students read a sentence and complete it appropriately

Other Examples-Includes multiple-choice and matching. The multiple-choice may focus on grammar, vocabulary, etc. Matching attempts to assess a students ability to pair similar items.

Conclusion

Reading assessment can take many forms. The examples here provide ways to deal with this for students who are still highly immature in their reading abilities. As fluency develops more complex measures can be used to determine a students reading capability.

Types of Speaking in ESL

In the context of ESL teaching, ~there are at least five types of speaking that take place in the classroom. This post will define and provide examples of each. The five types are as follows…

  • Imitative
  • Intensive
  • Responsive
  • Interactive
  • Extensive

The list above is ordered from simplest to most complex in terms of the requirements of oral production for the student.

Imitative

At the imitative level, it is probably already clear what the student is trying to do. At this level, the student is simply trying to repeat what was said to them in a way that is understandable and with some adherence to pronunciation as defined by the teacher.

It doesn’t matter if the student comprehends what they are saying or carrying on a conversation. The goal is only to reproduce what was said to them. One common example of this is a “repeat after me” experience in the classroom.

Intensive

Intensive speaking involves producing a limit amount of language in a highly control context. An example of this would be to read aloud a passage or give a direct response to a simple question.

Competency at this level is shown through achieving certain grammatical or lexical mastery. This depends on the teacher’s expectations.

Responsive

Responsive is slightly more complex than intensive but the difference is blurry, to say the least. At this level, the dialog includes a simple question with a follow-up question or two. Conversations take place by this point but are simple in content.

Interactive

The unique feature of intensive speaking is that it is usually more interpersonal than transactional. By interpersonal it is meant speaking for maintaining relationships. Transactional speaking is for sharing information as is common at the responsive level.

The challenge of interpersonal speaking is the context or pragmatics The speaker has to keep in mind the use of slang, humor, ellipsis, etc. when attempting to communicate. This is much more complex than saying yes or no or giving directions to the bathroom in a second language.

Extensive

Extensive communication is normal some sort of monolog. Examples include speech, story-telling, etc. This involves a great deal of preparation and is not typically improvisational communication.

It is one thing to survive having a conversation with someone in a second language. You can rely on each other’s body language to make up for communication challenges. However, with extensive communication either the student can speak in a comprehensible way without relying on feedback or they cannot. In my personal experience, the typical ESL student cannot do this in a convincing manner.

Intensive Listening and ESL

Intensive listening is listening for the elements (phonemes, intonation, etc.) in words and sentences. This form of listening is often assessed in an ESL setting as a way to measure an individual’s phonological,  morphological, and ability to paraphrase. In this post, we will look at these three forms of assessment with examples.

Phonological Elements

Phonological elements include phonemic consonant and phonemic vowel pairs. Phonemic consonant pair has to do with identifying consonants. Below is an example of what an ESL student would hear followed by potential choices they may have on a multiple-choice test.

Recording: He’s from Thailand

Choices:
(a) He’s from Thailand
(b) She’s from Thailand

The answer is clearly (a). The confusion is with the adding of ‘s’ for choice (b). If someone is not listening carefully they could make a mistake. Below is an example of phonemic pairs involving vowels

Recording: The girl is leaving?

Choices:
(a)The girl is leaving?
(b)The girl is living?

Again, if someone is not listening carefully they will miss the small change in the vowel.

Morphological Elements

Morphological elements follow the same approach as phonological elements. You can manipulate endings, stress patterns, or play with words.  Below is an example of ending manipulation.

Recording: I smiled a lot.

Choices:
(a) I smiled a lot.
(b) I smile a lot.

I sharp listener needs to hear the ‘d’ sound at the end of the word ‘smile’ which can be challenging for ESL student. Below is an example of stress pattern

Recording: My friend doesn’t smoke.

Choices:
(a) My friend doesn’t smoke.
(b) My friend does smoke.

The contraction in the example is the stress pattern the listener needs to hear. Below is an example of a play with words.

Recording: wine

Choices:
(a) wine
(b) vine

This is especially tricky for languages that do not have both a ‘v’ and ‘w’ sound, such as the Thai language.

Paraphrase recognition

Paraphrase recognition involves listening to an example of being able to reword it in an appropriate manner. This involves not only listening but also vocabulary selection and summarizing skills. Below is one example of sentence paraphrasing

Recording: My name is James. I come from California

Choices:
(a) James is Californian
(b) James loves Calfornia

This is trickier because both can be true. However, the goal is to try and rephrase what was heard.  Another form of paraphrasing is dialogue paraphrasing as shown below

Recording: 

Man: My name is Thomas. What is your name?
Woman: My name is Janet. Nice to meet you. Are you from Africa
Man: No, I am an American

Choices:
(a) Thomas is from America
(b)Thomas is African

You can see the slight rephrase that is wrong with choice (b). This requires the student to listen to slightly longer audio while still have to rephrase it appropriately.

Conclusion

Intensive listening involves the use of listening for the little details of an audio. This is a skill that provides a foundation for much more complex levels of listening.

Recommendation Engines in R

In this post, we will look at how to make a recommendation engine. We will use data that makes recommendations about movies. We will use the “recommenderlab” package to build several different engines. The data comes from

http://grouplens.org/datasets/movielens/latest/

At this link, you need to download the “ml-latest.zip”. From there, we will use the “ratings” and “movies” files in this post. Ratings provide the ratings of the movies while movies provide the names of the movies. Before going further it is important to know that the “recommenderlab” has five different techniques for developing recommendation engines (IBCF, UBCF, POPULAR, RANDOM, & SVD). We will use all of them for comparative purposes Below is the code for getting started.

library(recommenderlab)
ratings <- read.csv("~/Downloads/ml-latest-small/ratings.csv")
movies <- read.csv("~/Downloads/ml-latest-small/movies.csv")

We now need to merge the two datasets so that they become one. This way the titles and ratings are in one place. We will then coerce our “movieRatings” dataframe into a “realRatingMatrix” in order to continue our analysis. Below is the code

movieRatings<-merge(ratings, movies, by='movieId') #merge two files
movieRatings<-as(movieRatings,"realRatingMatrix") #coerce to realRatingMatrix

We will now create two histograms of the ratings. The first is raw data and the second will be normalized data. The function “getRatings” is used in combination with the “hist” function to make the histogram. The normalized data includes the “normalize” function. Below is the code.

hist(getRatings(movieRatings),breaks =10)

1.png

hist(getRatings(normalize(movieRatings)),breaks =10)

1.png

We are now ready to create the evaluation scheme for our analysis. In this object we need to set the data name (movieRatings), the method we want to use (cross-validation), the amount of data we want to use for the training set (80%), how many ratings the algorithm is given during the test set (1) with the rest being used to compute the error. We also need to tell R what a good rating is (4 or higher) and the number of folds for the cross-validation (10). Below is the code for all of this.

set.seed(123)
eSetup<-evaluationScheme(movieRatings,method='cross-validation',train=.8,given=1,goodRating=4,k=10)

Below is the code for developing our models. To do this we need to use the “Recommender” function and the “getData” function to get the dataset. Remember we are using all six modeling techniques

ubcf<-Recommender(getData(eSetup,"train"),"UBCF")
ibcf<-Recommender(getData(eSetup,"train"),"IBCF")
svd<-Recommender(getData(eSetup,"train"),"svd")
popular<-Recommender(getData(eSetup,"train"),"POPULAR")
random<-Recommender(getData(eSetup,"train"),"RANDOM")

The models have been created. We can now make our predictions using the “predict” function in addition to the “getData” function. We also need to set the argument “type” to “ratings”. Below is the code.

ubcf_pred<-predict(ubcf,getData(eSetup,"known"),type="ratings")
ibcf_pred<-predict(ibcf,getData(eSetup,"known"),type="ratings")
svd_pred<-predict(svd,getData(eSetup,"known"),type="ratings")
pop_pred<-predict(popular,getData(eSetup,"known"),type="ratings")
rand_pred<-predict(random,getData(eSetup,"known"),type="ratings")

We can now look at the accuracy of the models. We will do this in two steps. First, we will look at the error rates. After completing this, we will do a more detailed analysis of the stronger models. Below is the code for the first step

ubcf_error<-calcPredictionAccuracy(ubcf_pred,getData(eSetup,"unknown")) #calculate error
ibcf_error<-calcPredictionAccuracy(ibcf_pred,getData(eSetup,"unknown"))
svd_error<-calcPredictionAccuracy(svd_pred,getData(eSetup,"unknown"))
pop_error<-calcPredictionAccuracy(pop_pred,getData(eSetup,"unknown"))
rand_error<-calcPredictionAccuracy(rand_pred,getData(eSetup,"unknown"))
error<-rbind(ubcf_error,ibcf_error,svd_error,pop_error,rand_error) #combine objects into one data frame
rownames(error)<-c("UBCF","IBCF","SVD","POP","RAND") #give names to rows
error
##          RMSE      MSE       MAE
## UBCF 1.278074 1.633473 0.9680428
## IBCF 1.484129 2.202640 1.1049733
## SVD  1.277550 1.632135 0.9679505
## POP  1.224838 1.500228 0.9255929
## RAND 1.455207 2.117628 1.1354987

The results indicate that the “RAND” and “IBCF” models are clearly worse than the remaining three. We will now move to the second step and take a closer look at the “UBCF”, “SVD”, and “POP” models. We will do this by making a list and using the “evaluate” function to get other model evaluation metrics. We will make a list called “algorithms” and store the three strongest models. Then we will make an objectcalled “evlist” in this object we will use the “evaluate” function as well as called the evaluation scheme “esetup”, the list (“algorithms”) as well as the number of movies to assess (5,10,15,20)

algorithms<-list(POPULAR=list(name="POPULAR"),SVD=list(name="SVD"),UBCF=list(name="UBCF"))
evlist<-evaluate(eSetup,algorithms,n=c(5,10,15,20))
avg(evlist)
## $POPULAR
##           TP        FP       FN       TN  precision     recall        TPR
## 5  0.3010965  3.033333 4.917105 661.7485 0.09028443 0.07670381 0.07670381
## 10 0.4539474  6.214912 4.764254 658.5669 0.06806016 0.11289681 0.11289681
## 15 0.5953947  9.407895 4.622807 655.3739 0.05950450 0.14080354 0.14080354
## 20 0.6839912 12.653728 4.534211 652.1281 0.05127635 0.16024740 0.16024740
##            FPR
## 5  0.004566269
## 10 0.009363021
## 15 0.014177091
## 20 0.019075070
## 
## $SVD
##           TP        FP       FN       TN  precision     recall        TPR
## 5  0.1025219  3.231908 5.115680 661.5499 0.03077788 0.00968336 0.00968336
## 10 0.1808114  6.488048 5.037390 658.2938 0.02713505 0.01625454 0.01625454
## 15 0.2619518  9.741338 4.956250 655.0405 0.02620515 0.02716656 0.02716656
## 20 0.3313596 13.006360 4.886842 651.7754 0.02486232 0.03698768 0.03698768
##            FPR
## 5  0.004871678
## 10 0.009782266
## 15 0.014689510
## 20 0.019615377
## 
## $UBCF
##           TP        FP       FN       TN  precision     recall        TPR
## 5  0.1210526  2.968860 5.097149 661.8129 0.03916652 0.01481106 0.01481106
## 10 0.2075658  5.972259 5.010636 658.8095 0.03357173 0.02352752 0.02352752
## 15 0.3028509  8.966886 4.915351 655.8149 0.03266321 0.03720717 0.03720717
## 20 0.3813596 11.978289 4.836842 652.8035 0.03085246 0.04784538 0.04784538
##            FPR
## 5  0.004475151
## 10 0.009004466
## 15 0.013520481
## 20 0.018063361

Well, the numbers indicate that all the models are terrible. All metrics are scored rather poorly. True positives, false positives, false negatives, true negatives, precision, recall, true positive rate, and false positive rate are low for all models. Remember that these values are averages of the cross-validation. As such, for the “POPULAR” model when looking at the top five movies on average, the number of true positives was .3.

Even though the numbers are terrible the “POPULAR” model always performed the best. We can even view the ROC curve with the code below

plot(evlist,legend="topleft",annotate=T)

1.png

We can now determine individual recommendations. We first need to build a model using the POPULAR algorithm. Below is the code.

Rec1<-Recommender(movieRatings,method="POPULAR")
Rec1
## Recommender of type 'POPULAR' for 'realRatingMatrix' 
## learned using 9066 users.

We will now pull the top five recommendations for the first two raters and make a list. The numbers are the movie ids and not the actual titles

recommend<-predict(Rec1,movieRatings[1:5],n=5)
as(recommend,"list")
## $`1`
## [1] "78"  "95"  "544" "102" "4"  
## 
## $`2`
## [1] "242" "232" "294" "577" "95" 
## 
## $`3`
## [1] "654" "242" "30"  "232" "287"
## 
## $`4`
## [1] "564" "654" "242" "30"  "232"
## 
## $`5`
## [1] "242" "30"  "232" "287" "577"

Below we can see the specific score for a specific movie. The names of the movies come from the original “ratings” dataset.

rating<-predict(Rec1,movieRatings[1:5],type='ratings')
rating
## 5 x 671 rating matrix of class 'realRatingMatrix' with 2873 ratings.
movieresult<-as(rating,'matrix')[1:5,1:3]
colnames(movieresult)<-c("Toy Story","Jumanji","Grumpier Old Men")
movieresult
##   Toy Story  Jumanji Grumpier Old Men
## 1  2.859941 3.822666         3.724566
## 2  2.389340 3.352066         3.253965
## 3  2.148488 3.111213         3.013113
## 4  1.372087 2.334812         2.236711
## 5  2.255328 3.218054         3.119953

This is what the model thinks the person would rate the movie. It is the difference between this number and the actual one that the error is calculated. In addition, if someone did not rate a movie you would see an NA in that spot

Conclusion

This was a lot of work. However, with additional work, you can have your own recommendation system based on data that was collected.

Clustering Mixed Data in R

One of the major problems with hierarchical and k-means clustering is that they cannot handle nominal data. The reality is that most data is mixed or a combination of both interval/ratio data and nominal/ordinal data.

One of many ways to deal with this problem is by using the Gower coefficient. This coefficient compares the pairwise cases in the data set and calculates a dissimilarity between. By dissimilar we mean the weighted mean of the variables in that row.

Once the dissimilarity calculations are completed using the gower coefficient (there are naturally other choices), you can then use regular kmeans clustering (there are also other choices) to find the traits of the various clusters. In this post, we will use the “MedExp” dataset from the “Ecdat” package. Our goal will be to cluster the mixed data into four clusters. Below is some initial code.

library(cluster);library(Ecdat);library(compareGroups)
data("MedExp")
str(MedExp)
## 'data.frame':    5574 obs. of  15 variables:
##  $ med     : num  62.1 0 27.8 290.6 0 ...
##  $ lc      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ idp     : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 1 1 ...
##  $ lpi     : num  6.91 6.91 6.91 6.91 6.11 ...
##  $ fmde    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ physlim : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 1 1 1 1 ...
##  $ ndisease: num  13.7 13.7 13.7 13.7 13.7 ...
##  $ health  : Factor w/ 4 levels "excellent","good",..: 2 1 1 2 2 2 2 1 2 2 ...
##  $ linc    : num  9.53 9.53 9.53 9.53 8.54 ...
##  $ lfam    : num  1.39 1.39 1.39 1.39 1.1 ...
##  $ educdec : num  12 12 12 12 12 12 12 12 9 9 ...
##  $ age     : num  43.9 17.6 15.5 44.1 14.5 ...
##  $ sex     : Factor w/ 2 levels "male","female": 1 1 2 2 2 2 2 1 2 2 ...
##  $ child   : Factor w/ 2 levels "no","yes": 1 2 2 1 2 2 1 1 2 1 ...
##  $ black   : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 2 2 2 ...

You can clearly see that our data is mixed with both numerical and factor variables. Therefore, the first thing we must do is calculate the gower coefficient for the dataset. This is done with the “daisy” function from the “cluster” package.

disMat<-daisy(MedExp,metric = "gower")

Now we can use the “kmeans” to make are clusters. This is possible because all the factor variables have been converted to a numerical value. We will set the number of clusters to 4. Below is the code.

set.seed(123)
mixedClusters<-kmeans(disMat, centers=4)

We can now look at a table of the clusters

table(mixedClusters$cluster)
## 
##    1    2    3    4 
## 1960 1342 1356  916

The groups seem reasonably balanced. We now need to add the results of the kmeans to the original dataset. Below is the code

MedExp$cluster<-mixedClusters$cluster

We now can built a descriptive table that will give us the proportions of each variable in each cluster. To do this we need to use the “compareGroups” function. We will then take the output of the “compareGroups” function and use it in the “createTable” function to get are actual descriptive stats.

group<-compareGroups(cluster~.,data=MedExp)
clustab<-createTable(group)
clustab
## 
## --------Summary descriptives table by 'cluster'---------
## 
## __________________________________________________________________________ 
##                    1            2            3            4      p.overall 
##                  N=1960       N=1342       N=1356       N=916              
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯ 
## med            211 (1119)   68.2 (333)   269 (820)   83.8 (210)   <0.001   
## lc            4.07 (0.60)  4.05 (0.60)  0.04 (0.39)  0.03 (0.34)   0.000   
## idp:                                                              <0.001   
##     no        1289 (65.8%) 922 (68.7%)  1123 (82.8%) 781 (85.3%)           
##     yes       671 (34.2%)  420 (31.3%)  233 (17.2%)  135 (14.7%)           
## lpi           5.72 (1.94)  5.90 (1.73)  3.27 (2.91)  3.05 (2.96)  <0.001   
## fmde          6.82 (0.99)  6.93 (0.90)  0.00 (0.12)  0.00 (0.00)   0.000   
## physlim:                                                          <0.001   
##     no        1609 (82.1%) 1163 (86.7%) 1096 (80.8%) 789 (86.1%)           
##     yes       351 (17.9%)  179 (13.3%)  260 (19.2%)  127 (13.9%)           
## ndisease      11.5 (8.26)  10.2 (2.97)  12.2 (8.50)  10.6 (3.35)  <0.001   
## health:                                                           <0.001   
##     excellent 910 (46.4%)  880 (65.6%)  615 (45.4%)  612 (66.8%)           
##     good      828 (42.2%)  382 (28.5%)  563 (41.5%)  261 (28.5%)           
##     fair      183 (9.34%)   74 (5.51%)  137 (10.1%)  42 (4.59%)            
##     poor       39 (1.99%)   6 (0.45%)    41 (3.02%)   1 (0.11%)            
## linc          8.68 (1.22)  8.61 (1.37)  8.75 (1.17)  8.78 (1.06)   0.005   
## lfam          1.05 (0.57)  1.49 (0.34)  1.08 (0.58)  1.52 (0.35)  <0.001   
## educdec       12.1 (2.87)  11.8 (2.58)  12.0 (3.08)  11.8 (2.73)   0.005   
## age           36.5 (12.0)  9.26 (5.01)  37.0 (12.5)  9.29 (5.11)   0.000   
## sex:                                                              <0.001   
##     male      893 (45.6%)  686 (51.1%)  623 (45.9%)  482 (52.6%)           
##     female    1067 (54.4%) 656 (48.9%)  733 (54.1%)  434 (47.4%)           
## child:                                                             0.000   
##     no        1960 (100%)   0 (0.00%)   1356 (100%)   0 (0.00%)            
##     yes        0 (0.00%)   1342 (100%)   0 (0.00%)   916 (100%)            
## black:                                                            <0.001   
##     yes       1623 (82.8%) 986 (73.5%)  1148 (84.7%) 730 (79.7%)           
##     no        337 (17.2%)  356 (26.5%)  208 (15.3%)  186 (20.3%)           
## ¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯

The table speaks for itself. Results that utilize factor variables have proportions to them. For example, in cluster 1, 1289 people or 65.8% responded “no” that the have an individual deductible plan (idp). Numerical variables have the mean with the standard deviation in parentheses. For example, in cluster 1 the average family size was 1 with a standard deviation of 1.05 (lfam).

Conclusion

Mixed data can be partition into clusters with the help of the gower or another coefficient. In addition, kmeans is not the only way to cluster the data. There are other choices such as the partitioning around medoids. The example provided here simply serves as a basic introduction to this.

Hierarchical Clustering in R

Hierarchical clustering is a form of unsupervised learning. What this means is that the data points lack any form of label and the purpose of the analysis is to generate labels for our data points. IN other words, we have no Y values in our data.

Hierarchical clustering is an agglomerative technique. This means that each data point starts as their own individual clusters and are merged over iterations. This is great for small datasets but is difficult to scale. In addition, you need to set the linkage which is used to place observations in different clusters. There are several choices (ward, complete, single, etc.) and the best choice depends on context.

In this post, we will make a hierarchical clustering analysis of the “MedExp” data from the “Ecdat” package. We are trying to identify distinct subgroups in the sample. The actual hierarchical cluster creates what is a called a dendrogram. Below is some initial code.

library(cluster);library(compareGroups);library(NbClust);library(HDclassif);library(sparcl);library(Ecdat)
data("MedExp")
str(MedExp)
## 'data.frame':    5574 obs. of  15 variables:
##  $ med     : num  62.1 0 27.8 290.6 0 ...
##  $ lc      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ idp     : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 1 1 ...
##  $ lpi     : num  6.91 6.91 6.91 6.91 6.11 ...
##  $ fmde    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ physlim : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 1 1 1 1 ...
##  $ ndisease: num  13.7 13.7 13.7 13.7 13.7 ...
##  $ health  : Factor w/ 4 levels "excellent","good",..: 2 1 1 2 2 2 2 1 2 2 ...
##  $ linc    : num  9.53 9.53 9.53 9.53 8.54 ...
##  $ lfam    : num  1.39 1.39 1.39 1.39 1.1 ...
##  $ educdec : num  12 12 12 12 12 12 12 12 9 9 ...
##  $ age     : num  43.9 17.6 15.5 44.1 14.5 ...
##  $ sex     : Factor w/ 2 levels "male","female": 1 1 2 2 2 2 2 1 2 2 ...
##  $ child   : Factor w/ 2 levels "no","yes": 1 2 2 1 2 2 1 1 2 1 ...
##  $ black   : Factor w/ 2 levels "yes","no": 2 2 2 2 2 2 2 2 2 2 ...

Currently, for the purposes of this post. The dataset is too big. IF we try to do the analysis with over 5500 observations it will take a long time. Therefore, we will only use the first 1000 observations. In addition, We need to remove factor variables as hierarchical clustering cannot analyze factor variables. Below is the code.

MedExp_small<-MedExp[1:1000,]
MedExp_small$sex<-NULL
MedExp_small$idp<-NULL
MedExp_small$child<-NULL
MedExp_small$black<-NULL
MedExp_small$physlim<-NULL
MedExp_small$health<-NULL

We now need to scale are data. This is important because different scales will cause different variables to have more or less influence on the results. Below is the code

MedExp_small_df<-as.data.frame(scale(MedExp_small))

We now need to determine how many clusters to create. There is no rule on this but we can use statistical analysis to help us. The “NbClust” package will conduct several different analysis to provide a suggested number of clusters to create. You have to set the distance, min/max number of clusters, the method, and the index. The graphs can be understood by looking for the bend or elbow in them. At this point is the best number of clusters.

numComplete<-NbClust(MedExp_small_df,distance = 'euclidean',min.nc = 2,max.nc = 8,method = 'ward.D2',index = c('all'))

1.png

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

1.png

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 7 proposed 2 as the best number of clusters 
## * 9 proposed 3 as the best number of clusters 
## * 6 proposed 6 as the best number of clusters 
## * 1 proposed 8 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************
numComplete$Best.nc
##                     KL       CH Hartigan     CCC    Scott      Marriot
## Number_clusters 2.0000   2.0000   6.0000  8.0000    3.000 3.000000e+00
## Value_Index     2.9814 292.0974  56.9262 28.4817 1800.873 4.127267e+24
##                   TrCovW   TraceW Friedman   Rubin Cindex     DB
## Number_clusters      6.0   6.0000   3.0000  6.0000  2.000 3.0000
## Value_Index     166569.3 265.6967   5.3929 -0.0913  0.112 1.0987
##                 Silhouette   Duda PseudoT2  Beale Ratkowsky     Ball
## Number_clusters     2.0000 2.0000   2.0000 2.0000    6.0000    3.000
## Value_Index         0.2809 0.9567  16.1209 0.2712    0.2707 1435.833
##                 PtBiserial Frey McClain   Dunn Hubert SDindex Dindex
## Number_clusters     6.0000    1   3.000 3.0000      0  3.0000      0
## Value_Index         0.4102   NA   0.622 0.1779      0  1.9507      0
##                   SDbw
## Number_clusters 3.0000
## Value_Index     0.5195

Simple majority indicates that three clusters is most appropriate. However, four clusters are probably just as good. Every time you do the analysis you will get slightly different results unless you set the seed.

To make our actual clusters we need to calculate the distances between clusters using the “dist” function while also specifying the way to calculate it. We will calculate distance using the “Euclidean” method. Then we will take the distance’s information and make the actual clustering using the ‘hclust’ function. Below is the code.

distance<-dist(MedExp_small_df,method = 'euclidean')
hiclust<-hclust(distance,method = 'ward.D2')

We can now plot the results. We will plot “hiclust” and set hang to -1 so this will place the observations at the bottom of the plot. Next, we use the “cutree” function to identify 4 clusters and store this in the “comp” variable. Lastly, we use the “ColorDendrogram” function to highlight are actual clusters.

plot(hiclust,hang=-1, labels=F)
comp<-cutree(hiclust,4) ColorDendrogram(hiclust,y=comp,branchlength = 100)

1.jpeg

We can also create some descriptive stats such as the number of observations per cluster.

table(comp)
## comp
##   1   2   3   4 
## 439 203 357   1

We can also make a table that looks at the descriptive stats by cluster by using the “aggregate” function.

aggregate(MedExp_small_df,list(comp),mean)
##   Group.1         med         lc        lpi       fmde     ndisease
## 1       1  0.01355537 -0.7644175  0.2721403 -0.7498859  0.048977122
## 2       2 -0.06470294 -0.5358340 -1.7100649 -0.6703288 -0.105004408
## 3       3 -0.06018129  1.2405612  0.6362697  1.3001820 -0.002099968
## 4       4 28.66860936  1.4732183  0.5252898  1.1117244  0.564626907
##          linc        lfam    educdec         age
## 1  0.12531718 -0.08861109  0.1149516  0.12754008
## 2 -0.44435225  0.22404456 -0.3767211 -0.22681535
## 3  0.09804031 -0.01182114  0.0700381 -0.02765987
## 4  0.18887531 -2.36063161  1.0070155 -0.07200553

Cluster 1 is the most educated (‘educdec’). Cluster 2 stands out as having higher medical cost (‘med’), chronic disease (‘ndisease’) and age. Cluster 3 had the lowest annual incentive payment (‘lpi’). Cluster 4 had the highest coinsurance rate (‘lc’). You can make boxplots of each of the stats above. Below is just an example of age by cluster.

MedExp_small_df$cluster<-comp
boxplot(age~cluster,MedExp_small_df)

1.jpeg

Conclusion

Hierarchical clustering is one way in which to provide labels for data that does not have labels. The main challenge is determining how many clusters to create. However, this can be dealt with through using recommendations that come from various functions in R.