Data Munging with Dplyr

Data preparation aka data munging is what most data scientist spend the majority of their time doing. Extracting and transforming data is difficult, to say the least. Every dataset is different with unique problems. This makes it hard to generalize best practices for transforming data so that it is suitable for analysis.

In this post, we will look at how to use the various functions in the “dplyr”” package. This package provides numerous ways to develop features as well as explore the data. We will use the “attitude” dataset from base r for our analysis. Below is some initial code.

library(dplyr)
data("attitude")
str(attitude)
## 'data.frame':    30 obs. of  7 variables:
##  $ rating    : num  43 63 71 61 81 43 58 71 72 67 ...
##  $ complaints: num  51 64 70 63 78 55 67 75 82 61 ...
##  $ privileges: num  30 51 68 45 56 49 42 50 72 45 ...
##  $ learning  : num  39 54 69 47 66 44 56 55 67 47 ...
##  $ raises    : num  61 63 76 54 71 54 66 70 71 62 ...
##  $ critical  : num  92 73 86 84 83 49 68 66 83 80 ...
##  $ advance   : num  45 47 48 35 47 34 35 41 31 41 ...

You can see we have seven variables and only 30 observations. Our first function that we will learn to use is the “select” function. This function allows you to select columns of data you want to use. In order to use this feature, you need to know the names of the columns you want. Therefore, we will first use the “names” function to determine the names of the columns and then use the “select”” function.

names(attitude)[1:3]
## [1] "rating"     "complaints" "privileges"
smallset<-select(attitude,rating:privileges)
head(smallset)
##   rating complaints privileges
## 1     43         51         30
## 2     63         64         51
## 3     71         70         68
## 4     61         63         45
## 5     81         78         56
## 6     43         55         49

The difference is probably obvious. Using the “select” function we have 3 instead of 7 variables. We can also exclude columns we do not want by placing a negative in front of the names of the columns. Below is the code

head(select(attitude,-(rating:privileges)))
##   learning raises critical advance
## 1       39     61       92      45
## 2       54     63       73      47
## 3       69     76       86      48
## 4       47     54       84      35
## 5       66     71       83      47
## 6       44     54       49      34

We can also use the “rename” function to change the names of columns. In our example below, we will change the name of the “rating” to “rates.” The code is below. Keep in mind that the new name for the column is to the left of the equal sign and the old name is to the right

attitude<-rename(attitude,rates=rating)
head(attitude)
##   rates complaints privileges learning raises critical advance
## 1    43         51         30       39     61       92      45
## 2    63         64         51       54     63       73      47
## 3    71         70         68       69     76       86      48
## 4    61         63         45       47     54       84      35
## 5    81         78         56       66     71       83      47
## 6    43         55         49       44     54       49      34

The “select”” function can be used in combination with other functions to find specific columns in the dataset. For example, we will use the “ends_with” function inside the “select” function to find all columns that end with the letter s.

s_set<-head(select(attitude,ends_with("s")))
s_set
##   rates complaints privileges raises
## 1    43         51         30     61
## 2    63         64         51     63
## 3    71         70         68     76
## 4    61         63         45     54
## 5    81         78         56     71
## 6    43         55         49     54

The “filter” function allows you to select rows from a dataset based on criteria. In the code below we will select only rows that have a 75 or higher in the “raises” variable.

bigraise<-filter(attitude,raises>75)
bigraise
##   rates complaints privileges learning raises critical advance
## 1    71         70         68       69     76       86      48
## 2    77         77         54       72     79       77      46
## 3    74         85         64       69     79       79      63
## 4    66         77         66       63     88       76      72
## 5    78         75         58       74     80       78      49
## 6    85         85         71       71     77       74      55

If you look closely all values in the “raise” column are greater than 75. Of course, you can have more than one criteria. IN the code below there are two.

filter(attitude, raises>70 & learning<67)
##   rates complaints privileges learning raises critical advance
## 1    81         78         56       66     71       83      47
## 2    65         70         46       57     75       85      46
## 3    66         77         66       63     88       76      72

The “arrange” function allows you to sort the order of the rows. In the code below we first sort the data ascending by the “critical” variable. Then we sort it descendingly by adding the “desc” function.

ascCritical<-arrange(attitude, critical)
head(ascCritical)
##   rates complaints privileges learning raises critical advance
## 1    43         55         49       44     54       49      34
## 2    81         90         50       72     60       54      36
## 3    40         37         42       58     50       57      49
## 4    69         62         57       42     55       63      25
## 5    50         40         33       34     43       64      33
## 6    71         75         50       55     70       66      41
descCritical<-arrange(attitude, desc(critical))
head(descCritical)
##   rates complaints privileges learning raises critical advance
## 1    43         51         30       39     61       92      45
## 2    71         70         68       69     76       86      48
## 3    65         70         46       57     75       85      46
## 4    61         63         45       47     54       84      35
## 5    81         78         56       66     71       83      47
## 6    72         82         72       67     71       83      31

The “mutate” function is useful for engineering features. In the code below we will transform the “learning” variable by subtracting its mean from its self

attitude<-mutate(attitude,learningtrend=learning-mean(learning))
head(attitude)
##   rates complaints privileges learning raises critical advance
## 1    43         51         30       39     61       92      45
## 2    63         64         51       54     63       73      47
## 3    71         70         68       69     76       86      48
## 4    61         63         45       47     54       84      35
## 5    81         78         56       66     71       83      47
## 6    43         55         49       44     54       49      34
##   learningtrend
## 1    -17.366667
## 2     -2.366667
## 3     12.633333
## 4     -9.366667
## 5      9.633333
## 6    -12.366667

You can also create logical variables with the “mutate” function.In the code below, we create a logical variable that is true when the “critical” variable” is higher than 80 and false when “critical”” is less than 80. The new variable is called “highCritical”

attitude<-mutate(attitude,highCritical=critical>=80)
head(attitude)
##   rates complaints privileges learning raises critical advance
## 1    43         51         30       39     61       92      45
## 2    63         64         51       54     63       73      47
## 3    71         70         68       69     76       86      48
## 4    61         63         45       47     54       84      35
## 5    81         78         56       66     71       83      47
## 6    43         55         49       44     54       49      34
##   learningtrend highCritical
## 1    -17.366667         TRUE
## 2     -2.366667        FALSE
## 3     12.633333         TRUE
## 4     -9.366667         TRUE
## 5      9.633333         TRUE
## 6    -12.366667        FALSE

The “group_by” function is used for creating summary statistics based on a specific variable. It is similar to the “aggregate” function in R. This function works in combination with the “summarize” function for our purposes here. We will group our data by the “highCritical” variable. This means our data will be viewed as either TRUE for “highCritical” or FALSE. The results of this function will be saved in an object called “hcgroups”

hcgroups<-group_by(attitude,highCritical)
head(hcgroups)
## # A tibble: 6 x 9
## # Groups:   highCritical [2]
##   rates complaints privileges learning raises critical advance
##                            
## 1    43         51         30       39     61       92      45
## 2    63         64         51       54     63       73      47
## 3    71         70         68       69     76       86      48
## 4    61         63         45       47     54       84      35
## 5    81         78         56       66     71       83      47
## 6    43         55         49       44     54       49      34
## # ... with 2 more variables: learningtrend , highCritical 

Looking at the data you probably saw no difference. This is because we are not done yet. We need to summarize the data in order to see the results for our two groups in the “highCritical” variable.

We will now generate the summary statistics by using the “summarize” function. We specifically want to know the mean of the “complaint” variable based on the variable “highCritical.” Below is the code

summarize(hcgroups,complaintsAve=mean(complaints))
## # A tibble: 2 x 2
##   highCritical complaintsAve
##                   
## 1        FALSE      67.31579
## 2         TRUE      65.36364

Of course, you could have learned this through doing a t.test but this is another approach.

Conclusion

The “dplyr” package is one powerful tool for wrestling with data. There is nothing new in this package. Instead, the coding is simpler than what you can excute using base r.

Advertisements

Review of “Eye Wonder: Space”

In this post, we will take a look at the book Eye Wonder: Space (Eye Wonder) by Simon Holland (pp.48).

The Summary

This book takes on a journey defining the various characteristics related to space. The journey begins on earth where you look at the stars. From there, the book talks about the moon, the sun, the planets of the solar system, the Milky Way, and places in space beyond our galaxy.

The Good

This book is rich in photos which is consistent with its title. Students get to see what Mars,  asteroids, and even what life is like in space for humans. The book also offers explanations about the characteristics of various features of space. For example, it explains why Mercury is so hot, how stars die, as well why Mars is red.

This text is definitely for individual reading. The way the text is set up and the pictures make it that way.

The Bad

One of the biggest problems with this text is the choice of font color. If the background is black the font color was always white which is acceptable. However, if the background was any other color the font color was black. This often led to problems with trying to read black font on the surface of red Mars, on a night sky filled with stars, or when looking at the deep blue Neptune. There were also times when the text was probably too small for younger readers

There were also times when the text was probably too small for younger readers. However, the small text was normally used for details that did not affect the big picture.

The Recommendation

This book deserves 3/5 stars. It can provide some entertainment for one or a small group of students. It can also provide supplemental information for both the teacher or students. Add it to your library if you are looking to broaden the number of available books.

Teaching a Child to Read

Learning to read is in no way an easy experience. In order to read at even the most basic level requires mastery of syntax, phonology, morphology, and semantics at a minimum. These are skills that we expect a child normally under the age of 8 to show some proficiency at.

This post will explain a process for teaching reading to small children that worked. Of course, there is no claim here that this is the way but it does provide an example. When I began this experience I had been an educator for years at higher grades but had never actually taught anybody how to read. My training and experience have mostly been in improving reading comprehension skills.

The Process

The process I stumble upon goes as follows

  1. Letter recognition
  2. Letter sound production
  3. Word family phonics
  4. Sight words
  5. Reading stories with support from steps 3 & 4

Each step builds on the steps before it

Letter Recognition

The first step in this process was to have the child recognize the letters of the alphabet. This was done through the use of flashcards. In many ways, this was the easiest step. I thought it would take a year for a 4-year-old to learn this but it only took 3-4 months

Letter recognition relates to morphology as letters are in many ways morphemes that cannot be further divided. At this point, the learning experience is simply memory only with no application

Letter Sound Production

Once the alphabet was memorized, I exposed the student to the sounds of the letters. The student then had to reproduce the sound in addition to recognizing what letter it was.

This was much tougher. The student would either forget what letter it was or forget the sound or both. There was a lot of frustration. However, after several more months, we were ready to move on.

Letter sound production is an example of phonology or the understanding of the sounds letters make. This is a crucial step in learning to read.

Word Family Phonics

At this stage, we combine several letters and “sound” them out to produce words. Often, the words used had the same ending or morpheme such as “-ap”, “-at”, “-ad”. etc. and only the first letter would change. This helps the student to recognize patterns quickly at least in theory.

There was also an introduction to vowels and other common morphemes. Looking back I consider this a mistake as it seemed to be confusing for the student. In addition, although phonics are valuable in learning to sound out words I found them to lack context and read “cap”, “tap”, and “map” outside the setting of some story was boring for the student.

Sight Words

Sight words are words that are so common in English that they need to be memorized. Often they cannot be sounded out because they violate the rules of phonology but this is not always the case.

There are two common systems of sight words and these are Dolch and Fry respectively. In terms of which is better, it doesn’t really matter. I used Fry’s and again I think the lack of context was a problem as I was asking the student to learn words that lack an immediate application.

Reading Stories

After about a year of preparatory training, we finally began reading stories. The stories were little short stories appropriate for kindergarteners. At first, it was difficult but the student began to improve rapdily. It was much easier (usually) to get them to cooperate as well.

Conclusion 

The most important point is perhaps not the most obvious one. despite my inexperience and mistakes in pedagogy, the student still learned to read. In many ways, the student learned to read in spite of me. This should be reassuring for many teachers. Even bad teaching can get good results if the aspects of planning, discipline, and commitment to success are there. Students seem to grow as long as they have some guidance.

I would say the most important thing in terms of teaching reading is to actually make them read. Reading provides context and motivation as the student can see what they cannot do. Studying all of the theoretical aspects of reading such as phonics and letters are only beneficial when the child knows they need to know this.

Therefore, if you are provided with an opportunity to teach a child to read start with stories and as the struggle teach only what they are struggling with. For example, if they are having a hard time with long “o” sound, reinforcing that with supplemental theoretical work will make sense for the child. As such, children learn best by doing rather than talking about what they will do.

Types of Rubrics for Writing

Grading essays, papers and other forms of writing is subjective and frustrating for teachers at times. One tool that helps in improving the consistency of the marking, as well as the speed, is the use of rubrics. In this post, we will look at three commonly used rubrics which are…

  • Holistic
  • Analytical
  • Primary trait

Holistic Rubric

A holistic rubric looks at the overall quality of the writing. Normally, there are several levels on the rubric and each level has several descriptors on it. Below is an example template

Presentation1.gifThe descriptors must be systematic which means that they are addressed in each level and in the same order. Below is an actual Holistic Rubric for Writing.

Presentation1In the example above, there are four levels of marking. The descriptors are

  • idea explanation
  • coherency
  • grammar

Between levels, different adverbs and adjectives are used to distinguish the levels.  For example, in level one, “ideas are thoroughly explained” becomes “ideas are explained” in the second level. The use of adverbs is one of the easiest ways to distinguish between levels in a holistic rubric.

Holistic rubrics offer the convenience of fast marking that is easy to interpret and comes with high reliability. The downside is that there is a lack of strong feedback for improvement.

Analytical Rubrics

Analytical rubrics assign a score to each individual attribute the teacher is looking for in the writing. In other words, instead of lumping all the descriptors together as is done in a holistic rubric, each trait is given its own score. Below is a template of an analytical rubric.

Presentation1

You can see that the levels are across the top and the descriptors across the side. Best performance moves from left to right all the way to worst performance. Each level is assigned a range of potential point values.

Below is an actual holistic writing template

Presentation1

Analytical rubrics provide much more washback and learning than holistic. Of course, they also take a  lot more time for the teacher to complete as well.

Primary Trait

A lesser-known way of marking papers is the use of primary trait rubric. With primary trait, the student is only assessed on one specific function of writing. For example, persuasion if they are writing an essay or perhaps vocabulary use for an ESL student writing paragraphs.

The template would be similar to a holistic rubric except that there would only be on descriptor instead of several. The advantage of this is that it allows the teacher and the student to focus on one aspect of writing. Naturally, this can be a disadvantage as writing involves more than one specific skill.

Conclusion

Rubrics are useful for a variety of purposes. For writing, it is critical that you understand what the levels and descriptors are one deciding on what kind of rubric you want to use. In addition, the context affects the use of what type of rubric to use as well.

Review of “The Usborne Book of Houses and Homes”

The Houses and Homes (World geography) by Carol Bowyer (pp. 32) provides insights into how people live from all over the world.

The Summary

This book covers how people live in various climates and locations throughout the world. Living in water, living in caves, in icy places, and the jungle are just some of the examples from the text.

The text is not limited to just housing but also discusses the cultures of various people groups. Students learn about the Turcoman women of Iran making felt for their tents, the Huichol of Mexico grinding maize, and the hunting style of the Eskimos of Alaska to name a few.

The Good 

The multitude of illustrations is always a strength of books from Usborne. Students will be able to see how these people live with an emphasis on the way they live. There are also activities that the students can do that the book provides. For example, the can play an Eskimo game, learn how to make good luck crosses like the Huichol, and how to make a tent.

The text is readable for older elementary students. Younger students would enjoy and learn a great deal from seeing the pictures. In many ways, there is a little bit for everybody in this text.

The Bad

Some of the illustrations are small which relegates this book to the library of your classroom. With so much rich illustration many kids can bypass reading and just learn through the pictures. This is only a problem if you are trying to get the kids to read. For more sensitive people there is a little nudity as the illustrator drew pictures of what the people actual wear or do not wear.

The Recommendation

I would give this book 3.5/5 stars. It’ great supplementary material for any social studies course. The activities provided are more for fun than learning. However, the visuals are excellent for exposing children and stimulating discussion about how people live in the world today.

Guiding the Writing Process

How a teacher guides the writing process can depend on a host of factors. Generally, how you support a student at the beginning of the writing process is different from how you support them at the end. In this post, we will look at the differences between these two stages of writing.

The Beginning

At the beginning of writing, there are a lot of decisions that need to be made as well as extensive planning. Generally, at this point, grammar is not the deciding factor in terms of the quality of the writing. Rather, the teacher is trying to help the students to determine the focus of the paper as well as the main ideas.

The teacher needs to help the student to focus on the big picture of the purpose of their writing. This means that only major issues are addressed at least initially. You only want to point at potential disaster decisions rather than mundane details.

It is tempting to try and fix everything when looking at rough drafts. This not only takes up a great deal of your time but it is also discouraging to students as they deal with intense criticism while still trying to determine what they truly want to do. As such, it is better to view your role at this point as a counselor or guide and not as detail oriented control freak.

At this stage, the focus is on the discourse and not so much on the grammar.

The End

At the end of the writing process, there is a move from general comments to specific concerns. As the student gets closer and closer to the final draft the “little things” become more and more important. Grammar comes to the forefront. In addition, referencing and the strength of the supporting details become more important.

Now is the time to get “picky” this is because major decisions have been made and the cognitive load of fixing small stuff is less stressful once the core of the paper is in place. The analogy I like to give is that first, you build the house. Which involves lots of big movements such as pouring a foundation, adding walls, and including a roof. This is the beginning of writing. The end of building a house includes more refined aspects such as painting the walls, adding the furniture, etc. This is the end of the writing process.

Conclusion

For writers and teachers, it is important to know where they are in the writing process. In my experience, it seems as if it is all about grammar from the beginning when this is not necessarily the case. At the beginning of a writing experience, the focus is on ideas. At the end of a writing experience, the focus is on grammar. The danger is always in trying to do too much at the same time.

Academic vs Applied Research

Academic and applied research are perhaps the only two ways that research can be performed. In this post, we will look at the differences between these two perspectives on research.

Academic Research

Academic research falls into two categories. These two categories are

  • Research ON your field
  • Research FOR your field

Research ON your field is research is research that is searching for best practice. It looks at how your academic area is practiced in the real world. A scholar will examine how well a theory is being applied or used in a real-world setting and make recommendations.

For example, in education, if a scholar does research in reading comprehension, they may want to determine what are some of the most appropriate strategies for teaching reading comprehension. The scholar will look at existing theories and such which one(s) are most appropriate for supporting students.

Research ON your field is focused on existing theories that are tested with the goal of developing recommendations for improving practice.

Research FOR your field is slightly different. This perspective seeks to expand theoretical knowledge about your field. In orders, the scholar develops new theories rather than assess the application of older ones.

An example of this in education would be developing a new theory in reading comprehension. By theory, it is meant explanation. Famous theories in education include Piaget’s stages of development, Kohlberg’s stages of moral development, and more. At their time each of these theories pushes the boundaries of our understanding of something.

The main thing about academic research is that it leads to recommendations but not necessarily to answers that solve problems. Answering problems is something that is done with applied research.

Applied Research

Applied research is also known as research IN your field. This type of research is often performed by practitioners in the field.

  • research IN your field

There are several forms of research IN your field and they are as follows

  • Formative
  • Monitoring
  • Summative

Formative research is for identifying problems. For example, a teacher may notice that students are not performing well or doing their homework. Formative applied research is when the detective hat is put on and the teacher begins to search for the cause of this behavior.

The results of formative research lead to some sort of an action plan to solve the problem. During the implementation of the solution, monitoring applied research is conducted. Monitoring research is conducted during implementation of a solution to see how things are going.

For example, if the teacher discovers that students are struggling with reading because they are struggling with phonological awareness.  They may implement a review program of this skill for the students. Monitoring would involve assessing student performance of reading during the program.

Summative applied research is conduct at the end of implementation to see if the objectives of the program were met. Returning to the reading example, if the teacher’s objective was to improve reading comprehension scores 10% the summative research would assess how well the students can now read and whether there was a 10% improvement.

In education, applied research is also known as action research.

Conclusion

Research can serve many different purposes.  Academics focus on recommendations, not action while practitioners want to solve problems and perhaps not recommend as much. The point is that understanding what type of research you are trying to conduct can help you in shaping the direction of your study.

Review of “A Child’s History of the World”

The history textbook A Child’s History of the World by V.M. Hillyer (pp. 432) was originally written almost 100 years ago. Since then it has been revised and expanded by several other authors. This review is based on the 2014 edition of the text.

The Summary

This textbook is a survey of world history written at the comprehension level of a child. With most surveys, the text covers a little bit everything. Examples of topics in the book include Egyptian, Jewish, Greek, Roman, African, British civilizations and even the rise of the US and USSR. Naturally, many of the major wars of the past 5,000 years are covered as well.

Famous characters from history who are discussed in the book range from Alexander the Great to Jesus Christ as well as Emperor Constantine and even Richard Wagner the famous German composer of the 19th century.

The Good 

For a child’s book, there is a surprising amount of detail. For example, the book explains about  Zoroastrianism, which was the religion of the Medo-Persian empire. How many students today are familiar with such a topic?  In addition, the text is really written in an easy to read format.

The chapters are short, which is critical for young readers. There is also support with pronouncing various words that may be unusual to a western student.  There are also some illustrations throughout the book

The Bad

Given its age (almost 100) the pedagogical approach of the book is outdated. It’s heavy on text and light on illustrations Furthermore, the book lacks any sort of learning tools common in today’s textbooks such as inserts, vocabulary words, questions, discussion items, etc. It is literally just text.

At the time that it was written this text could probably be read by a small child. Today, however, the writing style would probably be more appropriate for high school as in-depth reading is not as common as it once was. With so much text it is almost impossible to read this to a class. My students became extremely bored and antsy when I attempted this even though a chapter is only three pages in length at times. I had to scrape reading it aloud and try another way to teach historical concepts. As such both, whole-class an individual reading of this text is difficult because peoples’ habits have change since the Depression.

The Recommendation

I would give this book 1.5/5 stars. It needs significant pedagogical support in order to be effective in the 21st-century classroom. The teacher would need to prepare support materials in order to help students with understanding the text. All textbooks require scaffolding support from the teacher but this book requires an extraordinary amount of help to provide learning experiences.

However, this book could be useful as a resource for a teacher who needs additional knowledge to teach history to children. In addition, if a regular textbook is already in use then A Child’s History of the World could serve as supplementary material that would allow the class to go deeper on a particular topic. The days of this text being the main source on history for children are probably over.

Types of Writing

This post will look at several types of writing that are done for assessment purposes. In particular, we will look this from the four level of writing which are

  • Imitative
  • Intensive
  • Responsive
  • Extensive

Imitative 

Imitative writing is focused strictly on the grammatical aspects of writing. The student simply reproduces what they see. This is a common way to teach children how to write. Additional examples of activities at this level include cloze task in which the student has to write the word in the blank from a list, spelling test, matching, and even converting numbers to their word equivalent.

Intensive

Intensive writing is more concern about selecting the appropriate word for a given context. Example activities include grammatical transformation, such as changing all verbs to past tense, sequencing pictures, describing pictures, completing short sentences, and ordering task.

Responsive 

Responsive writing involves the development of sentences into paragraphs. The purpose is almost exclusively on the context or function of writing. Form concerns are primarily at the discourse level which means how the sentences work together to make paragraphs and how the paragraphs work to support a thesis statement. Normally no more than 2-3 paragraphs at this level

Example activities at the responsive level include short reports, interpreting visual aids, and summary.

Extensive

Extensive writing is responsive writing over the course of an entire essay or research paper. The student is able to shape a purpose, objectives, main ideas, conclusions, etc. Into a coherent paper.

For many students, this is exceedingly challenging in their mother tongue and is further exasperated in a second language. There is also the experience of multiple drafts of a single paper.

Marking Intensive & Responsive Papers

Marking higher level papers requires a high degree of subjectivity. THis is because of the authentic nature of this type of assessment. As such, it is critical that the teacher communicate expectations clearly through the use of rubrics or some other form of communication.

Another challenge is the issue of time. Higher level papers take much more time to develop. This means that they normally cannot be used as a form of in class assessment. If they are used as in class assessment then it leads to a decrease in the authenticity of the assessment.

Conclusion

Writing is a critical component of the academic experience. Students need to learn how to shape and develop their ideas in print. For teachers, it is important to know at what level the student is capable of writing at in order to support them for further growth.

Analyzing Twitter Data in R

In this post, we will look at analyzing tweets from Twitter using R. Before beginning, if you plan to replicate this on your own, you will need to set up a developer account with Twitter. Below are the steps

Twitter Setup

  1. Go to https://dev.twitter.com/apps
  2. Create a twitter account if you do not already have one
  3. Next, you want to click “create new app”
  4. After entering the requested information be sure to keep the following information for R; consumer key, consumer secret, request token URL, authorize URL, access token URL

The instruction here are primarily for users of Linux. If you are using a windows machine you need to download a cecert.pem file below is the code

download.file(url=‘http://curl.haxx.se/ca/cacert.pem’,destfile=‘/YOUR_LOCATION/cacert.pem’)

You need to save this file where it is appropriate. Below we will begin the analysis by loading the appropriate libraries.

R Setup

library(twitteR);library(ROAuth);library(RCurl);library(tm);library(wordcloud)

Next, we need to use all of the application information we generate when we created the developer account at twitter. We will save the information in objects to use in R. In the example code below “XXXX” is used where you should provide your own information. Sharing this kind of information would allow others to use my twitter developer account. Below is the code

my.key<-"XXXX" #consumer key
my.secret<-"XXXX" #consumer secret
my.accesstoken<-'XXXX' #access token
my.accesssecret<-'XXXX' ##access secret

Some of the information we just stored now needs to be passed to the “OAuthFactory” function of the “ROAuth” package. We will be passing the “my.key” and “my.secret”. We also need to add the request URL, access URL, and auth URL. Below is the code for all this.

cred<-OAuthFactory$new(consumerKey=my.key,consumerSecret=my.secret,requestURL='https://api.twitter/oauth/request_token',
                       accessURL='https://api.twitter/oauth/access_token',authURL='https://api.twitter/oauth/authorize')

If you are a windows user you need to code below for the cacert.pem. You need to use the “cred$handshake(cainfo=”LOCATION OF CACERT.PEM” to complete the setup process. make sure to save your authentication and then use the “registerTwitterOAuth(cred)” to finish this. For Linux users, the code is below.

setup_twitter_oauth(my.key, my.secret, my.accesstoken, my.accesssecret)

Data Preparation

We can now begin the analysis. We are going to search twitter for the term “Data Science.” We will look for 1,500 of the most recent tweets that contain this term. To do this we will use the “searchTwitter” function. The code is as follows

ds_tweets<-searchTwitter("data science",n=1500)

We know need to some things that are a little more complicated. First, we need to convert our “ds_tweets” object to a dataframe. This is just to save our search so we don’t have to research again. The code below performs this.

ds_tweets.df<-do.call(rbind,lapply(ds_tweets,as.data.frame))

Second, we need to find all the text in our “ds_tweets” object and convert this into a list. We will use the “sapply” function along with a “getText” Below is the code

ds_tweets.list<-sapply(ds_tweets,function(x) x$getText())

Third, we need to turn our “ds_tweets.list” into a corpus.

ds_tweets.corpus<-Corpus(VectorSource(ds_tweets.list))  

Now we need to do a lot of cleaning of the text. In particular, we need to make all words lower case remove punctuation Get rid of funny characters (i.e. #,/, etc) remove stopwords (words that lack meaning such as “the”)

To do this we need to use a combination of functions in the “tm” package as well as some personally made functions

ds_tweets.corpus<-tm_map(ds_tweets.corpus,removePunctuation)
removeSpecialChars <- function(x) gsub("[^a-zA-Z0-9 ]","",x)#remove garbage terms
ds_tweets.corpus<-tm_map(ds_tweets.corpus,removeSpecialChars) #application of custom function
ds_tweets.corpus<-tm_map(ds_tweets.corpus,function(x) removeWords(x,stopwords())) #removes stop words
ds_tweets.corpus<-tm_map(ds_tweets.corpus,tolower)

Data Analysis

We can make a word cloud for fun now

wordcloud(ds_tweets.corpus)
1.png

We now need to convert our corpus to a matrix for further analysis. In addition, we need to remove sparse terms as this reduces the size of the matrix without losing much information. The value to set it to is at the discretion of the researcher. Below is the code

ds_tweets.tdm<-TermDocumentMatrix(ds_tweets.corpus)
ds_tweets.tdm<-removeSparseTerms(ds_tweets.tdm,sparse = .8)#remove sparse terms

We’ve looked at how to find the most frequent terms in another post. Below is the code for the 15 most common words

findFreqTerms(ds_tweets.tdm,15)
##  [1] "datasto"      "demonstrates" "download"     "executed"    
##  [5] "hard"         "key"          "leaka"        "locally"     
##  [9] "memory"       "mitchellvii"  "now"          "portable"    
## [13] "science"      "similarly"    "data"

Below are words that are highly correlated with the term “key”.

findAssocs(ds_tweets.tdm,'key',.95)
## $key
## demonstrates     download     executed        leaka      locally 
##         0.99         0.99         0.99         0.99         0.99 
##       memory      datasto         hard  mitchellvii     portable 
##         0.99         0.98         0.98         0.98         0.98 
##    similarly 
##         0.98

For the final trick, we will make a hierarchical agglomerative cluster. This will clump words that are more similar next to each other. We first need to convert our current “ds_tweets.tdm” into a regular matrix. Then we need to scale it because the distances need to be standardized. Below is the code.

ds_tweets.mat<-as.matrix(ds_tweets.tdm)
ds_tweets.mat.scale<-scale(ds_tweets.mat)

Now, we need to calculate the distance statistically

ds_tweets.dist<-dist(ds_tweets.mat.scale,method = 'euclidean')

At last, we can make the clusters,

ds_tweets.fit<-hclust(ds_tweets.dist,method = 'ward')
plot(ds_tweets.fit)

1

Looking at the chart, it appears we have six main clusters we can highlight them using the code below

plot(ds_tweets.fit)
groups<-cutree(ds_tweets.fit,k=6)
rect.hclust(ds_tweets.fit,k=6)

1.png

Conclusion

This post provided an example of how to pull data from twitter for text analysis. There are many steps but also some useful insights can be gained from this sort of research.

Review of “First Encyclopedia of the Human Body”

The First Encyclopedia of the Human Body (First Encyclopedias)by Fiona Chandler (pp. 64) provides insights into science for young children.

The Summary
This book explains all of the major functions of the human body as well as some aspects of health and hygiene. Students will learn about the brain, heart, hormones, where babies come from, as well as healthy eating and visiting the doctor.

The Good
This book is surprisingly well-written. The author was able to take the complexities of
the human body and word them in a way that a child can
understand. In addition, the illustrations are rich and interesting. For example, there are pictures of an infare-red scan of a child’s hands, x-rays of broken bones, as well as
pictures of people doing things with their bodies such as running or jumping.

There is also a good mix of small and large photos which allows this book to be used individually or for whole class reading. The large size of the text also allows for younger readers to appreciate not only the pictures but also the reading.

There are also several activities in the book at different places. For example, students are invited to take their pulse, determine how much air is in their lungs, as well as an activity for testing your sense of touch.

In every section of the book, there are links to online activities as well. It seems as though this book has every angle covered in terms of learning.

The Bad
There is little to criticize in this book. It’s a really fun text. Perhaps if you are an expert in the human body you may find things that are disappointing. However, for a layman called to teach young people science, this text is more than adequate.

The Recommendation
I would give this book 5/5 stars. My students loved it and I was able to use it in so many different ways to build activities and discussions. I am sure that the use of this book would be beneficial to almost any teacher in any classroom

Reading Assessment at the Interactive and Extensive Level

In reading assessment, the interactive and extensive level are the highest levels of reading. This post will provide examples of assessments at each of these two levels.

Interactive Level

Reading at this level is focused on both form and meaning of the text with an emphasis on top-down processing. Below are some assessment examples

Cloze

Cloze assessment involves removing certain words from a paragraph and expecting the student to supply them. The criteria for removal is every nth word aka fixed-ratio or removing words with meaning aka rational deletion.

In terms of marking, you have the choice of marking based on the student providing the exact wording or an appropriate wording. The exact wording is strict but consistent will appropriate wording can be subjective.

Read and Answer the Question

This is perhaps the most common form of assessment of reading. The student simply reads a passage and then answer questions such as T/F, multiple choice, or some other format.

Information Transfer

Information transfer involves the students interpreting something. For example, they may be asked to interpret a graph and answer some questions. They may also be asked to elaborate on the graph, make predictions, or explain. Explaining a visual is a common requirement for the IELTS.

Extensive Level

This level involves the highest level of reading. It is strictly top-down and requires the ability to see the “big picture” within a text. Marking at this level is almost always subjective.

Summarize and React

Summarizing and reacting requires the student to be able to read a large amount of information, share the main ideas, and then providing their own opinion on the topic. This is difficult as the student must understand the text to a certain extent and then form an opinion about what they understand.

I like to also have my students write several questions they have about the text This teaches them to identify what they do not know. These questions are then shared in class so that they can be discussed.

For marking purposes, you can provide directions about a number of words, paragraphs, etc. to provide guidance. However, marking at this level of reading is still subjective. The primary purpose of marking should probably be evidence that the student read the text.

Conclusion

The interactive and extensive level of reading is when teaching can become enjoyable. Students have moved beyond just learning to read to reading to learn. This opens up many possibilies in terms of learning experiences.

Reading Assessment at the Perceptual and Selective Level

This post will provide examples of assessments that can be used for reading at the perceptual and selective level.

Perceptual Level

The perceptual level is focused on bottom-up processing of text. Comprehension ability is not critical at this point. Rather, you are just determining if the student can accomplish the mechanical process of reading.

Examples

Reading Aloud-How this works is probably obvious to most teachers. The students read a text out loud in the presence of an assessor.

Picture-Cued-Students are shown a picture. At the bottom of the picture are words. The students read the word and point to a visual example of it in the picture. For example, if the picture has a cat in it. At the bottom of the picture would be the word cat. The student would read the word cat and point to the actual cat in the picture.

This can be extended by using sentences instead of words. For example, if the actual picture shows a man driving a car. There may be a sentence at the bottom of the picture that says “a man is driving a car”. The student would then point to the man in the actual picture who is driving.

Another option is T/F statements. Using our cat example from above. We might write that “There is one cat in the picture” the student would then select T/F.

Other Examples-These includes multiple-choice and written short answer.

Selective Level

The selective level is the next above perceptual. At this level, the student should be able to recognize various aspects of grammar.

Examples

Editing Task-Students are given a reading passage and are asked to fix the grammar. This can happen many different ways. They could be asked to pick the incorrect word in a sentence or to add or remove punctuation.

Pictured-Cued Task-This task appeared at the perceptual level. Now it is more complicated. For example, the students might be required to read statements and label a diagram appropriately, such as the human body or aspects of geography.

Gap-Filling Task-Students read a sentence and complete it appropriately

Other Examples-Includes multiple-choice and matching. The multiple-choice may focus on grammar, vocabulary, etc. Matching attempts to assess a students ability to pair similar items.

Conclusion

Reading assessment can take many forms. The examples here provide ways to deal with this for students who are still highly immature in their reading abilities. As fluency develops more complex measures can be used to determine a students reading capability.

Review of “See How It’s Made”

This is a review of the book See How It’s Made written by Penny Smith and Lorrie Mack.

The Summary

This book takes several everyday products such as ice cream, CDs, t-shirts, crayons, etc. and illustrates the process of how the item is made. The authors take you into the factory where these products are produced and shows you through the use of photographs how each item is made. It can be surprising even for teachers to learn how much work goes into making CDs or apple juice.

The Good

The photo rich environment of the text makes it as realistic as possible. In addition, choosing common everyday items really helps in relevancy for students. Many kids find it interesting to know how pencils and crayons are made. The book is truly engaging at least in a one-on-one situation.

The Bad

The text is small in this book. This would make reading it difficult for younger students. In addition, although I appreciate the photos there are so many jammed onto a single page that it would be difficult to share this book with an entire class. This leaves the book for use only in the class library for individual students. Lastly, kids learn a lot of

Lastly, kids learn a lot of relevant interesting things but there seems to be no overall point to the text. It just a collection of different processes for making things. It is left to the teacher to come up with a reason for reading this

The Recommendation

THis book is 3/5 stars. It’s a great text in terms of the visual stimulus but it can be difficult to read and lacks a sense of direction.

Types of Reading in ESL

Reading for comprehension involves two forms of processing which are bottom-up and top-down. Bottom-up processing involves pulling letters together to make words, words to make sentences, etc. This is most commonly seen as students sounding out words when they read. The goal is primarily to just read the word.

Top-down processing is the use of prior knowledge, usually organized as schemas in the mind to understand what is being read. For example, after a student reads the word “cat” using bottom-up processing they then use top-down processing of what they know about cats such as their appearance, diet, habits, etc.

These two processes work together in order for us to read. Generally, they happen simultaneously as we are frequently reading and using our background knowledge to understand what we are reading.

In the context of reading, there are four types of reading from simplest to most complex and they are

  • Perceptive
  • Selective
  • Interactive
  • Extensive

We will now look at each in detail

Perceptive

Perceptive reading is focused primarily on bottom-processing. In other words, if a teacher is trying to assess this type of reading they simply want to know if the student can read or not. The ability to understand or comprehend the text is not the primary goal at this.

Selective

Selective reading involves looking a reader’s ability to recognize grammar, discourse features, etc. This is done with brief paragraphs and short reading passages. Assessment involves standard assessment items such as multiple-choice, short answer, true/false, etc.

In order to be successful at this level, the student needs to use both bottom-up and top-down processing.  Charts and graphs can also be employed

Interactive

Interactive reading involves deriving meaning from the text. This places even more emphasis on top-down processing. Readings are often chosen from genres that employ implied main ideas rather than stated. The readings are also more authentic in nature and can include announcements, directions, recipes, etc.

Students who lack background knowledge will struggle with this type of reading regardless of their language ability. In addition, inability to think critically will impair performance even if the student can read the text.

Extensive

Extensive is reading large amounts of information and being able to understand the “big picture”. The student needs to be able to separate the details from the main ideas. Many students struggle with this in their native language. As such, this is even more difficult when students are trying to digest large amounts of information in a second language.

Conclusion

Reading is a combination of making sense of the words and using prior knowledge to comprehend text. The levels of reading vary in their difficulty. In order to have success at reading, students need to be exposed to many different experiences in order to have the background knowledge they need that they can call on when reading something new.

Diversity and Lexical Dispersion Analysis in R

In this post, we will learn how to conduct a diversity and lexical dispersion analysis in R. Diversity analysis is a measure of the breadth of an author’s vocabulary in a text. Are provides several calculations of this in their output

Lexical dispersion is used for knowing where or when certain words are used in a text. This is useful for identifying patterns if this is a goal of your data exploration.

We will conduct our two analysis by comparing two famous philosophical texts

  • Analects
  • The Prince

These books are available at the Gutenberg Project. You can go to the site type in the titles and download them to your computer.

We will use the “qdap” package in order to complete the sentiment analysis. Below is some initial code.

library(qdap)

Data Preparation

Below are the steps we need to take to prepare the data

  1. Paste the text files into R
  2. Convert the text files to ASCII format
  3. Convert the ASCII format to data frames
  4. Split the sentences in the data frame
  5. Add a variable that indicates the book name
  6. Combine the two books into one dataframe

We now need to prepare the three text. First, we move them into R using the “paste” function.

analects<-paste(scan(file ="C:/Users/darrin/Documents/R/R working directory/blog/blog/Text/Analects.txt",what='character'),collapse=" ")
prince<-paste(scan(file ="C:/Users/darrin/Documents/R/R working directory/blog/blog/Text/Prince.txt",what='character'),collapse=" ")

We must convert the text files to ASCII format see that R is able to interpret them.

analects<-iconv(analects,"latin1","ASCII","")
prince<-iconv(prince,"latin1","ASCII","")

For each book, we need to make a dataframe. The argument “texts” gives our dataframe one variable called “texts” which contains all the words in each book. Below is the code
data frame

analects<-data.frame(texts=analects)
prince<-data.frame(texts=prince)

With the dataframes completed. We can now split the variable “texts” in each dataframe by sentence. The “sentSplit” function will do this.

analects<-sentSplit(analects,'texts')
prince<-sentSplit(prince,'texts')

Next, we add the variable “book” to each dataframe. What this does is that for each row or sentence in the dataframe the “book” variable will tell you which book the sentence came from. This will be valuable for comparative purposes.

analects$book<-"analects"
prince$book<-"prince"

Now we combine the two books into one dataframe. The data preparation is now complete.

twobooks<-rbind(analects,prince)

Data Analysis

We will begin with the diversity analysis

div<-diversity(twobooks$texts,twobooks$book)
div
           book wc simpson shannon collision berger_parker brillouin
1 analects 30995    0.989   6.106   4.480     0.067         5.944
2 prince   52105    0.989   6.324   4.531     0.059         6.177

For most of the metrics, the diversity in the use of vocabulary is the same despite being different books from different eras in history. How these numbers are calculated is beyond the scope of this post.

Next, we will calculate the lexical dispersion of the two books. Will look at three common themes in history money, war, and marriage.

dispersion_plot(twobooks$texts,grouping.var=twobooks$book,c("money","war",'marriage'))

1

The tick marks show when each word appears. For example, money appears at the beginning of Analects only but is more spread out in tThe PRince. War is evenly dispersed in both books and marriage only appears in The Prince

Conclusion

This analysis showed additional tools that can be used to analyze text in R.

Review of “The Great Wall of China”

In this post, we will look at another book that I have used as a K-12 teacher The Great Wall Of China (Aladdin Picture Books) by Leonard Everett Fisher (pp 31).

The Summary

The title clearly lets you know what the book is about. It focuses on Ch’in Shih Huang Ti and his quest to build a wall that would protect his empire from the Mongols. According to the text, Ch’in Shih Huang Ti was the first supreme emperor of China as he conquered several other small kingdoms to make what we now know as China.

The book depicts how the Mongols were coming and burning down border villages in China and how the Emperor plans and builds the wall. Men were dragged from their families to go and work on the wall. The Emperor even sent his oldest son and crown prince to help build the wall.

The project was a combination of building a new wall while also restoring walls that were in disrepair. Workers who complained or ran away were buried alive. It took a total of ten years to complete what is now called the Great Wall of China.

The Good 

My favorite aspect of the book is the iconic black and white drawings depicting ancient China. The stern look of the Emperor and the soldiers remains of the toughness of characters old western movies. Nobody smiles in the book until the last page when the Emperor is rejoicing over the completion of the Wall.

THe book doesn’t include a lot of text. Rather, the pictures do the majority of the storying telling. The pictures are large enough that you can use this book for a whole-class reading experience where the kids sit around you as you read the text and show them the pictures.

THe author also did an excellent job of simplifying the complexity of the building of the Great Wall into a few pages for young children. For example, there is much more to the Emperor’s son being sent to help build the Great Wall. However, the author reduces this complex problem down to the accusation that the Emperor thought his son was a “whiner.”

The Bad

I can’t say there is bad in this text as it depends on what your purpose is for buying the book. There is not a lot of text in the book as it is primarily picture-based. If you want your students to read on their own there is not a lot to read. For those of us who have a background in Chinese history, the text may be oversimplified.

The Recommendation

I would give this book 4.5/5 stars. Whether for your library or for sharing with your entire class this book will provide a great learning experience about a part of history that is normally not studied as much as it should be.

Assessing Speaking in ESL

In this post, we will look at different activities that can be used to assess a language learner’s speaking ability, Unfortunately, will not go over how to mark or give a grade for the activities we will only provide examples.

Directed Response

In this activity, the teacher tries to have the student use a particular grammatical form by having the student modify something the teacher says. Below is an example.

Teacher: Tell me he went home
Student: He went home

This is obviously not deep. However, the student had to know to remove the words “tell me” from the sentence and they also had to know that they needed to repeat what the teacher said. As such, this is an appropriate form of assessment for beginning students.

Read Aloud

Read aloud is simply having the student read a passage verbatim out loud. Normally, the teacher will assess such things as pronunciation and fluency. There are several problems with this approach. First, reading aloud is not authentic as this is not an in demand skill in today’s workplace. Second, it blends reading with speaking which can be a problem if you do not want to assess both at the same time.

Oral Questionnaires 

Students are expected to respond and or complete sentences. Normally, there is some sort of setting such as a mall, school, or bank that provides the context or pragmatics. below is an example in which a student has to respond to a bank teller. The blank lines indicate where the student would speak.

Teacher (as bank teller): Would you like to open an account?
Student:_______________________
Teacher (as bank teller): How much would you like to deposit?
Student:___________________________

Visual Cues

Visual cues are highly opened. For example, you can give the students a map and ask them to give you directions to a location on the map. In addition, students can describe things in the picture or point to things as you ask them too. You can also ask the students to make inferences about what is happening in a picture. Of course, all of these choices are highly difficult to provide a grade for and may be best suited for formative assessment.

Translation

Translating can be a highly appropriate skill to develop in many contexts. In order to assess this, the teacher provides a word, phrase, or perhaps something more complicated such as directly translating their speech. The student then Takes the input and reproduces it in the second language.

This is tricky to do. For one, it is required to be done on the spot, which is challenging for anybody. In addition, this also requires the teacher to have some mastery of the student’s mother tongue, which for many is not possible.

Other Forms

There are many more examples that cannot be covered here. Examples include interviews, role play, and presentations. However, these are much more common forms of speaking assessment so for most they are already familiar with these.

Conclusion

Speaking assessment is a major component of the ESL teaching experience. The ideas presented here will hopefully provide some additionals ways that this can be done.

Readability and Formality Analysis in R

In this post, we will look at how to assess the readability and formality of a text using R. By readability, we mean the use of a formula that will provide us with the grade level at which the text is roughly written. This is highly useful information in the field of education and even medicine.

Formality provides insights into how the text relates to the reader. The more formal the writing the greater the distance between author and reader. Formal words are nouns, adjectives, prepositions, and articles while informal (contextual) words are pronouns, verbs, adverbs, and interjections.

The F-measure counts and calculates a score of formality based on the proportions of the formal and informal words.

We will conduct our two analysis by comparing two famous philosophical texts

  • Analects
  • The Prince

These books are available at the Gutenberg Project. You can go to the site type in the titles and download them to your computer.

We will use the “qdap” package in order to complete the sentiment analysis. Below is some initial code.

library(qdap)

Data Preparation

Below are the steps we need to take to prepare the data

  1. Paste the text files into R
  2. Convert the text files to ASCII format
  3. Convert the ASCII format to data frames
  4. Split the sentences in the data frame
  5. Add a variable that indicates the book name
  6. Combine the two books into one dataframe

We now need to prepare the two text. The “paste” function will move the text into the R environment.

analects<-paste(scan(file ="C:/Users/darrin/Documents/R/R working directory/blog/blog/Text/Analects.txt",what='character'),collapse=" ")
prince<-paste(scan(file ="C:/Users/darrin/Documents/R/R working directory/blog/blog/Text/Prince.txt",what='character'),collapse=" ")

The text need to be converted to the ASCII format and the code below does this.

analects<-iconv(analects,"latin1","ASCII","")
prince<-iconv(prince,"latin1","ASCII","")

For each book, we need to make a dataframe. The argument “texts” gives our dataframe one variable called “texts” which contains all the words in each book. Below is the code data frame

analects<-data.frame(texts=analects)
prince<-data.frame(texts=prince)

With the dataframes completed. We can now split the variable “texts” in each dataframe by sentence. The “sentSplit” function will do this.

analects<-sentSplit(analects,'texts')
prince<-sentSplit(prince,'texts')

Next, we add the variable “book” to each dataframe. What this does is that for each row or sentence in the dataframe the “book” variable will tell you which book the sentence came from. This will be useful for comparative purposes.

analects$book<-"analects"
prince$book<-"prince"

Lastly, we combine the two books into one dataframe. The data preparation is now complete.

twobooks<-rbind(analects,prince)

Data Analysis

We will begin with the readability. The “automated_readbility_index” function will calculate this for us.

ari<-automated_readability_index(twobooks$texts,twobooks$book)
ari
##       book word.count sentence.count character.count Automated_Readability_Index
## 1 analects      30995           3425          132981                       3.303
## 2   prince      52105           1542          236605                      16.853

Analects is written on a third-grade level but The Prince is written at grade 16. This is equivalent to a Senior in college. As such, The Prince is a challenging book to read.

Next we will calcualte the formality of the two books. The “formality” function is used for this.

form<-formality(twobooks$texts,twobooks$book)
form
##       book word.count formality
## 1   prince      52181     60.02
## 2 analects      31056     58.36

The books are mildly formal. The code below gives you the break down of the word use by percentage.

form$form.prop.by
##       book word.count  noun  adj  prep articles pronoun  verb adverb
## 1 analects      31056 25.05 8.63 14.23     8.49   10.84 22.92   5.86
## 2   prince      52181 21.51 9.89 18.42     7.59   10.69 20.74   5.94
##   interj other
## 1   0.05  3.93
## 2   0.00  5.24

The proportions are consistent when the two books are compared. Below is a visual of the table we just examined.

plot(form)

1.png

Conclusion

Readability and formality are additional text mining tools that can provide insights for Data Scientist. Both of these analysis tools can provide suggestions that may be needed in order to enhance communication or compare different authors and writing styles.

Review of “The Usborne Book of World History”

As a teacher, I have used hundreds of books in my career to instruct and guide students. As such, I decided I would share my thoughts on  some of these books to provide other educators with insights on potential instructional materials,

The Summary

“The Usborne Book of World History” covers, as you can probably guess, the history of the world from the beginning of recorded history to the dawn of the 20th century. Early civilizations such as the Sumerians, Egyptians, and Hittites are covered as well as more recent civilizations such as the Greeks, Romans, British Empire. There are mentions of African ad several Asian civilizations such as the Chinese and Japanese.

The Good

This book is rich with illustrations of all of the historical events and cultural topics. For young readers who are visual learners, this is a superb text for such an experience. In the text, there are illustrations of fighting between the Canaanites and the Philistines, an Assyrian king fighting a lion, life in the city of ancient Athens, and even depictions of settlers moving out west in what would later become the United States. This is completely a visually based learning experience.

The Bad

The focus on illustrations is a strength but also a weakness depending on your goals. The students can get so obsessed with the pictures that they never actually read the text in the book. This can be a problem if you are trying to get your students to develop their reading skills. In addition, The reading level of the text is probably at the 4th-5th-grade level which is beyond younger student.

This book is also not appropriate for a whole class read aloud because several illustrations are crammed onto one page. This would make it challenging for several students to all see what the teacher was talking about at the same time, which could quickly lead to behavioral problems.

Lastly, the book is somewhat detailed oriented in that it provides a bunch of little facts about each Kingdom or historical period. If you want the students to see the big picture you have to trace the themes of history yourself as there seem to be no pedagogical aids beyond the rich illustrations.

The Recommendation

The Usborne Book of World History absolutely deserves 4/5 stars. Buy it and put in your library as an opportunity for your students to “see” history rather than read about it. If you need something for whole-class instruction you better keep looking.

Homeschooling Bilingual Children

A colleague of mine has kids that are half Thai and half African (like Tiger Woods). In the home, both Thai and English are spoken frequently. A major problem with bilingual children is that one of the languages is never truly mastered. This is called semilingualism. The problem was not with the kids learning Thai because their mother was Thai. Instead, my colleague was worried about his kids developing broken poorly understood Pidgin English.

About 8 years ago there was another family whose children were half Thai and half American and they had faced the same problem. However, they overreacted and never spoke Thai in their home in order to make sure their children learned English. This led to the kids knowing only English even though they were half-Thai and lived in Thailand. My friend did not want to make this mistake.

What He Did

I suggested to my friend that he needed to set some sort of schedule in which time was set aside in the home for the use of both languages. Below is the schedule that he developed.

  • Monday – Friday from waking up until 2 pm Thai language
  • Monday-Friday 2 pm to bedtime English language
  • Weekends-English only
  • Exceptions-Home school curriculum is in English with the exception of Thai language

This has worked relatively well. The children are exposed to both languages each day for several hours at a time. Generally, the rule is when dad is home English is used.

To further support the acquisition of English I encouraged my friend to never speak any Thai to his children. This has stunted his development in the language but it’s more important that they learn than him.

For the oldest daughter who is home schooled, Dan and his wife taught her to read and write in Thai and English at the same time. Many language experts would disagree with this and suggest that it is better to learn one language first and to transfer those skills to learning a second language. I see their point but my friend wanted his daughter to have native fluency in both languages to the point that if she is having a dream both languages could be used without a problem so to speak.

Challenges

With bilingual children, all language goals are delayed. This is because the child has to acquire double the vocabulary of a monolingual child. My friend’s daughter didn’t really talk until she was three. However, by five things start to move at a normal pace with some “problems”

  • Word order is sometimes wrong. ie my friend’s daughter will use Thai syntax in English and vice versa.
  • Mixing of the two languages at times (code-switching)

Most kids grow out of this.

Conclusion

Raising bilingual children requires finding a balance between the two languages in the home. I have provided one example but I would like to know how you have dealt with this with your children.

Sentiment Analysis in R

In this post, we will perform a sentiment analysis in R. Sentiment analysis involves employs the use of dictionaries to give each word in a sentence a score. A more positive word is given a higher positive number while a more negative word is given a more negative number. The score is then calculated based on the position of the word, the weight, as well as other more complex factors. This is then performed for the entire corpus to give it a score.

We will do a sentiment analysis in which we will compare three famous philosophical texts

  • Analects
  • The Prince
  • Penesees

These books are available at the Gutenberg Project. You can go to the site type in the titles and download them to your computer.

We will use the “qdap” package in order to complete the sentiment analysis. Below is some initial code.

library(qdap)

Data Preparation

Below are the steps we need to take to prepare the data

  1. Paste the text files into R
  2. Convert the text files to ASCII format
  3. Convert the ASCII format to data frames
  4. Split the sentences in the data frame
  5. Add a variable that indicates the book name
  6. Combine the three books into one dataframe

We now need to prepare the three text. First, we move them into R using the “paste” function.

analects<-paste(scan(file ="C:/Users/darrin/Documents/R/R working directory/blog/blog/Text/Analects.txt",what='character'),collapse=" ")
pensees<-paste(scan(file ="C:/Users/darrin/Documents/R/R working directory/blog/blog/Text/Pascal.txt",what='character'),collapse=" ")
prince<-paste(scan(file ="C:/Users/darrin/Documents/R/R working directory/blog/blog/Text/Prince.txt",what='character'),collapse=" ")

We need to convert the text files to ASCII format see that R is able to read them.

analects<-iconv(analects,"latin1","ASCII","")
pensees<-iconv(pensees,"latin1","ASCII","")
prince<-iconv(prince,"latin1","ASCII","")

Now we make our dataframe for each book. The argument “texts” gives our dataframe one variable called “texts” which contains all the words in each book. Below is the code data frame

analects<-data.frame(texts=analects)
pensees<-data.frame(texts=pensees)
prince<-data.frame(texts=prince)

With the dataframes completed. We can now split the variable “texts” in each dataframe by sentence. We will use the “sentSplit” function to do this.

analects<-sentSplit(analects,'texts')
pensees<-sentSplit(pensees,'texts')
prince<-sentSplit(prince,'texts')

Next, we add the variable “book” to each dataframe. What this does is that for each row or sentence in the dataframe the “book” variable will tell you which book the sentence came from. This will be valuable for comparative purposes.

analects$book<-"analects"
pensees$book<-"pensees"
prince$book<-"prince"

Now we combine all three books into one dataframe. The data preparation is now complete.

threebooks<-rbind(analects,pensees,prince)

Data Analysis

We are now ready to perform the actual sentiment analysis. We will use the “polarity” function for this. Inside the function, we need to use the text and the book variables. Below is the code. polarity analysis

pol<-polarity(threebooks$texts,threebooks$book)

We can see the results and a plot in the code below.

pol
##       book total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1 analects            3425       31383        0.076       0.254              0.299
## 2  pensees            7617      101043        0.008       0.278              0.028
## 3   prince            1542       52281        0.017       0.296              0.056

The table is mostly self-explanatory. We have the total number of sentences and words in the first two columns. Next is the average polarity and the standard deviation. Lastly, we have the standardized mean. The last column is commonly used for comparison purposes. As such, it appears that Analects is the most positive book by a large margin with Pensees and Prince be about the same and generally neutral.

plot(pol)

1.png

The top plot shows the polarity of each sentence over time or through the book. The bluer the more negative and the redder the more positive the sentence. The second plot shows the dispersion of the polarity.

There are many things to interpret from the second plot. For example, Pensees is more dispersed than the other two books in terms of polarity. The Prince is much less dispersed in comparison to the other books.

Another interesting task is to find the most negative and positive sentence. We need to take information from the “pol” dataframe and then use the “which.min” function to find the lowest scoring. The “which.min” function only gives the row. Therefore, we need to take this information and use it to find the actual sentence and the book. Below is the code.

pol.df<-pol$all #take polarity scores from pol.df
which.min(pol.df$polarity) #find the lowest scored sentence
## [1] 6343
pol.df$text.var[6343] #find the actual sentence
## [1] "Apart from Him there is but vice, misery, darkness, death, despair."
pol.df$book[6343] #find the actual book name
## [1] "pensees"

Pensees had the most negative sentence. You can see for yourself the clearly negative words which are vice, misery, darkness, death, and despair. We can repeat this for the most positive sentence

which.max(pol.df$polarity)
## [1] 4839
pol.df$text.var[4839]
## [1] "You will be faithful, honest, humble, grateful, generous, a sincere friend, truthful."
pol.df$book[4839]
## [1] "pensees"

Again Pensees has the most positive sentence with such words as faithful, honest, humble, grateful, generous, sincere, friend, truthful all being positive.

Conclusion

Sentiment analysis allows for the efficient analysis of a large body of text in a highly qualitative manner. There are weaknesses to this approach such as the dictionary used to classify the words can affect the results. In addition, Sentiment analysis only looks at individual sentences and not larger contextual circumstances such as a paragraph. As such, a sentiment analysis provides descriptive insights and not generalizations.

Struggles with Early Childhood Education

I had a friend (Dan) share his experience with me of home schooling his oldest daughter (Jina) and the challenges he faced as he tried to start her education too early in his opinion. He began homeschooling his oldest daughter when she was about four years of age. His goals for the 1st year was simply for his daughter

  • to learn to count to 10
  • to recognize the letters of the alphabet

That was all he wanted for the first year of instruction. Dan friend knew Jina was young, perhaps too young, so he did not want to push it. He just wanted to develop a rhythm of learning and instruction in the family along with the two goals above. In addition, his family was one of only two families who home school their kids in his community and he wanted to make sure his daughter was always on par academically with the other children in the neighborhood as a witness to the benefits of homeschooling.

Yet, a strange thing happened. Both academic goals were achieved in less than four months. Now Jina was getting bored with school already. This meant that Dan now had to raise the level of complexity with more goals

  • recognize numbers
  • Know the sounds of all the letters of the alphabet

By the end of the first year (age 5 now), without any pressure, and by going at her own pace my friend’s daughter could read simple words, count objects, recognize numbers, do simple addition, subtraction, and had the rudiments of telling time. However, near the end of the first year of learning some strange things began to happen.

  • One day Jina would complete a task with no problems but the next day she could not seem to remember the slightest way how to do it. She seemed to inadvertently lose motivation for no reason.
  • Some concepts (telling time) never stuck no matter how many times it was taught and review.
  • She was inconsistent in her ability to recognize words and seemed to lack any ability to generalize concepts (transfer) to other settings. For example, realizing that ‘cap’, ‘snap’, ‘lap’, all end with the -ap ending.

When she turned five, Dan and his wife formally started Jina in an official home school curriculum rather than the ad-hoc stuff they did for the first year. Jina now had the ability to do 1st-grade work thanks to her parents prior teaching. Old struggles subsided and new ones appeared. Unlike the ad-hoc curriculum, the formal home school curriculum had weekly lesson plans and Dan was determined to stick to the “schedule.”

Why the Struggle

Dan still wondered what the problem was. Jina was progressing but it was a chore and I couldn’t understand why. Isn’t it good to start kids in school early? That’s when he asked me.

I explained to him some of the basics of Piaget’s theory of cognitive development. This is not just any theory. Piaget’s ideas are taught to almost all undergrad education majors on the planet.

Piaget proposes that there are four stages of cognitive development

  1. Sensorimotor (0-2 years)-Learning only through senses
  2. Preoperational (2-7)-Symbolic thinking and pretend play
  3. Concrete Operational (7-11)-Ideas applied to literally objects, understand time and quantity.
  4. Formal Operations (12-adult)-Abstract thinking, logic, transfer possible.

Dan was teaching his daughter all of these abstract ideas (counting, reading, telling time, etc) when she was at a preoperational level cognitively.

Reading is a highly abstract experience. Letters on a page have a sound attached to them and these letters can be combined to make words etc.? This is astounding for a child and their minds will struggle with this if they are not ready. Numbers on a page represent an actual amount in the real world? This is another astounding breakthrough for a young child. Dan was teaching his daughter to tell time when she had no idea what time was! He was frustrated when she could not transfer knowledge to new settings when this is normally not possible until they are 11 years or older.

If a child is not developmentally ready for these complex ideas they will struggle with school. If Piaget’s theory is correct (and not everyone agrees), formal schooling should not begin until age 7 for most children. What is meant by formal schooling is the study of math and reading. They should begin learning math and reading at 7. However, traditionally, students have been studying these subjects for several years by the age of 7.

This is not a totally radical idea. Many parents are delaying the enrollment of their child in kindergarten by a year in order to allow them to develop more. The term for this is redshirting

What He Did

By the time I had spoken with Dan Jina was six years old and already in second grade. She was doing better but now Dan and his wife worried about burnout.  He did not want to stop her studies completely because stopping now would mean having to fight with her to begin again. I suggested that they decided to slow down the instruction. Now they complete a weekly lesson plan over two weeks instead of one. This helps to minimize the damage that has taken place while still maintaining a structure of learning in the home. Unfortunately, Jina is learning multiplications when she should be learning to count.

Conclusion

I can say that there is evidence that early education is not best for children. If Piaget is correct a child under 7 is not ready for rigorous study and should be allowed more hands on experiences rather than abstract ones. Of course, there are exceptions but generally, you can start too early but it is difficult to start too late. If a child starts too early they will be in a constant state of struggling. All children are different but I think that parents should be aware that waiting is an option when it comes to formal instruction and one benefit of home schooling is the ability to have authority over your child’s education.

Types of Speaking in ESL

In the context of ESL teaching, ~there are at least five types of speaking that take place in the classroom. This post will define and provide examples of each. The five types are as follows…

  • Imitative
  • Intensive
  • Responsive
  • Interactive
  • Extensive

The list above is ordered from simplest to most complex in terms of the requirements of oral production for the student.

Imitative

At the imitative level, it is probably already clear what the student is trying to do. At this level, the student is simply trying to repeat what was said to them in a way that is understandable and with some adherence to pronunciation as defined by the teacher.

It doesn’t matter if the student comprehends what they are saying or carrying on a conversation. The goal is only to reproduce what was said to them. One common example of this is a “repeat after me” experience in the classroom.

Intensive

Intensive speaking involves producing a limit amount of language in a highly control context. An example of this would be to read aloud a passage or give a direct response to a simple question.

Competency at this level is shown through achieving certain grammatical or lexical mastery. This depends on the teacher’s expectations.

Responsive

Responsive is slightly more complex than intensive but the difference is blurry, to say the least. At this level, the dialog includes a simple question with a follow-up question or two. Conversations take place by this point but are simple in content.

Interactive

The unique feature of intensive speaking is that it is usually more interpersonal than transactional. By interpersonal it is meant speaking for maintaining relationships. Transactional speaking is for sharing information as is common at the responsive level.

The challenge of interpersonal speaking is the context or pragmatics The speaker has to keep in mind the use of slang, humor, ellipsis, etc. when attempting to communicate. This is much more complex than saying yes or no or giving directions to the bathroom in a second language.

Extensive

Extensive communication is normal some sort of monolog. Examples include speech, story-telling, etc. This involves a great deal of preparation and is not typically improvisational communication.

It is one thing to survive having a conversation with someone in a second language. You can rely on each other’s body language to make up for communication challenges. However, with extensive communication either the student can speak in a comprehensible way without relying on feedback or they cannot. In my personal experience, the typical ESL student cannot do this in a convincing manner.

Visualizing Clustered Data in R

In this post, we will look at how to visualize multivariate clustered data. We will use the “Hitters” dataset from the “ISLR” package. We will use the features of the various baseball players as the dimensions for the clustering. Below is the initial code

library(ISLR);library(cluster)
data("Hitters")
str(Hitters)
## 'data.frame':    322 obs. of  20 variables:
##  $ AtBat    : int  293 315 479 496 321 594 185 298 323 401 ...
##  $ Hits     : int  66 81 130 141 87 169 37 73 81 92 ...
##  $ HmRun    : int  1 7 18 20 10 4 1 0 6 17 ...
##  $ Runs     : int  30 24 66 65 39 74 23 24 26 49 ...
##  $ RBI      : int  29 38 72 78 42 51 8 24 32 66 ...
##  $ Walks    : int  14 39 76 37 30 35 21 7 8 65 ...
##  $ Years    : int  1 14 3 11 2 11 2 3 2 13 ...
##  $ CAtBat   : int  293 3449 1624 5628 396 4408 214 509 341 5206 ...
##  $ CHits    : int  66 835 457 1575 101 1133 42 108 86 1332 ...
##  $ CHmRun   : int  1 69 63 225 12 19 1 0 6 253 ...
##  $ CRuns    : int  30 321 224 828 48 501 30 41 32 784 ...
##  $ CRBI     : int  29 414 266 838 46 336 9 37 34 890 ...
##  $ CWalks   : int  14 375 263 354 33 194 24 12 8 866 ...
##  $ League   : Factor w/ 2 levels "A","N": 1 2 1 2 2 1 2 1 2 1 ...
##  $ Division : Factor w/ 2 levels "E","W": 1 2 2 1 1 2 1 2 2 1 ...
##  $ PutOuts  : int  446 632 880 200 805 282 76 121 143 0 ...
##  $ Assists  : int  33 43 82 11 40 421 127 283 290 0 ...
##  $ Errors   : int  20 10 14 3 4 25 7 9 19 0 ...
##  $ Salary   : num  NA 475 480 500 91.5 750 70 100 75 1100 ...
##  $ NewLeague: Factor w/ 2 levels "A","N": 1 2 1 2 2 1 1 1 2 1 ...

Data Preparation

We need to remove all of the factor variables as the kmeans algorithm cannot support factor variables. In addition, we need to remove the “Salary” variable because it is missing data. Lastly, we need to scale the data because the scaling affects the results of the clustering. The code for all of this is below.

hittersScaled<-scale(Hitters[,c(-14,-15,-19,-20)])

Data Analysis

We will set the k for the kmeans to 3. This can be set to any number and it often requires domain knowledge to determine what is most appropriate. Below is the code

kHitters<-kmeans(hittersScaled,3)

We now look at some descriptive stats. First, we will see how many examples are in each cluster.

table(kHitters$cluster)
## 
##   1   2   3 
## 116 144  62

The groups are mostly balanced. Next, we will look at the mean of each feature by cluster. This will be done with the “aggregate” function. We will use the original data and make a list by the three clusters.

round(aggregate(Hitters[,c(-14,-15,-19,-20)],FUN=mean,by=list(kHitters$cluster)),1)
##   Group.1 AtBat  Hits HmRun Runs  RBI Walks Years CAtBat  CHits CHmRun
## 1       1 522.4 143.4  15.1 73.8 66.0  51.7   5.7 2179.1  597.2   51.3
## 2       2 256.6  64.5   5.5 30.9 28.6  24.3   5.6 1377.1  355.6   24.7
## 3       3 404.9 106.7  14.8 54.6 59.4  48.1  15.1 6480.7 1783.4  207.5
##   CRuns  CRBI CWalks PutOuts Assists Errors
## 1 299.2 256.1  199.7   380.2   181.8   11.7
## 2 170.1 143.6  122.2   209.0    62.4    5.8
## 3 908.5 901.8  694.0   303.7    70.3    6.4

Now we can see some difference. It seems group 3 are young (5.6 years of experience) starters based on the number of at-bats they get. Group 1 is young players who may not get to start due to the lower at-bats the receive. Group 2 is old (15.1 years) players who receive significant playing time and have but together impressive career statistics.

Now we will create our visual of the three clusters. For this, we use the “clusplot” function from the “cluster” package.

clusplot(hittersScaled,kHitters$cluster,color = T,shade = T,labels = 4)

1.png

In general, there is little overlap between the clusters. The overlap between groups 1 and 3 may be due to how they both have a similar amount of experience.

Conclusion

Visualizing the clusters can help with developing insights into the groups found during the analysis. This post provided one example of this.

Multidimensional Scale in R

In this post, we will explore multidimensional scaling (MDS) in R. The main benefit of MDS is that it allows you to plot multivariate data into two dimensions. This allows you to create visuals of complex models. In addition, the plotting of MDS allows you to see relationships among examples in a dataset based on how far or close they are to each other.

We will use the “College” dataset from the “ISLR” package to create an MDS of the colleges that are in the data set. Below is some initial code.

library(ISLR);library(ggplot2)
data("College")
str(College)
## 'data.frame':    777 obs. of  18 variables:
##  $ Private    : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Apps       : num  1660 2186 1428 417 193 ...
##  $ Accept     : num  1232 1924 1097 349 146 ...
##  $ Enroll     : num  721 512 336 137 55 158 103 489 227 172 ...
##  $ Top10perc  : num  23 16 22 60 16 38 17 37 30 21 ...
##  $ Top25perc  : num  52 29 50 89 44 62 45 68 63 44 ...
##  $ F.Undergrad: num  2885 2683 1036 510 249 ...
##  $ P.Undergrad: num  537 1227 99 63 869 ...
##  $ Outstate   : num  7440 12280 11250 12960 7560 ...
##  $ Room.Board : num  3300 6450 3750 5450 4120 ...
##  $ Books      : num  450 750 400 450 800 500 500 450 300 660 ...
##  $ Personal   : num  2200 1500 1165 875 1500 ...
##  $ PhD        : num  70 29 53 92 76 67 90 89 79 40 ...
##  $ Terminal   : num  78 30 66 97 72 73 93 100 84 41 ...
##  $ S.F.Ratio  : num  18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
##  $ perc.alumni: num  12 16 30 37 2 11 26 37 23 15 ...
##  $ Expend     : num  7041 10527 8735 19016 10922 ...
##  $ Grad.Rate  : num  60 56 54 59 15 55 63 73 80 52 ...

Data Preparation

After using the “str” function we know that we need to remove the variable “Private” because it is a factor and type of MDS we are doing can only accommodate numerical variables. After removing this variable we will then make a matrix using the “as.matrix” function. Once the matrix is ready we can use the “cmdscale” function to create the actual two-dimensional MDS. Another point to mention is that for the sake of simplicity, we are only going to use the first ten colleges in the dataset. The reason being that using all 722 will m ake it hard to understand the plots we will make. Below is the code.

collegedata<-as.matrix(College[,-1])
collegemds<-cmdscale(dist(collegedata[1:10,]))

Data Analysis

We can now make our initial plot. The xlim and ylim arguments had to be played with a little for the plot to display properly. In addition, the “text” function was used to provide additional information such as the names of the colleges.

plot(collegemds,xlim=c(-15000,10000),ylim=c(-15000,10000))
text(collegemds[,1],collegemds[,2],rownames(collegemds))

1.png

From the plot, you can see that even with only ten names it is messy. The colleges are mostly clumped together which makes it difficult to interpret. We can plot this with a four quadrant graph using “ggplot2”. First, we need to convert the matrix that we create to a dataframe.

collegemdsdf<-as.data.frame(collegemds)

We are now ready to use “ggplot” to create the four quadrant plot.

p<-ggplot(collegemdsdf, aes(x=V1, y=V2)) +
        geom_point() +
        lims(x=c(-10000,8000),y=c(-4000,5000)) +
        theme_minimal() +
        coord_fixed() +  
        geom_vline(xintercept = 5) + geom_hline(yintercept = 5)+geom_text(aes(label=rownames(collegemdsdf)))
p+theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())

1

We set the horizontal and vertical line at the x and y-intercept respectively. By doing this it is much easier to understand and interpret the graph. Agnes Scott College is way off to the left while Alaska Pacific University, Abilene Christian College, and even Alderson-Broaddus College are clump together. The rest of the colleges are straddling below the x-axis.

Conclusion

In this example, we took several variables and condense them to two dimensions. This is the primary benefit of MDS. It allows you to visualize was cannot be visualized normally. The visualizing allows you to see the structure of the data from which you can draw inferences.

Topics Models in R

Topic models is a tool that can group text by their main themes. It involves the use of probability based on word frequencies. The algorithm that does this is called the Latent Dirichlet Allocation algorithm.

IN this post, we will use some text mining tools to analyze religious/philosophical text the five texts we will look at are The King James Bible The Quran The Book Of Mormon The Gospel of Buddha Meditations, by Marcus Aurelius

The link for access to these five text files is as follows https://www.kaggle.com/tentotheminus9/religious-and-philosophical-texts/downloads/religious-and-philosophical-texts.zip

Once you unzip it you will need to rename each file appropriately.

The next few paragraphs are almost verbatim from the post text mining in R. This is because the data preparation is essentially the same. Small changes were made but original material is found in the analysis section of this post.

We will now begin the actual analysis. The package we need or “tm” and “topicmodels” Below is some initial code.

library(tm);library(topicmodels)

Data Preparation

We need to do three things for each text file

  1. Paste it
  2. convert it
  3. write a table

Below is the code for pasting the text into R. Keep in mind that your code will be slightly different as the location of the file on your computer will be different. The “what” argument tells are what to take from the file and the “Collapse” argument deals with whitespace

bible<-paste(scan(file ="/home/darrin/Desktop/speech/bible.txt",what='character'),collapse=" ")
buddha<-paste(scan(file ="/home/darrin/Desktop/speech/buddha.txt",what='character'),collapse=" ")
meditations<-paste(scan(file ="/home/darrin/Desktop/speech/meditations.txt",what='character'),collapse=" ")
mormon<-paste(scan(file ="/home/darrin/Desktop/speech/mormon.txt",what='character'),collapse=" ")
quran<-paste(scan(file ="/home/darrin/Desktop/speech/quran.txt",what='character'),collapse=" ")

Now we need to convert the new objects we created to ASCII text. This removes a lot of “funny” characters from the objects. For this, we use the “iconv” function. Below is the code.

bible<-iconv(bible,"latin1","ASCII","")
meditations<-iconv(meditations,"latin1","ASCII","")
buddha<-iconv(buddha,"latin1","ASCII","")
mormon<-iconv(mormon,"latin1","ASCII","")
quran<-iconv(quran,"latin1","ASCII","")

The last step of the preparation is the creation of tables. What you are doing is you are taking the objects you have already created and are moving them to their own folder. The text files need to be alone in order to conduct the analysis. Below is the code.

write.table(bible,"/home/darrin/Documents/R working directory/textminingegw/mine/bible.txt")
write.table(meditations,"/home/darrin/Documents/R working directory/textminingegw/mine/meditations.txt")
write.table(buddha,"/home/darrin/Documents/R working directory/textminingegw/mine/buddha.txt")
write.table(mormon,"/home/darrin/Documents/R working directory/textminingegw/mine/mormon.txt")
write.table(quran,"/home/darrin/Documents/R working directory/textminingegw/mine/quran.txt")

Corpus Development

We are now ready to create the corpus. This is the object we use to clean the text together rather than individually as before. First, we need to make the corpus object, below is the code. Notice how it contains the directory where are tables are

docs<-Corpus(DirSource("/home/darrin/Documents/R working directory/textminingegw/mine"))

There are many different ways to prepare the corpus. For our example, we will do the following…

lower case all letters-This avoids the same word be counted separately (ie sheep and Sheep)

  • Remove numbers
  • Remove punctuation-Simplifies the document
  • Remove whitespace-Simplifies the document
  • Remove stopwords-Words that have a function but not a meaning (ie to, the, this, etc)
  • Remove custom words-Provides additional clarity

Below is the code for this

docs<-tm_map(docs,tolower)
docs<-tm_map(docs,removeNumbers)
docs<-tm_map(docs,removePunctuation)
docs<-tm_map(docs,removeWords,stopwords('english'))
docs<-tm_map(docs,stripWhitespace)
docs<-tm_map(docs,removeWords,c("chapter","also","no","thee","thy","hath","thou","thus","may",
                                "thee","even","yet","every","said","this","can","unto","upon",
                                "cant",'shall',"will","that","weve","dont","wont"))

We now need to create the matrix. The document matrix is what r will actually analyze. We will then remove sparse terms. Sparse terms are terms that do not occur are a certain percentage in the matrix. For our purposes, we will set the sparsity to .60. This means that a word must appear in 3 of the 5 books of our analysis. Below is the code. The ‘dim’ function will allow you to see how the number of terms is reduced drastically. This is done without losing a great deal of data will speeding up computational time.

dtm<-DocumentTermMatrix(docs)
dim(dtm)
## [1]     5 24368
dtm<-removeSparseTerms(dtm,0.6)
dim(dtm)
## [1]    5 5265

Analysis

We will now create our topics or themes. If there is no a priori information on how many topics to make it os up to you to decide how many. We will create three topics. The “LDA” function is used and the argument “k” is set to three indicating we want three topics. Below is the code

set.seed(123)
lda3<-LDA(dtm,k=3)

We can see which topic each book was assigned to using the “topics” function. Below is the code.

topics(lda3)
##       bible.txt      buddha.txt meditations.txt      mormon.txt 
##               2               3               3               1 
##       quran.txt 
##               3

According to the results. The book of Mormon and the Bible were so unique that they each had their own topic (1 and 3). The other three text (Buddha, Meditations, and the Book of Mormon) were all placed in topic 2. It’s surprising that the Bible and the Book of Mormon were in separate topics since they are both Christian text. It is also surprising the Book by Buddha, Meditations, and the Quran are all under the same topic as it seems that these texts have nothing in common.

We can also use the “terms” function to see what the most common words are for each topic. The first argument in the function is the model name followed by the number of words you want to see. We will look at 10 words per topic.

terms(lda3, 10)
##       Topic 1  Topic 2  Topic 3 
##  [1,] "people" "lord"   "god"   
##  [2,] "came"   "god"    "one"   
##  [3,] "god"    "israel" "things"
##  [4,] "behold" "man"    "say"   
##  [5,] "pass"   "son"    "truth" 
##  [6,] "lord"   "king"   "man"   
##  [7,] "yea"    "house"  "lord"  
##  [8,] "land"   "one"    "life"  
##  [9,] "now"    "come"   "see"   
## [10,] "things" "people" "good"

Interpreting these results takes qualitative skills and is subjective. They all seem to be talking about the same thing. Topic 3 (Bible) seems to focus on Israel and Lord while topic 1 (Mormon) is about God and people. Topic 2 (Buddha, Meditations, and Quran) speak of god as well but the emphasis has moved to truth and the word one.

Conclusion

This post provided insight into developing topic models using R. The results of a topic model analysis is highly subjective and will often require strong domain knowledge. Furthermore, the number of topics is highly flexible as well and in the example in this post we could have had different numbers of topics for comparative purposes.

Authentic Listening Tasks

There are many different ways in which a teacher can assess the listening skills of their students. Recognition, paraphrasing, cloze tasks, transfer, etc. are all ways to determine a student’s listening proficiency.

One criticism of the task above is that they are primarily inauthentic. This means that they do not strongly reflect something that happens in the real world.

In response to this, several authentic listening assessments have been developed over the years. These authentic listening assessments include the following.

  • Editing
  • Note-taking
  • Retelling
  • Interpretation

This post will each of the authentic listening assessments listed above.

Editing

An editing task that involves listening involves the student receiving reading material. The student reviews the reading material and then listens to a recording of someone reading aloud the same material. The student then marks the hard copy they have when there are differences between the reading and what the recording is saying.

Such an assessment requires the student to carefully for discrepancies between the reading material and the recording. This requires strong reading abilities and phonological knowledge.

Note-Taking

For those who are developing language skills for academic reasons. Note-taking is a highly authentic form of assessment. In this approach, the students listen to some type of lecture and attempt to write down what they believe is important from the lecture.

The students are then assessed by on some sort of rubric/criteria developed by the teacher. As such, marking note-taking can be highly subjective. However, the authenticity of note-taking can make it a valuable learning experience even if providing a grade is difficult.

Retelling

How retelling works should be somewhat obvious. The student listens to some form of talk. After listening, the student needs to retell or summarize what they heard.

Assessing the accuracy of the retelling has the same challenges as the note-taking assessment. However, it may be better to use retelling to encourage learning rather than provide evidence of the mastery of a skill.

Interpretation

Interpretation involves the students listening to some sort of input. After listening, the student then needs to infer the meaning of what they heard. The input can be a song, poem, news report, etc.

For example, if the student listens to a song they may be asked to explain why the singer was happy or sad depending on the context of the song. Naturally, they cannot hope to answer such a question unless they understood what they were listening too.

Conclusion

Listening does not need to be artificial. There are several ways to make learning task authentic. The examples in this post are just some of the potential ways

Text Mining in R

Text mining is descriptive analysis tool that is applied to unstructured textual data. By unstructured, it is meant data that is not stored in relational databases. The majority of data on the Internet and the business world, in general, is of an unstructured nature. As such, the use of text mining tools has grown in importance over the past two decades.

In this post, we will use some text mining tools to analyze religious/philosophical text the five texts we will look at are

  • The King James Bible
  • The Quran
  • The Book Of Mormon
  • The Gospel of Buddha
  • Meditations, by Marcus Aurelius

The link for access to these five text files is as follows

https://www.kaggle.com/tentotheminus9/religious-and-philosophical-texts/downloads/religious-and-philosophical-texts.zip

Once you unzip it you will need to rename each file appropriately.

The actual process of text mining is rather simple and does not involve a great deal of complex coding compared to other machine learning applications. Primarily you need to do the follow Prep the data by first scanning it into r, converting it to ASCII format, and creating the write table for each text Create a corpus that is then cleaned of unnecessary characters Conduct the actual descriptive analysis

We will now begin the actual analysis. The package we need or “tm” for text mining, “wordcloud”, and “RColorBrewer” for visuals. Below is some initial code.

library(tm);library(wordcloud);library(RColorBrewer)

Data Preparation

We need to do three things for each text file

  • Paste
  •  convert it
  • write a table

Below is the code for pasting the text into R. Keep in mind that your code will be slightly different as the location of the file on your computer will be different. The “what” argument tells are what to take from the file and the “Collapse” argument deals with whitespace

bible<-paste(scan(file ="/home/darrin/Desktop/speech/bible.txt",what='character'),collapse=" ")
buddha<-paste(scan(file ="/home/darrin/Desktop/speech/buddha.txt",what='character'),collapse=" ")
meditations<-paste(scan(file ="/home/darrin/Desktop/speech/meditations.txt",what='character'),collapse=" ")
mormon<-paste(scan(file ="/home/darrin/Desktop/speech/mormon.txt",what='character'),collapse=" ")
quran<-paste(scan(file ="/home/darrin/Desktop/speech/quran.txt",what='character'),collapse=" ")

Now we need to convert the new objects we created to ASCII text. This removes a lot of “funny” characters from the objects. For this, we use the “iconv” function. Below is the code.

bible<-iconv(bible,"latin1","ASCII","")
meditations<-iconv(meditations,"latin1","ASCII","")
buddha<-iconv(buddha,"latin1","ASCII","")
mormon<-iconv(mormon,"latin1","ASCII","")
quran<-iconv(quran,"latin1","ASCII","")

The last step of the preparation is the creation of tables. Primarily you are taken the objects you have already created and moved them to their own folder. The text files need to be alone in order to conduct the analysis. Below is the code.

write.table(bible,"/home/darrin/Documents/R working directory/textminingegw/mine/bible.txt")
write.table(meditations,"/home/darrin/Documents/R working directory/textminingegw/mine/meditations.txt")
write.table(buddha,"/home/darrin/Documents/R working directory/textminingegw/mine/buddha.txt")
write.table(mormon,"/home/darrin/Documents/R working directory/textminingegw/mine/mormon.txt")
write.table(quran,"/home/darrin/Documents/R working directory/textminingegw/mine/quran.txt")

For fun, you can see a snippet of each object by simply typing its name into r as shown below.

bible
##[1] "x 1 The Project Gutenberg EBook of The King James Bible This eBook is for the use of anyone anywhere at no cost and with almost no restrictions whatsoever. You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included with this eBook or online at www.gutenberg.org Title: The King James Bible Release Date: March 2, 2011 [EBook #10] [This King James Bible was orginally posted by Project Gutenberg in late 1989] Language: English *** START OF THIS PROJECT

Corpus Creation

We are now ready to create the corpus. This is the object we use to clean the text together rather than individually as before. First, we need to make the corpus object, below is the code. Notice how it contains the directory where are tables are

docs<-Corpus(DirSource("/home/darrin/Documents/R working directory/textminingegw/mine"))

There are many different ways to prepare the corpus. For our example, we will do the following…

  • lower case all letters-This avoids the same word be counted separately (ie sheep and Sheep)
  • Remove numbers
  • Remove punctuation-Simplifies the document Remove whitespace-Simplifies the document
  • Remove stopwords-Words that have a function but not a meaning (ie to, the, this, etc)
  • Remove custom words-Provides additional clarity

Below is the code for this

docs<-tm_map(docs,tolower)
docs<-tm_map(docs,removeNumbers)
docs<-tm_map(docs,removePunctuation)
docs<-tm_map(docs,removeWords,stopwords('english'))
docs<-tm_map(docs,stripWhitespace)
#docs<-tm_map(docs,stemDocument)
docs<-tm_map(docs,removeWords,c("chapter","also","no","thee","thy","hath","thou","thus","may",
                                "thee","even","yet","every","said","this","can","unto","upon",
                                "cant",'shall',"will","that","weve","dont","wont"))

We now need to create the matrix. The document matrix is what r will actually analyze. We will then remove sparse terms. Sparse terms are terms that do not occur are a certain percentage in the matrix. For our purposes, we will set the sparsity to .60. This means that a word must appear in 3 of the 5 books of our analysis. Below is the code. The ‘dim’ function will allow you to see how the number of terms is reduced drastically. This is done without losing a great deal of data will speeding up computational time.

dtm<-DocumentTermMatrix(docs)
dim(dtm)
## [1]     5 24368
dtm<-removeSparseTerms(dtm,0.6)
dim(dtm)
## [1]    5 5265

Analysis

We now can explore the text. First, we need to make a matrix that has the sum of the columns od the document term matrix. Then we need to change the order of the matrix to have the most frequent terms first. Below is the code for this.

freq<-colSums(as.matrix(dtm))
ord<-order(-freq)#changes the order to descending

We can now make a simple bar plot to see what the most common words are. Below is the code

barplot(freq[head(ord)])

1

As expected with religious text. The most common terms are religious terms. You can also determine what words appeared least often with the code below.

freq[tail(ord)]
##   posting   secured    smiled      sway swiftness worthless 
##         3         3         3         3         3         3

Notice how each word appeared 3 times. This may mean that the 3 terms appear once in three of the five books. Remember we set the sparsity to .60 or 3/5.

Another analysis is to determine how many words appear a certain number of times. For example, how many words appear 200 times or 300. Below is the code.

head(table(freq))
## freq
##   3   4   5   6   7   8 
## 117 230 172 192 191 187

Using the “head” function and the “table” function gives us the six most common values of word frequencies. Three words appear 117 times, four appear 230 times, etc. Remember the “head” gives the first few values regardless of their amount

The “findFreqTerms” function allows you to set a cutoff point of how frequent a word needs to be. For example, if we want to know how many words appeared 3000 times we would use the following code.

findFreqTerms(dtm,3000)
##  [1] "behold" "came"   "come"   "god"    "land"   "lord"   "man"   
##  [8] "now"    "one"    "people"

The “findAssocs” function finds the correlation between two words in the text. This provides insight into how frequently these words appear together. For our example, we will see which words are associated with war, which is a common subject in many religious texts. We will set the correlation high to keep the list short for the blog post. Below is the code

findAssocs(dtm,"war",corlimit =.998) 
## $war
##     arrows      bands   buildeth    captive      cords     making 
##          1          1          1          1          1          1 
##  perisheth prosperity      tower      wages      yield 
##          1          1          1          1          1

The interpretation of the results can take many forms. It makes sense for ‘arrows’ and ‘captives’ to be associated with ‘war’ but ‘yield’ seems confusing. We also do not know the sample size of the associations.

Our last technique is the development of a word cloud. This allows you to see word frequency based on where the word is located in the cloud as well as its size. For our example, we will set it so that a word must appear at least 1000 times in the corpus with more common words in the middle. Below is the code.

wordcloud(names(freq),freq,min.freq=1000,scale=c(3,.5),colors=brewer.pal(6,"Dark2"),random.color = F,random.order = F)

1.png

Conclusion

This post provided an introduction to text mining in R. There are many more complex features that are available for the more serious user of R than what is described here