Category Archives: Uncategorized

Arithmetic Sequences VIDEO

Arithmetic Sequences

Different Views of Research

People have been doing research formally or informally since the beginning of time. We are always trying to figure out how to do this or why something is the way that it is. In this post, we will look at different ways to view and or conduct research. These perspectives are empirical, theoretical, and analytical.

Empirical

Perhaps the most common form or approach to doing research is the empirical approach. This approach involves observing reality and developing hypotheses and theories based on what was observed. This is an inductive approach to doing research because the researcher starts with their observations to make a theory. In other words, you start with examples and abstract them to theories.

An example of this is found in the work of Charles Darwin and evolution. Darwin collected a lot of examples and observations of birds during his travels. Based on what he saw he inferred that animals evolved over time. This was his conclusion based on his interpretation of the data. Later, other researchers tried to further bolster Darwin’s theory by finding mathematical support for his claims.

The order in which empirical research is conducted is as follows…

  1. Identify the phenomenon
  2. Collect data
  3. Abstraction/model development
  4. Hypothesis
  5. Test

You can see that hypotheses and theory are derived from data which is similar to qualitative research. However, steps 4 and 5 are were the equation developing and or statistical tools are used. As such the empirical view of research is valuable when there is a large amount of data available and can include many variables, which is again often common for quantitative methods.

To summarize this, empirical research is focused on what happened, which is one way in which scientific laws are derived.

Theoretical

The theoretical perspective is essentially the same process as empirical but moving in the opposite direction. For theorists, the will start with what they think about the phenomenon and how things should be. This approach starts with a general principle and then the researcher goes and looks for evidence that supports their general principle. Another way of stating this is that the theoretical approach is deductive in nature.

A classic example of this is Einstein’s theory of relativity. Apparently, he deduced this theory through logic and left it to others to determine if the theory was correct. To put it simply, he knew without knowing, if this makes sense. In this approach, the steps are as follows

  1. Theory
  2. Hypotheses
  3. model abstraction
  4. data collection
  5. Phenomenon

You collect data to confirm the hypotheses. Common statistical tools can include simulations or any other method that is suitable in situations in which there is little data available. The caveat is that the data must match the phenomenon to have meaning. For example, if I am trying to understand some sort of phenomenon about women  I cannot collect data from as this does not match the phenomenon.

In general, theoretical research is focused on why something happens which is the goal of most theories, explaining why.

Analytical

Analytical research is probably the hardest to explain and understand. Essentially,  analytical research is trying to understand how people develop their empirical or theoretical research. How did Darwin make this collection or how did Einstein develop his ideas.

In other words, analytical research is commonly used to judge the research of others. Examples of this can be people who spend a lot of time criticizing the works of others. An analytical approach is looking for the strengths and weaknesses of various research. Therefore, this approach is focused on how research is done and can use tools both from empirical and theoretical research.

Conclusion

The point here was to explain different views om conducting research. The goal was not to state that one is superior to the other. Rather, the goal was to show how different tools can be used in different ways

Medieval Universities Costs

The post will talk about some of the characteristics and costs of university studies during the Medieval time period. Naturally, there are a lot of similarities to modern times. However, many aspects of university life took time to grow and develop as we will see.

University

Universities during the Middle Ages were distinct from what we see today. There were essentially no buildings that made up the university. This means that initially in many situations in Europe there ewer no libraries, no laboratories, no halls, no endowments or money, and even no sports. Today, we often think of universities in terms of there physical presence. In the past, universities were thought of in terms of the students and teachers who learned and taught regardless of the physical location.

A university was defined as the totality of students and teachers in a particular location. Both the teachers and the students organized themselves into groups for bargaining power. The university of students would work together to control rent, book price, and tuition. If local businesses tried to abuse them the students would threaten  to leave. The students also placed expectations on the teachers such as no absences without permission, no leaving the city without leaving a deposit (this prevented crooks from taking tuition and running), maintain a regular schedule.

College

Professors formed their own guild called the college and set expectations for people who wanted to become professors. In addition to colleges, teachers would form themselves into faculty, which is several teaches from the same discipline. Faculties were allowed to confer degrees and promote students to the academic rank of masters. In addition, it was common for teachers to be celibate

The term “college” was also used to refer to the hospice or residence hall where students live. This is similar to the modern-day dormitories. Originally, colleges were for religious students and not for secular. To this day, institutions of higher learning are referred to as colleges and or universities. The success of universities put the cathedral, monastic and provincial schools a=out of business.

Textbooks

Textbooks were hard to find during the Middle Ages. This was before the printing press which means that books were copied by hand. this was highly time-consuming and kept the price of books high. To get around this, it was common for students to rent books rather than purchase them. This is a strategy that is stilled being used today, especially with ebooks.

Books were so valuable that they were not even supposed to leave the city. In  addition, professors were expected to turn over their lecture notes occasionally so that they could be converted into books. Famous textbooks from this time include Peter Lombard’s “Sentences” a theological book and the jurist Gratian’s text “Decretum.” With the rental system, it actually postponed the need for libraries

Degree

Completing the degrees involved 3-4 for the BA which included completing an examination before 4 teachers. Since nobody owned books, memorizing was heavy. For many students, the BA was the end of their academic career but for those who wanted to continued they were often expected to teach for two years before taking the masters.

The masters was often focused on obtaining the license to teach. This process involved attending lectures until a student believed he was ready for the examinations. This varied by disciplined but after the BA a student could take 2-4 years after completing the BA for a total of 5-8 years

Conclusion

University life was different yet somewhat similar to the modern era. The features of the modern university crept in gradually as the schools adjusted to the demands of modern life. As such, we can be sure that higher education will continue to change as it continues to adapt to the needs of the students

The Shocks of Teaching

New teachers often experience the shock of being a teacher. In this post, we will look at three common shocks new teachers face. These shocks are the shock of the classroom, administration, and peers.

Shock of the Classroom

A new teacher has to deal with the reality of the classroom. The problem here is that teachers are highly familiar with the classroom experience as students. This warps their perception of the classroom as they are no longer a student but a teacher. In other words, the student is now on the other side of the desk as a teacher.

This change can be difficult to adjust to. For example, it is common for new teachers to struggle with developing appropriate relationships with students. By appropriate it is meant avoiding the pitfall of trying to be buddies with the students. Cordial relationships are good as a teacher but the teacher is still an authority figure who needs to respected and obeyed by the students. This balance is difficult for many teachers to find as many new  teachers want to be liked.

Another major challenge is the implementation of all the various teaching strategies that were acquired as a student-teacher. All teaching styles work but all teaching styles do not work for all teachers. It takes time to develop a personal style of teaching and this is learned mainly through trial and error. Unfortunately, the students are the guinea pigs in this process of instructional mastery.

Shock of Administration

Working with the principal also demands a shift in perspective. All teachers were students who interacted with principals before but at a larger social distance. Now as a teacher, the social distance is smaller but this can actually make things more confusing in terms of how to relate.

The principal is a colleague but also superior. They can support a teacher’s teaching with advice and counsel but could also, and even simultaneously, believe that a teacher is unfit for their school. This dual role of supporter and judge can be uncomfortable for many.

Some principals have an open door policy while others say they have an open-door policy because that is what they are supposed to say. Some will help while others will say they will help because that is what they are supposed to say. The mixed messages can be frustrating. However, if there are any significant problems at a school it is the principal who is the first to pay the price. Therefore, many leaders are not only looking at the teachers but also trying to watch their own back.

Shock of Peers

Another shift in perspective needed is dealing with peers. Again, a new teacher brings their viewpoint of being a former student with them when interacting with fellow teachers. Now as a teacher,  a new teacher gets to see what teachers are really like. The gossip in the breakroom, politic intrigue with the administration, complaints about parents and students, and more. Sometimes the atmosphere can be somewhat negative, to say the least.

Dealing with other teachers is not always negative. There are opportunities for collaboration and learning from more experienced teachers. However, it is important to know both sides of the experience so that a new teacher is not disappointed with what they see.

Conclusion

The main problem here is that a new teacher has to deal with changing their perspective on how they see education. Going forward, a new teacher is an authority figure and not a friend and a colleague/employee and not a student. With this transition comes confusion that can be overcome with time.

Paraphrasing Tips for ESL Students

Paraphrasing is an absolute skill in a professional setting. By paraphrasing, it is meant to have the ability to take someone else’s words and rephrase them while giving credit for the original source. Whenever a student fails to do this it is called plagiarism which is a major problem in academia. In this post, we will look at several tips on how to paraphrase.

The ability to paraphrase academically takes almost near-native writing ability. This is because you have to be able to play with the language in a highly complex manner. To be able to do this after a few semesters of ESL is difficult for the typical student. Despite this, there are several ways to try to make paraphrase work. Below are just some ideas.

  • Use synonyms
  • Change the syntax
  • Make several sentences
  • Condense/summarize

One tip not mentioned is reading. Next, to actually writing, nothing will improve writing skills like reading. Being exposed to different texts helps you develop an intuitive understanding of the second language in a way that copying and pasting never will.

Use Synonyms

Using synonyms is a first step in paraphrasing an idea but this approach cannot be used by itself as that is considered to be plagiarism by many people. With synonyms, you replace some words with others. The easiest words to replace are adjectives and verbs, followed by nouns. Below is an example. The first sentence is the original one and the second is the paraphrase.

The man loves to play guitar
The man likes to play guitar

In the example above all we did was change the word “loves” to “like”. This is a superficial change that is often still considered plagiarism because of how easy it is to do. We can take this a step further by modifying the infinitive verb “to play.”

The man loves to play guitar
The man likes to play guitar
The man likes playing guitar

Again this is superficial but a step above the first example. In addition, most word processors will provide synonyms if you right-click on the word and off course there are online options as well. Remember that this is a beginning and is a tool you use in addition to more complex approaches.

Change the Syntax

Changing the syntax has to do with the word order of the sentence or sentences. Below is an example

The man loves to play guitar
Playing the guitar is something the man loves

In this example, we move the infinitive phrase “to play” to the front and change it to a present participle. There were other adjustments that needed to be made to maintain the flow of the sentence. This example is a more advanced form of paraphrasing and it may be enough to only do this to avoid plagiarism. However, you can combine synonyms and syntax as shown in the example below

The man loves to play guitar
Playing the guitar is something the man likes

Make Several Sentences

Another approach is to convert a sentence(s) into several more sentences. As shown below

The man loves to play guitar
This man has a hobby. He likes playing guitar.

You can see that there are two sentences now. The first sentence indicates the man has a hobby and the second explains what the hobby is and how much he likes it. In addition, in the second sentence, the verb “to play” was changed to the present participle of “playing.”

Condense/Summarize

Condensing or summarizing is not considered by everyone to be paraphrasing. The separation between paraphrasing and summarizing is fuzzy and it is more of a continuum than black and white. With this technique, you try to reduce the length of the statement you are paraphrasing as shown below.

The man loves to play guitar
He likes guitar

This was a difficult sentence to summarizes because it was already so short. However, we were able to shrink it from six to three words by removing what it was about the guitar he liked.

Academic Examples

We will now look at several academic examples to show the applications of these rules in a real context. The passage below is some academic text

There is also a push within Southeast Asia for college graduates to have
interpersonal skills. For example, Malaysia is calling for graduates to
have soft skills and that these need to be part of the curriculum of tertiary schools.
In addition, a lack of these skills has been found to limit graduates’ employability.

Example 1: Paraphrase with synonyms and syntax changes

There are several skills graduates need for employability in Southeast Asia.  For example, people skills are needed. The ability to relate to others is being pushed for inclusion in higher education from parts of Southeast Asia (Thomas, 2018).

You can see how difficult this can be. We rearranged several concepts and changed several verbs to try and make this our own sentence. Below is an example of condensing.

Example 2: Condensing

There is demand in Southeast Asia for higher education to develop the interpersonal skills of their students as this is limiting the employability of graduates (Thomas, 2018).

With this example, we reduced the paragraph to one sentence.

Culture and Plagiarism

There are majors differences in terms of how plagiarism is viewed based on culture. In the West, plagiarism is universally condemned both in and out of academia as essentially stealing ideas from other people. However, in other places, the idea of plagiarism is much more nuanced or even okay.

In some cultures, one way to honor what someone has said or taught is to literally repeat it verbatim. The thought process goes something like this

  • This person is a great teacher/elder
  • What they said is insightful
  • As a student or lower person, I cannot improve what they said
  • Therefore, I should copy these perfects words into my own paper.

Of course, everyone does not think like this but I have experienced enough to know that it does happen.

Whether the West likes it or not plagiarism is a cultural position rather than an ethical one. To reduce plagiarism requires to show students how it is culturally unacceptable in an academic/professional setting to do this. The tips in this post will at least provide tools for how to support students to overcome this habit

Whole Language vs Phonics

Among educators who specialized in reading instruction there has been a long controversy over how to teach students to read. Generally, the two main schools of thought are phonics on one side and the whole language approach on the other side. In this post, we will look at both of these approaches as well as a compromise position.

Phonics

Phonics is an approach that has the students decode the words that they see by sounding out individual letters and letter combinations. By blending the individual sounds of a word together the students is able to read the word. This requires that the student know what sounds different letters make. Without this phonemic awareness there is no hope for reading.

The benefits of this is that it is clear if a student can do this or not. This makes it easy to provide the needed support in order to help the students. This means that it is easy to assess the students development. Another benefit of this approach is that it focuses on the smallest aspects of speech sound. This helps a child to keep track of one thing  at a time.
Problems with a phonic-based approach is that the importance of the context is lost because students only focus on sounding out the words rather than developing reading comprehension. This can lead to  students who can read and sound out well but have no idea what they read nor the meaning of the text. The idea of seeing the passage as a whole is lost.

Whole Language

Whole language is a literature based approach that emphasizes the relevancy for the student and culture. Activities used include oral reading, silent reading, journal writing, group activities, etc. Students do not focus on sounded out words but rather on knowing the whole word through a knowledge of the context. There is even allowance made for inventive spelling in which students make for  up their own words for spelling to avoid discouraging them through frequent correction of misspelled words.

An extreme example of whole language approach is when students are allowed to use substitute words in a text they are reading rather than the word the author wrote in the book. For example, if in the story the author mentions the word “pony” and the student does not understand this word. The student can substitute the word “horse with “pony” in the author’s story and this is considered okay by whole language approach standards.

Some benefits of this approach is that it is much more enjoyable in comparison to the phonics approach. Students begin reading immediately content that is relevant to their lives and interesting and their prior knowledge supposedly helps with understanding.

The drawbacks of whole language is that at times students struggle to generalize their reading skills to new contexts. In addition, the replacement of unknown to known words of the student with their own words can make it difficult for the teacher to understand where the students are struggling. If all students are doing this, it becomes difficult for them to communicate with each other about a commonly read text. This may be one reason why whole language has been reject over the pass 30 years with an emphasis on phonics.

Balance Approach

Currently, there is more of a push for a mixture of both methods. Phonics can be taught to enhance a bottom up approach while whole language is more of a use for bottom down approach to reading. By blending the two method it is possible to capture the strengths of both approaches without the corresponding weaknesses.

How this may look in the classroom may be relevant literature for the student with reading teaching that matches the needs of the students. If the student can reading without extensive phonemic awareness, training whole word might be more appropriate. When the student cannot read a word, phonics may be beneficial;.

Conclusion

It is better to match the system to the student than to match the student to the system. Whenever extreme positions are taken it helps some while hurting others. A teacher needs to have the flexibility to find the best tool for the context they are working in rather than based on what they were taught as students.

School as a Socializing Agent: Cultural Preservation

Many would agree that education, as found in schools, as an obligation to socialize students to help them fit into society. With this goal in mind, it is logical to conclude that there will be different views on how to socialize students. The two main extreme positions on this continuum of socialization would be

  • Socializing through the preservation of cultural form on generation to the next
  • Socializing through the questioning of prior norms and pushing for social change

As I have already mentioned, these may be the two extremes on a continuum going from complete and total cultural preservation to complete and total anarchy. In this post, we will focus the discussion on schools as agents of cultural preservation.

School as  Cultural Preserver

In the view of the school as a cultural preserver, the responsibility of the school to society is to support the dominant ideas and views of the culture. This is done through teaching and explaining things from a dominant group’s perspective and excluding or censoring other viewpoints to some degree. In other words, American schools should produce Americans who support and live American values, Chinese schools should produce Chinese who support and live Chinese values, etc.

This approach to schooling has been used throughout history to compel people from minority groups to conform to the views of the dominant group. In the US, there were boarding schools for Native Americans to try to “civilize” them. This was also seen in many parts of Asia in which ethnic tribes were sent to government schools, forbidden to use their mother tongue in place of the national language, and pledge devotion and loyalty to the dominant culture. Through the process of weakening local identities, it is believed by many that it will help to strengthen the state or at least maintain the status quo. If you are in a position of dominance either of these would benefit you.

What this view lacks in diversity, due to minority views being absent, it makes up for it through stability. Schools that support cultural preservation show students their place in society and how to interact with those around them. Through the limits of a specific predefined worldview, it lowers but does not reduce internal social strife.

Problems and Pushback

A natural consequence of schools as cultural preservers is a strong sense of pride in those who belong to the culture that is being preserved. This can lead, at times, to a sense of superiority and pride. Of course, if you are not from the dominant culture, it can be suffocating to constantly have other people’s values and beliefs push upon you.

This sense of exclusion can lead to serious challenges from minority groups. There are countless examples of this in the United States where it seems everyone is pushing back against the establish dominant culture. There are those who are pushing for Black, Latino, Asian, feminist, and other worldviews to be a part of the education of the school. This is not inherently a problem, however, if everyone has an equal voice and everyone is talking at the same time this means that nobody is listening. In other words, a voice needs an ear as much as an ear needs a voice.

Conclusion

It is convenient to take an extreme position and say that using school to preserve culture is wrong. The problem with this is that the people who say this want to preserve the belief that using school to preserve culture is wrong. In other words, it is not the preservation of culture that is the problem. The real battle is over what culture is going to be preserved. Whether it is the current dominant view or the view of a challenger.

Understanding Variables

In research, there are many terms that have the same underlying meaning which can be confusing for researchers as they try to complete a project. The problem is that people have different backgrounds and learn different terms during their studies and when they try to work with others there is often confusion  over what is what.

In this post, we will try to clarify as much as possible various terms that are used when referring to variables. We will look at the following during this discussion

  • Definition of a variable
  • Minimum knowledge of the characteristics of a variable in research
  • Various synonyms of variable

Definition

The word variable has the root of “vary” and the suffix “able”. This literally means that a variable is something that is able to change. Examples include such concepts as height, weigh, salary, etc. All of these concepts change as you gather data from different people. Statistics is primarily about trying to explain and or understand the variability of variables.

However, to make things more confusing there are times in research when a variable dies not change or remains constant. This will be explained in greater detail in a moment.

Minimum You Need to Know

Two broad concepts that you need to understand regardless of the specific variable terms you encounter are the following

  • Whether the variable(s) are independent or dependent
  • Whether the variable(s) are categorical or continuous

When we speak of independent and dependent variables we are looking at the relationship(s) between variables. Dependent variables are explained by independent variables. Therefore, one dimension of variables is understanding how they relate to each other and the most basic way to see this is independent vs dependent.

The second dimension to consider when thinking about variables is how they are measured which is captured with the terms categorical or continuous. A categorical variable has a finite number of values that can be used. Examples in clue gender, hair color, or cellphone brand. A person can only be male or female, have blue or brown eyes, and can only have one brand of cellphone.

Continuous variables are variables that can take on an infinite number of values. Salary, temperature, etc are all continuous in nature. It is possible to limit a continuous variable to categorical variable by creating intervals in which to place values. This is commonly done when creating bins for histograms. In sum, here are the four possible general variable types below

  1. Independent categorical
  2. Independent continuous
  3. Dependent categorical
  4. Dependent continuous

Natural, most models have one dependent categorical or continuous variable, however you can have any combination of continuous and categorical variables as independents. Remember that all variables have the above characteristics despite whatever terms is used for them.

Variable Synonyms

Below is a list of various names that variables go by in different disciplines. This is by no means an exhaustive list.

Experimental variable

A variable whose values are independent of any changes in the values of other variables. In other words, an experimental variable is just another term for independent variable.

Manipulated Variable

A variable that is independent in an experiment but whose value/behavior the researcher is able to control or manipulate. This is also another term for an independent variable.

Control Variable

A variable whose value does not change. Controlling a variable helps to explain the relationship between the independent and dependent variable in an experiment by making sure the control variable has not influenced in the model

Responding Variable

The dependent variable in an experiment. It responds to the experimental variable.

Intervening Variable

This is a hypothetical variable. It is used to explain the causal links between variables. Since they are hypothetical, they are observed in an actual experiment. For example, if you are looking at a strong relationship between income and life expectancy  and find a positive relationship. The intervening variable for this may be access to healthcare. People who make more money have more access to health care and this contributes to them often living longer.

Mediating Variable

This is the same thing as an intervening variable. The difference being often that the mediating variable is not always hypothetical in nature and is often measured it’s self.

Confounding Variable

A confounder is a variable that influences both the independent and dependent variable, causing a spurious or false association. Often a confounding variable is a causal idea and  cannot be described in terms of correlations or associations with other variables. In other words, it is often the same thing as an intervening variable.

Explanatory Variable

This variable is the same as an independent variable. The difference being that an independent variable is not influenced by any other variables. However, when independence is not for sure, than the variable is called an explanatory variable.

Predictor Variable

A predictor variable is an independent variable. This term is commonly used for regression analysis.

Outcome Variable

An outcome variable is a dependent variable in the context of regression analysis.

Observed Variable

This is a variable that is measured directly. An example would be gender or height. There is no psychology construct to infer the meaning of such variables.

Unobserved Variable

Unobserved variables are constructs that cannot be measured directly. In such situations, observe variables are used to try to determine the characteristic of the unobserved variable. For example, it is hard to measure addiction directly. Instead, other things will be measure to infer addiction such as health, drug use, performance, etc. The measures of this observed variables will indicate the level of the unobserved variable of addiction

Features

A feature is an independent variable in the context of machine learning and data science.

Target Variable

A target variable is the dependent variable in the context f machine learning and data science.

To conclude this, below is a summary of the different variables discussed and whether they are independent, dependent, or neither.

Independent Dependent Neither
Experimental Responding Control
Manipulated Target Explanatory
Predictor Outcome Intervening
Feature Mediating
Observed
Unobserved
Confounding

You can see how confusing this can be. Even though variables are mostly independent or dependent, there is a class of variables that do not fall into either category. However, for most purposes, the first to columns cover the majority of needs in simple research.

Conclusion

The confusion over variables is mainly due to an inconsistency in terms across variables. There is nothing right or wrong about the different terms. They all developed in different places to address the same common problem. However, for students or those new to research, this can be confusing and this post hopefully helps to clarify this.

T-SNE Visualization and R

It is common in research to want to visualize data in order to search for patterns. When the number of features increases, this can often become even more important. Common tools for visualizing numerous features include principal component analysis and linear discriminant analysis. Not only do these tools work for visualization they can also be beneficial in dimension reduction.

However, the available tools for us are not limited to these two options. Another option for achieving either of these goals is t-Distributed Stochastic Embedding. This relative young algorithm (2008) is the focus of the post. We will explain what it is and provide an example using a simple dataset from the Ecdat package in R.

t-sne Defined

t-sne is a nonlinear dimension reduction visualization tool. Essentially what it does is identify observed clusters. However, it is not a clustering algorithm because it reduces the dimensions (normally to 2) for visualizing. This means that the input features are not longer present in their original form and this limits the ability to make inference. Therefore, t-sne is often used for exploratory purposes.

T-sne non-linear characteristic is what makes it often appear to be superior to PCA, which is linear. Without getting too technical t-sne takes simultaneously a global and local approach to mapping points while PCA can only use a global approach.

The downside to t-sne approach is that it requires a large amount of calculations. The calculations are often pairwise comparisons which can grow exponential in large datasets.

Initial Packages

We will use the “Rtsne” package for the analysis, and we will use the “Fair” dataset from the “Ecdat” package. The “Fair” dataset is data collected from people who had cheated on their spouse. We want to see if we can find patterns among the unfaithful people based on their occupation. Below is some initial code.

library(Rtsne)
library(Ecdat)

Dataset Preparation

To prepare the data, we first remove in rows with missing data using the “na.omit” function. This is saved in a new object called “train”. Next, we change or outcome variable into a factor variable. The categories range from 1 to 9

  1. Farm laborer, day laborer,
  2. Unskilled worker, service worker,
  3. Machine operator, semiskilled worker,
  4. Skilled manual worker, craftsman, police,
  5. Clerical/sales, small farm owner,
  6. Technician, semiprofessional, supervisor,
  7. Small business owner, farm owner, teacher,
  8. Mid-level manager or professional,
  9. Senior manager or professional.

Below is the code.

train<-na.omit(Fair)
train$occupation<-as.factor(train$occupation)

Visualization Preparation

Before we do the analysis we need to set the colors for the different categories. This is done with the code below.

colors<-rainbow(length(unique(train$occupation)))
names(colors)<-unique(train$occupation)

We can now do are analysis. We will use the “Rtsne” function. When you input the dataset you must exclude the dependent variable as well as any other factor variables. You also set the dimensions and the perplexity. Perplexity determines how many neighbors are used to determine the location of the datapoint after the calculations. Verbose just provides information during the calculation. This is useful if you want to know what progress is being made. max_iter is the number of iterations to take to complete the analysis and check_duplicates checks for duplicates which could be a problem in the analysis. Below is the code.

tsne<-Rtsne(train[,-c(1,4,7)],dims=2,perplexity=30,verbose=T,max_iter=1500,check_duplicates=F)
## Performing PCA
## Read the 601 x 6 data matrix successfully!
## OpenMP is working. 1 threads.
## Using no_dims = 2, perplexity = 30.000000, and theta = 0.500000
## Computing input similarities...
## Building tree...
## Done in 0.05 seconds (sparsity = 0.190597)!
## Learning embedding...
## Iteration 1450: error is 0.280471 (50 iterations in 0.07 seconds)
## Iteration 1500: error is 0.279962 (50 iterations in 0.07 seconds)
## Fitting performed in 2.21 seconds.

Below is the code for making the visual.

plot(tsne$Y,t='n',main='tsne',xlim=c(-30,30),ylim=c(-30,30))
text(tsne$Y,labels=train$occupation,col = colors[train$occupation])
legend(25,5,legend=unique(train$occupation),col = colors,,pch=c(1))

1

You can see that there are clusters however, the clusters are all mixed with the different occupations. What this indicates is that the features we used to make the two dimensions do not discriminant between the different occupations.

Conclusion

T-SNE is an improved way to visualize data. This is not to say that there is no place for PCA anymore. Rather, this newer approach provides a different way of quickly visualizing complex data without the limitations of PCA.

Checkout ERT online courses

Drag, Pan, & Zoom Elements with D3.js

Mouse events can be combined in order to create some  rather complex interactions using d3.js. Some examples of these complex actions includes dragging,  panning, and zooming. These events are handle with tools called behaviors. The behaviors deal with dragging, panning, and zooming.

In this post, we will look at these three behaviors in two examples.

  • Dragging
  • Panning and zooming

Dragging

Dragging allows the user to move an element around on the screen. What we are going to do is make three circles that are different colors that we can move around as we desire within the element. We start by setting the width, height of the svg element as well as the radius of the circles we will make (line 7). Next, we create our svg by appending it to the body element. We also set a black line around the element so that the user knows where the borders are (lines 8-14).

The next part involves setting the colors for the circles and then creating the circles and setting all of their attributes (lines 21 – 30). Setting the drag behavior comes later, and we use the .drag() and the .on() methods t create this behavior and the .call() method connects the information in this section to our circles variable.

The last part is the use of the onDrag function. This function retrieves the position of the moving element and transform the element within the svg element (lines 36-46). This involves using an if statement as well as setting attributes. If this sounds confusing, below is the code followed by a visual of what the code does.

1

If you look carefully you will notice I can never move the circles beyond the border. This is because the border represents the edge of the element. This is important because you can limit how far an element can travel by determining the size of the elements space.

Panning and Zooming

Panning allows you to move all visuals around at once inside an element. Zooming allows you to expand or contract what you see. Most of this code is a extension  of the what we did in the previous example. The new additions are explained below.

  1. A variable called zoomAction sets the zoom behavior by determining the scale of the zoom and setting the .on() method (Lines 9-11)
  2. We add the .call() method to the svg variable as well as the .append(‘g’) so that this behavior can be used (Lines 20-21).
  3. The dragAction variable is created to allow us to pan or move the entire element around. This same variable is placed inside a .call() method for the circles variable that was created earlier (Lines 40-46).
  4. Lines 48-60 update the position of the element by making two functions. The onDrag function deals with panning and the onZzoom function deal with zooming.

Below is the code and a visual of what it does.

You can clearly see that we can move the circles individually or as a group. In addition, you also were able to see how we could zoom in and out. Unlike the first example this example allows you to leave the border. This is probably due to the zoom capability.

Conclusion

The behaviors shared here provide additional tools that you can use as you design visuals using D3.js. There are other more practical ways to use these tools as we shall see.

Intro to Interactivity with D3.js

The D3.js provides many ways in which the user can interact with visual data. Interaction with a visual can help the user to better understand the nature and characteristics of the data, which can lead to insights. In this post, we will look at three basic examples of interactivity involving mouse events.

Mouse events are actions taken by the browser in response to some action by the mouse. The handler for mouse events is primarily the .on() method. The three examples of mouse events in this post are listed below.

  • Tracking the mouse’s position
  • Highlighting an element based on mouse position
  • Communicating to the user when they have clicked on an element

Tracking the Mouses’s Position

The code for tracking the mouse’s position is rather simple. What is new is Is that we need to create a variable that appends a text element to the svg element. When we do this we need to indicate the position and size of the text as well.

Next, we need to use the .on() method on the svg variable we created. Inside this method is the type of behavior to monitor which in this case is the movement of the mouse. We then create a simple way for the browser to display the x, y coordinates.  Below is the code followed by the actual visual.

1.png

You can see that as the mouse moves the x,y coordinates move as well. The browser is watching the movement of the mouse and communicating this through the changes in the coordinates in the clip above.

Highlighting an Element Based on Mouse Position

This example allows an element to change color when the mouse comes in contact with it. To do this we need to create some data that will contain the radius of four circles with their x,y position (line 13).

Next we use the .selectAll() method to select all circles, load the data, enter the data, append the circles, set the color of the circles to green, and create a function that sets the position of the circles (lines 15-26).

Lastly, we will use the .on() function twice. Once will be for when the mouse touches the circle and the second time for when the mouse leaves the circle. When the mouse touches a circle the circle will turn black. When the mouse leaves a circle the circle will return to the original color of green (lines 27-32). Below is the code followed by the visual.

1

Indicating when a User Clicks on an Element

This example is an extension of the previous one. All the code is the same except you add the following at the bottom of the code right before the close of the script element.

.on('click', function (d, i) {

alert(d + ' ' + i);

});

This .on() method has an alert inside the function. When this is used it will tell the user when they have clicked on an element and will also tell the user the radius of the circle as well what position in the array the data comes from. Below is the visual of this code.

Conclusion

You can perhaps see the fun that is possible with interaction when using D3.js. There is much more that can be done in ways that are much more practical than what was shown here.

Tweening with D3.js

Tweening is a tool that allows you to tell D3.js how to calculate attributes during transitions without keyframes tracking. The problem with keyframes tracking is that it can develop performance issues if there is a lot of animation.

We are going to look at three examples of the use of tweening in this post. The examples are as follows.

  • Counting numbers animation
  • Changing font size animation
  • Spinning shape animation

Counting Numbers Animation

This simple animation involves using the .tween() method to count from 0 to 25. The other information in the code determines the position of the element, the font-size, and the length of the animation.

In order to use the .tween()  method you must make a function. You first give the function a name followed by providing the arguments to be used. Inside the function  we indicate what it should do using the .interpolateRound() method which indicates to d3.js to count from 0 to 25. Below is the code followed by the animation.

1

You can see that the speed of the numbers is not constant. This is because we did not control for this.

Changing Font-Size Animation

The next example is more of the same. This time we simply make the size of a text change. TO do this you use the .text() method in your svg element. In addition, you now use the .styleTween() method. Inside this method we use the .interpolate method and set arguments for the font and font-size at the beginning and the end of the animation. Below is the code and the animation1

Spinning Shape Animation

The last example is somewhat more complicated. It involves create a shape that spins in place. To achieve this we do the following.

  1. Set the width and height of the element
  2. Set the svg element to the width and height
  3. Append a group element to  the svg element.
  4. Transform and translate the g element in order to move it
  5. Append a path to the g element
  6. Set the shape to a diamond using the .symbol(), .type(), and .size() methods.
  7. Set the color of the shape using .style()
  8. Set the .each() method to follow the cycle function
  9. Create the cycle function
  10. Set the .transition(), and .duration() methods
  11. Use the .attrTween() and use the .interpolateString() method to set the rotation of the spinning.
  12. Finish with the .each() method

Below is the code followed by the animation.

1

This animation never stops because we are using a cycle.

Conclusion

Animations can be a lot of fun when using d3.js. The examples here may not be the most practical, but they provide you with an opportunity to look at the code and decide how you will want to use d3.js in the future.

Adding labels to Graphs D3.js

In this post, we will look at how to add the following to a bar graph using d3.js.

  • Labels
  • Margins
  • Axes

Before we begin, you need the initial code that has a bar graph already created. This is shown below follow by what it should look like before we make any changes.

1

1

The first change is in line 16-19. Here, we change the name of the variable and modify the type of element it creates.

1.png

Our next change begins at line 27 and continues until line 38. Here we make two changes. First, we make a variable called barGroup, which selects all the group elements of the variable g. We also use the data, enter, append and attr methods. Starting in line 33 and continuing until line 38 we use the append method on our new variable barGroup to add rect elements as well as the color and size of each bar. Below is the code.

1.png

The last step for adding text appears in lines 42-50. First, we make a variable called textTranslator to move our text. Then we append the text to the bargroup variable. The color, font type, and font size are all set in the code below followed by a visual of what our graph looks like now.

12

Margin

Margins serve to provide spacing in a graph. This is especially useful if you want to add axes. The changes in the code take place in lines 16-39 and include an extensive reworking of the code. In lines 16-20 we create several variables that are used for calculating the margins and the size and shape of the svg element. In lines 22-30 we set the attributes for the svg variable. In line 32-34 we add a group element to hold the main parts of the graph. Lastly, in lines 36-40 we add a gray background for effect. Below is the code followed by our new graph. 1.png

1

Axes

In order for this to work, we have to change the value for the variable maxValue to 150. This would give a little more space at the top of the graph. The code for the axis goes form line 74 to line 98.

  • Line 74-77 we create variables to set up the axis so that it is on the left
  • Lines 78-85 we create two more variables that set the scale and the orientation of the axis
  • Lines 87-99 sets the visual characteristics of the axis.

Below is the code followed by the updated graph

12

You can see the scale off to the left as planned..

Conclusion

Make bar graphs is a basic task for d3.js. Although the code can seem cumbersome to people who do not use JavaScript. The ability to design visuals like this often outweighs the challenges.

Defining Terms in Debates

Defining terms in debates is an important part of the process that can be tricky at times. In this post, we will look at three criteria to consider when dealing with terms in debates. Below are the three criteria

  • When to define
  • What to define
  • How to define

When to Define

Definitions are almost always giving at the beginning of the debate. This is cause it helps to set up limits about what is discussed. It also makes it clear what the issue and potential propositions are.

Some debates focus exclusively on just defining terms. For example, highly controversial ideas such as abortion, non-traditional marriage, etc. Often the focus is just on such definitions as when does life beginning, or what is marriage? Defining terms helps to remove the fuzziness of the controversy and to focus on the exchange of ideas.

What to Define

It is not always clear what needs to be defined when staring a debate. Consider the following proposition of value

Resolved: That  playing videos games is detrimental to the development of children

Here are just a few things that may need to be  defined.

  • Video games: Does this refer to online, mobile, or console games? What about violent vs non-violent? Do educational games also fall into this category as well?
  • Development: What kind of development? Is this referring to emotional, physical, social or some other form of development
  • Children: Is this referring only to small children (0-6), young children (7-12) or teenagers?

These are just some of the questions to consider when trying to determine what to define. Again this is important because the affirmative may be arguing that videos are bad for small children but not for teenagers while the negative may be preparing a debate for the opposite.

How to Define

There are several ways to define a term below are just a few examples of how to do this.

Common Usage

Common usage is the everyday meaning of the term. For example,

We define children as individuals who are under the age of 18

This is clear and simple

Example

Example definitions give an example of the term to illustrate it as shown below.

An example of a video game would be PlayerUnknwon’s Battleground

This provides a context of the type of video games the debate may focus one

Operation

An operational definition is a working definition limited to the specific context. For example,

Video games for us is any game that is played on an electronic device

Fex define video games like this but this is an example.

Authority

Authority is a term that is defined by an expert.

According to technopedia, a video game is…..

Authority uses their experiences and knowledge to set what a term means and this can be used by debaters.

Negation

Negation is defining a word by what it is not. For example,

When we speak of video games we are not talking about educational games such as Oregon Trail. Rather, we are speaking of violent games such as Grand Theft Auto

The contrast between the types of games here is what the debater is using to define their term.

Conclusion

Defining terms is part of debating. Debaters need to be trained to understand the importance of this so that they can enhance their communication and persuasion.

Making Bar Graphs with D3.js

This post will provide an example of how to make a basic bar graph using d3.js. Visualizing data is important and developing bar graphs in one way to communicate information efficiently.

This post has the following steps

  1. Initial Template
  2. Enter the data
  3. Setup for the bar graphs
  4. Svg element
  5. Positioning
  6. Make the bar graph

Initial Template

Below is what the initial code should look like.

1

Entering the Data

For the data we will hard code it into the script using an array. This is not the most optimal way of doing this but it is the simplest for a first time experience.  This code is placed inside the second script element. Below is a picture.

1

The new code is in lines 10-11 save as the variable data.

Setup for the Bars in the Graph

We are now going to create three variables. Each is explained below

  • The barWidth variable will indicate ho wide the bars should be for the graph
  • barPadding variable will put space between the bars in the graph. If this is set to 0 it would make a histogram
  • The variable maxValue scales the height of the bars relative to the largest observation in the array. This variable uses the method .max() to find the largest value.

Below is the code for these three variables

1

The new information was added in lines 13-14

SVG Element

We can now begin to work with the svg element. We are going to create another variable called mainGroup. This will assign the svg element inside the body element using the .select() method. We will append the svg using .append and will set the width and height using .attr. Lastly, we will append a group element inside the svg so that all of our bars are inside the group that is inside the svg element.

The code is getting longer, so we will only show the new additions in the pictures with a reference to older code. Below is the new code in lines 16-19 directly  under the maxValue variable.

1

Positioning

New=x we need to make three functions.

  • The first function will calculate the x location of the bar graph
  • The second function  will calculate the y location of the bar graph
  • The last function will combine the work of the first two functions to place the bar in the proper x,y coordinate in the svg element.

Below is the code for the three functions. These are added in lines 21-251

The xloc function starts in the bottom left of the mainGroup element and adds the barWidth plus the barPadding to make the next bar. The yloc function starts in the top left and subtracts the maxValue from the given data point to calculate the y position. Lastly, the translator combines the output of both the xloc and the yloc functions to position bar using the translate method.

Making the Graph

We can now make our graph. We will use our mainGroup variable with the .selectAll method with the rect argument inside. Next, we use .data(data) to add the data, .enter() to update the element, .append(“rect”) to add the rectangles. Lastly, we use .attr() to set the color, transformation, and height of the bars. Below is the code in lines 27-36 followed by actual bar graph. 1.png

1

The graph is complete but you can see that there is a lot of work that needs to be done in order to improve it. However, that will be done in a future post.

Quadratic Discriminant Analysis with Python

Quadratic discriminant analysis allows for the classifier to assess non -linear relationships. This of course something that linear discriminant analysis is not able to do. This post will go through the steps necessary to complete a qda analysis using Python. The steps that will be conducted are as follows

  1. Data preparation
  2. Model training
  3. Model testing

Our goal will be to predict the gender of examples in the “Wages1” dataset using the available independent variables.

Data Preparation

We will begin by first loading the libraries we will need

import pandas as pd
from pydataset import data
import matplotlib.pyplot as plt
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis as QDA
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import (confusion_matrix,accuracy_score)
import seaborn as sns
from matplotlib.colors import ListedColormap

Next, we will load our data “Wages1” it comes from the “pydataset” library. After loading the data, we will use the .head() method to look at it briefly.

1

We need to transform the variable ‘sex’, our dependent variable, into a dummy variable using numbers instead of text. We will use the .getdummies() method to make the dummy variables and then add them to the dataset using the .concat() method. The code for this is below.

In the code below we have the histogram for the continuous independent variables.  We are using the .distplot() method from seaborn to make the histograms.

fig = plt.figure()
fig, axs = plt.subplots(figsize=(15, 10),ncols=3)
sns.set(font_scale=1.4)
sns.distplot(df['exper'],color='black',ax=axs[0])
sns.distplot(df['school'],color='black',ax=axs[1])
sns.distplot(df['wage'],color='black',ax=axs[2])

1

The variables look reasonable normal. Below is the proportions of the categorical dependent variable.

round(df.groupby('sex').count()/3294,2)
Out[247]: 
exper school wage female male
sex 
female 0.48 0.48 0.48 0.48 0.48
male 0.52 0.52 0.52 0.52 0.52

About half male and half female.

We will now make the correlational matrix

corrmat=df.corr(method='pearson')
f,ax=plt.subplots(figsize=(12,12))
sns.set(font_scale=1.2)
sns.heatmap(round(corrmat,2),
vmax=1.,square=True,
cmap="gist_gray",annot=True)

1

There appears to be no major problems with correlations. The last thing we will do is set up our train and test datasets.

X=df[['exper','school','wage']]
y=df['male']
X_train,X_test,y_train,y_test=train_test_split(X,y,
test_size=.2, random_state=50)

We can now move to model development

Model Development

To create our model we will instantiate an instance of the quadratic discriminant analysis function and use the .fit() method.

qda_model=QDA()
qda_model.fit(X_train,y_train)

There are some descriptive statistics that we can pull from our model. For our purposes, we will look at the group means  Below are the  group means.

exper school wage
Female 7.73 11.84 5.14
Male 8.28 11.49 6.38

You can see from the table that mean generally have more experience, higher wages, but slightly less education.

We will now use the qda_model we create to predict the classifications for the training set. This information will be used to make a confusion matrix.

cm = confusion_matrix(y_train, y_pred)
ax= plt.subplots(figsize=(10,10))
sns.set(font_scale=3.4)
with sns.axes_style('white'):
sns.heatmap(cm, cbar=False, square=True, annot=True, fmt='g',
cmap=ListedColormap(['gray']), linewidths=2.5)

1

The information in the upper-left corner are the number of people who were female and correctly classified as female. The lower-right corner is for the men who were correctly classified as men. The upper-right corner is females who were classified as male. Lastly, the lower-left corner is males who were classified as females. Below is the actually accuracy of our model

round(accuracy_score(y_train, y_pred),2)
Out[256]: 0.6

Sixty percent accuracy is not that great. However, we will now move to model testing.

Model Testing

Model testing involves using the .predict() method again but this time with the testing data. Below is the prediction with the confusion matrix.

 y_pred=qda_model.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
from matplotlib.colors import ListedColormap
ax= plt.subplots(figsize=(10,10))
sns.set(font_scale=3.4)
with sns.axes_style('white'):
sns.heatmap(cm, cbar=False, square=True,annot=True,fmt='g',
cmap=ListedColormap(['gray']),linewidths=2.5)

1

The results seem similar. Below is the accuracy.

round(accuracy_score(y_test, y_pred),2)
Out[259]: 0.62

About the same, our model generalizes even though it performs somewhat poorly.

Conclusion

This post provided an explanation of how to do a quadratic discriminant analysis using python. This is just another potential tool that may be useful for the data scientist.

Data Exploration Case Study: Credit Default

Exploratory data analysis is the main task of a Data Scientist with as much as 60% of their time being devoted to this task. As such, the majority of their time is spent on something that is rather boring compared to building models.

This post will provide a simple example of how to analyze a dataset from the website called Kaggle. This dataset is looking at how is likely to default on their credit. The following steps will be conducted in this analysis.

  1. Load the libraries and dataset
  2. Deal with missing data
  3. Some descriptive stats
  4. Normality check
  5. Model development

This is not an exhaustive analysis but rather a simple one for demonstration purposes. The dataset is available here

Load Libraries and Data

Here are some packages we will need

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import norm
from sklearn import tree
from scipy import stats
from sklearn import metrics

You can load the data with the code below

df_train=pd.read_csv('/application_train.csv')

You can examine what variables are available with the code below. This is not displayed here because it is rather long

df_train.columns
df_train.head()

Missing Data

I prefer to deal with missing data first because missing values can cause errors throughout the analysis if they are not dealt with immediately. The code below calculates the percentage of missing data in each column.

total=df_train.isnull().sum().sort_values(ascending=False)
percent=(df_train.isnull().sum()/df_train.isnull().count()).sort_values(ascending=False)
missing_data=pd.concat([total,percent],axis=1,keys=['Total','Percent'])
missing_data.head()
 
                           Total   Percent
COMMONAREA_MEDI           214865  0.698723
COMMONAREA_AVG            214865  0.698723
COMMONAREA_MODE           214865  0.698723
NONLIVINGAPARTMENTS_MODE  213514  0.694330
NONLIVINGAPARTMENTS_MEDI  213514  0.694330

Only the first five values are printed. You can see that some variables have a large amount of missing data. As such, they are probably worthless for inclusion in additional analysis. The code below removes all variables with any missing data.

pct_null = df_train.isnull().sum() / len(df_train)
missing_features = pct_null[pct_null > 0.0].index
df_train.drop(missing_features, axis=1, inplace=True)

You can use the .head() function if you want to see how  many variables are left.

Data Description & Visualization

For demonstration purposes, we will print descriptive stats and make visualizations of a few of the variables that are remaining.

round(df_train['AMT_CREDIT'].describe())
Out[8]: 
count     307511.0
mean      599026.0
std       402491.0
min        45000.0
25%       270000.0
50%       513531.0
75%       808650.0
max      4050000.0

sns.distplot(df_train['AMT_CREDIT']

1.png

round(df_train['AMT_INCOME_TOTAL'].describe())
Out[10]: 
count       307511.0
mean        168798.0
std         237123.0
min          25650.0
25%         112500.0
50%         147150.0
75%         202500.0
max      117000000.0
sns.distplot(df_train['AMT_INCOME_TOTAL']

1.png

I think you are getting the point. You can also look at categorical variables using the groupby() function.

We also need to address categorical variables in terms of creating dummy variables. This is so that we can develop a model in the future. Below is the code for dealing with all the categorical  variables and converting them to dummy variable’s

df_train.groupby('NAME_CONTRACT_TYPE').count()
dummy=pd.get_dummies(df_train['NAME_CONTRACT_TYPE'])
df_train=pd.concat([df_train,dummy],axis=1)
df_train=df_train.drop(['NAME_CONTRACT_TYPE'],axis=1)

df_train.groupby('CODE_GENDER').count()
dummy=pd.get_dummies(df_train['CODE_GENDER'])
df_train=pd.concat([df_train,dummy],axis=1)
df_train=df_train.drop(['CODE_GENDER'],axis=1)

df_train.groupby('FLAG_OWN_CAR').count()
dummy=pd.get_dummies(df_train['FLAG_OWN_CAR'])
df_train=pd.concat([df_train,dummy],axis=1)
df_train=df_train.drop(['FLAG_OWN_CAR'],axis=1)

df_train.groupby('FLAG_OWN_REALTY').count()
dummy=pd.get_dummies(df_train['FLAG_OWN_REALTY'])
df_train=pd.concat([df_train,dummy],axis=1)
df_train=df_train.drop(['FLAG_OWN_REALTY'],axis=1)

df_train.groupby('NAME_INCOME_TYPE').count()
dummy=pd.get_dummies(df_train['NAME_INCOME_TYPE'])
df_train=pd.concat([df_train,dummy],axis=1)
df_train=df_train.drop(['NAME_INCOME_TYPE'],axis=1)

df_train.groupby('NAME_EDUCATION_TYPE').count()
dummy=pd.get_dummies(df_train['NAME_EDUCATION_TYPE'])
df_train=pd.concat([df_train,dummy],axis=1)
df_train=df_train.drop(['NAME_EDUCATION_TYPE'],axis=1)

df_train.groupby('NAME_FAMILY_STATUS').count()
dummy=pd.get_dummies(df_train['NAME_FAMILY_STATUS'])
df_train=pd.concat([df_train,dummy],axis=1)
df_train=df_train.drop(['NAME_FAMILY_STATUS'],axis=1)

df_train.groupby('NAME_HOUSING_TYPE').count()
dummy=pd.get_dummies(df_train['NAME_HOUSING_TYPE'])
df_train=pd.concat([df_train,dummy],axis=1)
df_train=df_train.drop(['NAME_HOUSING_TYPE'],axis=1)

df_train.groupby('ORGANIZATION_TYPE').count()
dummy=pd.get_dummies(df_train['ORGANIZATION_TYPE'])
df_train=pd.concat([df_train,dummy],axis=1)
df_train=df_train.drop(['ORGANIZATION_TYPE'],axis=1)

You have to be careful with this because now you have many variables that are not necessary. For every categorical variable you must remove at least one category in order for the model to work properly.  Below we did this manually.

df_train=df_train.drop(['Revolving loans','F','XNA','N','Y','SK_ID_CURR,''Student','Emergency','Lower secondary','Civil marriage','Municipal apartment'],axis=1)

Below are some boxplots with the target variable and other variables in the dataset.

f,ax=plt.subplots(figsize=(8,6))
fig=sns.boxplot(x=df_train['TARGET'],y=df_train['AMT_INCOME_TOTAL'])

1.png

There is a clear outlier there. Below is another boxplot with a different variable

f,ax=plt.subplots(figsize=(8,6))
fig=sns.boxplot(x=df_train['TARGET'],y=df_train['CNT_CHILDREN'])

2

It appears several people have more than 10 children. This is probably a typo.

Below is a correlation matrix using a heatmap technique

corrmat=df_train.corr()
f,ax=plt.subplots(figsize=(12,9))
sns.heatmap(corrmat,vmax=.8,square=True)

1.png

The heatmap is nice but it is hard to really appreciate what is happening. The code below will sort the correlations from least to strongest, so we can remove high correlations.

c = df_train.corr().abs()

s = c.unstack()
so = s.sort_values(kind="quicksort")
print(so.head())

FLAG_DOCUMENT_12 FLAG_MOBIL 0.000005
FLAG_MOBIL FLAG_DOCUMENT_12 0.000005
Unknown FLAG_MOBIL 0.000005
FLAG_MOBIL Unknown 0.000005
Cash loans FLAG_DOCUMENT_14 0.000005

The list is to long to show here but the following variables were removed for having a high correlation with other variables.

df_train=df_train.drop(['WEEKDAY_APPR_PROCESS_START','FLAG_EMP_PHONE','REG_CITY_NOT_WORK_CITY','REGION_RATING_CLIENT','REG_REGION_NOT_WORK_REGION'],axis=1)

Below we check a few variables for homoscedasticity, linearity, and normality  using plots and histograms

sns.distplot(df_train['AMT_INCOME_TOTAL'],fit=norm)
fig=plt.figure()
res=stats.probplot(df_train['AMT_INCOME_TOTAL'],plot=plt)

12

This is not normal

sns.distplot(df_train['AMT_CREDIT'],fit=norm)
fig=plt.figure()
res=stats.probplot(df_train['AMT_CREDIT'],plot=plt)

12

This is not normal either. We could do transformations, or we can make a non-linear model instead.

Model Development

Now comes the easy part. We will make a decision tree using only some variables to predict the target. In the code below we make are X and y dataset.

X=df_train[['Cash loans','DAYS_EMPLOYED','AMT_CREDIT','AMT_INCOME_TOTAL','CNT_CHILDREN','REGION_POPULATION_RELATIVE']]
y=df_train['TARGET']

The code below fits are model and makes the predictions

clf=tree.DecisionTreeClassifier(min_samples_split=20)
clf=clf.fit(X,y)
y_pred=clf.predict(X)

Below is the confusion matrix followed by the accuracy

print (pd.crosstab(y_pred,df_train['TARGET']))
TARGET       0      1
row_0                
0       280873  18493
1         1813   6332
accuracy_score(y_pred,df_train['TARGET'])
Out[47]: 0.933966589813047

Lastly, we can look at the precision, recall, and f1 score

print(metrics.classification_report(y_pred,df_train['TARGET']))
              precision    recall  f1-score   support

           0       0.99      0.94      0.97    299366
           1       0.26      0.78      0.38      8145

   micro avg       0.93      0.93      0.93    307511
   macro avg       0.62      0.86      0.67    307511
weighted avg       0.97      0.93      0.95    307511

This model looks rather good in terms of accuracy of the training set. It actually impressive that we could use so few variables from such a large dataset and achieve such a high degree of accuracy.

Conclusion

Data exploration and analysis is the primary task of a data scientist.  This post was just an example of how this can be approached. Of course, there are many other creative ways to do this but the simplistic nature of this analysis yielded strong results

Teaching Materials

Regardless of what level a teacher teaches at you are always looking or materials and activities to do in your class. It is always a challenge to have new ideas and activities to support and help students because the world changes. This leads to an constant need to remove old inefficient activities and bring in new fresh one. The primary purpose of  activities is to provide practical application of the skills taught and experience in school.

For the math teacher you can naturally make your own math problems. However, this can quickly become quietly. One solution to this is to employed other worksheets that provide growth opportunities for the students with stressing out the teacher.

There are many great websites for this. For example, education.com provides many different types of worksheets to help students. They have some great simple math worksheets like the ones below

addition_outer space_answers

addition_outer space

There  are many more resources available at education.com  as well as other sites. There is no purpose or benefit to reinventing the wheel. The incorporation of the assignments of others is a great way to expand the resources you have available without the stress of developing them yourself.

Data Science Pipeline

One of the challenges of conducting a data analysis or any form of research is making decisions. You have to decide primarily two things

  1. What to do
  2. When to do it

People who are familiar with statistics may know what to do but may struggle with timing or when to do it. Others who are weaker when it comes to numbers may not know what to do or when to do it. Generally, it is rare for someone to know when to do something but not know how to do it.

In this post, we will look at a process that that can be used to perform an analysis in the context of data science. Keep in mind that this is just an example and there are naturally many ways to perform an analysis. The purpose here is to provide some basic structure for people who are not sure of what to do and when. One caveat, this process is focused primarily on supervised learning which has a clearer beginning, middle, and end in terms of the process.

Generally, there are three steps that probably always take place when conducting a data analysis and they are as follows.

  1. Data preparation (data mugging)
  2. Model training
  3. Model testing

Off course, it is much more complicated than this but this is the minimum. Within each of these steps there are several substeps, However, depending on the context, the substeps can be optional.

There is one pre-step that you have to consider. How you approach these three steps depends a great deal on the algorithm(s) you have in mind to use for developing different models. The assumptions and characteristics of one algorithm are different from another and shape how you prepare the data and develop models. With this in mind, we will go through each of these three steps.

Data Preparation

Data preparation involves several substeps. Some of these steps are necessary but general not all of them happen ever analysis. Below is a list of steps at this level

  • Data mugging
  • Scaling
  • Normality
  • Dimension reduction/feature extraction/feature selection
  • Train, test, validation split

Data mugging is often the first step in data preparation and involves making sure your data is in a readable structure for your algorithm. This can involve changing the format of dates, removing punctuation/text, changing text into dummy variables or factors, combining tables, splitting tables, etc. This is probably the hardest and most unclear aspect of data science because the problems you will face will be highly unique to the dataset you are working with.

Scaling involves making sure all the variables/features are on the same scale. This is important because most algorithms are sensitive to the scale of the variables/features. Scaling can be done through normalization or standardization. Normalization reduces the variables to a range of 0 – 1. Standardization involves converting the examples in the variable to their respective z-score. Which one you use depends on the situation but normally it is expected to do this.

Normality is often an optional step because there are so many variables that can be involved with big data and data science in a given project. However, when fewer variables are involved checking for normality is doable with a few tests and some visualizations. If normality is violated various transformations can be used to deal with this problem. Keep mind that many machine learning algorithms are robust against the influence of non-normal data.

Dimension reduction involves reduce the number of variables that will be included in the final analysis. This is done through factor analysis or principal component analysis. This reduction  in the number of variables is also an example of feature extraction. In some context, feature extraction is the in goal in itself. Some algorithms make their own features such as neural networks through the use of hidden layer(s)

Feature selection is the process of determining which variables to keep for future analysis. This can be done through the use of regularization such or in smaller datasets with subset regression. Whether you extract or select features depends on the context.

After all this is accomplished, it is necessary to split the dataset. Traditionally, the data was split in two. This led to the development of a training set and a testing set. You trained the model on the training set and tested the performance on the test set.

However, now many analyst split the data into three parts to avoid overfitting the data to the test set. There is now a training a set, a validation set, and a testing set. The  validation set allows you to check the model performance several times. Once you are satisfied you use the test set once at the end.

Once the data is prepared, which again is perhaps the most difficult part, it is time to train the model.

Model training

Model training involves several substeps as well

  1. Determine the metric(s) for success
  2. Creating a grid of several hyperparameter values
  3. Cross-validation
  4. Selection of the most appropriate hyperparameter values

The first thing you have to do and this is probably required is determined how you will know if your model is performing well. This involves selecting a metric. It can be accuracy for classification or mean squared error for a regression model or something else. What you pick depends on your goals. You use these metrics to determine the best algorithm and hyperparameters settings.

Most algorithms have some sort of hyperparameter(s). A hyperparameter is a value or estimate that the algorithm cannot learn and must be set by you. Since there is no way of knowing what values to select it is common practice to have several values tested and see which one is the best.

Cross-validation is another consideration. Using cross-validation always you to stabilize the results through averaging the results of the model over several folds of the data if you are using k-folds cross-validation. This also helps to improve the results of the hyperparameters as well.  There are several types of cross-validation but k-folds is probably best initially.

The information for the metric, hyperparameters, and cross-validation are usually put into  a grid that then runs the model. Whether you are using R or Python the printout will tell you which combination of hyperparameters is the best based on the metric you determined.

Validation test

When you know what your hyperparameters are you can now move your model to validation or straight to testing. If you are using a validation set you asses your models performance by using this new data. If the results are satisfying based on your metric you can move to testing. If not, you may move back and forth between training and the validation set making the necessary adjustments.

Test set

The final step is testing the model. You want to use the testing dataset as little as possible. The purpose here is to see how your model generalizes to data it has not seen before. There is little turning back after this point as there is an intense danger of overfitting now. Therefore, make sure you are ready before playing with the test data.

Conclusion

This is just one approach to conducting data analysis. Keep in mind the need to prepare data, train your model, and test it. This is the big picture for a somewhat complex process

Bagging Classification with Python

Bootstrap aggregation aka bagging is a technique used in machine learning that relies on resampling from the sample and running multiple models from the different samples. The mean or some other value is calculated from the results of each model. For example, if you are using Decisions trees, bagging would have you run the model several times with several different subsamples to help deal with variance in statistics.

Bagging is an excellent tool for algorithms that are considered weaker or more susceptible to variances such as decision trees or KNN. In this post, we will use bagging to develop a model that determines whether or not people voted using the turnout dataset. These results will then be compared to a model that was developed in a traditional way.

We will use the turnout dataset available in the pydataset module. Below is some initial code.

from pydataset import data
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report

We will load our dataset. Then we will separate the independnet and dependent variables from each other and create our train and test sets. The code is below.

df=data("turnout")
X=df[['age','educate','income',]]
y=df['vote']
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=.3,random_state=0)

We can now prepare to run our model. We need to first set up the bagging function. There are several arguments that need to be set. The max_samples argument determines the largest amount of the dataset to use in resampling. The max_features argument is the max number of features to use in a sample. Lastly, the n_estimators is for determining the number of subsamples to draw. The code is as follows

 h=BaggingClassifier(KNeighborsClassifier(n_neighbors=7),max_samples=0.7,max_features=0.7,n_estimators=1000)

Basically, what we told python was to use up to 70% of the samples, 70% of the features, and make 100 different KNN models that use seven neighbors to classify. Now we run the model with the fit function, make a prediction with the predict function, and check the accuracy with the classificarion_reoirts function.

h.fit(X_train,y_train)
y_pred=h.predict(X_test)
print(classification_report(y_test,y_pred))

1

This looks oka below are the results when you do a traditional model without bagging

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=.3,random_state=0)
clf=KNeighborsClassifier(7)
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)
print(classification_report(y_test,y_pred))

1

The improvement is not much. However, this depends on the purpose and scale of your project. A small improvement can mean millions in the reight context such as for large companies such as Google who deal with billions of people per day.

Conclusion

This post provides an example of the use of bagging in the context of classification. Bagging provides a why to improve your model through the use of resampling.

Support Vector Machines Classification with Python

Support vector machines (SVM) is an algorithm used to fit non-linear models. The details are complex but to put it simply  SVM tries to create the largest boundaries possible between the various groups it identifies in the sample. The mathematics behind this is complex especially if you are unaware of what a vector is as defined in algebra.

This post will provide an example of SVM using Python broken into the following steps.

  1. Data preparation
  2. Model Development

We will use two different kernels in our analysis. The linear kernel and he rbf kernel. The difference in terms of kernels has to do with how the boundaries between the different groups are made.

Data Preparation

We are going to use the OFP dataset available in the pydataset module. We want to predict if someone single or not. Below is some initial code.

import numpy as np
import pandas as pd
from pydataset import data
from sklearn import svm
from sklearn.metrics import classification_report
from sklearn import model_selection

We now need to load our dataset and remove any missing values.

df=pd.DataFrame(data('OFP'))
df=df.dropna()
df.head()

1

Looking at the dataset we need to do something with the variables that have text. We will create dummy variables for all except region and hlth. The code is below.

dummy=pd.get_dummies(df['black'])
df=pd.concat([df,dummy],axis=1)
df=df.rename(index=str, columns={"yes": "black_person"})
df=df.drop('no', axis=1)

dummy=pd.get_dummies(df['sex'])
df=pd.concat([df,dummy],axis=1)
df=df.rename(index=str, columns={"male": "Male"})
df=df.drop('female', axis=1)

dummy=pd.get_dummies(df['employed'])
df=pd.concat([df,dummy],axis=1)
df=df.rename(index=str, columns={"yes": "job"})
df=df.drop('no', axis=1)

dummy=pd.get_dummies(df['maried'])
df=pd.concat([df,dummy],axis=1)
df=df.rename(index=str, columns={"no": "single"})
df=df.drop('yes', axis=1)

dummy=pd.get_dummies(df['privins'])
df=pd.concat([df,dummy],axis=1)
df=df.rename(index=str, columns={"yes": "insured"})
df=df.drop('no', axis=1)

For each variable, we did the following

  1. Created a dummy in the dummy dataset
  2. Combined the dummy variable with our df dataset
  3. Renamed the dummy variable based on yes or no
  4. Drop the other dummy variable from the dataset. Python creates two dummies instead of one.

If you look at the dataset now you will see a lot of variables that are not necessary. Below is the code to remove the information we do not need.

df=df.drop(['black','sex','maried','employed','privins','medicaid','region','hlth'],axis=1)
df.head()

1

This is much cleaner. Now we need to scale the data. This is because SVM is sensitive to scale. The code for doing this is below.

df = (df - df.min()) / (df.max() - df.min())
df.head()

1

We can now create our dataset with the independent variables and a separate dataset with our dependent variable. The code is as follows.

X=df[['ofp','ofnp','opp','opnp','emr','hosp','numchron','adldiff','age','school','faminc','black_person','Male','job','insured']]
y=df['single']

We can now move to model development

Model Development

We need to make our test and train sets first. We will use a 70/30 split.

X_train,X_test,y_train,y_test=model_selection.train_test_split(X,y,test_size=.3,random_state=1)

Now, we need to create the models or the hypothesis we want to test. We will create two hypotheses. The first model is using a linear kernel and the second is one using the rbf kernel. For each of these kernels, there are hyperparameters that need to be set which you will see in the code below.

h1=svm.LinearSVC(C=1)
h2=svm.SVC(kernel='rbf',degree=3,gamma=0.001,C=1.0)

The details about the hyperparameters are beyond the scope of this post. Below are the results for the first model.

1.png

The overall accuracy is 73%. The crosstab() function provides a breakdown of the results and the classification_report() function provides other metrics related to classification. In this situation, 0 means not single or married while 1 means single. Below are the results for model 2

1.png

You can see the results are similar with the first model having a slight edge. The second model really struggls with predicting people who are actually single. You can see thtat the recall in particular is really poor.

Conclusion

This post provided how to ob using SVM in python. How this algorithm works can be somewhat confusing. However, its use can be powerful if use appropriately.

Multiple Regression in Python

In this post, we will go through the process of setting up and a regression model with a training and testing set using Python. We will use the insurance dataset from kaggle. Our goal will be to predict charges. In this analysis, the following steps will be performed.

  1. Data preparation
  2. Model training
  3. model testing

Data Preparation

Below is a list of the modules we will need in order to complete the analysis

import matplotlib.pyplot as plt
import pandas as pd
from sklearn import linear_model,model_selection, feature_selection,preprocessing
import statsmodels.formula.api as sm
from statsmodels.tools.eval_measures import mse
from statsmodels.tools.tools import add_constant
from sklearn.metrics import mean_squared_error

After you download the dataset you need to load it and take a look at it. You will use the  .read_csv function from pandas to load the data and .head() function to look at the data. Below is the code and the output.

insure=pd.read_csv('YOUR LOCATION HERE')

1.png

We need to create some dummy variables for sex, smoker, and region. We will address that in a moment, right now we will look at descriptive stats for our continuous variables. We will use the .describe() function for descriptive stats and the .corr() function to find the correlations.

1.png

The descriptives are left for your own interpretation. As for the correlations, they are generally weak which is an indication that regression may be appropriate.

As mentioned earlier, we need to make dummy variables sex, smoker, and region in order to do the regression analysis. To complete this we need to do the following.

  1. Use the pd.get_dummies function from pandas to create the dummy
  2. Save the dummy variable in an object called ‘dummy’
  3. Use the pd.concat function to add our new dummy variable to our ‘insure’ dataset
  4. Repeat this three times

Below is the code for doing this

dummy=pd.get_dummies(insure['sex'])
insure=pd.concat([insure,dummy],axis=1)
dummy=pd.get_dummies(insure['smoker'])
insure=pd.concat([insure,dummy],axis=1)
dummy=pd.get_dummies(insure['region'])
insure=pd.concat([insure,dummy],axis=1)
insure.head()

1.png

The .get_dummies function requires the name of the dataframe and in the brackets the name of the variable to convert. The .concat function requires the name of the two datasets to combine as well the axis on which to perform it.

We now need to remove the original text variables from the dataset. In addition, we need to remove the y variable “charges” because this is the dependent variable.

y = insure.charges
insure=insure.drop(['sex', 'smoker','region','charges'], axis=1)

We can now move to model development.

Model Training

Are train and test sets are model with the model_selection.trainin_test_split function. We will do an 80-20 split of the data. Below is the code.

X_train, X_test, y_train, y_test = model_selection.train_test_split(insure, y, test_size=0.2)

In this single line of code, we create a train and test set of our independent variables and our dependent variable.

We can not run our regression analysis. This requires the use of the .OLS function from statsmodels module. Below is the code.

answer=sm.OLS(y_train, add_constant(X_train)).fit()

In the code above inside the parentheses, we put the dependent variable(y_train) and the independent variables (X_train). However, we had to use the function add_constant to get the intercept for the output. All of this information is then used inside the .fit() function to fit a model.

To see the output you need to use the .summary() function as shown below.

answer.summary()

1.png

The assumption is that you know regression but our reading this post to learn python. Therefore, we will not go into great detail about the results. The r-square is strong, however, the region and gender are not statistically significant.

We will now move to model testing

Model Testing

Our goal here is to take the model that we developed and see how it does on other data. First, we need to predict values with the model we made with the new data. This is shown in the code below

ypred=answer.predict(add_constant(X_test))

We use the .predict() function for this action and we use the X_test data as well. With this information, we will calculate the mean squared error. This metric is useful for comparing models. We only made one model so it is not that useful in this situation. Below is the code and results.

print(mse(ypred,y_test))
33678660.23480476

For our final trick, we will make a scatterplot with the predicted and actual values of the test set. In addition, we will calculate the correlation of the predict values and test set values. This is an alternative metric for assessing a model.

1.png

You can see the first two lines are for making the plot. Lines 3-4 are for making the correlation matrix and involves the .concat() function. The correlation is high at 0.86 which indicates the model is good at accurately predicting the values. THis is confirmed with the scatterplot which is almost a straight line.

Conclusion

IN this post we learned how to do a regression analysis in Python. We prepared the data, developed a model, and tested a model with an evaluation of it.

Working with a Dataframe in Python

In this post, we will learn to do some basic exploration of a dataframe in Python. Some of the task we will complete include the following…

  • Import data
  • Examine data
  • Work with strings
  • Calculating descriptive statistics

Import Data 

First, you need data, therefore, we will use the Titanic dataset, which is readily available on the internet. We will need to use the pd.read_csv() function from the pandas package. This means that we must also import pandas. Below is the code.

import pandas as pd
df=pd.read_csv('FILE LOCATION HERE')

In the code above we imported pandas as pd so we can use the functions within it. Next, we create an object called ‘df’. Inside this object, we used the pd.read_csv() function to read our file into the system. The location of the file needs to type in quotes inside the parentheses. Having completed this we can now examine the data.

Data Examination

Now we want to get an idea of the size of our dataset, any problems with missing. To determine the size we use the .shape function as shown below.

df.shape
Out[33]: (891, 12)

Results indicate that we have 891 rows and 12 columns/variables. You can view the whole dataset by typing the name of the dataframe “df” and pressing enter. If you do this you may notice there are a lot of NaN values in the “Cabin” variable. To determine exactly how many we can use is.null() in combination with the values_count. variables.

df['Cabin'].isnull().value_counts()
Out[36]: 
True     687
False    204
Name: Cabin, dtype: int64

The code starts with the name of the dataframe. In the brackets, you put the name of the variable. After that, you put the functions you are using. Keep in mind that the order of the functions matters. You can see we have over 200 missing examples. For categorical varable, you can also see how many examples are part of each category as shown below.

df['Embarked'].value_counts()
Out[39]: 
S    644
C    168
Q     77
Name: Embarked, dtype: int64

This time we used our ‘Embarked’ variable. However, we need to address missing values before we can continue. To deal with this we will use the .dropna() function on the dataset. THen we will check the size of the dataframe again with the “shape” function.

df=df.dropna(how='any')
df.shape
Out[40]: (183, 12)

You can see our dataframe is much smaller going 891 examples to 183. We can now move to other operations such as dealing with strings.

Working with Strings

What you do with strings really depends or your goals. We are going to look at extraction, subsetting, determining the length. Our first step will be to extract the last name of the first five people. We will do this with the code below.

df['Name'][0:5].str.extract('(\w+)')
Out[44]: 
1 Cumings
3 Futrelle
6 McCarthy
10 Sandstrom
11 Bonnell
Name: Name, dtype: object

As you can see we got the last names of the first five examples. We did this by using the following format…

dataframe name[‘Variable Name’].function.function(‘whole word’))

.str is a function for dealing with strings in dataframes. The .extract() function does what its name implies.

If you want, you can even determine how many letters each name is. We will do this with the .str and .len() function on the first five names in the dataframe.

df['Name'][0:5].str.len()
Out[64]: 
1 51
3 44
6 23
10 31
11 24
Name: Name, dtype: int64

Hopefully, the code is becoming easier to read and understand.

Aggregation

We can also calculate some descriptive statistics. We will do this for the “Fare” variable. The code is repetitive in that only the function changes so we will run all of them at once. Below we are calculating the mean, max, minimum, and standard deviation  for the price of a fare on the Titanic

df['Fare'].mean()
Out[77]: 78.68246885245901

df['Fare'].max()
Out[78]: 512.32920000000001

df['Fare'].min()
Out[79]: 0.0

df['Fare'].std()
Out[80]: 76.34784270040574

Conclusion

This post provided you with some ways in which you can maneuver around a dataframe in Python.

Numpy Arrays in Python

In this post, we are going to explore arrays is created by the numpy package in Python. Understanding how arrays are created and manipulated is useful when you need to perform complex coding and or analysis. In particular, we will address the following,

  1. Creating and exploring arrays
  2. Math with arrays
  3. Manipulating arrays

Creating and Exploring an Array

Creating an array is simple. You need to import the numpy package and then use the np.array function to create the array. Below is the code.

import numpy as np
example=np.array([[1,2,3,4,5],[6,7,8,9,10]])

Making an array requires the use of square brackets. If you want multiple dimensions or columns than you must use inner square brackets. In the example above I made an array with two dimensions and each dimension has it’s own set of brackets.

Also, notice that we imported numpy as np. This is a shorthand so that we do not have to type the word numpy but only np. In addition, we now created an array with ten data points spread in two dimensions.

There are several functions you can use to get an idea of the size of a data set. Below is a list with the function and explanation.

  • .ndim = number of dimensions
  • .shape =  Shares the number of rows and columns
  • .size = Counts the number of individual data points
  • .dtype.name = Tells you the data structure type

Below is code that uses all four of these functions with our array.

example.ndim
Out[78]: 2

example.shape
Out[79]: (2, 5)

example.size
Out[80]: 10

example.dtype.name
Out[81]: 'int64'

You can see we have 2 dimensions. The .shape function tells us we have 2 dimensions and 5 examples in each one. The .size function tells us we have 10 total examples (5 * 2). Lastly, the .dtype.name function tells us that this is an integer data type.

Math with Arrays

All mathematical operations can be performed on arrays. Below are examples of addition, subtraction, multiplication, and conditionals.

example=np.array([[1,2,3,4,5],[6,7,8,9,10]]) 
example+2
Out[83]: 
array([[ 3, 4, 5, 6, 7],
[ 8, 9, 10, 11, 12]])

example-2
Out[84]: 
array([[-1, 0, 1, 2, 3],
[ 4, 5, 6, 7, 8]])

example*2
Out[85]: 
array([[ 2, 4, 6, 8, 10],
[12, 14, 16, 18, 20]])

example<3
Out[86]: 
array([[ True, True, False, False, False],
[False, False, False, False, False]], dtype=bool)

Each number inside the example array was manipulated as indicated. For example, if we typed example + 2 all the values in the array increased by 2. Lastly, the example < 3 tells python to look inside the array and find all the values in the array that are less than 3.

Manipulating Arrays

There are also several ways you can manipulate or access data inside an array. For example, you can pull a particular element in an array by doing the following.

example[0,0]
Out[92]: 1

The information in the brackets tells python to access the first bracket and the first number in the bracket. Recall that python starts from 0. You can also access a range of values using the colon as shown below

example=np.array([[1,2,3,4,5],[6,7,8,9,10]]) 
example[:,2:4]
Out[96]: 
array([[3, 4],
[8, 9]])

In this example, the colon means take all values or dimension possible for finding numbers. This means to take columns 1 & 2. After the comma we have 2:4, this means take the 3rd and 4th value but not the 5th.

It is also possible to turn a multidimensional array into a single dimension with the .ravel() function and also to transpose with the transpose() function. Below is the code for each.

example=np.array([[1,2,3,4,5],[6,7,8,9,10]]) 
example.ravel()
Out[97]: array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

example.transpose()
Out[98]: 
array([[ 1, 6],
[ 2, 7],
[ 3, 8],
[ 4, 9],
[ 5, 10]])

You can see the .ravel function made a one-dimensional array. The .transpose broke the array into several more dimensions with two numbers each.

Conclusion

We now have a basic understanding of how numpy array work using python. As mention before, this is valuable information to understand when trying to wrestling with different data science questions.

Lists in Python

Lists allow you to organize information. In the real world, we make list all the time to keep track of things. This same concept applies in Python when making list. A list is a sequence of stored data. By sequence, it is mean a data structure that allows multiple items to exist in a single storage unit. By making list we are explaining to the computer how to store the data in the computer’s memory.

In this post, we learn the following about list

  • How to make a list
  • Accessing items in a list
  • Looping through a list
  • Modifying a list

Making a List

Making a list is not difficult at all. To make one you first create a variable name followed by the equal sign and then place your content inside square brackets. Below is an example of two different lists.

numList=[1,2,3,4,5]
alphaList=['a','b','c','d','e']
print(numList,alphaList)
[1, 2, 3, 4, 5] ['a', 'b', 'c', 'd', 'e']

Above we made two lists, a numeric and a character list. We then printed both. In general, you want your list to have similar items such as all numbers or all characters. This makes it easier to recall what is in them then if you mixed them. However, Python can handle mixed list as well.

Access a List

To access individual items in a list is the same as for a sting. Just employ brackets with the index that you want. Below are some examples.

numList=[1,2,3,4,5]
alphaList=['a','b','c','d','e']

numList[0]
Out[255]: 1

numList[0:3]
Out[256]: [1, 2, 3]

alphaList[0]
Out[257]: 'a'

alphaList[0:3]
Out[258]: ['a', 'b', 'c']

numList[0] gives us the first value in the list. numList[0:3] gives us the first three values. This is repeated with the alphaList as well.

Looping through a List

A list can be looped through as well. Below is a simple example.

for item in numList :
    print(item)


for item in alphaList :
    print(item)


1
2
3
4
5
a
b
c
d
e

By making the two for loops above we are able to print all of the items inside each list.

Modifying List

There are several functions for modifying lists. Below are a few

The append() function as a new item to the list

numList.append(9)
print(numList)
alphaList.append('h')
print(alphaList)

[1, 2, 3, 4, 5, 9]
['a', 'b', 'c', 'd', 'e', 'h']

You can see our lists new have one new member each at the end.

You can also remove the last member of a list with the pop() function.

numList.pop()
print(numList)
alphaList.pop()
print(alphaList)
[1, 2, 3, 4, 5]
['a', 'b', 'c', 'd', 'e']

By using the pop() function we have returned our lists back to there original size.

Another trick is to merge lists together with the extend() function. For this, we will merge the same list with its self. This will cause the list to have duplicates of all of its original values.

numList.extend(numList)
print(numList)
alphaList.extend(alphaList)
print(alphaList)

[1, 2, 3, 4, 5, 1, 2, 3, 4, 5]
['a', 'b', 'c', 'd', 'e', 'a', 'b', 'c', 'd', 'e']

All the values in each list have been duplicated. Finally, you can sort a list using the sort() function.

numList.sort()
print(numList)
alphaList.sort()
print(alphaList)
[1, 1, 2, 2, 3, 3, 4, 4, 5, 5]
['a', 'a', 'b', 'b', 'c', 'c', 'd', 'd', 'e', 'e']

Now all the numbers and letters are sorted.

Conclusion

THere is way more that could be done with lists. However, the purpose here was just to cover some of the basic ways that list can be used in Python.

Introducing Google Classroom

Google Classroom is yet another player in the learning management system industry. This platform provides most of the basics that are expected in a lms.  This post is not a critique of Google Classroom. Rather, the focus here is on how to use it. It is better for you to decide for yourself about the quality of Google Classroom.

In this post, we will learn how to set up a class in order to prepare the learning experience.

Before we begin it is assumed that you have a Gmail account as this is needed to access Google Classroom. In addition, this demonstration is from an individual account and not through the institutional account that a school would set up with Google if they adopted tGoogle Classroom.

Creating a Google Class

Once you are logged in to your Gmail account you can access Google Classroom by clicking on the little gray squares in the upper right-hand corner of your browser. Doing so will show the following.

1

In the example above, Google Classroom is the icon in the bottom row in the middle. When you click on it you will see the following.

1

You might see a screen before this asking if you are a student or teacher. In the screen above, Google tells you where to click to make your first class. Therefore, click on the plus sign and click on “create class” and you will see the following.

1

Click on the box which promises Google you will only use your classroom with adults. After this, you will see a dialog box where you can give your class a name as shown below.

1

Give your course a name and click “create”. Then you will see the following.

1.pngThere is a lot of information here. The name of the class is at the top followed by the name of the teacher below. In the middle of the page, you have something called the “stream”. This is where most of the action happens in terms of posting assignments, leading discussions, and making announcements. To the left are some options for dealing with the stream, a calendar, and a way to organize information in the stream by topic.

The topic feature is valuable because it allows you to organize information in a way similar to topics in Moodle. When creating an activity just be sure to assign it to a topic so students can see expectations for that week’s work. This will be explained more in the future.

One thing that was not mentioned was the tabs at the very top of the screen.

1

We started in the “stream” tab. If you click on the “students” tab you will see the following.

1.png

The “invite students” button allows you to add students by typing their email. To the left, you have the class code. This is the code people need in order to add your course.

If you click on the “about” tab you will see the following.

1

Here you can access the drive where all files are saved, the class calendar, your Google calendar, and even invite teachers. In the middle, you can edit the information about the course as well as additional materials that the students will need. This page is useful because it is not dynamic like the stream page. Posted files staying easy to find when using the “about” page.

Conclusion

Google Classroom is not extremely difficult to learn. You can set-up a course with minimal computer knowledge in less than 20 minutes. The process shared hear was simply the development of a course. In a future post, we will look at how setup teaching activities and other components of a balanced learning experience.

Luther and Educational Reform

Martin Luther (1483-1546) is best known for his religious work as one of the main catalysts for the Protestant Reformation. However, Luther was also a powerful influence on education during his lifetime. This post will take a look at Luther’s early life and his contributions to education

Early Life

Luther was born during the late 15th century. His father was a tough miner with a severe disciplinarian streak. You would think that this would be a disaster but rather the harsh discipline gave Luther a toughness that would come in handy when standing alone for his beliefs.

Upon reaching adulthood Luther studied law as his father diseased for him to become a lawyer. However, Luther decided instead to become a monk much to the consternation of his father.

As a monk, Luther was a diligent student and studied for several additional degrees. Eventually, he was given an opportunity to visit Rome which was the headquarters of his church. However, Luther saw things there that troubled him and in many laid the foundation for his doubt in the direction of his church.

Eventually, Luther had a serious issue with several church doctrines. This motivated him to nail his 95 theses onto the door of a church in 1517. This act was a challenge to defend the statements in the theses and was actually a common behavior among the scholarly community at the time.

For the next several years it was a back forth intellectual battle with the church. A common pattern was the church would use some sort of psychological torture such as the eternal damnation of his soul and Luther would ask for biblical evidence which was normally not given. Finally, in 1521 at the Diet of Worms, Luther was forced to flee for his life and the Protestant Reformation had in many was begun.

Views on Education

Luther’s views on education would not be considered radical or innovative today but they were during his lifetime. For our purposes, we will look at three tenets of Luther’s position on education

  • People should be educated so they can read the scriptures
  • Men and women should receive an education
  • Education  should benefit the church and state

People Should be Educated so they Can Read the Scriptures

The thought that everyone should be educated was rather radical. By education, we mean developing literacy skills and not some form of vocational training. Education was primarily for those who needed it which was normally the clergy, merchants, and some of the nobility.

If everyone was able to read it would significantly weaken the churches position to control spiritual ideas and the state’s ability to maintain secular control, which is one reason why widespread literacy was uncommon. Luther’s call for universal education would not truly be repeated until Horace Mann and the common school. movement.

The idea of universal literacy also held with it a sense of personal responsibility. No one could rely on another to understand scripture. Everyone needs to know how to read and interpret scripture for themselves.

Men and Women Should be Educated

The second point is related to the first. Luther said that everyone should be educated he truly meant everyone. This means men and women should learn literacy. The women could not hide behind the man for her spiritual development but needed to read for herself.

Again the idea of women education was controversial at the time. The Greeks believed that educating women was embarrassing although this view was not shared by all in any manner.

WOmen were not only educated for spiritual reasons but also so they could manage the household as well. Therefore, there was a spiritual and a practical purpose to the education of women for Luther

Education Benefits the Church and the State

Although it was mentioned that education had been neglected to maintain the power of the church and state. For Luther, educated citizens would be of a greater benefit to the church and state.

The rationale is that the church would receive ministers, teachers, pastors, etc. and the state would receive future civil servants. Therefore, education would not tear down society but would rather build it up.

Conclusion

Luther was primarily a reformer but also was a powerful force in education. His plea for the development of education in Germany led to the construction of schools all over the Protestant controlled parts of Germany. His work was of such importance that he has been viewed as one of the leading educational reformers of the 16th century.