Author Archives: Dr. Darrin

Artificial Intelligence in the Classroom

In 1990, a little known film called “Class of 1999” came out. In this movie, three military grade robots are placed in an inner-city war zone school to with the job of teaching.

As with any movie, things quickly get out of hand and the robot teachers begin killing the naughty students and eventually manipulating the local gangs into fighting and killing each other. Eventually, in something that can only happen in a movie, three military grade AI robots similar to the terminator are destroyed by a delinquent teenager

There has been a lot of hype and excitement over the growth of data science, machine learning, and artificial intelligence. With this growth, these ideas have begun to expend into supporting education. This has even led to speculation among some that algorithms and artificial intelligence could replace teachers in the classroom.

There are several reasons why this is. My reasons are listed as follows

  • People Need People
  • Computers need people
  • Computers assist people

People Need People

When it comes to education, people need people. Originally, education was often passed through apprenticeship for trades and one-on-one tutoring for elites. There has allows been some form of mass education but it has always involved people helping each other.

There are certain social-emotional needs that people have that cannot be satisfied by even the most life-like machine. When humans communicate they always convey some form of emotion even in the most harden computer like individual. Although AI is making strides in attempting to read, emotions they are far from convincingly portraying emotions. Besides, students want someone who can laugh, joke, smile, and do all those little things that involve being human. Even such mundane things as tripping over one’s shoes, or forgetting someone’s name add a human element to the learning experience.

Furthermore, even if a computer is able to share emotions in a human-like manner what child would really feel satisfaction from pleasing an Amazon Alexa? People need people and AI teachers cannot provide this even if they can provide top-level content.

Another concern is that people are highly unpredictable. Again, this relates to the emotional aspects of human nature. Even humans who have the same emotional characteristics are surprised by the behavior of fellow humans. When an algorithm is coldly calculating what is an appropriate action this inability to deal with unpredictable actions can be a problem.

A classic example of this is classroom management. If a student is not paying attention, or not doing their work, or showing defiance in one way or the other how would a computer handle this. In the movie “Class of 1999” the answer for disruptive behavior was to kill. Few parents and administrators would approve of such an action coming from an artificial neural network.

People need people in the context of education for the socio-emotional aspect of education as well as for the tribulation of classroom management. Computers are not humans and therefore they cannot provide the motivation or inspiration that so many students need to be successful in school.

Computers Need People

A second reason AI teachers are unlikely is because computers need people. Computers breakdown,  there are bugs in code, updates have to be made etc. All this precludes a machine going completely independent. With everything that can go wrong there has to be people there to monitor the use and interaction of machines with people.

Even in the movie “Class of 1999” there was a research professor and administrator monitoring the school. This continued until they were killed by the AI teachers.

With all the amazing advances in AI and machine learning it is still people who tweak the algorithms and provide the data for the machine to learn. After this is done, the algorithm is still monitored to see how it performs. Computers cannot escape their reliance on humans to maintain their functionality which implies that they cannot be turned loss in a classroom alone.

Computers Help People

The way going forward is that perhaps AI and other aspects of machine learning and data science can support teachers to be better teachers. For example, in some versions of Moodle there is an algorithm that will monitor students participation and will predict if students are at risk of failing. There is also an algorithm that predicts if a teacher is teaching. This is an excellent use of machine learning in that it deals with a routine task and simple flags a problem rather than trying to intervene it’s self.

Another useful application more in line with AI is through tutoring. Providing feedback on progress and adjusting what the student does based on their performance. Again, in a supporting role, AI can be excellent. The problem is when AI becomes the leader.


The advances in technology are going to continue. However, with the amazing breakthroughs in this field people still need interaction with other people and the example of others in a social context. Computers will never be able to provide this. Computers also need the constant support of humans in order to function. The proper role for AI and data science in education may be as a supporter to a teacher rather than the one leading and making criticaltaff decisions about other people.


Computational Thinking

Computational thinking is the process of expressing a problem in a way that a computer can solve. In general, there are four various ways that computational thinking can be done. These four ways are decomposition, pattern recognition, abstraction, and algorithmic thinking.

Although computational thinking is dealt with in the realm of computer science. Everyone thinks computationally at one time or another especially in school. Awareness of these subconscious strategies can help people to know how they think at times as well as to be aware of the various ways in which thinking is possible.


Decomposition is the process of breaking a large problem down into smaller and smaller parts or problems. The benefit of this is that by addressing all of the created little problems you can solve the large problem.

In education decomposition can show up in many ways. For teachers, they often have to break goals done into objectives, and sometimes down into procedures in a daily lesson plan. Seeing the big picture of the content students need and breaking it down into pieces that students can comprehend is critically to education such as with chunking.

For the student, decomposition involves breaking down the parts of a project such as writing a paper. The student has to determine what to do and how it helps to achieve the completion of their project.

Pattern Recognition

Pattern recognition has to refer to how various aspects of a problem have things in common. For a teacher, this may involve the development of a thematic unit. Developing such a unit requires the teacher to see what various subjects or disciplines have in common as they try to create the thematic unit.

For the student, pattern recognition can support the development of  shortcuts. Examples include seeing similarities in assignments that need to be completed and completing similar assignments together.


Abstraction  is the ability to remove irrelevant information from a problem. This is perhaps the most challenging form of thinking to develop because people often fall into the trap that everything is important.

For a teacher, abstractions involves teaching only the critical information that is in the content and not stressing the small stuff. This is not easy especially when the  teacher has a passion for their subject. This often blinds them to trying to share only the most relevant information about their field with their students.

For students, abstraction involves being able to share the most critical information. Students are guilty of the same problems as teachers in that they share everything when writing or presenting. Determining what is important requires the development of an opinion to judge the relevance of something. This is a skill that is hard to find among graduates.

Algorithmic Thinking

Algorithmic thinking is being able to develop a step-by-step plan to do something. For teachers, this happens everyday through planning activities and leading a class. Planning may be the most common form of thinking for the teacher.

For students, algorithmic thinking is somewhat more challenging. It is common for younger people to rely heavily on intuition to accomplish tasks. This means that they did something but they do not know how they did it.

Another common mistake for young people is doing things through brute force. Rather than planning, they will just keep pounding away until something works. As such, it is common for students to do things the “hard way” as the saying goes.


Computational thinking is really how humans think in many situations in which emotions are not the primary mover. As such, what is really happening is not that computers are thinking as much as they are trying to model how humans think. In education, there are several situations. In which computational thinking can be employed for success.

Mentoring New Teachers

A career in teaching is an attractive option for many young adults. One of the major challenges in a career in teaching is the student teaching experience that is normally placed at the end of the degree program. This post will provide some suggestion for teacher mentors

Go Over Local Expectations

Every school has its own set of policies and expectations that all employees need to adhere too. Often, the student teacher is not aware of these and it is the mentoring teacher’s responsibility to provide some idea of what is expected. This includes such things as showing them around the campus, communicating expectations for how to dress, discipline procedures, and even how to deal with grades.

Knowing these little things can allow the new teacher to focus on teaching rather than the administrative aspects of the classroom.

Provide Feedback

Feedback is critical so that the new teacher knows what they are doing well and wrong. It is, of course, important to mention what the student teacher does well. However, growth happens by providing support to overcome weaknesses.

The temptation for many supervising teachers is simply to mention what the problems are and let the student figure out what to do. This approach may work for an experience or a highly independent teacher. However, for most new teachers they need specific support on what to do in order to improve their teaching and overcome a weakness.

Therefore, criticism without some sort of suggestion for how to overcome the problem is not beneficial. In addition, it is important to only address major problems that can cripple the educational experience of the students rather than every single weakness in the students teaching. We all have issues and problems with our teaching and for beginners, only the big problems should be corrected.

The student also should provide feedback on how they view their own teaching. Most teacher education programs require this in the form of a journal. However, the benefit of the journal is only in discussing it with others such as the mentor teacher.

Lead By Example

IN reality, in order for a student to be a successful teacher, they need to see what successful teaching is so they can imitate until perfection. What this means for you as a supervising teacher is that you need to lay the example for the student to imitate. Everyone has there own style but a good example goes a long way in molding the teaching approach of a student.

This also means that a mentor teacher needs to do a lot of verbalizing in terms of what they do. Often, as an experienced teacher, things become automatic in the classroom. You know what to do without much thought or discussion. The problem is that if there is a lack of explanation in terms of wqhat is happening the student teacher is not able to deermine why you are doing certain things. Therefore, a mentor teacher must explained explictiylywhat they are doing and why while they are provding the exmple of teaching.


Students who dream of teaching need support in order to have success. This involves bringing in people with more experience to support these young teachers as they develop their skillset. This means that even experienced teachers need some support in order to determine how to help new teachers

3 Steps to Successful Research

When students have to conduct a research project they often struggle with determining what to do. There are many decisions that have to be made that can impede a student’s chances of achieving success.  However, there are ways to overcome this problem.

This post will essentially reduce the decision-making process for conducting research down to three main questions that need to be addressed. These questions are.

  • What do you Want to Know?
  • How do You Get the Answer?
  • What Does Your Answer Mean?

Answering these three questions makes it much easier to develop a sense of direction and scope in order to complete a project.

What do you Want to Know?

Often, students want to complete a project but it is unclear to them what they are trying to figure out. In other words, the students do not know what it is that they want to know. Therefore, one of the first steps in research is to determine exactly it is you want to know.

Understanding what you want to know will allow you to develop a problem as well as research questions to facilitate your ability to understand exactly what it is that you are looking for. Research always begins with a problem and questions about the problem and this is simply another way of stating what it is that you want to know.

How do You Get the Answer?

Once it is clear what it is that you want to know it is critical that you develop a process for determining how you will obtain the answers. It is often difficult for students to develop a systematic way in which to answer questions. However, in a research paradigm, a scientific way of addressing questions is critical.

When you are determining how to get answers to what you want to know this is essential the development of your methodology section. This section includes such matters as the research design, sample, ethics, data analysis, etc. The purpose here is again to explain the way to get the answer(s).

What Does Your Answer Mean?

After you actually get the answer you have to explain what it means. Many students fall into the trap of doing something without understanding why or determining the relevance of the outcome. However, a research project requires some sort of interpretation or explanation of the results. Just getting the answer is not enough it is the meaning that holds the power.

Often, the answers to the research questions are found in the results section of a paper and the meaning is found in the discussion and conclusion section. In the discussion section, you explain the major findings with interpretation, sare recommendations, and provide a conclusion. This requires thought into the usefulness of what you wanted to know. In other words, you are explaining why someone else should care about your work. This is much harder to do than many realize.


Research is challenging but if you keep in mind these three keys it will help you to see the big picture of research and o focus on the goals of your study and not so much on the tiny details that encompasses the processes.

Undergrad and Grad Students

In this post,  we will look at a comparison of grad and undergrad students.

Student Quality

Generally, graduate students are of a higher quality academically than undergrad students. Of course, this varies widely from institution to institution. New graduate programs may have a lower quality of student than established undergrad programs. This is because the new program is trying to fill sears initially and quality is often compromised.


At the graduate level, there is an expectation of a much more focused and rigorous curriculum. This makes sense as the primary purpose of graduate school is usually specialization and not generalization. This requires that the teachers at this level have a deep expert-level mastery of the content.

In comparison to graduate school, undergrad is a generalized experience with some specialization. However, this depends on the country in which the studies take place. Some countries require rather an intense specialization from the beginning with a minimum of general education while others take a more American style approach with a wide exposure to various fields.


Graduate students are usually older. This means that they require less institution sponsored social activities and may not socialize at all. In addition, some graduate students are married which adds a whole other level of complexity to their studies. Although they are probably less inclined to be “wild” due to their family they are also going to struggle due to the time commitment of their loved ones.

Assuming that an undergraduate student is a traditional one they will tend to be straight from high school, require some social support, but will also have the free time needed to study. The challenge with these students is the maturity level and self-regulation skills that are often missing for academic success.

For the teacher, graduate students offer higher motivation and commitment generally when compared to undergrads. This is reasonable as people often feel compelled to complete a bachelors but normally do not face the same level of pressure to go to graduate school. This means that undergrad is often compulsory due to external circumstances while grad school is by choice.


Despite the differences but types of students hold in common an experience that is filled with exposure to various ideas and content for several years. Grad students and undergrad students are individuals who are developing skills for the goal of eventually finding a purpose in the world.

Using Blended Learning in Your Classroom

Blended learning is becoming a reality in education. Many schools now require some sort of online presences not online of the school but also for individual classes that teachers teach. This has led to much more pressure for teachers to figure out some sort of way to get content online to support studnets. This post will take a look at the pros and cons of blended learning and provide tips on how to approach the use of blended learning.

Pros and Cons

Blended learning gives you flexibility. You are not tied to either traditional or elearning completely. This allows you to find the right balance for your teaching style and the students learning. Some teachers want more online presence in the form of activities and submission of assignments. Others just want a centralized place for communicating with their students and tracking academic progress. Whatever works for you can probably be accommodated when employing blended learning.

Communication and documentation is another benefit of blended learning. Announcements and messaging can be handled by the online platform and these forms of transactions are usually logged by the system and saved. This can be useful for referencing in the future if confusion arises.

The drawback to the flexibility is actually the flexibility. When employing blended learning a teacher has to be proficient in both e-learning and traditional teaching. In other words, you have to become a jack of all trades. Without strength in both methodologies, it will be difficult to determine what you want to do online and to determine how the online experiences augment or replace in-class learning opportunities.

Another problem is the confusion over what is done in class and what is done online. When learning takes place in two mediums it increases opportunities for misunderstanding and miscommunication. I have frequently had students confused over what was to be submitted online vs in class no matter how clear I was in the course outline and calendar.

Success in a Blended Learning Context

To have success using blended learning involves doing some of the following.

  • Focus on using the online platform less for learning and more for communication when you initially begin using blended learning. There is a lot for you to learn as the teacher and trying to move everything online will lead to confusion for you and the students.
  • Consider having students submit the final version of assignments online. Final versions of assignments usually require the least amount of feedback because they have already been vetted by you in person. This will allow you to focus on the grade rather than on providing more support.
  • Online activities should support learning and probably not replace it. Blended learning is often more effective if it helps students to understand content in class rather than replace it. This means the blended learning platform is a study tool to scaffold students learning outside of class. If an assignment can be completely done online without having to go to class then perhaps this is no longer blended learning since the in-class part is not needed for support.
  • PLanning is critical to using any web-based resource. Websites are designed in advance before they are set up. Even post a blog requires you to develop a draft or two. Therefore, online activities and expectations must be planned well in advance and not just thrown online at the whim of the teacher throughout the semester. Many teachers fall into the trap of just making stuff up as they go. This is a poor methodology in a traditional classroom and a disaster in a blended learning context.
  • When in doubt go traditional. If you are unsure how to achieve a specific learning goal online it is better to stick to a traditional approach until you can figure it out. In-class teaching is old but it still has a place in the 21st century especially when it is unclear how to do it online.


Blended learning can be a powerful tool for helping students are a major headache that annoys everyone. The secret to success lies with the teacher who understands what they want from the online aspect of the students learning as well as what they want in the classroom. When this is clear is it critical that the teacher determine how to meet these goals through the use of various learning experiences.

Elearning Academic Success

Studying online has become almost an expectation now. Even if you never earn a degree or take a class for credit online there are still many opportunities to train and develop skills over the internet. The role of the teacher is to try and find ways to engage and support their students as they begin their learning experience physically alone with support perhaps thousands of miles away.

In this post, we will look at ways to encourage the academic success of students while studying online. Two ways to support academic success in elearning involve providing feedback and encouraging engagement.

Provide Feedback

Feedback is critical in every aspect of teaching. However, in elearning, it is even more important. This is because the students have no face-to-face communication with you so they have no idea how they are doing beyond a letter grade. In addition, there is no body language to examined or other paralinguistic features that the student can infer meaning from.

Giving feedback requires timeliness. In other words, mark assignments quickly and indicate progress. In addition, if students do not meet expectations it is critical that you point them towards resources that will help them to inderstand. For example, students seem to neglect reading rubrics. When a student gets feedback from a rubric they can see where they were not succesfful.

In terms of a more formative feedback approach, there may be times where it is beneficial to live stream lecture. This allows the students to chime in whenever they do not understand an idea or point. Furthermore, the teacher can ask a question or two of the students and get feedback from them.

Engage Them

Engaging is almost synonymous with active. In other words, students should be doing something in order to learn. Unfortunately, listening is a passive activity which implies that lecturing is not the best way to inspire learning.

In the contest of elearning, one of the ways to inspire active learning is to have the students go out and do something in the real world and report what happens online in the form of a reflection. For example, students studying English will go out and teach English in the real world. They will then come and share their experience. The teacher is then able to provide insights and feedback to improve the students teaching. This provides a connection to the real world as well as a sense of relevance

In a more abstract subject, such as history, music theory, or engineering, students can become active through sharing these insights with laymen or explaining how they are already applying this information at their job or in the home. The goal of using provides the purpose for learning the content.


Feedback and engagement are critical to success in a situation in which the student is primarily learning alone which is found in the context of elearning.

Scatter Plots in Python

Scatterplots are one of many crucial forms of visualization in statistics. With scatterplots, you can examine the relationship between two variables. This can lead to insights in terms of decision making or additional analysis.

We will be using the “Prestige” dataset form the pydataset module to look at scatterplot use. Below is some initial code.

from pydataset import data
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

We will begin by making a correlation matrix. this will help us to determine which pairs of variables have strong relationships with each other. This will be done with the .corr() function. below is the code


You can see that there are several strong relationships. For our purposes, we will look at the relationship between education and income.

The seaborn library is rather easy to use for making visuals. To make a plot you can use the .lmplot() function. Below is a basic scatterplot of our data.


The code should be self-explanatory. THe only thing that might be unknown is the fit_reg argument. This is set to False so that the function does not make a regression line. Below is the same visual but this time with the regression line.

facet = sns.lmplot(data=df, x='education', y='income',fit_reg=True)


It is also possible to add a third variable to our plot. One of the more common ways is through including a categorical variable. Therefore, we will look at job type and see what the relationship is. To do this we use the same .lmplot.() function but include several additional arguments. These include the hue and the indication of a legend. Below is the code and output.


You can clearly see that type separates education and income. A look at the boxplots for these variables confirms this.


As you can see, we can conclude that job type influences both education and income in this example.


This post focused primarily on making scatterplots with the seaborn package. Scatterplots are a tool that all data analyst should be familiar with as it can be used to communicate information to people who must make decisions.

Data Visualization in Python

In this post, we will look at how to set up various features of a graph in Python. The fine tune tweaks that can be made when creating a data visualization can be enhanced the communication of results with an audience. This will all be done using the matplotlib module available for python. Our objectives are as follows

  • Make a graph with  two lines
  • Set the tick marks
  • Change the linewidth
  • Change the line color
  • Change the shape of the line
  • Add a label to each axes
  • Annotate the graph
  • Add a legend and title

We will use two variables from the “toothpaste” dataset from the pydataset module for this demonstration. Below is some initial code.

from pydataset import data
import matplotlib.pyplot as plt
DF = data('toothpaste')

Make Graph with Two Lines

To make a plot you use the .plot() function. Inside the parentheses you out the dataframe and variable you want. If you want more than one line or graph you use the .plot() function several times. Below is the code for making a graph with two line plots using variables from the toothpaste dataset.



To get the graph above you must run both lines of code simultaneously. Otherwise, you will get two separate graphs.

Set Tick Marks

Setting the tick marks requires the use of the .axes() function. However, it is common to save this function in a variable called axes as a handle. This makes coding easier. Once this is done you can use the .set_xticks() function for the x-axes and .set_yticks() for the y axes. In our example below, we are setting the tick marks for the odd numbers only. Below is the code.



Changing the Line Type

It is also possible to change the line type and width. There are several options for the line type. The important thing here is to put this information after the data you want to plot inside the code. Line width is changed with an argument that has the same name. Below is the code and visual



Changing the Line Color

It is also possible to change the line color. There are several options available. The important thing is that the argument for the line color goes inside the same parentheses as the line type. Below is the code. r means red and k means black.



Change the Point Type

Changing the point type requires more code inside the same quotation marks where the line color and line type are. Again there are several choices here. The code is below



Add X and Y Labels

Adding LAbels is simple. You just use the .xlabel() function or .ylabel() function. Inside the parentheses, you put the text you want in quotation marks. Below is the code.

plt.xlabel('X Example')
plt.ylabel('Y Example')


Adding Annotation, Legend, and Title

Annotation allows you to write text directly inside the plot wherever you want. This involves the use of the .annotate function. Inside this function, you must indicate the location of the text and the actual text you want added to the plot. For our example, we will add the word ‘python’ to the plot for fun.

The .legend() function allows you to give a description of the line types that you have included. Lastly, the .title() function allows you to add a title. Below is the code.

plt.xlabel('X Example')
plt.ylabel('Y Example')
plt.title("Plot Example")



Now you have a practical understanding of how you can communicate information visually with matplotlib in python. This is barely scratching the surface in terms of the potential that is available.

Critical Thinking and Problem Solving

There have been concerns for years that critical thinking and problem-solving skills are in decline not only among students but also the general public. On the surface, this appears to be true. However, throughout human history, the average person was not much of a deep thinker but rather a doer. How much time can you spend on thinking for the sake of thinking when you are dealing with famine, war, and disease? This internal focus vs external focus is one of the differences between critical thinking and problem-solving.

Critical Thinking

There is no agreed-upon definition of critical thinking. This makes sense as any agreement would indicate a lack of critical thinking. In general, critical thinking is about questioning and testing the claims and statements made through external evidence as well as internal thought. Critical thinking is the ability to know what you don’t know and seek answers through finding information. To test and assert claims means taking time to develop them which is a lonely process many times

Thinking for the sake of thinking is a somewhat introverted process. There are few people who want to sit and ponder in the fast-paced 21st century.  This is one reason why it appears that critical thinking is in decline. It’s not that people are incapable of thinking critical they would just rather not do it and seek a quick answer and or entertainment. Critical thinking is just too slow for many people.

Whenever I give my students any form of opened assignment that requires them to develop an argument I am usually shocked by the superficial nature of the work. Having thought about this I have come to the conclusion that the students lacked the mental energy to demonstrate the critical thought needed to write a paper or even to share their opinion about something a little deeper then facebook videos.

Problem Solving

Problem-solving is about getting stuff done. When an obstacle is in the way a problem solver finds a way around. Problem-solving is focused often on tangible things and objects in a practical way. Therefore, problem-solving is more of an extroverted experience. It is common and easy to solve problems with friends gregariously. However, thinking critically is somewhat more difficult to do in groups and cannot move as fast as we want we discussing.

Due to the potential of working in groups and the fast pace that it can take, problem-solving skills are in better shape than critical thinking skills. This is because when people work in groups several superficial ideas can be combined to overcome a problem. This groupthink if you will allow for success even though the individual members are probably not the brightest.

Problem-solving has been the focus of mankind for most of their existence. Please keep in mind that for most of human history people could not even read and write. Instead, they were farmers and soldiers concern with food and protecting their civilization from invasion. These problems led to amazing discoveries for the sake of providing life and not for the sake of thinking for the sake of thinking or questioning for the sake of objection.


There is some overlap in critical thinking and problem-solving. Solutions to problems have to be critically evaluated. However, often a potential solution is voted good or bad by whether it works or not which requires observation and not in-depth thinking. The goal for problem-solving is always “does this solve the problem” or “does this solve the problem better”. These are important criteria but critical thinking involves much broader and deeper issues than just “does this work.” Critical thinking is on a quest for truth and satisfying curiosity. These are ideas that problem-solvers struggle to appreciate

The world is mostly focused on people who can solve problems and not necessarily on deep thinkers who can ponder complex ideas alone. As such, perhaps critical thinking was a fad that has ceased to be relevant as problem solvers do not see how critical thinking solves problems. Both forms of thought are needed and they do overlap yet most of the world simply wants to know what the answer is to their problem rather than to think deeply about why they have a problem in the first place.

Random Forest Regression in Python

Random forest is simply the making of dozens if not thousands of decision trees. The decision each tree makes about an example are then tallied for the purpose of voting with the classification that receives the most votes winning. For regression, the results of the trees are averaged in  order to give the most accurate results

In this post, we will use the cancer dataset from the pydataset module to predict the age of people. Below is some initial code.

import pandas as pd
import numpy as np
from pydataset import data
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

We can load our dataset as df, drop all NAs, and create our dataset that contains the independent variables and a separate dataset that includes the dependent variable of age. The code is below

df = data('cancer')

Next, we need to set up our train and test sets using a 70/30 split. After that, we set up our model using the RandomForestRegressor function. n_estimators is the number of trees we want to create and the random_state argument is for supporting reproducibility. The code is below

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

We can now run our model and test it. Running the model requires the .fit() function and testing involves the .predict() function. The results of the test are found using the mean_squared_error() function.,y_train)

The MSE of 71.75 is only useful for model comparison and has little meaning by its self. Another way to assess the model is by determining variable importance. This helps you to determine in a descriptive way the strongest variables for the regression model. The code is below followed by the plot of the variables.



As you can see, the strongest predictors of age include calories per meal, weight loss, and time sick. Sex and whether the person is censored or dead make a smaller difference. This makes sense as younger people eat more and probably lose more weight because they are heavier initially when dealing with cancer.


This post provided an example of the use of regression with random forest. Through the use of ensemble voting, you can improve the accuracy of your models. This is a distinct power that is not available with other machine learning algorithm.

Bagging Classification with Python

Bootstrap aggregation aka bagging is a technique used in machine learning that relies on resampling from the sample and running multiple models from the different samples. The mean or some other value is calculated from the results of each model. For example, if you are using Decisions trees, bagging would have you run the model several times with several different subsamples to help deal with variance in statistics.

Bagging is an excellent tool for algorithms that are considered weaker or more susceptible to variances such as decision trees or KNN. In this post, we will use bagging to develop a model that determines whether or not people voted using the turnout dataset. These results will then be compared to a model that was developed in a traditional way.

We will use the turnout dataset available in the pydataset module. Below is some initial code.

from pydataset import data
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report

We will load our dataset. Then we will separate the independnet and dependent variables from each other and create our train and test sets. The code is below.


We can now prepare to run our model. We need to first set up the bagging function. There are several arguments that need to be set. The max_samples argument determines the largest amount of the dataset to use in resampling. The max_features argument is the max number of features to use in a sample. Lastly, the n_estimators is for determining the number of subsamples to draw. The code is as follows


Basically, what we told python was to use up to 70% of the samples, 70% of the features, and make 100 different KNN models that use seven neighbors to classify. Now we run the model with the fit function, make a prediction with the predict function, and check the accuracy with the classificarion_reoirts function.,y_train)


This looks oka below are the results when you do a traditional model without bagging



The improvement is not much. However, this depends on the purpose and scale of your project. A small improvement can mean millions in the reight context such as for large companies such as Google who deal with billions of people per day.


This post provides an example of the use of bagging in the context of classification. Bagging provides a why to improve your model through the use of resampling.

K Nearest Neighbor Classification with Python

K Nearest Neighbor uses the idea of proximity to predict class. What this means is that with KNN Python will look at K neighbors to determine what the unknown examples class should be. It is your job to determine the K or number of neighbors that should be used to determine the unlabeled examples class.

KNN is great for a small dataset. However, it normally does not scale well when the dataset gets larger and larger. As such, unless you have an exceptionally powerful computer KNN is probably not going to do well in a Big Data context.

In this post, we will go through an example of the use of KNN with the turnout dataset from the pydataset module. We want to predict whether someone voted or not based on the independent variables. Below is some initial code.

from pydataset import data
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report

We now need to load the data and separate the independent variables from the dependent variable by making two datasets.


Next, we will make our train and test sets with a 70/30 split. The random.state is set to 0. This argument allows us to reproduce our model if we want. After this, we will run our model. We will set the K to 7 for our model  and run the model. This means that Python will look at the 7 closes examples to predict the value of an unknown example. below is the code


We can now predict with our model and see the results with the classification reports function.



The results are shown above. To determine the quality of the model relies more on domain knowledge. What we can say for now is that the model is better at classifying people who vote rather than people who do not vote.


This post shows you how to work with Python when using KNN. This algorithm is useful in using neighboring examples tot predict the class of an unknown example.

Naive Bayes with Python

Naive Bayes is a probabilistic classifier that is often employed when you have multiple or more than two classes in which you want to place your data. This algorithm is particularly used when you dealing with text classification with large datasets and many features.

If you are more familiar with statistics you know that Bayes developed a method of probability that is highly influential today. In short, his system takes into conditional probability. In the case of naive Bayes,  the classifier assumes that the presence of a certain feature in a class is not related to the presence of any other feature. This assumption is why Naive Bayes is Naive.

For our purposes, we will use Naive Bayes to predict the type of insurance a person has in the DoctorAUS dataset in the pydataset module. Below is some initial code.

from pydataset import data
import pandas as pd
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

Next, we will load our dataset DoctorAUS. Then we will separate the independent variables that we will use from the dependent variable of insurance in two different datasets. If you want to know more about the dataset and the variables you can type data(“DoctorAUS”, show_doc=True)


Now, we will create our train and test datasets. We will do a 70/30 split. We will also use Gaussian Naive Bayes as our algorithm. This algorithm assumes the data is normally distributed. There are other algorithms available for Naive Bayes as well.  We will also create our model with the .fit function.


Finally, we will predict with our model and run the classification report to determine the success of the model.



You can see that our overall numbers are not that great. This means that the current algorithm is probably not the best choice for classification. Of course, there could other problems as well that need to be explored.


This post was simply a demonstration of how to conduct an analysis with Naive Bayes using Python. The process is not all that complicate and is similar to other algorithms that are used.

K Nearest Neighbor Regression with Python

K Nearest Neighbor Regression (KNN) works in much the same way as KNN for classification. The difference lies in the characteristics of the dependent variable. With classification KNN the dependent variable is categorical. WIth regression KNN the dependent variable is continuous. Both involve the use neighboring examples to predict the class or value of other examples.

This post will provide an example of KNN regression using the turnout dataset from the pydataset module. Our purpose will be to predict the age of a voter through the use of other variables in the dataset. Below is some initial code.

from pydataset import data
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

We now need to setup our data. We need to upload our actual dataset. Then we need to separate the independnet and dependent variables. Once this is done we need to create our train and test sets using the tarin test spli t funvtion. Below is the code to accmplouh each of these steps.


We are now ready to train our model. We need to call the function we will use and determine the size of K, which will be 11 in our case. Then we need to train our model and then predict with it. Lastly, we will print out the mean squared error. This value is useful for comparing models but does not have much value by itself. The MSE is calculated by comparing the actual test set with the predicted test data. The code is below


If we were to continue with model development we may look for ways to improve our MAE through different nethods such as regular linear regression. However, for our purposes this is adequate.


This post provides an example of regression with KNN in Python. This tool is a practical and simple way to make numeric predictions that can be accurate at times.

Support Vector Machines Regression with Python

This post will provide an example of how to do regression with support vector machines SVM. SVM is a complex algorithm that allows for the development of non-linear models. This is particularly useful for messy data that does not have clear boundaries.

The steps that we will use are listed below

  1. Data preparation
  2. Model Development

We will use two different kernels in our analysis. The LinearSVR kernel and SVR kernel. The difference between these two kernels has to do with slight changes in the calculations of the boundaries between classes.

Data Preparation

We are going to use the OFP dataset available in the pydataset module. This dataset was used previously for classification with SVM on this site. Our plan this time is that we want to predict family inc (famlinc), which is a continuous variable.  Below is some initial code.

import numpy as np
import pandas as pd
from pydataset import data
from sklearn import svm
from sklearn import model_selection
from import mse

We now need to load our dataset and remove any missing values.


AS in the previous post, we need to change the text variables into dummy variables and we also need to scale the data. The code below creates the dummy variables, removes variables that are not needed, and also scales the data.

df=df.rename(index=str, columns={"yes": "black_person"})
df=df.drop('no', axis=1)

df=df.rename(index=str, columns={"male": "Male"})
df=df.drop('female', axis=1)

df=df.rename(index=str, columns={"yes": "job"})
df=df.drop('no', axis=1)

df=df.rename(index=str, columns={"no": "single"})
df=df.drop('yes', axis=1)

df=df.rename(index=str, columns={"yes": "insured"})
df=df.drop('no', axis=1)
df = (df - df.min()) / (df.max() - df.min())


We now need to set up our datasets. The X dataset will contain the independent variables while the y dataset will contain the dependent variable


We can now move to model development

Model Development

We now need to create our train and test sets for or X and y datasets. We will do a 70/30 split of the data. Below is the code


Next, we will create our two models with the code below.


We will now run our first model and assess the results. Our metric is the mean squared error. Generally, the lower the number the better.  We will use the .fit() function to train the model and the .predict() function for test the model


The mse was 0.27. This number means nothing only and is only beneficial for comparison reasons. Therefore, the second model will be judged as better or worst only if the mse is lower than 0.27. Below are the results of the second model.


We can see that the mse for our second model is 0.34 which is greater than the mse for the first model. This indicates that the first model is superior based on the current results and parameter settings.


This post provided an example of how to use SVM for regression.

Support Vector Machines Classification with Python

Support vector machines (SVM) is an algorithm used to fit non-linear models. The details are complex but to put it simply  SVM tries to create the largest boundaries possible between the various groups it identifies in the sample. The mathematics behind this is complex especially if you are unaware of what a vector is as defined in algebra.

This post will provide an example of SVM using Python broken into the following steps.

  1. Data preparation
  2. Model Development

We will use two different kernels in our analysis. The linear kernel and he rbf kernel. The difference in terms of kernels has to do with how the boundaries between the different groups are made.

Data Preparation

We are going to use the OFP dataset available in the pydataset module. We want to predict if someone single or not. Below is some initial code.

import numpy as np
import pandas as pd
from pydataset import data
from sklearn import svm
from sklearn.metrics import classification_report
from sklearn import model_selection

We now need to load our dataset and remove any missing values.



Looking at the dataset we need to do something with the variables that have text. We will create dummy variables for all except region and hlth. The code is below.

df=df.rename(index=str, columns={"yes": "black_person"})
df=df.drop('no', axis=1)

df=df.rename(index=str, columns={"male": "Male"})
df=df.drop('female', axis=1)

df=df.rename(index=str, columns={"yes": "job"})
df=df.drop('no', axis=1)

df=df.rename(index=str, columns={"no": "single"})
df=df.drop('yes', axis=1)

df=df.rename(index=str, columns={"yes": "insured"})
df=df.drop('no', axis=1)

For each variable, we did the following

  1. Created a dummy in the dummy dataset
  2. Combined the dummy variable with our df dataset
  3. Renamed the dummy variable based on yes or no
  4. Drop the other dummy variable from the dataset. Python creates two dummies instead of one.

If you look at the dataset now you will see a lot of variables that are not necessary. Below is the code to remove the information we do not need.



This is much cleaner. Now we need to scale the data. This is because SVM is sensitive to scale. The code for doing this is below.

df = (df - df.min()) / (df.max() - df.min())


We can now create our dataset with the independent variables and a separate dataset with our dependent variable. The code is as follows.


We can now move to model development

Model Development

We need to make our test and train sets first. We will use a 70/30 split.


Now, we need to create the models or the hypothesis we want to test. We will create two hypotheses. The first model is using a linear kernel and the second is one using the rbf kernel. For each of these kernels, there are hyperparameters that need to be set which you will see in the code below.


The details about the hyperparameters are beyond the scope of this post. Below are the results for the first model.


The overall accuracy is 73%. The crosstab() function provides a breakdown of the results and the classification_report() function provides other metrics related to classification. In this situation, 0 means not single or married while 1 means single. Below are the results for model 2


You can see the results are similar with the first model having a slight edge. The second model really struggls with predicting people who are actually single. You can see thtat the recall in particular is really poor.


This post provided how to ob using SVM in python. How this algorithm works can be somewhat confusing. However, its use can be powerful if use appropriately.

Linear Discriminant Analysis in Python

Linear discriminant analysis is a classification algorithm commonly used in data science. In this post, we will learn how to use LDA with Python. The steps we will for this are as follows.

  1. Data preparation
  2. Model training and evaluation

Data Preparation

We will be using the bioChemists dataset which comes from the pydataset module. We want to predict whether someone is married or single based on academic output and prestige. Below is some initial code.

import pandas as pd
from pydataset import data
import matplotlib.pyplot as plt
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

Now we will load our data and take a quick look at it using the .head() function.


There are two variables that contain text so we need to convert these two dummy variables for our analysis the code is below with the output.


Here is what we did.

  1. We created the dummy variable by using the .get_dummies() function.
  2. We saved the output in an object called dummy
  3. We then combine the dummy and df dataset with the .concat() function
  4. We repeat this process for the second variable

The output shows that we have our original variables and the dummy variables. However, we do not need all of this information. Therefore, we will create a dataset that has the X variables we will use and a separate dataset that will have our y values. Below is the code.


The X dataset has our five independent variables and the y dataset has our dependent variable which is married or not. We can not split our data into a train and test set.  The code is below.


The data was split 70% for training and 30% for testing. We made a train and test set for the independent and dependent variables which meant we made 4 sets altogether. We can now proceed to model development and testing

Model Training and Testing

Below is the code to run our LDA model. We will use the .fit() function for this.


We will now use this model to predict using the .predict function


Now for the results, we will use the classification_report function to get all of the metrics associated with a confusion matrix.


The interpretation of this information is described in another place. For our purposes, we have an accuracy of 71% for our prediction.  Below is a visual of our model using the ROC curve.


Here is what we did

  1. We had to calculate the roc_curve for the model this is explained in detail here
  2. Next, we plotted our own curve and compared to a baseline curve which is the dotted lines.

A ROC curve of 0.67 is considered fair by many. Our classification model is not that great but there are worst models out there.


This post went through an example of developing and evaluating a linear discriminant model. To do this you need to prepare the data, train the model, and evaluate.

Factor Analysis in Python

Factor analysis is a dimensionality reduction technique commonly used in statistics. FA is similar to principal component analysis. The difference are highly technical but include the fact the FA does not have an orthogonal decomposition and FA assumes that there are latent variables and that are influencing the observed variables in the model. For FA the goal is the explanation of the covariance among the observed variables present.

Our purpose here will be to use the BioChemist dataset from the pydataset module and perform a FA that creates two components. This dataset has data on the people completing PhDs and their mentors. We will also create a visual of our two-factor solution. Below is some initial code.

import pandas as pd
from pydataset import data
from sklearn.decomposition import FactorAnalysis
import matplotlib.pyplot as plt

We now need to prepare the dataset. The code is below

df = data('bioChemists')

In the code above, we did the following

  1. The first line creates our dataframe called “df” and is made up of the dataset bioChemist
  2. The second line reduces the df to 250 rows. This is done for the visual that we will make. To take the whole dataset and graph it would make a giant blob of color that would be hard to interpret.
  3. The last line pulls the variables we want to use for our analysis. The meaning of these variables can be found by typing data(“bioChemists”,show_doc=True)

In the code below we need to set the number of factors we want and then run the model.


The first line tells Python how many factors we want. The second line takes this information along with or revised dataset X to create the actual factors that we want. We can now make our visualization

To make the visualization requires several steps. We want to identify how well the two components separate students who are married from students who are not married. First, we need to make a dictionary that can be used to convert the single or married status to a number. Below is the code.

thisdict = {
"Single": "1",
"Married": "2",}

Now we are ready to make our plot. The code is below. Pay close attention to the ‘c’ argument as it uses our dictionary.



You can perhaps tell why we created the dictionary now. By mapping the dictionary to the mar variable it automatically changed every single and married entry in the df dataset to a 1 or 2. The c argument needs a number in order to set a color and this is what the dictionary was able to supply it with.

You can see that two factors do not do a good job of separating the people by their marital status. Additional factors may be useful but after two factors it becomes impossible to visualize them.


This post provided an example of factor analysis in Python. Here the focus was primarily on visualization but there are so many other ways in which factor analysis can be deployed.

Analyzing Twitter Data in Python

In this post, we will look at how to analyze text from Twitter. We will do each of the following for tweets that refer to Donald Trump and tweets that refer to Barrack Obama.

  • Conduct a sentiment analysis
  • Create a word cloud

This is a somewhat complex analysis so I am assuming that you are familiar with Python as explaining everything would make the post much too long. In order to achieve our two objectives above we need to do the following.

  1. Obtain all of the necessary information from your twitter apps account
  2. Download the tweets & clean
  3. Perform the analysis

Before we begin, here is a list of modules we will need to load to complete our analysis

import wordcloud
import matplotlib.pyplot as plt
import twython
import re
import numpy

Obtain all Needed Information

From your twitter app account, you need the following information

  • App key
  • App key secret
  • Access token
  • Access token secret

All this information needs to be stored in individual objects in Python. Then each individual object needs to be combined into one object. The code is below.


In the code above we saved all the information in different objects at first and then combined them. You will of course replace the XXXXXXX with your own information.

Next, we need to create a function that will pull the tweets from Twitter. Below is the code,

def get_tweets(twython_object,query,n):

   for r in result_generator:
      if count ==n: break

   return result_set

You will have to figure out the code yourself. We can now download the tweets.

Downloading Tweets & Clean

Downloading the tweets involves making an empty dictionary that we can save our information in. We need two keys in our dictionary one for Trump and the other for Obama because we are downloading tweets about these two people.

There are also two additional things we need to do. We need to use regular expressions to get rid of punctuation and we also need to lower case all words. All this is done in the code below.

tweets['trump']=[re.sub(r'[-.#/?!.":;()\']',' ',tweet.lower())for tweet in get_tweets(t,'#trump',1500)]
tweets['obama']=[re.sub(r'[-.#/?!.":;()\']',' ',tweet.lower())for tweet in get_tweets(t,'#obama',1500)]

The get_tweets function is also used in the code above along with our twitter app information. We pulled 1500 tweets concerning Obama and 1500 tweets about Trump. We were able to download and clean our tweets at the same time. We can now do our analysis


To do the sentiment analysis you need dictionaries of positive and negative words. The ones in this post were taken from GitHub. Below is the code for loading them into Python.


We now will make a function to calculate the sentiment

def sentiment_score(text,pos_list,neg_list):

   for w in text.split(' '):
      if w in pos_list:positive_score+=1
      if w in neg_list:negative_score+=1
   return positive_score-negative_score

Now we create an empty dictionary and run the analysis for Trump and then for Obama

tweets_sentiment['trump']=[sentiment_score(tweet,positive_words,negative_words)for tweet in tweets['trump']]
tweets_sentiment['obama']=[sentiment_score(tweet,positive_words,negative_words)for tweet in tweets['obama']]

Now we can make visuals of our results with the code below


Obama is on the left and trump is on the right. It seems that trump tweets are consistently more positive. Below are the means for both.

Out[133]: 0.36363636363636365

Out[134]: 0.2222222222222222

Trump tweets are slightly more positive than Obama tweets. Below is the code for the Trump word cloud


Here is the code for the Obama word cloud


A lot of speculating can be made from the word clouds and sentiment analysis. However, the results will change every single time because of the dynamic nature of Twitter. People are always posting tweets which changes the results.


This post provided an example of how to download and analyze tweets from twitter. It is important to develop a clear idea of what you want to know before attempting this sort of analysis as it is easy to become confused and not accomplish anything.

Word Clouds in Python

Word clouds are a type of data visualization in which various words from a dataset are actuated. Words that are larger in the word cloud are more common and words in the middle are also more common. In addition, some word clouds even use various colors to indicated importance.

This post will provide an example of how to make a word cloud using python. We will be using the “Women’s E-Commerce Clothing Reviews” available on the kaggle website.  We are going to only use the text reviews to make our word cloud even though other data is in the dataset. To prepare our dataset for making the word cloud we need to the following.

  1. Lowercase all words
  2. Remove punctuation
  3. Remove stopwords

After completing these steps we can make the word cloud. First, we need to load all of the necessary modules.

import pandas as pd
import re
from nltk.corpus import stopwords
import wordcloud
import matplotlib.pyplot as plt

We now need to load our dataset we will store it as the object ‘df’

df=pd.read_csv('YOUR LOCATION HERE')


It’s hard to read but we will be working only with the “Review Text” column as this has the text data we need. Here is what our column looks like up close.

df['Review Text'].head()

0 Absolutely wonderful - silky and sexy and comf...
1 Love this dress! it's sooo pretty. i happene...
2 I had such high hopes for this dress and reall...
3 I love, love, love this jumpsuit. it's fun, fl...
4 This shirt is very flattering to all due to th...
Name: Review Text, dtype: object

We will now make all words lower case and remove punctuation with the code below.

df["Review Text"]=df['Review Text'].str.lower()
df["Review Text"]=df['Review Text'].str.replace(r'[-./?!,":;()\']',' ')

The first line in the code above lower cases all words. The second line removes the punctuation. The second line is trickier as you have to explain to python exactly what type of punctuation you want to remove and what to replace it with. Everything we want to remove is in the first set of single quotes. We want to replace the punctuation with a space which is the second set of single quotation marks with a space in the middle. THe r at the beginning of the parentheses stands for remove.

Here is what our data looks like after making these two changes

df['Review Text'].head()

0 absolutely wonderful silky and sexy and comf...
1 love this dress it s sooo pretty i happene...
2 i had such high hopes for this dress and reall...
3 i love love love this jumpsuit it s fun fl...
4 this shirt is very flattering to all due to th...
Name: Review Text, dtype: object

All the words are in lowercase. In addition, you can see that the dash in line 0 is gone as all the punctuation in the other lines. We now need to remove stopwords. Stopwords are the functional words that glue the meaning together without. Examples include and, for, but, etc. We are trying to make a cloud of substantial words and not stopwords so these words need to be removed.

If you have never done this on your computer before you may need to import the nltk module and run nltk.download_gui(). Once this is done you need to download the stopwords package.

Below is the code for removing the stopwords. First, we need to load the stopwords this is done below.


We create an object called stopwords_list which has all the English stopwords. The second line just adds the word ‘to’ to the list. Nex,t we need to make an object that will look for the pattern of words we want to remove. Below is the code

pat = r'\b(?:{})\b'.format('|'.join(stopwords_list))

This code is the basically telling Python what to look for. Using regularized expressions Python will look for any word whos pattern on the left is the same as the pattern on the right after the .join function. Inside the .join function is our stopwords_list. We will now take this object called ‘pat’ and use it on our ‘Review Text’ variable.

df['Split Text'] = df['Review Text'].str.replace(pat, '')
df['Split Text'].head()df['Split Text'].head()

0      absolutely wonderful   silky  sexy  comfortable
1    love  dress     sooo pretty    happened  find ...
2       high hopes   dress  really wanted   work   ...
3     love  love  love  jumpsuit    fun  flirty   f...
4     shirt   flattering   due   adjustable front t...
Name: Split Text, dtype: object

You can see that we have created a new column called ‘Split Text’ and the results is a text that has lost many stop words.

We are now ready to make our word cloud below is the code and the output.

wordcloud1=wordcloud.WordCloud(width=1000,height=500).generate(' '.join(map(str, df['Split Text'])))


This code is complex. We used the word cloud function and we had to use both generate map, and join as inner functions. All of these function were needed to take the words from the dataframe and make them simple text for the wordcloud function.

The rest of the code is common to mathplotlib so does not require much explanation. Ass you look at the word cloud, you can see that the most common words include top, look, dress, shirt, fabric. etc. This is reasonable given that these are women’s reviews of clothing.


This post provided an example of text analysis using word clouds in Python. The insights here are primarily descriptive in nature. This means that if the desire is prediction or classification other additional tools would need to build upon what is shown here.

KMeans Clustering in Python

Kmeans clustering is a technique in which the examples in a dataset our divided through segmentation. The segmentation has to do with complex statistical analysis in which examples within a group are more similar the examples outside of a group.

The application of this is that it provides the analysis with various groups that have similar characteristics which can be used to cater services to in various industries such as business or education. In this post, we will look at how to do this using Python. We will use the steps below to complete this process.

  1. Data preparation
  2. Determine the number of clusters
  3. Conduct analysis

Data Preparation

Our data for this examples comes from the sat.act dataset available in the pydataset module. Below is some initial code.

import pandas as pd
from pydataset import data
from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist
import numpy as np
import matplotlib.pyplot as plt

We will now load our dataset and drop any NAs they may be present


You can see there are six variables that will be used for the clustering. Next, we will turn to determining the number of clusters.

Determine the Number of Clusters

Before you can actually do a kmeans analysis you must specify the number of clusters. This can be tricky as there is no single way to determine this. For our purposes, we will use the elbow method.

The elbow method measures the within sum of error in each cluster. As the number of clusters increasings this error decrease. However,  a certain point the return on increasing clustering becomes minimal and this is known as the elbow. Below is the code to calculate this.

distortions = []
K = range(1,10)
for k in K:
kmeanModel = KMeans(n_clusters=k).fit(df)
distortions.append(sum(np.min(cdist(df, kmeanModel.cluster_centers_, 'euclidean'), axis=1)) / df.shape[0])

Here is what we did

  1. We made an empty list called ‘distortions’ we will save our results there.
  2. In the second line, we told R the range of clusters we want to consider. Simply, we want to consider anywhere from 1 to 10 clusters.
  3. Line 3 and 4, we use a for loop to calculate the number of clusters when fitting it to the df object.
  4. In Line 5, we save the sum of the cluster distance in the distortions list.

Below is a visual to determine the number of clusters

plt.plot(K, distortions, 'bx-')
plt.title('The Elbow Method showing the optimal k')


The graph indicates that 3 clusters are sufficient for this dataset. We can now perform the actual kmeans clustering.

KMeans Analysis

The code for the kmeans analysis is as follows

  1. We use the KMeans function and tell Python the number of clusters, the type of, initialization, and we set the seed with the random_state argument. All this is saved in the objet called km
  2. The km object has the .fit function used on it with df.values

Next, we will predict with the predict function and look at the first few lines of the modified df with the .head() function.


You can see we created a new variable called predict. This variable contains the kmeans algorithm prediction of which group each example belongs too. We then printed the first five values as an illustration. Below are the descriptive statistics for the three clusters that were produced for the variable in the dataset.


It is clear that the clusters are mainly divided based on the performance on the various test used. In the last piece of code, gender is used. 1 represents male and 2 represents female.

We will now make a visual of the clusters using two dimensions. First, w e need to make a map of the clusters that is saved as a dictionary. Then we will create a new variable in which we take the numerical value of each cluster and convert it to a sting in our cluster map dictiojnary.


Next, we make a different dictionary to color the points in our graph.



Here is what is happening in the code above.

  1. We set the ax object to a value.
  2. A for loop is used to go through every example in clust_map.values so that they are colored according the color
  3. Lastly, a plot is called which lines upo the perf and clust values for color.

The groups are clearly separated when looking at them in two dimensions.


Kmeans is a form of unsupervised learning in which there is no dependent variable which you can use to assess the accuracy of the classification or the reduction of error in regression. As such, it can be difficult to know how well the algorithm did with the data. Despite this, kmeans is commonly used in situations in which people are trying to understand the data rather than predict.

Random Forest in Python

This post will provide a demonstration of the use of the random forest algorithm in python. Random forest is similar to decision trees except that instead of one tree a multitude of trees are grown to make predictions. The various trees all vote in terms of how to classify an example and majority vote is normally the winner. Through making many trees the accuracy of the model normally improves.

The steps are as follows for the use of random forest

  1. Data preparation
  2. Model development & evaluation
  3. Model comparison
  4. Determine variable importance

Data Preparation

We will use the cancer dataset from the pydataset module. We want to predict if someone is censored or dead in the status variable. The other variables will be used as predictors. Below is some code that contains all of the modules we will use.

import pandas as pd
import sklearn.ensemble as sk
from pydataset import data
from sklearn.model_selection import train_test_split
from sklearn import metrics
import matplotlib.pyplot as plt

We will now load our data cancer in an object called ‘df’. Then we will remove all NA’s use the .dropna() function. Below is the code.

df = data('cancer')

We now need to make two datasets. One dataset, called X, will contain all of the predictor variables. Another dataset, called y, will contain the outcome variable. In the y dataset, we need to change the numerical values to a string. This will make interpretation easier as we will not need to lookup what the numbers represents. Below is the code.


Instead of 1 we now have the string “censored” and instead of 2 we now have the string “dead” in the status variable. The final step is to set up our train and test sets. We will do a 70/30 split. We will have a train set for the X and y dataset as well as a test set for the X and y datasets. This means we will have four datasets in all. Below is the code.

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

We are now ready to move to model development

Model Development and Evaluation

We now need to create our classifier and fit the data to it. This is done with the following code.


The clf object has our random forest algorithm,. The number of estimators is set to 100. This is the number of trees that will be generated. In the second line of code, we use the .fit function and use the training datasets x and y.

We now will test our model and evaluate it. To do this we will use the .predict() with the test dataset Then we will make a confusion matrix followed by common metrics in classification. Below is the code and the output.


You can see that our model is good at predicting who is dead but struggles with predicting who is censored. The metrics are reasonable for dead but terrible for censored.

We will now make a second model for the purpose of comparison

Model Comparision

We will now make a different model for the purpose of comparison. In this model, we will use out of bag samples to determine accuracy, set the minimum split size at 5 examples, and that each leaf has at least 2 examples. Below is the code and the output.


There was some improvement in classify people who were censored as well as for those who were dead.

Variable Importance

We will now look at which variables were most important in classifying our examples. Below is the code


We create an object called model_ranks and we indicate the following.

  1. Classify the features by importance
  2. Set index to the columns in the training dataset of x
  3. Sort the features from most to least importance
  4. Make a barplot

Below is the output


You can see that time is the strongest classifier. How long someone has cancer is the strongest predictor of whether they are censored or dead. Next is the number of calories per meal followed by weight and lost and age.


Here we learned how to use random forest in Python. This is another tool commonly used in the context of machine learning.

Decision Trees in Python

Decision trees are used in machine learning. They are easy to understand and are able to deal with data that is less than ideal. In addition, because of the pictorial nature of the results decision trees are easy for people to interpret. We are going to use the ‘cancer’ dataset to predict mortality based on several independent variables.

We will follow the steps below for our decision tree analysis

  1. Data preparation
  2. Model development
  3. Model evaluation

Data Preparation

We need to load the following modules in order to complete this analysis.

import pandas as pd
import statsmodels.api as sm
import sklearn
from pydataset import data
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn import tree
import matplotlib.pyplot as plt
from sklearn.externals.six import StringIO 
from IPython.display import Image 
from sklearn.tree import export_graphviz
import pydotplus

The ‘cancer’ dataset comes from the ‘pydataset’ module. You can learn more about the dataset by typing the following

data('cancer', show_doc=True)

This provides all you need to know about our dataset in terms of what each variable is measuring. We need to load our data as ‘df’. In addition, we need to remove rows with missing values and this is done below.

df = data('cancer')
Out[58]: 228
Out[59]: 167

The initial number of rows in the data set was 228. After removing missing data it dropped to 167. We now need to setup up our lists with the independent variables and a second list with the dependent variable. While doing this, we need to recode our dependent variable “status” so that the numerical values are replaced with a string. This will help us to interpret our decision tree later. Below is the code


Next,  we need to make our train and test sets using the train_test_split function.  We want a 70/30 split. The code is below.

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

We are now ready to develop our model.

Model Development

The code for the model is below


We first make an object called “clf” which calls the DecisionTreeClassifier. Inside the parentheses, we tell Python that we do not want any split in the tree to contain less than 10 examples. The second “clf” object uses the  .fit function and calls the training datasets.

We can also make a visual of our decision tree.

dot_data = StringIO()
export_graphviz(clf, out_file=dot_data, 
filled=True, rounded=True,feature_names=list(x_train.columns.values),
graph = pydotplus.graph_from_dot_data(dot_data.getvalue()) 


If we interpret the nodes furthest to the left we get the following

  • If a person has had cancer less than 171 days and
  • If the person is less than 74.5 years old then
  • The person is dead

If you look closely every node is classified as ‘dead’ this may indicate a problem with our model. The evaluation metrics are below.

Model Evaluation

We will use the .crosstab function and the metrics classification functions


You can see that the metrics are not that great in general. This may be why everything was classified as ‘dead’. Another reason is that few people were classified as ‘censored’ in the dataset.


Decisions trees are another machine learning tool. Python allows you to develop trees rather quickly that can provide insights into how to take action.

Multiple Regression in Python

In this post, we will go through the process of setting up and a regression model with a training and testing set using Python. We will use the insurance dataset from kaggle. Our goal will be to predict charges. In this analysis, the following steps will be performed.

  1. Data preparation
  2. Model training
  3. model testing

Data Preparation

Below is a list of the modules we will need in order to complete the analysis

import matplotlib.pyplot as plt
import pandas as pd
from sklearn import linear_model,model_selection, feature_selection,preprocessing
import statsmodels.formula.api as sm
from import mse
from import add_constant
from sklearn.metrics import mean_squared_error

After you download the dataset you need to load it and take a look at it. You will use the  .read_csv function from pandas to load the data and .head() function to look at the data. Below is the code and the output.

insure=pd.read_csv('YOUR LOCATION HERE')


We need to create some dummy variables for sex, smoker, and region. We will address that in a moment, right now we will look at descriptive stats for our continuous variables. We will use the .describe() function for descriptive stats and the .corr() function to find the correlations.


The descriptives are left for your own interpretation. As for the correlations, they are generally weak which is an indication that regression may be appropriate.

As mentioned earlier, we need to make dummy variables sex, smoker, and region in order to do the regression analysis. To complete this we need to do the following.

  1. Use the pd.get_dummies function from pandas to create the dummy
  2. Save the dummy variable in an object called ‘dummy’
  3. Use the pd.concat function to add our new dummy variable to our ‘insure’ dataset
  4. Repeat this three times

Below is the code for doing this



The .get_dummies function requires the name of the dataframe and in the brackets the name of the variable to convert. The .concat function requires the name of the two datasets to combine as well the axis on which to perform it.

We now need to remove the original text variables from the dataset. In addition, we need to remove the y variable “charges” because this is the dependent variable.

y = insure.charges
insure=insure.drop(['sex', 'smoker','region','charges'], axis=1)

We can now move to model development.

Model Training

Are train and test sets are model with the model_selection.trainin_test_split function. We will do an 80-20 split of the data. Below is the code.

X_train, X_test, y_train, y_test = model_selection.train_test_split(insure, y, test_size=0.2)

In this single line of code, we create a train and test set of our independent variables and our dependent variable.

We can not run our regression analysis. This requires the use of the .OLS function from statsmodels module. Below is the code.

answer=sm.OLS(y_train, add_constant(X_train)).fit()

In the code above inside the parentheses, we put the dependent variable(y_train) and the independent variables (X_train). However, we had to use the function add_constant to get the intercept for the output. All of this information is then used inside the .fit() function to fit a model.

To see the output you need to use the .summary() function as shown below.



The assumption is that you know regression but our reading this post to learn python. Therefore, we will not go into great detail about the results. The r-square is strong, however, the region and gender are not statistically significant.

We will now move to model testing

Model Testing

Our goal here is to take the model that we developed and see how it does on other data. First, we need to predict values with the model we made with the new data. This is shown in the code below


We use the .predict() function for this action and we use the X_test data as well. With this information, we will calculate the mean squared error. This metric is useful for comparing models. We only made one model so it is not that useful in this situation. Below is the code and results.


For our final trick, we will make a scatterplot with the predicted and actual values of the test set. In addition, we will calculate the correlation of the predict values and test set values. This is an alternative metric for assessing a model.


You can see the first two lines are for making the plot. Lines 3-4 are for making the correlation matrix and involves the .concat() function. The correlation is high at 0.86 which indicates the model is good at accurately predicting the values. THis is confirmed with the scatterplot which is almost a straight line.


IN this post we learned how to do a regression analysis in Python. We prepared the data, developed a model, and tested a model with an evaluation of it.

Principal Component Analysis in Python

Principal component analysis is a form of dimension reduction commonly used in statistics. By dimension reduction, it is meant to reduce the number of variables without losing too much overall information. This has the practical application of speeding up computational times if you want to run other forms of analysis such as regression but with fewer variables.

Another application of principal component analysis is for data visualization. Sometimes, you may want to reduce many variables to two in order to see subgroups in the data.

Keep in mind that in either situation PCA works better when there are high correlations among the variables. The explanation is complex but has to do with the rotation of the data which helps to separate the overlapping variance.

Prepare the Data

We will be using the pneumon dataset from the pydataset module. We want to try and explain the variance with fewer variables than in the dataset. Below is some initial code.

import pandas as pd
from sklearn.decomposition import PCA
from pydataset import data
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

Next, we will set up our dataframe. We will only take the first 200 examples from the dataset. If we take all (over 3000 examples), the visualization will be a giant blob of dotes that cannot be interpreted. We will also drop in missing values. Below is the code

df = data('pneumon')

When doing a PCA, it is important to scale the data because PCA is sensitive to this. The result of the scaling process is an array. This is a problem because the PCA function needs a dataframe. This means we have to convert the array to a dataframe. When this happens you also have to rename the columns in the new dataframe. All this is done in the code below.

scaler = StandardScaler() #instance
df_scaled = scaler.fit_transform(df) #scaled the data
df_scaled= pd.DataFrame(df_scaled) #made the dataframe
df_scaled=df_scaled.rename(index=str, columns={0: "chldage", 1: "hospital",2:"mthage",3:"urban",4:"alcohol",5:"smoke",6:"region",7:"poverty",8:"bweight",9:"race",10:"education",11:"nsibs",12:"wmonth",13:"sfmonth",14:"agepn"}) # renamed columns


We are now ready to do our analysis. We first use the PCA function to indicate how many components we want. For our first example, we will have two components. Next, you use the .fit_transform function to fit the model. Below is the code.


Now we can see the variance explained by component and the sum

Out[199]: array([0.18201588, 0.12022734])

Out[200]: 0.30224321247148167

In the first line of code, we can see that the first component explained 18% of the variance and the second explained 12%. This leads to a total of about 30%. Below is a visual of our 2 component model the color represents the race of the respondent. The three different colors represent three different races.


Our two components do a reasonable separating the data. Below is the code for making four components. We can not graph four components since our graph can only handle two but you will see that as we increase the components we also increase the variance explained.



Out[209]: array([0.18201588, 0.12022734, 0.09290502, 0.08945079])

Out[210]: 0.4845990164486457

With four components we now have almost 50% of the variance explained.


PCA is for summarising and reducing the number of variables used in an analysis or for the purposes of data visualization. Once this process is complete you can use the results to do further analysis if you desire.

Data Exploration with Python

In this post, we will explore a dataset using Python. The dataset we will use is the Ghouls, Goblins, and Ghost (GGG) dataset available at the kaggle website. The analysis will not be anything complex we will simply do the following.

  • Data preparation
  • Data  visualization
  • Descriptive statistics
  • Regression analysis

Data Preparation

The GGG dataset is fictitious data on the characteristics of spirits. Below are the modules we will use for our analysis.

import pandas as pd
import statsmodels.regression.linear_model as sm
import numpy as np

Once you download the dataset to your computer you need to load it into Python using the function. Below is the code.

df=pd.read_csv('FILE LOCATION HERE')

We store the data as “df” in the example above. Next, we will take a peek at the first few rows of data to see what we are working with.


Using the print function and accessing the first five rows reveals. It appears the first five columns are continuous data and the last two columns are categorical. The ‘id’ variable is useless for our purposes so we will remove it with the code below.


The code above uses the drop function to remove the variable ‘id’. This is all saved into the object ‘df’. In other words, we wrote over are original ‘df’.

Data Visualization

We will start with our categorical variables for the data visualization. Below is a table and a graph of the ‘color’ and ‘type’ variables.

First, we make an object called ‘spirits’ using the groupby function to organize the table by the ‘type’ variable.


Below we make a graph of the data above using the .plot function. A professional wouldn’t make this plot but we are just practicing how to code.


We now know how many ghosts, goblins and, ghouls there are in the dataset. We will now do a breakdown of ‘type’ by ‘color’ using the .crosstab function from pandas.


We will now make bar graphs of both of the categorical variables using the .plot function.


We will now turn our attention to the continuous variables. We will simply make histograms and calculate the correlation between them. First the histograms

The code is simply subset the variable you want in the brackets and then type .plot.hist() to access the histogram function. It appears that all of our data is normally distributed. Now for the correlation


Using the .corr() function has shown that there are now high correlations among the continuous variables. We will now do an analysis in which we combine the continuous and categorical variables through making boxplots

The code is redundant. We use the .boxplot() function and tell python the column which is continuous and the ‘by’ which is the categorical variable.

Descriptive Stats

We are simply going to calcualte the mean and standard deviation of the continuous variables.

Out[65]: 0.43415996604821117

Out[66]: 0.13265391313941383

Out[67]: 0.5291143100058727

Out[68]: 0.16967268504935665

Out[69]: 0.47139203219259107

Out[70]: 0.17589180837106724

The mean is calcualted with the .mean(). Standard deviation is calculated using the .std() function from the numpy package.

Multiple Regression

Our final trick is we want to explain the variable “has_soul” using the other continuous variables that are available. Below is the code

X = df[["bone_length", "rotting_flesh","hair_length"]]

y = df["has_soul"]

model = sm.OLS(y, X).fit()

In the code above we crate to new list. X contains are independent variables and y contains the dependent variable. Then we create an object called model and use the  OLS() function. We place the y and X inside the parenthesis and we then use the .fit() function as well. Below is the summary of the analysis


There is obviously a lot of information in the output. The r-square is 0.91 which is surprisingly high given that there were not high correlations in the matrix. The coefficiencies for the three independent variables are listed and all are significant. The AIC and BIC are for model comparison and do not mean much in isolation. The JB stat indicates that are distribution is not normal. Durbin watson test indicates negative autocorrelation which is important in time-series analysis.


Data exploration can be an insightful experience. Using Python, we found mant different patterns and ways to describe the data.

Logistic Regression in Python

This post will provide an example of a logistic regression analysis in Python. Logistic regression is commonly used when the dependent variable is categorical. Our goal will be to predict the gender of an example based on the other variables in the model. Below are the steps we will take to achieve this.

  1. Data preparation
  2. Model development
  3. Model testing
  4. Model evaluation

Data Preparation

The dataset we will use is the ‘Survey of Labour and Income Dynamics’ (SLID) dataset available in the pydataset module in Python. This dataset contains basic data on labor and income along with some demographic information. The initial code that we need is below.

import pandas as pd
import statsmodels.api as sm
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
from pydataset import data

The code above loads all the modules and other tools we will need in this example. We now can load our data. In addition to loading the data, we will also look at the count and the characteristics of the variables. Below is the code.


At the top of this code, we create the ‘df’ object which contains our data from the “SLID”. Next, we used the .count() function to determine if there was any missing data and to see what variables were available. It appears that we have five variables and a lot of missing data as each variable has different amounts of data. Lastly, we used the .head() function to see what each variable contained. It appears that wages, education, and age are continuous variables well sex and language are categorical. The categorical variables will need to be converted to dummy variables as well.

The next thing we need to do is drop all the rows that are missing data since it is hard to create a model when data is missing. Below is the code and the output for this process.


In the code above, we used the .dropna() function to remove missing data. Then we used the .count() function to see how many rows remained. You can see that all the variables have the same number of rows which is important for model analysis. We will now make our dummy variables for sex and language in the code below.


Here is what we did,

  1. We used the .get_dummies function from pandas first on the sex variable. All this was stored in a new object called “dummy”
  2. We then combined the dummy and df datasets using the .concat() function. The axis =1 argument is for combing by column.
  3. We repeat steps 1 and 2 for the language variable
  4. Lastly, we used the .head() function to see the results

With this, we are ready to move to model development.

Model Development

The first thing we need to do is put all of the independent variables in one dataframe and the dependent variable in its own dataframe. Below is the code for this


Notice that we did not use every variable that was available. For the language variables, we only used “French” and “Other”. This is because when you make dummy variables you only need k-1 dummies created. Since the language variable had three categories we only need two dummy variables. Therefore, we excluded “English” because when “French” and “Other” are coded 0 it means that “English” is the characteristic of the example.

In addition, we only took “male” as our dependent variable because if “male” is set to 0 it means that example is female. We now need to create our train and test dataset. The code is below.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

We created four datasets

  • train dataset with the independent variables
  • train dataset with the dependent variable
  • test dataset with the independent variables
  • test dataset with the independent variable

The split is 70/30 with 70% being used for the training and 30% being used for testing. This is the purpose of the “test_size” argument. we used the train_test_split function to do this. We can now run our model and get the results. Below is the code.


Here is what we did

  1. We used the .Logit() function from statsmodel to create the logistic model. Notice we used only the training data.
  2. We then use the .fit() function to get the results and stored this in the result object.
  3. Lastly, we printed the results in the ‘result’ object using the .summary()

There are some problems with the results. The Pseudo R-square is infinity which is usually. Also, you may have some error output about hessian inversion in your output. For these reasons, we cannot trust the results but will continue for the sake of learning.

The coefficients are clear.  Only wage, education, and age are significant. In order to determine the probability you have to take the coefficient from the model and use the .exp() function from numpy. Below are the results.

Out[107]: 1.0832870676749586

Out[108]: 0.9417645335842487

Out[109]: 0.9900498337491681

For the first value, for every unit wages increaser the probability that they are male increase 8%. For every 1 unit increase in education there probability of the person being male decrease 6%. Lastly, for every one unit increase in age the probability of the person being male decrease by 1%. Notice that we subtract 1 from the outputs to find the actual probability.

We will now move to model testing

Model Testing

To do this we first test our model with the code below


We made the result object earlier. Now we just use the .predict() function with the X_test data. Next, we need to flag examples that the model believes has a 60% chance or greater of being male. The code is below


This creates a boolean object with True and False as the output. Now we will make our confusion matrix as well as other metrics for classification.


The results speak for themselves. There are a lot of false positives if you look at the confusion matrix. In addition precision, recall, and f1 are all low. Hopefully, the coding should be clear the main point is to be sure to use the test set dependent dataset (y_test) with the flag data you made in the previous step.

We will not make the ROC curve. For a strong model, it should have a strong elbow shape while with a weak model it will be a diagonal straight line.


The first plot is of our data. The second plot is what a really bad model would look like. As you can see there is littte difference between the two. Again this is because of all the false positives we have in the model. The actual coding should be clear. fpr is the false positive rate, tpr is the true positive rate. The function is .roc_curve. Inside goes the predict vs actual test data.


This post provided a demonstration of the use of logistic regression in Python. It is necessary to follow the steps above but keep in mind that this was a demonstration and the results are dubious.

Working with a Dataframe in Python

In this post, we will learn to do some basic exploration of a dataframe in Python. Some of the task we will complete include the following…

  • Import data
  • Examine data
  • Work with strings
  • Calculating descriptive statistics

Import Data 

First, you need data, therefore, we will use the Titanic dataset, which is readily available on the internet. We will need to use the pd.read_csv() function from the pandas package. This means that we must also import pandas. Below is the code.

import pandas as pd
df=pd.read_csv('FILE LOCATION HERE')

In the code above we imported pandas as pd so we can use the functions within it. Next, we create an object called ‘df’. Inside this object, we used the pd.read_csv() function to read our file into the system. The location of the file needs to type in quotes inside the parentheses. Having completed this we can now examine the data.

Data Examination

Now we want to get an idea of the size of our dataset, any problems with missing. To determine the size we use the .shape function as shown below.

Out[33]: (891, 12)

Results indicate that we have 891 rows and 12 columns/variables. You can view the whole dataset by typing the name of the dataframe “df” and pressing enter. If you do this you may notice there are a lot of NaN values in the “Cabin” variable. To determine exactly how many we can use is.null() in combination with the values_count. variables.

True     687
False    204
Name: Cabin, dtype: int64

The code starts with the name of the dataframe. In the brackets, you put the name of the variable. After that, you put the functions you are using. Keep in mind that the order of the functions matters. You can see we have over 200 missing examples. For categorical varable, you can also see how many examples are part of each category as shown below.

S    644
C    168
Q     77
Name: Embarked, dtype: int64

This time we used our ‘Embarked’ variable. However, we need to address missing values before we can continue. To deal with this we will use the .dropna() function on the dataset. THen we will check the size of the dataframe again with the “shape” function.

Out[40]: (183, 12)

You can see our dataframe is much smaller going 891 examples to 183. We can now move to other operations such as dealing with strings.

Working with Strings

What you do with strings really depends or your goals. We are going to look at extraction, subsetting, determining the length. Our first step will be to extract the last name of the first five people. We will do this with the code below.

1 Cumings
3 Futrelle
6 McCarthy
10 Sandstrom
11 Bonnell
Name: Name, dtype: object

As you can see we got the last names of the first five examples. We did this by using the following format…

dataframe name[‘Variable Name’].function.function(‘whole word’))

.str is a function for dealing with strings in dataframes. The .extract() function does what its name implies.

If you want, you can even determine how many letters each name is. We will do this with the .str and .len() function on the first five names in the dataframe.

1 51
3 44
6 23
10 31
11 24
Name: Name, dtype: int64

Hopefully, the code is becoming easier to read and understand.


We can also calculate some descriptive statistics. We will do this for the “Fare” variable. The code is repetitive in that only the function changes so we will run all of them at once. Below we are calculating the mean, max, minimum, and standard deviation  for the price of a fare on the Titanic

Out[77]: 78.68246885245901

Out[78]: 512.32920000000001

Out[79]: 0.0

Out[80]: 76.34784270040574


This post provided you with some ways in which you can maneuver around a dataframe in Python.

Numpy Arrays in Python

In this post, we are going to explore arrays is created by the numpy package in Python. Understanding how arrays are created and manipulated is useful when you need to perform complex coding and or analysis. In particular, we will address the following,

  1. Creating and exploring arrays
  2. Math with arrays
  3. Manipulating arrays

Creating and Exploring an Array

Creating an array is simple. You need to import the numpy package and then use the np.array function to create the array. Below is the code.

import numpy as np

Making an array requires the use of square brackets. If you want multiple dimensions or columns than you must use inner square brackets. In the example above I made an array with two dimensions and each dimension has it’s own set of brackets.

Also, notice that we imported numpy as np. This is a shorthand so that we do not have to type the word numpy but only np. In addition, we now created an array with ten data points spread in two dimensions.

There are several functions you can use to get an idea of the size of a data set. Below is a list with the function and explanation.

  • .ndim = number of dimensions
  • .shape =  Shares the number of rows and columns
  • .size = Counts the number of individual data points
  • = Tells you the data structure type

Below is code that uses all four of these functions with our array.

Out[78]: 2

Out[79]: (2, 5)

Out[80]: 10
Out[81]: 'int64'

You can see we have 2 dimensions. The .shape function tells us we have 2 dimensions and 5 examples in each one. The .size function tells us we have 10 total examples (5 * 2). Lastly, the function tells us that this is an integer data type.

Math with Arrays

All mathematical operations can be performed on arrays. Below are examples of addition, subtraction, multiplication, and conditionals.

array([[ 3, 4, 5, 6, 7],
[ 8, 9, 10, 11, 12]])

array([[-1, 0, 1, 2, 3],
[ 4, 5, 6, 7, 8]])

array([[ 2, 4, 6, 8, 10],
[12, 14, 16, 18, 20]])

array([[ True, True, False, False, False],
[False, False, False, False, False]], dtype=bool)

Each number inside the example array was manipulated as indicated. For example, if we typed example + 2 all the values in the array increased by 2. Lastly, the example < 3 tells python to look inside the array and find all the values in the array that are less than 3.

Manipulating Arrays

There are also several ways you can manipulate or access data inside an array. For example, you can pull a particular element in an array by doing the following.

Out[92]: 1

The information in the brackets tells python to access the first bracket and the first number in the bracket. Recall that python starts from 0. You can also access a range of values using the colon as shown below

array([[3, 4],
[8, 9]])

In this example, the colon means take all values or dimension possible for finding numbers. This means to take columns 1 & 2. After the comma we have 2:4, this means take the 3rd and 4th value but not the 5th.

It is also possible to turn a multidimensional array into a single dimension with the .ravel() function and also to transpose with the transpose() function. Below is the code for each.

Out[97]: array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

array([[ 1, 6],
[ 2, 7],
[ 3, 8],
[ 4, 9],
[ 5, 10]])

You can see the .ravel function made a one-dimensional array. The .transpose broke the array into several more dimensions with two numbers each.


We now have a basic understanding of how numpy array work using python. As mention before, this is valuable information to understand when trying to wrestling with different data science questions.