Monthly Archives: October 2018

Support Vector Machines Regression with Python

This post will provide an example of how to do regression with support vector machines SVM. SVM is a complex algorithm that allows for the development of non-linear models. This is particularly useful for messy data that does not have clear boundaries.

The steps that we will use are listed below

  1. Data preparation
  2. Model Development

We will use two different kernels in our analysis. The LinearSVR kernel and SVR kernel. The difference between these two kernels has to do with slight changes in the calculations of the boundaries between classes.

Data Preparation

We are going to use the OFP dataset available in the pydataset module. This dataset was used previously for classification with SVM on this site. Our plan this time is that we want to predict family inc (famlinc), which is a continuous variable.  Below is some initial code.

import numpy as np
import pandas as pd
from pydataset import data
from sklearn import svm
from sklearn import model_selection
from statsmodels.tools.eval_measures import mse

We now need to load our dataset and remove any missing values.

df=pd.DataFrame(data('OFP'))
df=df.dropna()

AS in the previous post, we need to change the text variables into dummy variables and we also need to scale the data. The code below creates the dummy variables, removes variables that are not needed, and also scales the data.

dummy=pd.get_dummies(df['black'])
df=pd.concat([df,dummy],axis=1)
df=df.rename(index=str, columns={"yes": "black_person"})
df=df.drop('no', axis=1)

dummy=pd.get_dummies(df['sex'])
df=pd.concat([df,dummy],axis=1)
df=df.rename(index=str, columns={"male": "Male"})
df=df.drop('female', axis=1)

dummy=pd.get_dummies(df['employed'])
df=pd.concat([df,dummy],axis=1)
df=df.rename(index=str, columns={"yes": "job"})
df=df.drop('no', axis=1)

dummy=pd.get_dummies(df['maried'])
df=pd.concat([df,dummy],axis=1)
df=df.rename(index=str, columns={"no": "single"})
df=df.drop('yes', axis=1)

dummy=pd.get_dummies(df['privins'])
df=pd.concat([df,dummy],axis=1)
df=df.rename(index=str, columns={"yes": "insured"})
df=df.drop('no', axis=1)
df=df.drop(['black','sex','maried','employed','privins','medicaid','region','hlth'],axis=1)
df = (df - df.min()) / (df.max() - df.min())
df.head()

1

We now need to set up our datasets. The X dataset will contain the independent variables while the y dataset will contain the dependent variable

X=df[['ofp','ofnp','opp','opnp','emr','hosp','numchron','adldiff','age','school','single','black_person','Male','job','insured']]
y=df['faminc']

We can now move to model development

Model Development

We now need to create our train and test sets for or X and y datasets. We will do a 70/30 split of the data. Below is the code

X_train,X_test,y_train,y_test=model_selection.train_test_split(X,y,test_size=.3,random_state=1)

Next, we will create our two models with the code below.

h1=svm.SVR()
h2=svm.LinearSVR()

We will now run our first model and assess the results. Our metric is the mean squared error. Generally, the lower the number the better.  We will use the .fit() function to train the model and the .predict() function for test the model

1

The mse was 0.27. This number means nothing only and is only beneficial for comparison reasons. Therefore, the second model will be judged as better or worst only if the mse is lower than 0.27. Below are the results of the second model.

1.png

We can see that the mse for our second model is 0.34 which is greater than the mse for the first model. This indicates that the first model is superior based on the current results and parameter settings.

Conclusion

This post provided an example of how to use SVM for regression.

Support Vector Machines Classification with Python

Support vector machines (SVM) is an algorithm used to fit non-linear models. The details are complex but to put it simply  SVM tries to create the largest boundaries possible between the various groups it identifies in the sample. The mathematics behind this is complex especially if you are unaware of what a vector is as defined in algebra.

This post will provide an example of SVM using Python broken into the following steps.

  1. Data preparation
  2. Model Development

We will use two different kernels in our analysis. The linear kernel and he rbf kernel. The difference in terms of kernels has to do with how the boundaries between the different groups are made.

Data Preparation

We are going to use the OFP dataset available in the pydataset module. We want to predict if someone single or not. Below is some initial code.

import numpy as np
import pandas as pd
from pydataset import data
from sklearn import svm
from sklearn.metrics import classification_report
from sklearn import model_selection

We now need to load our dataset and remove any missing values.

df=pd.DataFrame(data('OFP'))
df=df.dropna()
df.head()

1

Looking at the dataset we need to do something with the variables that have text. We will create dummy variables for all except region and hlth. The code is below.

dummy=pd.get_dummies(df['black'])
df=pd.concat([df,dummy],axis=1)
df=df.rename(index=str, columns={"yes": "black_person"})
df=df.drop('no', axis=1)

dummy=pd.get_dummies(df['sex'])
df=pd.concat([df,dummy],axis=1)
df=df.rename(index=str, columns={"male": "Male"})
df=df.drop('female', axis=1)

dummy=pd.get_dummies(df['employed'])
df=pd.concat([df,dummy],axis=1)
df=df.rename(index=str, columns={"yes": "job"})
df=df.drop('no', axis=1)

dummy=pd.get_dummies(df['maried'])
df=pd.concat([df,dummy],axis=1)
df=df.rename(index=str, columns={"no": "single"})
df=df.drop('yes', axis=1)

dummy=pd.get_dummies(df['privins'])
df=pd.concat([df,dummy],axis=1)
df=df.rename(index=str, columns={"yes": "insured"})
df=df.drop('no', axis=1)

For each variable, we did the following

  1. Created a dummy in the dummy dataset
  2. Combined the dummy variable with our df dataset
  3. Renamed the dummy variable based on yes or no
  4. Drop the other dummy variable from the dataset. Python creates two dummies instead of one.

If you look at the dataset now you will see a lot of variables that are not necessary. Below is the code to remove the information we do not need.

df=df.drop(['black','sex','maried','employed','privins','medicaid','region','hlth'],axis=1)
df.head()

1

This is much cleaner. Now we need to scale the data. This is because SVM is sensitive to scale. The code for doing this is below.

df = (df - df.min()) / (df.max() - df.min())
df.head()

1

We can now create our dataset with the independent variables and a separate dataset with our dependent variable. The code is as follows.

X=df[['ofp','ofnp','opp','opnp','emr','hosp','numchron','adldiff','age','school','faminc','black_person','Male','job','insured']]
y=df['single']

We can now move to model development

Model Development

We need to make our test and train sets first. We will use a 70/30 split.

X_train,X_test,y_train,y_test=model_selection.train_test_split(X,y,test_size=.3,random_state=1)

Now, we need to create the models or the hypothesis we want to test. We will create two hypotheses. The first model is using a linear kernel and the second is one using the rbf kernel. For each of these kernels, there are hyperparameters that need to be set which you will see in the code below.

h1=svm.LinearSVC(C=1)
h2=svm.SVC(kernel='rbf',degree=3,gamma=0.001,C=1.0)

The details about the hyperparameters are beyond the scope of this post. Below are the results for the first model.

1.png

The overall accuracy is 73%. The crosstab() function provides a breakdown of the results and the classification_report() function provides other metrics related to classification. In this situation, 0 means not single or married while 1 means single. Below are the results for model 2

1.png

You can see the results are similar with the first model having a slight edge. The second model really struggls with predicting people who are actually single. You can see thtat the recall in particular is really poor.

Conclusion

This post provided how to ob using SVM in python. How this algorithm works can be somewhat confusing. However, its use can be powerful if use appropriately.

Linear Discriminant Analysis in Python

Linear discriminant analysis is a classification algorithm commonly used in data science. In this post, we will learn how to use LDA with Python. The steps we will for this are as follows.

  1. Data preparation
  2. Model training and evaluation

Data Preparation

We will be using the bioChemists dataset which comes from the pydataset module. We want to predict whether someone is married or single based on academic output and prestige. Below is some initial code.

import pandas as pd
from pydataset import data
import matplotlib.pyplot as plt
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

Now we will load our data and take a quick look at it using the .head() function.

1.png

There are two variables that contain text so we need to convert these two dummy variables for our analysis the code is below with the output.

1

Here is what we did.

  1. We created the dummy variable by using the .get_dummies() function.
  2. We saved the output in an object called dummy
  3. We then combine the dummy and df dataset with the .concat() function
  4. We repeat this process for the second variable

The output shows that we have our original variables and the dummy variables. However, we do not need all of this information. Therefore, we will create a dataset that has the X variables we will use and a separate dataset that will have our y values. Below is the code.

X=df[['Men','kid5','phd','ment','art']]
y=df['Married']

The X dataset has our five independent variables and the y dataset has our dependent variable which is married or not. We can not split our data into a train and test set.  The code is below.

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=.3,random_state=0)

The data was split 70% for training and 30% for testing. We made a train and test set for the independent and dependent variables which meant we made 4 sets altogether. We can now proceed to model development and testing

Model Training and Testing

Below is the code to run our LDA model. We will use the .fit() function for this.

clf=LDA()
clf.fit(X_train,y_train)

We will now use this model to predict using the .predict function

y_pred=clf.predict(X_test)

Now for the results, we will use the classification_report function to get all of the metrics associated with a confusion matrix.

1.png

The interpretation of this information is described in another place. For our purposes, we have an accuracy of 71% for our prediction.  Below is a visual of our model using the ROC curve.

1

Here is what we did

  1. We had to calculate the roc_curve for the model this is explained in detail here
  2. Next, we plotted our own curve and compared to a baseline curve which is the dotted lines.

A ROC curve of 0.67 is considered fair by many. Our classification model is not that great but there are worst models out there.

Conclusion

This post went through an example of developing and evaluating a linear discriminant model. To do this you need to prepare the data, train the model, and evaluate.

Factor Analysis in Python

Factor analysis is a dimensionality reduction technique commonly used in statistics. FA is similar to principal component analysis. The difference are highly technical but include the fact the FA does not have an orthogonal decomposition and FA assumes that there are latent variables and that are influencing the observed variables in the model. For FA the goal is the explanation of the covariance among the observed variables present.

Our purpose here will be to use the BioChemist dataset from the pydataset module and perform a FA that creates two components. This dataset has data on the people completing PhDs and their mentors. We will also create a visual of our two-factor solution. Below is some initial code.

import pandas as pd
from pydataset import data
from sklearn.decomposition import FactorAnalysis
import matplotlib.pyplot as plt

We now need to prepare the dataset. The code is below

df = data('bioChemists')
df=df.iloc[1:250]
X=df[['art','kid5','phd','ment']]

In the code above, we did the following

  1. The first line creates our dataframe called “df” and is made up of the dataset bioChemist
  2. The second line reduces the df to 250 rows. This is done for the visual that we will make. To take the whole dataset and graph it would make a giant blob of color that would be hard to interpret.
  3. The last line pulls the variables we want to use for our analysis. The meaning of these variables can be found by typing data(“bioChemists”,show_doc=True)

In the code below we need to set the number of factors we want and then run the model.

fact_2c=FactorAnalysis(n_components=2)
X_factor=fact_2c.fit_transform(X)

The first line tells Python how many factors we want. The second line takes this information along with or revised dataset X to create the actual factors that we want. We can now make our visualization

To make the visualization requires several steps. We want to identify how well the two components separate students who are married from students who are not married. First, we need to make a dictionary that can be used to convert the single or married status to a number. Below is the code.

thisdict = {
"Single": "1",
"Married": "2",}

Now we are ready to make our plot. The code is below. Pay close attention to the ‘c’ argument as it uses our dictionary.

plt.scatter(X_factor[:,0],X_factor[:,1],c=df.mar.map(thisdict),alpha=.8,edgecolors='none')

1

You can perhaps tell why we created the dictionary now. By mapping the dictionary to the mar variable it automatically changed every single and married entry in the df dataset to a 1 or 2. The c argument needs a number in order to set a color and this is what the dictionary was able to supply it with.

You can see that two factors do not do a good job of separating the people by their marital status. Additional factors may be useful but after two factors it becomes impossible to visualize them.

Conclusion

This post provided an example of factor analysis in Python. Here the focus was primarily on visualization but there are so many other ways in which factor analysis can be deployed.

Analyzing Twitter Data in Python

In this post, we will look at how to analyze text from Twitter. We will do each of the following for tweets that refer to Donald Trump and tweets that refer to Barrack Obama.

  • Conduct a sentiment analysis
  • Create a word cloud

This is a somewhat complex analysis so I am assuming that you are familiar with Python as explaining everything would make the post much too long. In order to achieve our two objectives above we need to do the following.

  1. Obtain all of the necessary information from your twitter apps account
  2. Download the tweets & clean
  3. Perform the analysis

Before we begin, here is a list of modules we will need to load to complete our analysis

import wordcloud
import matplotlib.pyplot as plt
import twython
import re
import numpy

Obtain all Needed Information

From your twitter app account, you need the following information

  • App key
  • App key secret
  • Access token
  • Access token secret

All this information needs to be stored in individual objects in Python. Then each individual object needs to be combined into one object. The code is below.

TWITTER_APP_KEY=XXXXXXXXXXXXXXXXXXXXXXXXXX
TWITTER_APP_KEY_SECRET=XXXXXXXXXXXXXXXXXXX
TWITTER_ACCESS_TOKEN=XXXXXXXXXXXXXXXXXXXXX
TWITTER_ACCESS_TOKEN_SECRET=XXXXXXXXXXXXXX
t=twython.Twython(app_key=TWITTER_APP_KEY,app_secret=TWITTER_APP_KEY_SECRET,oauth_token=TWITTER_ACCESS_TOKEN,oauth_token_secret=TWITTER_ACCESS_TOKEN_SECRET)

In the code above we saved all the information in different objects at first and then combined them. You will of course replace the XXXXXXX with your own information.

Next, we need to create a function that will pull the tweets from Twitter. Below is the code,

def get_tweets(twython_object,query,n):
   count=0
   result_generator=twython_object.cursor(twython_object.search,q=query)

   result_set=[]
   for r in result_generator:
      result_set.append(r['text'])
      count+=1
      if count ==n: break

   return result_set

You will have to figure out the code yourself. We can now download the tweets.

Downloading Tweets & Clean

Downloading the tweets involves making an empty dictionary that we can save our information in. We need two keys in our dictionary one for Trump and the other for Obama because we are downloading tweets about these two people.

There are also two additional things we need to do. We need to use regular expressions to get rid of punctuation and we also need to lower case all words. All this is done in the code below.

tweets={}
tweets['trump']=[re.sub(r'[-.#/?!.":;()\']',' ',tweet.lower())for tweet in get_tweets(t,'#trump',1500)]
tweets['obama']=[re.sub(r'[-.#/?!.":;()\']',' ',tweet.lower())for tweet in get_tweets(t,'#obama',1500)]

The get_tweets function is also used in the code above along with our twitter app information. We pulled 1500 tweets concerning Obama and 1500 tweets about Trump. We were able to download and clean our tweets at the same time. We can now do our analysis

Analysis

To do the sentiment analysis you need dictionaries of positive and negative words. The ones in this post were taken from GitHub. Below is the code for loading them into Python.

positive_words=open('XXXXXXXXXXXX').read().split('\n')
negative_words=open('XXXXXXXXXXXX').read().split('\n')

We now will make a function to calculate the sentiment

def sentiment_score(text,pos_list,neg_list):
   positive_score=0
   negative_score=0

   for w in text.split(' '):
      if w in pos_list:positive_score+=1
      if w in neg_list:negative_score+=1
   return positive_score-negative_score

Now we create an empty dictionary and run the analysis for Trump and then for Obama

tweets_sentiment={}
tweets_sentiment['trump']=[sentiment_score(tweet,positive_words,negative_words)for tweet in tweets['trump']]
tweets_sentiment['obama']=[sentiment_score(tweet,positive_words,negative_words)for tweet in tweets['obama']]

Now we can make visuals of our results with the code below

trump=plt.hist(tweets_sentiment['trump'],5)
obama=plt.hist(tweets_sentiment['obama'],5)

Obama is on the left and trump is on the right. It seems that trump tweets are consistently more positive. Below are the means for both.

numpy.mean(tweets_sentiment['trump'])
Out[133]: 0.36363636363636365

numpy.mean(tweets_sentiment['obama'])
Out[134]: 0.2222222222222222

Trump tweets are slightly more positive than Obama tweets. Below is the code for the Trump word cloud

1.png

Here is the code for the Obama word cloud

1

A lot of speculating can be made from the word clouds and sentiment analysis. However, the results will change every single time because of the dynamic nature of Twitter. People are always posting tweets which changes the results.

Conclusion

This post provided an example of how to download and analyze tweets from twitter. It is important to develop a clear idea of what you want to know before attempting this sort of analysis as it is easy to become confused and not accomplish anything.

Word Clouds in Python

Word clouds are a type of data visualization in which various words from a dataset are actuated. Words that are larger in the word cloud are more common and words in the middle are also more common. In addition, some word clouds even use various colors to indicated importance.

This post will provide an example of how to make a word cloud using python. We will be using the “Women’s E-Commerce Clothing Reviews” available on the kaggle website.  We are going to only use the text reviews to make our word cloud even though other data is in the dataset. To prepare our dataset for making the word cloud we need to the following.

  1. Lowercase all words
  2. Remove punctuation
  3. Remove stopwords

After completing these steps we can make the word cloud. First, we need to load all of the necessary modules.

import pandas as pd
import re
from nltk.corpus import stopwords
import wordcloud
import matplotlib.pyplot as plt

We now need to load our dataset we will store it as the object ‘df’

df=pd.read_csv('YOUR LOCATION HERE')
df.head()

1.png

It’s hard to read but we will be working only with the “Review Text” column as this has the text data we need. Here is what our column looks like up close.

df['Review Text'].head()

Out[244]: 
0 Absolutely wonderful - silky and sexy and comf...
1 Love this dress! it's sooo pretty. i happene...
2 I had such high hopes for this dress and reall...
3 I love, love, love this jumpsuit. it's fun, fl...
4 This shirt is very flattering to all due to th...
Name: Review Text, dtype: object

We will now make all words lower case and remove punctuation with the code below.

df["Review Text"]=df['Review Text'].str.lower()
df["Review Text"]=df['Review Text'].str.replace(r'[-./?!,":;()\']',' ')

The first line in the code above lower cases all words. The second line removes the punctuation. The second line is trickier as you have to explain to python exactly what type of punctuation you want to remove and what to replace it with. Everything we want to remove is in the first set of single quotes. We want to replace the punctuation with a space which is the second set of single quotation marks with a space in the middle. THe r at the beginning of the parentheses stands for remove.

Here is what our data looks like after making these two changes

df['Review Text'].head()

Out[249]: 
0 absolutely wonderful silky and sexy and comf...
1 love this dress it s sooo pretty i happene...
2 i had such high hopes for this dress and reall...
3 i love love love this jumpsuit it s fun fl...
4 this shirt is very flattering to all due to th...
Name: Review Text, dtype: object

All the words are in lowercase. In addition, you can see that the dash in line 0 is gone as all the punctuation in the other lines. We now need to remove stopwords. Stopwords are the functional words that glue the meaning together without. Examples include and, for, but, etc. We are trying to make a cloud of substantial words and not stopwords so these words need to be removed.

If you have never done this on your computer before you may need to import the nltk module and run nltk.download_gui(). Once this is done you need to download the stopwords package.

Below is the code for removing the stopwords. First, we need to load the stopwords this is done below.

stopwords_list=stopwords.words('english')
stopwords_list=stopwords_list+['to']

We create an object called stopwords_list which has all the English stopwords. The second line just adds the word ‘to’ to the list. Nex,t we need to make an object that will look for the pattern of words we want to remove. Below is the code

pat = r'\b(?:{})\b'.format('|'.join(stopwords_list))

This code is the basically telling Python what to look for. Using regularized expressions Python will look for any word whos pattern on the left is the same as the pattern on the right after the .join function. Inside the .join function is our stopwords_list. We will now take this object called ‘pat’ and use it on our ‘Review Text’ variable.

df['Split Text'] = df['Review Text'].str.replace(pat, '')
df['Split Text'].head()

Out[258]: 
0      absolutely wonderful   silky  sexy  comfortable
1    love  dress     sooo pretty    happened  find ...
2       high hopes   dress  really wanted   work   ...
3     love  love  love  jumpsuit    fun  flirty   f...
4     shirt   flattering   due   adjustable front t...
Name: Split Text, dtype: object

You can see that we have created a new column called ‘Split Text’ and the results is a text that has lost many stop words.

We are now ready to make our word cloud below is the code and the output.

wordcloud1=wordcloud.WordCloud(width=1000,height=500).generate(' '.join(map(str, df['Split Text'])))
plt.figure(figsize=(15,8))
plt.imshow(wordcloud1)
plt.axis('off')

1.png

This code is complex. We used the word cloud function and we had to use both generate map, and join as inner functions. All of these function were needed to take the words from the dataframe and make them simple text for the wordcloud function.

The rest of the code is common to mathplotlib so does not require much explanation. Ass you look at the word cloud, you can see that the most common words include top, look, dress, shirt, fabric. etc. This is reasonable given that these are women’s reviews of clothing.

Conclusion

This post provided an example of text analysis using word clouds in Python. The insights here are primarily descriptive in nature. This means that if the desire is prediction or classification other additional tools would need to build upon what is shown here.

KMeans Clustering in Python

Kmeans clustering is a technique in which the examples in a dataset our divided through segmentation. The segmentation has to do with complex statistical analysis in which examples within a group are more similar the examples outside of a group.

The application of this is that it provides the analysis with various groups that have similar characteristics which can be used to cater services to in various industries such as business or education. In this post, we will look at how to do this using Python. We will use the steps below to complete this process.

  1. Data preparation
  2. Determine the number of clusters
  3. Conduct analysis

Data Preparation

Our data for this examples comes from the sat.act dataset available in the pydataset module. Below is some initial code.

import pandas as pd
from pydataset import data
from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist
import numpy as np
import matplotlib.pyplot as plt

We will now load our dataset and drop any NAs they may be present

1

You can see there are six variables that will be used for the clustering. Next, we will turn to determining the number of clusters.

Determine the Number of Clusters

Before you can actually do a kmeans analysis you must specify the number of clusters. This can be tricky as there is no single way to determine this. For our purposes, we will use the elbow method.

The elbow method measures the within sum of error in each cluster. As the number of clusters increases this error decreases. However,  a certain point the return on increasing clustering becomes minimal and this is known as the elbow. Below is the code to calculate this.

distortions = []
K = range(1,10)
for k in K:
kmeanModel = KMeans(n_clusters=k).fit(df)
distortions.append(sum(np.min(cdist(df, kmeanModel.cluster_centers_, 'euclidean'), axis=1)) / df.shape[0])

Here is what we did

  1. We made an empty list called ‘distortions’ we will save our results there.
  2. In the second line, we told R the range of clusters we want to consider. Simply, we want to consider anywhere from 1 to 10 clusters.
  3. Line 3 and 4, we use a for loop to calculate the number of clusters when fitting it to the df object.
  4. In Line 5, we save the sum of the cluster distance in the distortions list.

Below is a visual to determine the number of clusters

plt.plot(K, distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method showing the optimal k')

1.png

The graph indicates that 3 clusters are sufficient for this dataset. We can now perform the actual kmeans clustering.

KMeans Analysis

The code for the kmeans analysis is as follows

km=KMeans(3,init='k-means++',random_state=3425)
km.fit(df.values)
  1. We use the KMeans function and tell Python the number of clusters, the type of, initialization, and we set the seed with the random_state argument. All this is saved in the object called km
  2. The km object has the .fit function used on it with df.values

Next, we will predict with the predict function and look at the first few lines of the modified df with the .head() function.

1

You can see we created a new variable called predict. This variable contains the kmeans algorithm prediction of which group each example belongs too. We then printed the first five values as an illustration. Below are the descriptive statistics for the three clusters that were produced for the variable in the dataset.

1.png

It is clear that the clusters are mainly divided based on the performance on the various test used. In the last piece of code, gender is used. 1 represents male and 2 represents female.

We will now make a visual of the clusters using two dimensions. First, w e need to make a map of the clusters that is saved as a dictionary. Then we will create a new variable in which we take the numerical value of each cluster and convert it to a string in our cluster map dictionary.

clust_map={0:'Weak',1:'Average',2:'Strong'}
df['perf']=df.predict.map(clust_map)

Next, we make a different dictionary to color the points in our graph.

d_color={'Weak':'y','Average':'r','Strong':'g'}

1.png

Here is what is happening in the code above.

  1. We set the ax object to a value.
  2. A for loop is used to go through every example in clust_map.values so that they are colored according the color
  3. Lastly, a plot is called which lines up the perf and clust values for color.

The groups are clearly separated when looking at them in two dimensions.

Conclusion

Kmeans is a form of unsupervised learning in which there is no dependent variable which you can use to assess the accuracy of the classification or the reduction of error in regression. As such, it can be difficult to know how well the algorithm did with the data. Despite this, kmeans is commonly used in situations in which people are trying to understand the data rather than predict.

Random Forest in Python

This post will provide a demonstration of the use of the random forest algorithm in python. Random forest is similar to decision trees except that instead of one tree a multitude of trees are grown to make predictions. The various trees all vote in terms of how to classify an example and majority vote is normally the winner. Through making many trees the accuracy of the model normally improves.

The steps are as follows for the use of random forest

  1. Data preparation
  2. Model development & evaluation
  3. Model comparison
  4. Determine variable importance

Data Preparation

We will use the cancer dataset from the pydataset module. We want to predict if someone is censored or dead in the status variable. The other variables will be used as predictors. Below is some code that contains all of the modules we will use.

import pandas as pd
import sklearn.ensemble as sk
from pydataset import data
from sklearn.model_selection import train_test_split
from sklearn import metrics
import matplotlib.pyplot as plt

We will now load our data cancer in an object called ‘df’. Then we will remove all NA’s use the .dropna() function. Below is the code.

df = data('cancer')
df=df.dropna()

We now need to make two datasets. One dataset, called X, will contain all of the predictor variables. Another dataset, called y, will contain the outcome variable. In the y dataset, we need to change the numerical values to a string. This will make interpretation easier as we will not need to lookup what the numbers represents. Below is the code.

X=df[['time','age',"sex","ph.ecog",'ph.karno','pat.karno','meal.cal','wt.loss']]
df['status']=df.status.replace(1,'censored')
df['status']=df.status.replace(2,'dead')
y=df['status']

Instead of 1 we now have the string “censored” and instead of 2 we now have the string “dead” in the status variable. The final step is to set up our train and test sets. We will do a 70/30 split. We will have a train set for the X and y dataset as well as a test set for the X and y datasets. This means we will have four datasets in all. Below is the code.

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

We are now ready to move to model development

Model Development and Evaluation

We now need to create our classifier and fit the data to it. This is done with the following code.

clf=sk.RandomForestClassifier(n_estimators=100)
clf=clf.fit(x_train,y_train)

The clf object has our random forest algorithm,. The number of estimators is set to 100. This is the number of trees that will be generated. In the second line of code, we use the .fit function and use the training datasets x and y.

We now will test our model and evaluate it. To do this we will use the .predict() with the test dataset Then we will make a confusion matrix followed by common metrics in classification. Below is the code and the output.

1.png

You can see that our model is good at predicting who is dead but struggles with predicting who is censored. The metrics are reasonable for dead but terrible for censored.

We will now make a second model for the purpose of comparison

Model Comparison

We will now make a different model for the purpose of comparison. In this model, we will use out of bag samples to determine accuracy, set the minimum split size at 5 examples, and that each leaf has at least 2 examples. Below is the code and the output.

1.png

There was some improvement in classify people who were censored as well as for those who were dead.

Variable Importance

We will now look at which variables were most important in classifying our examples. Below is the code

model_ranks=pd.Series(clf.feature_importances_,index=x_train.columns,name="Importance").sort_values(ascending=True,inplace=False)
ax=model_ranks.plot(kind='barh')

We create an object called model_ranks and we indicate the following.

  1. Classify the features by importance
  2. Set index to the columns in the training dataset of x
  3. Sort the features from most to least importance
  4. Make a barplot

Below is the output

1.png

You can see that time is the strongest classifier. How long someone has cancer is the strongest predictor of whether they are censored or dead. Next is the number of calories per meal followed by weight and lost and age.

Conclusion

Here we learned how to use random forest in Python. This is another tool commonly used in the context of machine learning.

Decision Trees in Python

Decision trees are used in machine learning. They are easy to understand and are able to deal with data that is less than ideal. In addition, because of the pictorial nature of the results decision trees are easy for people to interpret. We are going to use the ‘cancer’ dataset to predict mortality based on several independent variables.

We will follow the steps below for our decision tree analysis

  1. Data preparation
  2. Model development
  3. Model evaluation

Data Preparation

We need to load the following modules in order to complete this analysis.

import pandas as pd
import statsmodels.api as sm
import sklearn
from pydataset import data
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn import tree
import matplotlib.pyplot as plt
from sklearn.externals.six import StringIO 
from IPython.display import Image 
from sklearn.tree import export_graphviz
import pydotplus

The ‘cancer’ dataset comes from the ‘pydataset’ module. You can learn more about the dataset by typing the following

data('cancer', show_doc=True)

This provides all you need to know about our dataset in terms of what each variable is measuring. We need to load our data as ‘df’. In addition, we need to remove rows with missing values and this is done below.

df = data('cancer')
len(df)
Out[58]: 228
df=df.dropna()
len(df)
Out[59]: 167

The initial number of rows in the data set was 228. After removing missing data it dropped to 167. We now need to setup up our lists with the independent variables and a second list with the dependent variable. While doing this, we need to recode our dependent variable “status” so that the numerical values are replaced with a string. This will help us to interpret our decision tree later. Below is the code

X=df[['time','age',"sex","ph.ecog",'ph.karno','pat.karno','meal.cal','wt.loss']]
df['status']=df.status.replace(1,'censored')
df['status']=df.status.replace(2,'dead')
y=df['status']

Next,  we need to make our train and test sets using the train_test_split function.  We want a 70/30 split. The code is below.

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

We are now ready to develop our model.

Model Development

The code for the model is below

clf=tree.DecisionTreeClassifier(min_samples_split=10)
clf=clf.fit(x_train,y_train)

We first make an object called “clf” which calls the DecisionTreeClassifier. Inside the parentheses, we tell Python that we do not want any split in the tree to contain less than 10 examples. The second “clf” object uses the  .fit function and calls the training datasets.

We can also make a visual of our decision tree.

dot_data = StringIO()
export_graphviz(clf, out_file=dot_data, 
filled=True, rounded=True,feature_names=list(x_train.columns.values),
special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue()) 
Image(graph.create_png())

1

If we interpret the nodes furthest to the left we get the following

  • If a person has had cancer less than 171 days and
  • If the person is less than 74.5 years old then
  • The person is dead

If you look closely every node is classified as ‘dead’ this may indicate a problem with our model. The evaluation metrics are below.

Model Evaluation

We will use the .crosstab function and the metrics classification functions

1.png

You can see that the metrics are not that great in general. This may be why everything was classified as ‘dead’. Another reason is that few people were classified as ‘censored’ in the dataset.

Conclusion

Decisions trees are another machine learning tool. Python allows you to develop trees rather quickly that can provide insights into how to take action.

Multiple Regression in Python

In this post, we will go through the process of setting up and a regression model with a training and testing set using Python. We will use the insurance dataset from kaggle. Our goal will be to predict charges. In this analysis, the following steps will be performed.

  1. Data preparation
  2. Model training
  3. model testing

Data Preparation

Below is a list of the modules we will need in order to complete the analysis

import matplotlib.pyplot as plt
import pandas as pd
from sklearn import linear_model,model_selection, feature_selection,preprocessing
import statsmodels.formula.api as sm
from statsmodels.tools.eval_measures import mse
from statsmodels.tools.tools import add_constant
from sklearn.metrics import mean_squared_error

After you download the dataset you need to load it and take a look at it. You will use the  .read_csv function from pandas to load the data and .head() function to look at the data. Below is the code and the output.

insure=pd.read_csv('YOUR LOCATION HERE')

1.png

We need to create some dummy variables for sex, smoker, and region. We will address that in a moment, right now we will look at descriptive stats for our continuous variables. We will use the .describe() function for descriptive stats and the .corr() function to find the correlations.

1.png

The descriptives are left for your own interpretation. As for the correlations, they are generally weak which is an indication that regression may be appropriate.

As mentioned earlier, we need to make dummy variables sex, smoker, and region in order to do the regression analysis. To complete this we need to do the following.

  1. Use the pd.get_dummies function from pandas to create the dummy
  2. Save the dummy variable in an object called ‘dummy’
  3. Use the pd.concat function to add our new dummy variable to our ‘insure’ dataset
  4. Repeat this three times

Below is the code for doing this

dummy=pd.get_dummies(insure['sex'])
insure=pd.concat([insure,dummy],axis=1)
dummy=pd.get_dummies(insure['smoker'])
insure=pd.concat([insure,dummy],axis=1)
dummy=pd.get_dummies(insure['region'])
insure=pd.concat([insure,dummy],axis=1)
insure.head()

1.png

The .get_dummies function requires the name of the dataframe and in the brackets the name of the variable to convert. The .concat function requires the name of the two datasets to combine as well the axis on which to perform it.

We now need to remove the original text variables from the dataset. In addition, we need to remove the y variable “charges” because this is the dependent variable.

y = insure.charges
insure=insure.drop(['sex', 'smoker','region','charges'], axis=1)

We can now move to model development.

Model Training

Are train and test sets are model with the model_selection.trainin_test_split function. We will do an 80-20 split of the data. Below is the code.

X_train, X_test, y_train, y_test = model_selection.train_test_split(insure, y, test_size=0.2)

In this single line of code, we create a train and test set of our independent variables and our dependent variable.

We can not run our regression analysis. This requires the use of the .OLS function from statsmodels module. Below is the code.

answer=sm.OLS(y_train, add_constant(X_train)).fit()

In the code above inside the parentheses, we put the dependent variable(y_train) and the independent variables (X_train). However, we had to use the function add_constant to get the intercept for the output. All of this information is then used inside the .fit() function to fit a model.

To see the output you need to use the .summary() function as shown below.

answer.summary()

1.png

The assumption is that you know regression but our reading this post to learn python. Therefore, we will not go into great detail about the results. The r-square is strong, however, the region and gender are not statistically significant.

We will now move to model testing

Model Testing

Our goal here is to take the model that we developed and see how it does on other data. First, we need to predict values with the model we made with the new data. This is shown in the code below

ypred=answer.predict(add_constant(X_test))

We use the .predict() function for this action and we use the X_test data as well. With this information, we will calculate the mean squared error. This metric is useful for comparing models. We only made one model so it is not that useful in this situation. Below is the code and results.

print(mse(ypred,y_test))
33678660.23480476

For our final trick, we will make a scatterplot with the predicted and actual values of the test set. In addition, we will calculate the correlation of the predict values and test set values. This is an alternative metric for assessing a model.

1.png

You can see the first two lines are for making the plot. Lines 3-4 are for making the correlation matrix and involves the .concat() function. The correlation is high at 0.86 which indicates the model is good at accurately predicting the values. THis is confirmed with the scatterplot which is almost a straight line.

Conclusion

IN this post we learned how to do a regression analysis in Python. We prepared the data, developed a model, and tested a model with an evaluation of it.

Principal Component Analysis in Python

Principal component analysis is a form of dimension reduction commonly used in statistics. By dimension reduction, it is meant to reduce the number of variables without losing too much overall information. This has the practical application of speeding up computational times if you want to run other forms of analysis such as regression but with fewer variables.

Another application of principal component analysis is for data visualization. Sometimes, you may want to reduce many variables to two in order to see subgroups in the data.

Keep in mind that in either situation PCA works better when there are high correlations among the variables. The explanation is complex but has to do with the rotation of the data which helps to separate the overlapping variance.

Prepare the Data

We will be using the pneumon dataset from the pydataset module. We want to try and explain the variance with fewer variables than in the dataset. Below is some initial code.

import pandas as pd
from sklearn.decomposition import PCA
from pydataset import data
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

Next, we will set up our dataframe. We will only take the first 200 examples from the dataset. If we take all (over 3000 examples), the visualization will be a giant blob of dotes that cannot be interpreted. We will also drop in missing values. Below is the code

df = data('pneumon')
df=df.dropna()
df=df.iloc[0:199,]

When doing a PCA, it is important to scale the data because PCA is sensitive to this. The result of the scaling process is an array. This is a problem because the PCA function needs a dataframe. This means we have to convert the array to a dataframe. When this happens you also have to rename the columns in the new dataframe. All this is done in the code below.

scaler = StandardScaler() #instance
df_scaled = scaler.fit_transform(df) #scaled the data
df_scaled= pd.DataFrame(df_scaled) #made the dataframe
df_scaled=df_scaled.rename(index=str, columns={0: "chldage", 1: "hospital",2:"mthage",3:"urban",4:"alcohol",5:"smoke",6:"region",7:"poverty",8:"bweight",9:"race",10:"education",11:"nsibs",12:"wmonth",13:"sfmonth",14:"agepn"}) # renamed columns

Analysis

We are now ready to do our analysis. We first use the PCA function to indicate how many components we want. For our first example, we will have two components. Next, you use the .fit_transform function to fit the model. Below is the code.

pca_2c=PCA(n_components=2)
X_pca_2c=pca_2c.fit_transform(df_scaled)

Now we can see the variance explained by component and the sum

pca_2c.explained_variance_ratio_
Out[199]: array([0.18201588, 0.12022734])

pca_2c.explained_variance_ratio_.sum()
Out[200]: 0.30224321247148167

In the first line of code, we can see that the first component explained 18% of the variance and the second explained 12%. This leads to a total of about 30%. Below is a visual of our 2 component model the color represents the race of the respondent. The three different colors represent three different races.

1.png

Our two components do a reasonable separating the data. Below is the code for making four components. We can not graph four components since our graph can only handle two but you will see that as we increase the components we also increase the variance explained.

pca_4c=PCA(n_components=4)

X_pca_4c=pca_4c.fit_transform(df_scaled)

pca_4c.explained_variance_ratio_
Out[209]: array([0.18201588, 0.12022734, 0.09290502, 0.08945079])

pca_4c.explained_variance_ratio_.sum()
Out[210]: 0.4845990164486457

With four components we now have almost 50% of the variance explained.

Conclusion

PCA is for summarising and reducing the number of variables used in an analysis or for the purposes of data visualization. Once this process is complete you can use the results to do further analysis if you desire.

Data Exploration with Python

In this post, we will explore a dataset using Python. The dataset we will use is the Ghouls, Goblins, and Ghost (GGG) dataset available at the kaggle website. The analysis will not be anything complex we will simply do the following.

  • Data preparation
  • Data  visualization
  • Descriptive statistics
  • Regression analysis

Data Preparation

The GGG dataset is fictitious data on the characteristics of spirits. Below are the modules we will use for our analysis.

import pandas as pd
import statsmodels.regression.linear_model as sm
import numpy as np

Once you download the dataset to your computer you need to load it into Python using the pd.read.csv function. Below is the code.

df=pd.read_csv('FILE LOCATION HERE')

We store the data as “df” in the example above. Next, we will take a peek at the first few rows of data to see what we are working with.

1.png

Using the print function and accessing the first five rows reveals. It appears the first five columns are continuous data and the last two columns are categorical. The ‘id’ variable is useless for our purposes so we will remove it with the code below.

df=df.drop(['id'],axis=1)

The code above uses the drop function to remove the variable ‘id’. This is all saved into the object ‘df’. In other words, we wrote over are original ‘df’.

Data Visualization

We will start with our categorical variables for the data visualization. Below is a table and a graph of the ‘color’ and ‘type’ variables.

First, we make an object called ‘spirits’ using the groupby function to organize the table by the ‘type’ variable.

1

Below we make a graph of the data above using the .plot function. A professional wouldn’t make this plot but we are just practicing how to code.

1.png

We now know how many ghosts, goblins and, ghouls there are in the dataset. We will now do a breakdown of ‘type’ by ‘color’ using the .crosstab function from pandas.

1.png

We will now make bar graphs of both of the categorical variables using the .plot function.

1

We will now turn our attention to the continuous variables. We will simply make histograms and calculate the correlation between them. First the histograms

The code is simply subset the variable you want in the brackets and then type .plot.hist() to access the histogram function. It appears that all of our data is normally distributed. Now for the correlation

1

Using the .corr() function has shown that there are now high correlations among the continuous variables. We will now do an analysis in which we combine the continuous and categorical variables through making boxplots

The code is redundant. We use the .boxplot() function and tell python the column which is continuous and the ‘by’ which is the categorical variable.

Descriptive Stats

We are simply going to calcualte the mean and standard deviation of the continuous variables.

df["bone_length"].mean()
Out[65]: 0.43415996604821117

np.std(df["bone_length"])
Out[66]: 0.13265391313941383

df["hair_length"].mean()
Out[67]: 0.5291143100058727

np.std(df["hair_length"])
Out[68]: 0.16967268504935665

df["has_soul"].mean()
Out[69]: 0.47139203219259107

np.std(df["has_soul"])
Out[70]: 0.17589180837106724

The mean is calcualted with the .mean(). Standard deviation is calculated using the .std() function from the numpy package.

Multiple Regression

Our final trick is we want to explain the variable “has_soul” using the other continuous variables that are available. Below is the code

X = df[["bone_length", "rotting_flesh","hair_length"]]

y = df["has_soul"]

model = sm.OLS(y, X).fit()

In the code above we crate to new list. X contains are independent variables and y contains the dependent variable. Then we create an object called model and use the  OLS() function. We place the y and X inside the parenthesis and we then use the .fit() function as well. Below is the summary of the analysis

1.png

There is obviously a lot of information in the output. The r-square is 0.91 which is surprisingly high given that there were not high correlations in the matrix. The coefficiencies for the three independent variables are listed and all are significant. The AIC and BIC are for model comparison and do not mean much in isolation. The JB stat indicates that are distribution is not normal. Durbin watson test indicates negative autocorrelation which is important in time-series analysis.

Conclusion

Data exploration can be an insightful experience. Using Python, we found mant different patterns and ways to describe the data.

Logistic Regression in Python

This post will provide an example of a logistic regression analysis in Python. Logistic regression is commonly used when the dependent variable is categorical. Our goal will be to predict the gender of an example based on the other variables in the model. Below are the steps we will take to achieve this.

  1. Data preparation
  2. Model development
  3. Model testing
  4. Model evaluation

Data Preparation

The dataset we will use is the ‘Survey of Labour and Income Dynamics’ (SLID) dataset available in the pydataset module in Python. This dataset contains basic data on labor and income along with some demographic information. The initial code that we need is below.

import pandas as pd
import statsmodels.api as sm
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
from pydataset import data

The code above loads all the modules and other tools we will need in this example. We now can load our data. In addition to loading the data, we will also look at the count and the characteristics of the variables. Below is the code.

1

At the top of this code, we create the ‘df’ object which contains our data from the “SLID”. Next, we used the .count() function to determine if there was any missing data and to see what variables were available. It appears that we have five variables and a lot of missing data as each variable has different amounts of data. Lastly, we used the .head() function to see what each variable contained. It appears that wages, education, and age are continuous variables well sex and language are categorical. The categorical variables will need to be converted to dummy variables as well.

The next thing we need to do is drop all the rows that are missing data since it is hard to create a model when data is missing. Below is the code and the output for this process.

1

In the code above, we used the .dropna() function to remove missing data. Then we used the .count() function to see how many rows remained. You can see that all the variables have the same number of rows which is important for model analysis. We will now make our dummy variables for sex and language in the code below.

1.png

Here is what we did,

  1. We used the .get_dummies function from pandas first on the sex variable. All this was stored in a new object called “dummy”
  2. We then combined the dummy and df datasets using the .concat() function. The axis =1 argument is for combing by column.
  3. We repeat steps 1 and 2 for the language variable
  4. Lastly, we used the .head() function to see the results

With this, we are ready to move to model development.

Model Development

The first thing we need to do is put all of the independent variables in one dataframe and the dependent variable in its own dataframe. Below is the code for this

X=df[['wages','education','age',"French","Other"]]
y=df['Male']

Notice that we did not use every variable that was available. For the language variables, we only used “French” and “Other”. This is because when you make dummy variables you only need k-1 dummies created. Since the language variable had three categories we only need two dummy variables. Therefore, we excluded “English” because when “French” and “Other” are coded 0 it means that “English” is the characteristic of the example.

In addition, we only took “male” as our dependent variable because if “male” is set to 0 it means that example is female. We now need to create our train and test dataset. The code is below.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

We created four datasets

  • train dataset with the independent variables
  • train dataset with the dependent variable
  • test dataset with the independent variables
  • test dataset with the independent variable

The split is 70/30 with 70% being used for the training and 30% being used for testing. This is the purpose of the “test_size” argument. we used the train_test_split function to do this. We can now run our model and get the results. Below is the code.

1

Here is what we did

  1. We used the .Logit() function from statsmodel to create the logistic model. Notice we used only the training data.
  2. We then use the .fit() function to get the results and stored this in the result object.
  3. Lastly, we printed the results in the ‘result’ object using the .summary()

There are some problems with the results. The Pseudo R-square is infinity which is usually. Also, you may have some error output about hessian inversion in your output. For these reasons, we cannot trust the results but will continue for the sake of learning.

The coefficients are clear.  Only wage, education, and age are significant. In order to determine the probability you have to take the coefficient from the model and use the .exp() function from numpy. Below are the results.

np.exp(.08)
Out[107]: 1.0832870676749586

np.exp(-0.06)
Out[108]: 0.9417645335842487

np.exp(-.01)
Out[109]: 0.9900498337491681

For the first value, for every unit wages increaser the probability that they are male increase 8%. For every 1 unit increase in education there probability of the person being male decrease 6%. Lastly, for every one unit increase in age the probability of the person being male decrease by 1%. Notice that we subtract 1 from the outputs to find the actual probability.

We will now move to model testing

Model Testing

To do this we first test our model with the code below

y_pred=result.predict(X_test)

We made the result object earlier. Now we just use the .predict() function with the X_test data. Next, we need to flag examples that the model believes has a 60% chance or greater of being male. The code is below

y_pred_flag=y_pred>.6

This creates a boolean object with True and False as the output. Now we will make our confusion matrix as well as other metrics for classification.

1

The results speak for themselves. There are a lot of false positives if you look at the confusion matrix. In addition precision, recall, and f1 are all low. Hopefully, the coding should be clear the main point is to be sure to use the test set dependent dataset (y_test) with the flag data you made in the previous step.

We will not make the ROC curve. For a strong model, it should have a strong elbow shape while with a weak model it will be a diagonal straight line.

1.png

The first plot is of our data. The second plot is what a really bad model would look like. As you can see there is littte difference between the two. Again this is because of all the false positives we have in the model. The actual coding should be clear. fpr is the false positive rate, tpr is the true positive rate. The function is .roc_curve. Inside goes the predict vs actual test data.

Conclusion

This post provided a demonstration of the use of logistic regression in Python. It is necessary to follow the steps above but keep in mind that this was a demonstration and the results are dubious.

Working with a Dataframe in Python

In this post, we will learn to do some basic exploration of a dataframe in Python. Some of the task we will complete include the following…

  • Import data
  • Examine data
  • Work with strings
  • Calculating descriptive statistics

Import Data 

First, you need data, therefore, we will use the Titanic dataset, which is readily available on the internet. We will need to use the pd.read_csv() function from the pandas package. This means that we must also import pandas. Below is the code.

import pandas as pd
df=pd.read_csv('FILE LOCATION HERE')

In the code above we imported pandas as pd so we can use the functions within it. Next, we create an object called ‘df’. Inside this object, we used the pd.read_csv() function to read our file into the system. The location of the file needs to type in quotes inside the parentheses. Having completed this we can now examine the data.

Data Examination

Now we want to get an idea of the size of our dataset, any problems with missing. To determine the size we use the .shape function as shown below.

df.shape
Out[33]: (891, 12)

Results indicate that we have 891 rows and 12 columns/variables. You can view the whole dataset by typing the name of the dataframe “df” and pressing enter. If you do this you may notice there are a lot of NaN values in the “Cabin” variable. To determine exactly how many we can use is.null() in combination with the values_count. variables.

df['Cabin'].isnull().value_counts()
Out[36]: 
True     687
False    204
Name: Cabin, dtype: int64

The code starts with the name of the dataframe. In the brackets, you put the name of the variable. After that, you put the functions you are using. Keep in mind that the order of the functions matters. You can see we have over 200 missing examples. For categorical varable, you can also see how many examples are part of each category as shown below.

df['Embarked'].value_counts()
Out[39]: 
S    644
C    168
Q     77
Name: Embarked, dtype: int64

This time we used our ‘Embarked’ variable. However, we need to address missing values before we can continue. To deal with this we will use the .dropna() function on the dataset. THen we will check the size of the dataframe again with the “shape” function.

df=df.dropna(how='any')
df.shape
Out[40]: (183, 12)

You can see our dataframe is much smaller going 891 examples to 183. We can now move to other operations such as dealing with strings.

Working with Strings

What you do with strings really depends or your goals. We are going to look at extraction, subsetting, determining the length. Our first step will be to extract the last name of the first five people. We will do this with the code below.

df['Name'][0:5].str.extract('(\w+)')
Out[44]: 
1 Cumings
3 Futrelle
6 McCarthy
10 Sandstrom
11 Bonnell
Name: Name, dtype: object

As you can see we got the last names of the first five examples. We did this by using the following format…

dataframe name[‘Variable Name’].function.function(‘whole word’))

.str is a function for dealing with strings in dataframes. The .extract() function does what its name implies.

If you want, you can even determine how many letters each name is. We will do this with the .str and .len() function on the first five names in the dataframe.

df['Name'][0:5].str.len()
Out[64]: 
1 51
3 44
6 23
10 31
11 24
Name: Name, dtype: int64

Hopefully, the code is becoming easier to read and understand.

Aggregation

We can also calculate some descriptive statistics. We will do this for the “Fare” variable. The code is repetitive in that only the function changes so we will run all of them at once. Below we are calculating the mean, max, minimum, and standard deviation  for the price of a fare on the Titanic

df['Fare'].mean()
Out[77]: 78.68246885245901

df['Fare'].max()
Out[78]: 512.32920000000001

df['Fare'].min()
Out[79]: 0.0

df['Fare'].std()
Out[80]: 76.34784270040574

Conclusion

This post provided you with some ways in which you can maneuver around a dataframe in Python.