Decision trees are common tool used in data science and machine learning. In the video below we will learn how to develop a simple decision tree using Python.

# Category Archives: python

# Principal Component Analysis with Python VIDEO

Principal component analysis is a tool for reducing the number of variables in a dataset without losing too much information. This is a great way to summarize information or to simplify things for a more complex analysis. The video provides a simple example of how to do this.

# Data Visualization with Altair VIDEO

Python has a great library called that Altair that makes it really easy to make various data visualizations. The primary strength of this particular library is how easy it is to use and to also create interactive plots. The video below provides an introduction to using this innovative tool.

# Visualizations with Altair

We are going to take a look at Altair which is a data visulization library for Python. What is unique abiut Altair compared to other packages experienced on this blog is that it allows for interactions.

The interactions can take place inside jupyter or they can be exported and loaded onto websites as we shall see. In the past, making interactions for website was often tught using a jacascript library such as d3.js. D3.js works but is cumbersome to work with for the avaerage non-coder. Altair solves this problem as Python is often seen as easier to work with compared to javascript.

**Installing Altair**

If Altair is not already install on your computer you can do so with the following code

```
pip install altair vega_datasets
OR
conda install -c conda-forge altair vega_datasets
```

Which one of the lines above you use will depend on the type of Python installation you have.

**Goal**

We are going to make some simple visualizations using the “Duncan” dataset from the pydataset library using Altair. If you do not have pydataset install on your ocmputer you can use the code listed above to install it. Simple replace “altair vega_datasets” with “pydataset.” Below is the initial code followed by the output

```
import pandas as pd
from pydataset import data
df=data("Duncan")
df.head()
```

In the code above, we load pandas and import “data” from the “pydataset” library. Next, we load the “Duncan” dataset as the object “df”. Lastly, we use the .head() function to take a look at the dataset. You can see in the imagine above what variables are available.

Our first visualization is a simple bar graph. The code is below followed by the visualization.

```
import altair as alt
alt.Chart(df).mark_bar().encode(
x= "type",
y = "prestige"
)
```

In the code above we did the following,

- Line one loads the altair library.
- Line 2 uses several functions together to make the bar graph. .Chart(df) loads the data for the plot. .mark_bar() assigns the geomtric shape for the plot which in this case is bars. Lastly, the .encode() function contains the information for the variables that will be assigned to the x and y axes. In this case we are looking at job type and prestige.

The three dots in the upper right provide options for saving or editing the plot. We will learn more about saving plots later. In addition, Altair follows the grammar of graphics for creating plots. This has been discussed in another post but a summary of the components are below.

- Data
- Aesthetics
- Scale.
- Statistical transformation
- Geometric object
- Facets
- Coordinate system

We will not deal with all of these but we have dealt with the following

- Data as .Chart()
- Aesthetics and Geometric object as .mark_bar()
- coordinate system as .encode()

In our second example, we will make a scatterplot. The code and output are below.

```
alt.Chart(df).mark_circle().encode(
x= "education",
y = "prestige"
)
```

The code is mostly the same. We simple use .mark_circle() as to indicate the type of geometric object. For .encode() we made sure to use two continuous variables.

In the next plot, we add a categorical variable to the scatterplot by manipulating the color.

```
alt.Chart(df).mark_circle().encode(
x= "education",
y = "prestige",
color='type'
)
```

The only change is the addition of the “color”argument which is set to the categorical vareiable of “type.”

It is also possible to use bubbles to indicate size. In the plot below we can add the income varibale to the plot using bubbles.

```
alt.Chart(df).mark_circle().encode(
x= "education",
y = "prestige",
color='type',
size="income"
)
```

The latest argument that was added was the “size” argument which was used to map income to the plot.

You can also facet data by piping. The code below makes two plots and saving them as objects. Then you print both by typing the name of the objects while separated by the pipe symbol (|) which you can find above the enter key on your keyboard. Below you will find two different plots created through this piping process.

```
educationPlot=alt.Chart(df).mark_circle().encode(
x= "education",
y = "prestige",
color='type',
)
incomePlot=alt.Chart(df).mark_circle().encode(
x= "income",
y = "prestige",
color='type',
)
educationPlot | incomePlot
```

With this code you can make multiple plots. Simply keep adding pipes to make more plots.

**Interaction** **and Saving Plots**

It is also possible to move plots interactive. In the code below we add the command called tool tip. This allows us to add an additional variable called “income” to the chart. When the mouse hoovers over a data-point the income will display.

However, since we are in a browser right now this will not work unless w save the chart as an html file. The last line of code saves the plot as an html file and renders it using svg. We also remove the three dots in the upper left corner by adding the ‘actions’:False. Below is the code and the plot once the html was loaded to this blog.

```
interact_plot=alt.Chart(df).mark_circle().encode(
x= "education",
y = "prestige",
color='type',
tooltip=["income"]
)
interact_plot.save('interact_plot.html',embed_options={'renderer':'svg','actions':False})
```

I’ve made a lot of visuals in the past and never has it been this simple

**Conclusion**

Altair is another tool for visualizations. This may be the easiest way to make complex and interactive charts that I have seen. As such, this is a great way to achieve goals if visualizing data is something that needs to be done.

# Random Forest Classification with Python

Random forest is a type of machine learning algorithm in which the algorithm makes multiple decision trees that may use different features and subsample to making as many trees as you specify. The trees then vote to determine the class of an example. This approach helps to deal with the high variance that is a problem with making only one decision tree.

In this post, we will learn how to develop a random forest model in Python. We will use the cancer dataset from the pydataset module to classify whether a person status is censored or dead based on several independent variables. The steps we need to perform to complete this task are defined below

- Data preparation
- Model development and evaluation

**Data Preparation**

Below are some initial modules we need to complete all of the tasks for this project.

import pandas as pd import numpy as np from pydataset import data from sklearn.model_selection import train_test_split from sklearn.model_selection import cross_val_score from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import classification_report

We will now load our dataset “Cancer” and drop any rows that contain NA using the .dropna() function.

df = data('cancer') df=df.dropna()

Next, we need to separate our independent variables from our dependent variable. We will do this by make two datasets. The X dataset will contain all of our independent variables and the y dataset will contain our dependent variable. You can check the documentation for the dataset using the code data(“Cancer”, show_doc=True)

Before we make the y dataset we need to change the numerical values in the status variable to text. Doing this will aid in the interpretation of the results. If you look at the documentation of the dataset you will see that a 1 in the status variable means censored while a 2 means dead. We will change the 1 to censored and the 2 to dead when we make the y dataset. This involves the use of the .replace() function. The code is below.

X=df[['time','age',"sex","ph.ecog",'ph.karno','pat.karno','meal.cal','wt.loss']] df['status']=df.status.replace(1,'censored') df['status']=df.status.replace(2,'dead') y=df['status']

We can now proceed to model development.

**Model Development and Evaluation**

We will first make our train and test datasets. We will use a 70/30 split. Next, we initialize the actual random forest classifier. There are many options that can be set. For our purposes, we will set the number of trees to make to 100. Setting the random_state option is similar to setting the seed for the purpose of reproducibility. Below is the code.

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0) h=RandomForestClassifier(n_estimators=100,random_state=1)

We can now run our modle with the .fit() function and test it with the .pred() function. The code is velow.

h.fit(x_train,y_train) y_pred=h.predict(x_test)

We will now print two tables. The first will provide the raw results for the classification using the .crosstab() function. THe classification_reports function will provide the various metrics used for determining the value of a classification model.

print(pd.crosstab(y_test,y_pred)) print(classification_report(y_test,y_pred))

Our overall accuracy is about 75%. How good this is depends in context. We are really good at predicting people are dead but have much more trouble with predicting if people are censored.

**Conclusion**

This post provided an example of using random forest in python. Through the use of a forest of trees, it is possible to get much more accurate results when a comparison is made to a single decision tree. This is one of many reasons for the use of random forest in machine learning.

# Data Exploration Case Study: Credit Default

Exploratory data analysis is the main task of a Data Scientist with as much as 60% of their time being devoted to this task. As such, the majority of their time is spent on something that is rather boring compared to building models.

This post will provide a simple example of how to analyze a dataset from the website called Kaggle. This dataset is looking at how is likely to default on their credit. The following steps will be conducted in this analysis.

- Load the libraries and dataset
- Deal with missing data
- Some descriptive stats
- Normality check
- Model development

This is not an exhaustive analysis but rather a simple one for demonstration purposes. The dataset is available here

**Load Libraries and Data**

Here are some packages we will need

import pandas as pd import matplotlib.pyplot as plt import seaborn as sns from scipy.stats import norm from sklearn import tree from scipy import stats from sklearn import metrics

You can load the data with the code below

df_train=pd.read_csv('/application_train.csv')

You can examine what variables are available with the code below. This is not displayed here because it is rather long

df_train.columns df_train.head()

**Missing Data**

I prefer to deal with missing data first because missing values can cause errors throughout the analysis if they are not dealt with immediately. The code below calculates the percentage of missing data in each column.

total=df_train.isnull().sum().sort_values(ascending=False) percent=(df_train.isnull().sum()/df_train.isnull().count()).sort_values(ascending=False) missing_data=pd.concat([total,percent],axis=1,keys=['Total','Percent']) missing_data.head() Total Percent COMMONAREA_MEDI 214865 0.698723 COMMONAREA_AVG 214865 0.698723 COMMONAREA_MODE 214865 0.698723 NONLIVINGAPARTMENTS_MODE 213514 0.694330 NONLIVINGAPARTMENTS_MEDI 213514 0.694330

Only the first five values are printed. You can see that some variables have a large amount of missing data. As such, they are probably worthless for inclusion in additional analysis. The code below removes all variables with any missing data.

pct_null = df_train.isnull().sum() / len(df_train) missing_features = pct_null[pct_null > 0.0].index df_train.drop(missing_features, axis=1, inplace=True)

You can use the .head() function if you want to see how many variables are left.

**Data Description & Visualization**

For demonstration purposes, we will print descriptive stats and make visualizations of a few of the variables that are remaining.

round(df_train['AMT_CREDIT'].describe()) Out[8]: count 307511.0 mean 599026.0 std 402491.0 min 45000.0 25% 270000.0 50% 513531.0 75% 808650.0 max 4050000.0 sns.distplot(df_train['AMT_CREDIT']

round(df_train['AMT_INCOME_TOTAL'].describe()) Out[10]: count 307511.0 mean 168798.0 std 237123.0 min 25650.0 25% 112500.0 50% 147150.0 75% 202500.0 max 117000000.0 sns.distplot(df_train['AMT_INCOME_TOTAL']

I think you are getting the point. You can also look at categorical variables using the groupby() function.

We also need to address categorical variables in terms of creating dummy variables. This is so that we can develop a model in the future. Below is the code for dealing with all the categorical variables and converting them to dummy variable’s

df_train.groupby('NAME_CONTRACT_TYPE').count() dummy=pd.get_dummies(df_train['NAME_CONTRACT_TYPE']) df_train=pd.concat([df_train,dummy],axis=1) df_train=df_train.drop(['NAME_CONTRACT_TYPE'],axis=1) df_train.groupby('CODE_GENDER').count() dummy=pd.get_dummies(df_train['CODE_GENDER']) df_train=pd.concat([df_train,dummy],axis=1) df_train=df_train.drop(['CODE_GENDER'],axis=1) df_train.groupby('FLAG_OWN_CAR').count() dummy=pd.get_dummies(df_train['FLAG_OWN_CAR']) df_train=pd.concat([df_train,dummy],axis=1) df_train=df_train.drop(['FLAG_OWN_CAR'],axis=1) df_train.groupby('FLAG_OWN_REALTY').count() dummy=pd.get_dummies(df_train['FLAG_OWN_REALTY']) df_train=pd.concat([df_train,dummy],axis=1) df_train=df_train.drop(['FLAG_OWN_REALTY'],axis=1) df_train.groupby('NAME_INCOME_TYPE').count() dummy=pd.get_dummies(df_train['NAME_INCOME_TYPE']) df_train=pd.concat([df_train,dummy],axis=1) df_train=df_train.drop(['NAME_INCOME_TYPE'],axis=1) df_train.groupby('NAME_EDUCATION_TYPE').count() dummy=pd.get_dummies(df_train['NAME_EDUCATION_TYPE']) df_train=pd.concat([df_train,dummy],axis=1) df_train=df_train.drop(['NAME_EDUCATION_TYPE'],axis=1) df_train.groupby('NAME_FAMILY_STATUS').count() dummy=pd.get_dummies(df_train['NAME_FAMILY_STATUS']) df_train=pd.concat([df_train,dummy],axis=1) df_train=df_train.drop(['NAME_FAMILY_STATUS'],axis=1) df_train.groupby('NAME_HOUSING_TYPE').count() dummy=pd.get_dummies(df_train['NAME_HOUSING_TYPE']) df_train=pd.concat([df_train,dummy],axis=1) df_train=df_train.drop(['NAME_HOUSING_TYPE'],axis=1) df_train.groupby('ORGANIZATION_TYPE').count() dummy=pd.get_dummies(df_train['ORGANIZATION_TYPE']) df_train=pd.concat([df_train,dummy],axis=1) df_train=df_train.drop(['ORGANIZATION_TYPE'],axis=1)

You have to be careful with this because now you have many variables that are not necessary. For every categorical variable you must remove at least one category in order for the model to work properly. Below we did this manually.

df_train=df_train.drop(['Revolving loans','F','XNA','N','Y','SK_ID_CURR,''Student','Emergency','Lower secondary','Civil marriage','Municipal apartment'],axis=1)

Below are some boxplots with the target variable and other variables in the dataset.

f,ax=plt.subplots(figsize=(8,6)) fig=sns.boxplot(x=df_train['TARGET'],y=df_train['AMT_INCOME_TOTAL'])

There is a clear outlier there. Below is another boxplot with a different variable

f,ax=plt.subplots(figsize=(8,6)) fig=sns.boxplot(x=df_train['TARGET'],y=df_train['CNT_CHILDREN'])

It appears several people have more than 10 children. This is probably a typo.

Below is a correlation matrix using a heatmap technique

corrmat=df_train.corr() f,ax=plt.subplots(figsize=(12,9)) sns.heatmap(corrmat,vmax=.8,square=True)

The heatmap is nice but it is hard to really appreciate what is happening. The code below will sort the correlations from least to strongest, so we can remove high correlations.

c = df_train.corr().abs() s = c.unstack() so = s.sort_values(kind="quicksort") print(so.head()) FLAG_DOCUMENT_12 FLAG_MOBIL 0.000005 FLAG_MOBIL FLAG_DOCUMENT_12 0.000005 Unknown FLAG_MOBIL 0.000005 FLAG_MOBIL Unknown 0.000005 Cash loans FLAG_DOCUMENT_14 0.000005

The list is to long to show here but the following variables were removed for having a high correlation with other variables.

df_train=df_train.drop(['WEEKDAY_APPR_PROCESS_START','FLAG_EMP_PHONE','REG_CITY_NOT_WORK_CITY','REGION_RATING_CLIENT','REG_REGION_NOT_WORK_REGION'],axis=1)

Below we check a few variables for homoscedasticity, linearity, and normality using plots and histograms

sns.distplot(df_train['AMT_INCOME_TOTAL'],fit=norm) fig=plt.figure() res=stats.probplot(df_train['AMT_INCOME_TOTAL'],plot=plt)

This is not normal

sns.distplot(df_train['AMT_CREDIT'],fit=norm) fig=plt.figure() res=stats.probplot(df_train['AMT_CREDIT'],plot=plt)

This is not normal either. We could do transformations, or we can make a non-linear model instead.

**Model Development**

Now comes the easy part. We will make a decision tree using only some variables to predict the target. In the code below we make are X and y dataset.

X=df_train[['Cash loans','DAYS_EMPLOYED','AMT_CREDIT','AMT_INCOME_TOTAL','CNT_CHILDREN','REGION_POPULATION_RELATIVE']] y=df_train['TARGET']

The code below fits are model and makes the predictions

clf=tree.DecisionTreeClassifier(min_samples_split=20) clf=clf.fit(X,y) y_pred=clf.predict(X)

Below is the confusion matrix followed by the accuracy

print (pd.crosstab(y_pred,df_train['TARGET'])) TARGET 0 1 row_0 0 280873 18493 1 1813 6332 accuracy_score(y_pred,df_train['TARGET']) Out[47]: 0.933966589813047

Lastly, we can look at the precision, recall, and f1 score

print(metrics.classification_report(y_pred,df_train['TARGET'])) precision recall f1-score support 0 0.99 0.94 0.97 299366 1 0.26 0.78 0.38 8145 micro avg 0.93 0.93 0.93 307511 macro avg 0.62 0.86 0.67 307511 weighted avg 0.97 0.93 0.95 307511

This model looks rather good in terms of accuracy of the training set. It actually impressive that we could use so few variables from such a large dataset and achieve such a high degree of accuracy.

**Conclusion**

Data exploration and analysis is the primary task of a data scientist. This post was just an example of how this can be approached. Of course, there are many other creative ways to do this but the simplistic nature of this analysis yielded strong results

# RANSAC Regression in Python

RANSAC is an acronym for Random Sample Consensus. What this algorithm does is fit a regression model on a subset of data that the algorithm judges as inliers while removing outliers. This naturally improves the fit of the model due to the removal of some data points.

The process that is used to determine inliers and outliers is described below.

- The algorithm randomly selects a random amount of samples to be inliers in the model.
- All data is used to fit the model and samples that fall with a certain tolerance are relabeled as inliers.
- Model is refitted with the new inliers
- Error of the fitted model vs the inliers is calculated
- Terminate or go back to step 1 if a certain criterion of iterations or performance is not met.

In this post, we will use the tips data from the pydataset module. Our goal will be to predict the tip amount using two different models.

- Model 1 will use simple regression and will include total bill as the independent variable and tips as the dependent variable
- Model 2 will use multiple regression and includes several independent variables and tips as the dependent variable

The process we will use to complete this example is as follows

- Data preparation
- Simple Regression Model fit
- Simple regression visualization
- Multiple regression model fit
- Multiple regression visualization

Below are the packages we will need for this example

import pandas as pd from pydataset import data from sklearn.linear_model import RANSACRegressor from sklearn.linear_model import LinearRegression import numpy as np import matplotlib.pyplot as plt from sklearn.metrics import mean_absolute_error from sklearn.metrics import r2_score

**Data Preparation**

For the data preparation, we need to do the following

- Load the data
- Create X and y dataframes
- Convert several categorical variables to dummy variables
- Drop the original categorical variables from the X dataframe

Below is the code for these steps

df=data('tips') X,y=df[['total_bill','sex','size','smoker','time']],df['tip'] male=pd.get_dummies(X['sex']) X['male']=male['Male'] smoker=pd.get_dummies(X['smoker']) X['smoker']=smoker['Yes'] dinner=pd.get_dummies(X['time']) X['dinner']=dinner['Dinner'] X=X.drop(['sex','time'],1)

Most of this is self-explanatory, we first load the tips dataset and divide the independent and dependent variables into an X and y dataframe respectively. Next, we converted the sex, smoker, and dinner variables into dummy variables, and then we dropped the original categorical variables.

We can now move to fitting the first model that uses simple regression.

**Simple Regression Model**

For our model, we want to use total bill to predict tip amount. All this is done in the following steps.

- Instantiate an instance of the RANSACRegressor. We the call LinearRegression function, and we also set the residual_threshold to 2 indicate how far an example has to be away from 2 units away from the line.
- Next we fit the model
- We predict the values
- We calculate the r square the mean absolute error

Below is the code for all of this.

ransacReg1= RANSACRegressor(LinearRegression(),residual_threshold=2,random_state=0) ransacReg1.fit(X[['total_bill']],y) prediction1=ransacReg1.predict(X[['total_bill']])

r2_score(y,prediction1) Out[150]: 0.4381748268686979 mean_absolute_error(y,prediction1) Out[151]: 0.7552429811944833

The r-square is 44% while the MAE is 0.75. These values are most comparative and will be looked at again when we create the multiple regression model.

The next step is to make the visualization. The code below will create a plot that shows the X and y variables and the regression. It also identifies which samples are inliers and outliers. Te coding will not be explained because of the complexity of it.

inlier=ransacReg1.inlier_mask_ outlier=np.logical_not(inlier) line_X=np.arange(3,51,2) line_y=ransacReg1.predict(line_X[:,np.newaxis]) plt.scatter(X[['total_bill']][inlier],y[inlier],c='lightblue',marker='o',label='Inliers') plt.scatter(X[['total_bill']][outlier],y[outlier],c='green',marker='s',label='Outliers') plt.plot(line_X,line_y,color='black') plt.xlabel('Total Bill') plt.ylabel('Tip') plt.legend(loc='upper left')

Plot is self-explanatory as a handful of samples were considered outliers. We will now move to creating our multiple regression model.

**Multiple Regression Model Development**

The steps for making the model are mostly the same. The real difference takes place in make the plot which we will discuss in a moment. Below is the code for developing the model.

ransacReg2= RANSACRegressor(LinearRegression(),residual_threshold=2,random_state=0) ransacReg2.fit(X,y) prediction2=ransacReg2.predict(X)

r2_score(y,prediction2) Out[154]: 0.4298703800652126 mean_absolute_error(y,prediction2) Out[155]: 0.7649733201032204

Things have actually gotten slightly worst in terms of r-square and MAE.

For the visualization, we cannot plot directly several variables t once. Therefore, we will compare the predicted values with the actual values. The better the correlated the better our prediction is. Below is the code for the visualization

inlier=ransacReg2.inlier_mask_ outlier=np.logical_not(inlier) line_X=np.arange(1,8,1) line_y=(line_X[:,np.newaxis]) plt.scatter(prediction2[inlier],y[inlier],c='lightblue',marker='o',label='Inliers') plt.scatter(prediction2[outlier],y[outlier],c='green',marker='s',label='Outliers') plt.plot(line_X,line_y,color='black') plt.xlabel('Predicted Tip') plt.ylabel('Actual Tip') plt.legend(loc='upper left')

The plots are mostly the same as you cans see for yourself.

**Conclusion**

This post provided an example of how to use the RANSAC regressor algorithm. This algorithm will remove samples from the model based on a criterion you set. The biggest complaint about this algorithm is that it removes data from the model. Generally, we want to avoid losing data when developing models. In addition, the algorithm removes outliers objectively this is a problem because outlier removal is often subjective. Despite these flaws, RANSAC regression is another tool that can be use din machine learning.

# Combining Algorithms for Classification with Python

Many approaches in machine learning involve making many models that combine their strength and weaknesses to make more accuracy classification. Generally, when this is done it is the same algorithm being used. For example, random forest is simply many decision trees being developed. Even when bagging or boosting is being used it is the same algorithm but with variances in sampling and the use of features.

In addition to this common form of ensemble learning there is also a way to combine different algorithms to make predictions. For one way of doing this is through a technique called stacking in which the predictions of several models are passed to a higher model that uses the individual model predictions to make a final prediction. In this post we will look at how to do this using Python.

**Assumptions**

This blog usually tries to explain as much as possible about what is happening. However, due to the complexity of this topic there are several assumptions about the reader’s background.

- Already familiar with python
- Can use various algorithms to make predictions (logistic regression, linear discriminant analysis, decision trees, K nearest neighbors)
- Familiar with cross-validation and hyperparameter tuning

We will be using the Mroz dataset in the pydataset module. Our goal is to use several of the independent variables to predict whether someone lives in the city or not.

The steps we will take in this post are as follows

- Data preparation
- Individual model development
- Ensemble model development
- Hyperparameter tuning of ensemble model
- Ensemble model testing

Below is all of the libraries we will be using in this post

import pandas as pd from sklearn.model_selection import GridSearchCV from sklearn.preprocessing import StandardScaler from sklearn.preprocessing import LabelEncoder from pydataset import data from sklearn.model_selection import train_test_split from sklearn.model_selection import cross_val_score from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier from sklearn.neighbors import KNeighborsClassifier from mlxtend.classifier import EnsembleVoteClassifier from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA from sklearn.metrics import accuracy_score from sklearn.metrics import classification_report

**Data Preparation**

We need to perform the following steps for the data preparation

- Load the data
- Select the independent variables to be used in the analysis
- Scale the independent variables
- Convert the dependent variable from text to numbers
- Split the data in train and test sets

Not all of the variables in the Mroz dataset were used. Some were left out because they were highly correlated with others. This analysis is not in this post but you can explore this on your own. The data was also scaled because many algorithms are sensitive to this so it is best practice to always scale the data. We will use the StandardScaler function for this. Lastly, the dpeendent variable currently consist of values of “yes” and “no” these need to be convert to numbers 1 and 0. We will use the LabelEncoder function for this. The code for all of this is below.

df=data('Mroz') X,y=df[['hoursw','child6','child618','educw','hearnw','hoursh','educh','wageh','educwm','educwf','experience']],df['city'] sc=StandardScaler() X_scale=sc.fit_transform(X) X=pd.DataFrame(X_scale, index=X.index, columns=X.columns) le=LabelEncoder() y=le.fit_transform(y) X_train, X_test,y_train, y_test=train_test_split(X,y,test_size=.3,random_state=5)

We can now proceed to individul model development

**Individual Model Development**

Below are the steps for this part of the analysis

- Instantiate an instance of each algorithm
- Check accuracy of each model
- Check roc curve of each model

We will create four different models, and they are logistic regression, decision tree, k nearest neighbor, and linear discriminant analysis. We will also set some initial values for the hyperparameters for each. Below is the code

logclf=LogisticRegression(penalty='l2',C=0.001, random_state=0) treeclf=DecisionTreeClassifier(max_depth=3,criterion='entropy',random_state=0) knnclf=KNeighborsClassifier(n_neighbors=5,p=2,metric='minkowski') LDAclf=LDA()

We can now assess the accuracy and roc curve of each model. This will be done through using two separate for loops. The first will have the accuracy results and the second will have the roc curve results. The results will also use k-fold cross validation with the cross_val_score function. Below is the code with the results.

clf_labels=['Logistic Regression','Decision Tree','KNN','LDAclf'] for clf, label in zip ([logclf,treeclf,knnclf,LDAclf],clf_labels): scores=cross_val_score(estimator=clf,X=X_train,y=y_train,cv=10,scoring='accuracy') print("accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(),scores.std(),label)) for clf, label in zip ([logclf,treeclf,knnclf,LDAclf],clf_labels): scores=cross_val_score(estimator=clf,X=X_train,y=y_train,cv=10,scoring='roc_auc') print("roc auc: %0.2f (+/- %0.2f) [%s]" % (scores.mean(),scores.std(),label)) accuracy: 0.69 (+/- 0.04) [Logistic Regression] accuracy: 0.72 (+/- 0.06) [Decision Tree] accuracy: 0.66 (+/- 0.06) [KNN] accuracy: 0.70 (+/- 0.05) [LDAclf] roc auc: 0.71 (+/- 0.08) [Logistic Regression] roc auc: 0.70 (+/- 0.07) [Decision Tree] roc auc: 0.62 (+/- 0.10) [KNN] roc auc: 0.70 (+/- 0.08) [LDAclf]

The results can speak for themselves. We have a general accuracy of around 70% but our roc auc is poor. Despite this we will now move to the ensemble model development.

**Ensemble Model Development**

The ensemble model requires the use of the EnsembleVoteClassifier function. Inside this function are the four models we made earlier. Other than this the rest of the code is the same as the previous step. We will assess the accuracy and the roc auc. Below is the code and the results

mv_clf= EnsembleVoteClassifier(clfs=[logclf,treeclf,knnclf,LDAclf],weights=[1.5,1,1,1]) for clf, label in zip ([logclf,treeclf,knnclf,LDAclf,mv_clf],labels): scores=cross_val_score(estimator=clf,X=X_train,y=y_train,cv=10,scoring='accuracy') print("accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(),scores.std(),label)) for clf, label in zip ([logclf,treeclf,knnclf,LDAclf,mv_clf],labels): scores=cross_val_score(estimator=clf,X=X_train,y=y_train,cv=10,scoring='roc_auc') print("roc auc: %0.2f (+/- %0.2f) [%s]" % (scores.mean(),scores.std(),label)) accuracy: 0.69 (+/- 0.04) [LR] accuracy: 0.72 (+/- 0.06) [tree] accuracy: 0.66 (+/- 0.06) [knn] accuracy: 0.70 (+/- 0.05) [LDA] accuracy: 0.70 (+/- 0.04) [combine] roc auc: 0.71 (+/- 0.08) [LR] roc auc: 0.70 (+/- 0.07) [tree] roc auc: 0.62 (+/- 0.10) [knn] roc auc: 0.70 (+/- 0.08) [LDA] roc auc: 0.72 (+/- 0.09) [combine]

You can see that the combine model as similar performance to the individual models. This means in this situation that the ensemble learning did not make much of a difference. However, we have not tuned are hyperparameters yet. This will be done in the next step.

**Hyperparameter Tuning of Ensemble Model**

We are going to tune the decision tree, logistic regression, and KNN model. There are many different hyperparameters we can tune. For demonstration purposes we are only tuning one hyperparameter per algorithm. Once we set the hyperparameters we will run the model and pull the best hyperparameters values based on the roc auc as the metric. Below is the code and the output.

params={'decisiontreeclassifier__max_depth':[2,3,5], 'logisticregression__C':[0.001,0.1,1,10], 'kneighborsclassifier__n_neighbors':[5,7,9,11]} grid=GridSearchCV(estimator=mv_clf,param_grid=params,cv=10,scoring='roc_auc') grid.fit(X_train,y_train) grid.best_params_ Out[34]: {'decisiontreeclassifier__max_depth': 3, 'kneighborsclassifier__n_neighbors': 9, 'logisticregression__C': 10} grid.best_score_ Out[35]: 0.7196051482279385

The best values are as follows

- Decision tree max depth set to 3
- KNN number of neighbors set to 9
- logistic regression C set to 10

These values give us a roc auc of 0.72 which is still poor . We can now use these values when we test our final model.

**Ensemble Model Testing**

The following steps are performed in the analysis

- Created new instances of the algorithms with the adjusted hyperparameters
- Run the ensemble model
- Predict with the test data
- Check the results

Below is the first step

logclf=LogisticRegression(penalty='l2',C=10, random_state=0) treeclf=DecisionTreeClassifier(max_depth=3,criterion='entropy',random_state=0) knnclf=KNeighborsClassifier(n_neighbors=9,p=2,metric='minkowski') LDAclf=LDA()

Below is step two

mv_clf= EnsembleVoteClassifier(clfs=[logclf,treeclf,knnclf,LDAclf],weights=[1.5,1,1,1]) mv_clf.fit(X_train,y_train)

Below are steps 3 and four

y_pred=mv_clf.predict(X_test) print(accuracy_score(y_test,y_pred)) print(pd.crosstab(y_test,y_pred)) print(classification_report(y_test,y_pred)) 0.6902654867256637 col_0 0 1 row_0 0 29 58 1 12 127 precision recall f1-score support 0 0.71 0.33 0.45 87 1 0.69 0.91 0.78 139 avg / total 0.69 0.69 0.66 226

The accuracy is about 69%. One thing that is noticeable low is the recall for people who do not live in the city. This probably one reason why the overall roc auc score is so low. The f1-score is also low for those who do not live in the city as well. The f1-score is just a combination of precision and recall. If we really want to improve performance we would probably start with improving the recall of the no’s.

**Conclusion**

This post provided an example of how you can combine different algorithms to make predictions in Python. This is a powerful technique t to use. Off course, it is offset by the complexity of the analysis which makes it hard to explain exactly what the results mean if you were asked tot do so.

# Gradient Boosting Regression in Python

In this post, we will take a look at gradient boosting for regression. Gradient boosting simply makes sequential models that try to explain any examples that had not been explained by previously models. This approach makes gradient boosting superior to AdaBoost.

Regression trees are mostly commonly teamed with boosting. There are some additional hyperparameters that need to be set which includes the following

- number of estimators
- learning rate
- subsample
- max depth

We will deal with each of these when it is appropriate. Our goal in this post is to predict the amount of weight loss in cancer patients based on the independent variables. This is the process we will follow to achieve this.

- Data preparation
- Baseline decision tree model
- Hyperparameter tuning
- Gradient boosting model development

Below is some initial code

from sklearn.ensemble import GradientBoostingRegressor from sklearn import tree from sklearn.model_selection import GridSearchCV import numpy as np from pydataset import data import pandas as pd from sklearn.model_selection import cross_val_score from sklearn.model_selection import KFold

**Data Preparation**

The data preparation is not that difficult in this situation. We simply need to load the dataset in an object and remove any missing values. Then we separate the independent and dependent variables into separate datasets. The code is below.

df=data('cancer').dropna() X=df[['time','sex','ph.karno','pat.karno','status','meal.cal']] y=df['wt.loss']

We can now move to creating our baseline model.

**Baseline Model**

The purpose of the baseline model is to have something to compare our gradient boosting model to. Therefore, all we will do here is create several regression trees. The difference between the regression trees will be the max depth. The max depth has to with the number of nodes python can make to try to purify the classification. We will then decide which tree is best based on the mean squared error.

The first thing we need to do is set the arguments for the cross-validation. Cross validating the results helps to check the accuracy of the results. The rest of the code requires the use of for loops and if statements that cannot be reexplained in this post. Below is the code with the output.

for depth in range (1,10): tree_regressor=tree.DecisionTreeRegressor(max_depth=depth,random_state=1) if tree_regressor.fit(X,y).tree_.max_depth

You can see that a max depth of 2 had the lowest amount of error. Therefore, our baseline model has a mean squared error of 176. We need to improve on this in order to say that our gradient boosting model is superior.

However, before we create our gradient boosting model. we need to tune the hyperparameters of the algorithm.

**Hyperparameter Tuning**

Hyperparameter tuning has to with setting the value of parameters that the algorithm cannot learn on its own. As such, these are constants that you set as the researcher. The problem is that you are not any better at knowing where to set these values than the computer. Therefore, the process that is commonly used is to have the algorithm use several combinations of values until it finds the values that are best for the model/. Having said this, there are several hyperparameters we need to tune, and they are as follows.

- number of estimators
- learning rate
- subsample
- max depth

The number of estimators is show many trees to create. The more trees the more likely to overfit. The learning rate is the weight that each tree has on the final prediction. Subsample is the proportion of the sample to use. Max depth was explained previously.

What we will do now is make an instance of the GradientBoostingRegressor. Next, we will create our grid with the various values for the hyperparameters. We will then take this grid and place it inside GridSearchCV function so that we can prepare to run our model. There are some arguments that need to be set inside the GridSearchCV function such as estimator, grid, cv, etc. Below is the code.

GBR=GradientBoostingRegressor() search_grid={'n_estimators':[500,1000,2000],'learning_rate':[.001,0.01,.1],'max_depth':[1,2,4],'subsample':[.5,.75,1],'random_state':[1]} search=GridSearchCV(estimator=GBR,param_grid=search_grid,scoring='neg_mean_squared_error',n_jobs=1,cv=crossvalidation)

We can now run the code and determine the best combination of hyperparameters and how well the model did base on the means squared error metric. Below is the code and the output.

search.fit(X,y) search.best_params_ Out[13]: {'learning_rate': 0.01, 'max_depth': 1, 'n_estimators': 500, 'random_state': 1, 'subsample': 0.5} search.best_score_ Out[14]: -160.51398257591643

The hyperparameter results speak for themselves. With this tuning we can see that the mean squared error is lower than with the baseline model. We can now move to the final step of taking these hyperparameter settings and see how they do on the dataset. The results should be almost the same.

**Gradient Boosting Model Development**

Below is the code and the output for the tuned gradient boosting model

GBR2=GradientBoostingRegressor(n_estimators=500,learning_rate=0.01,subsample=.5,max_depth=1,random_state=1) score=np.mean(cross_val_score(GBR2,X,y,scoring='neg_mean_squared_error',cv=crossvalidation,n_jobs=1)) score Out[18]: -160.77842893572068

These results were to be expected. The gradient boosting model has a better performance than the baseline regression tree model.

**Conclusion**

In this post, we looked at how to use gradient boosting to improve a regression tree. By creating multiple models. Gradient boosting will almost certainly have a better performance than other type of algorithms that rely on only one model.

# Gradient Boosting Classification in Python

Gradient Boosting is an alternative form of boosting to AdaBoost. Many consider gradient boosting to be a better performer than adaboost. Some differences between the two algorithms is that gradient boosting uses optimization for weight the estimators. Like adaboost, gradient boosting can be used for most algorithms but is commonly associated with decision trees.

In addition, gradient boosting requires several additional hyperparameters such as max depth and subsample. Max depth has to do with the number of nodes in a tree. The higher the number the purer the classification become. The downside to this is the risk of overfitting.

Subsampling has to do with the proportion of the sample that is used for each estimator. This can range from a decimal value up until the whole number 1. If the value is set to 1 it becomes stochastic gradient boosting.

This post is focused on classification. To do this, we will use the cancer dataset from the pydataset library. Our goal will be to predict the status of patients (alive or dead) using the available independent variables. The steps we will use are as follows.

- Data preparation
- Baseline decision tree model
- Hyperparameter tuning
- Gradient boosting model development

Below is some initial code.

from sklearn.ensemble import GradientBoostingClassifier

from sklearn import tree

from sklearn.model_selection import GridSearchCV

import numpy as np

from pydataset import data

import pandas as pd

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import KFold

**Data Preparation**

The data preparation is simple in this situtation. All we need to do is load are dataset, dropping missing values, and create our X dataset and y dataset. All this happens in the code below.

df=data('cancer').dropna()

X=df[['time','sex','ph.karno','pat.karno','meal.cal','wt.loss']]

y=df['status']

We will now develop our baseline decision tree model.

**Baseline Model**

The purpose of the baseline model is to have something to compare our gradient boosting model to. The strength of a model is always relative to some other model, so we need to make at least two, so we can say one is better than the other.

The criteria for better in this situation is accuracy. Therefore, we will make a decision tree model, but we will manipulate the max depth of the tree to create 9 different baseline models. The best accuracy model will be the baseline model.

To achieve this, we need to use a for loop to make python make several decision trees. We also need to set the parameters for the cross validation by calling KFold(). Once this is done, we print the results for the 9 trees. Below is the code and results.

crossvalidation=KFold(n_splits=10,shuffle=True,random_state=1)

for depth in range (1,10):

tree_classifier=tree.DecisionTreeClassifier(max_depth=depth,random_state=1)

if tree_classifier.fit(X,y).tree_.max_depth<depth:

break

score=np.mean(cross_val_score(tree_classifier,X,y,scoring='accuracy', cv=crossvalidation,n_jobs=1))

print(depth, score)

1 0.71875

2 0.6477941176470589

3 0.6768382352941177

4 0.6698529411764707

5 0.6584558823529412

6 0.6525735294117647

7 0.6283088235294118

8 0.6573529411764706

9 0.6577205882352941

It appears that when the max depth is limited to 1 that we get the best accuracy at almost 72%. This will be our baseline for comparison. We will now tune the parameters for the gradient boosting algorithm

**Hyperparameter Tuning**

There are several hyperparameters we need to tune. The ones we will tune are as follows

- number of estimators
- learning rate
- subsample
- max depth

First, we will create an instance of the gradient boosting classifier. Second, we will create our grid for the search. It is inside this grid that we set several values for each hyperparameter. Then we call GridSearchCV and place the instance of the gradient boosting classifier, the grid, the cross validation values from mad earlier, and n_jobs all together in one place. Below is the code for this.

GBC=GradientBoostingClassifier()

search_grid={'n_estimators':[500,1000,2000],'learning_rate':[.001,0.01,.1],'max_depth':[1,3,5],'subsample':[.5,.75,1],'random_state':[1]}

search=GridSearchCV(estimator=GBC,param_grid=search_grid,scoring='accuracy',n_jobs=1,cv=crossvalidation)

You can now run your model by calling .fit(). Keep in mind that there are several hyperparameters. This means that it might take some time to run the calculations. It is common to find values for max depth, subsample, and number of estimators first. Then as second run through is done to find the learning rate. In our example, we are doing everything at once which is why it takes longer. Below is the code with the out for best parameters and best score.

search.fit(X,y)

search.best_params_

Out[11]:

{'learning_rate': 0.01,

'max_depth': 5,

'n_estimators': 2000,

'random_state': 1,

'subsample': 0.75}

search.best_score_

Out[12]: 0.7425149700598802

You can see what the best hyperparameters are for yourself. In addition, we see that when these parameters were set we got an accuracy of 74%. This is superior to our baseline model. We will now see if we can replicate these numbers when we use them for our Gradient Boosting model.

**Gradient Boosting Model**

Below is the code and results for the model with the predetermined hyperparameter values.

ada2=GradientBoostingClassifier(n_estimators=2000,learning_rate=0.01,subsample=.75,max_depth=5,random_state=1)

score=np.mean(cross_val_score(ada2,X,y,scoring='accuracy',cv=crossvalidation,n_jobs=1))

score

Out[17]: 0.742279411764706

You can see that the results are similar. This is just additional information that the gradient boosting model does outperform the baseline decision tree model.

**Conclusion**

This post provided an example of what gradient boosting classification can do for a model. With its distinct characteristics gradient boosting is generally a better performing boosting algorithm in comparison to AdaBoost.

# AdaBoost Regression with Python

This post will share how to use the adaBoost algorithm for regression in Python. What boosting does is that it makes multiple models in a sequential manner. Each newer model tries to successful predict what older models struggled with. For regression, the average of the models are used for the predictions. It is often most common to use boosting with decision trees but this approach can be used with any machine learning algorithm that deals with supervised learning.

Boosting is associated with ensemble learning because several models are created that are averaged together. An assumption of boosting, is that combining several weak models can make one really strong and accurate model.

For our purposes, we will be using adaboost classification to improve the performance of a decision tree in python. We will use the cancer dataset from the pydataset library. Our goal will be to predict the weight loss of a patient based on several independent variables. The steps of this process are as follows.

- Data preparation
- Regression decision tree baseline model
- Hyperparameter tuning of Adaboost regression model
- AdaBoost regression model development

Below is some initial code

from sklearn.ensemble import AdaBoostRegressor

from sklearn import tree

from sklearn.model_selection import GridSearchCV

import numpy as np

from pydataset import data

import pandas as pd

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import train_test_split

from sklearn.model_selection import KFold

from sklearn.metrics import mean_squared_error

**Data Preparation**

There is little data preparation for this example. All we need to do is load the data and create the X and y datasets. Below is the code.

df=data('cancer').dropna()

X=df[['time','sex','ph.karno','pat.karno','status','meal.cal']]

y=df['wt.loss']

We will now proceed to creating the baseline regression decision tree model.

**Baseline Regression Tree Model**

The purpose of the baseline model is for comparing it to the performance of our model that utilizes adaBoost. In order to make this model we need to Initiate a Kfold cross-validation. This will help in stabilizing the results. Next we will create a for loop so that we can create several trees that vary based on their depth. By depth, it is meant how far the tree can go to purify the classification. More depth often leads to a higher likelihood of overfitting.

Finally, we will then print the results for each tree. The criteria used for judgment is the mean squared error. Below is the code and results

for depth in range (1,10):

tree_regressor=tree.DecisionTreeRegressor(max_depth=depth,random_state=1)

if tree_regressor.fit(X,y).tree_.max_depth<depth:

break

score=np.mean(cross_val_score(tree_regressor,X,y,scoring='neg_mean_squared_error', cv=crossvalidation,n_jobs=1))

print(depth, score)

1 -193.55304528235052

2 -176.27520747356175

3 -209.2846723461564

4 -218.80238479654003

5 -222.4393459885871

6 -249.95330609042858

7 -286.76842138165705

8 -294.0290706405905

9 -287.39016236497804

Looks like a tree with a depth of 2 had the lowest amount of error. We can now move to tuning the hyperparameters for the adaBoost algorithm.

**Hyperparameter Tuning**

For hyperparameter tuning we need to start by initiating our AdaBoostRegresor() class. Then we need to create our grid. The grid will address two hyperparameters which are the number of estimators and the learning rate. The number of estimators tells Python how many models to make and the learning indicates how each tree contributes to the overall results. There is one more parameters which is random_state but this is just for setting the seed and never changes.

After making the grid, we need to use the GridSearchCV function to finish this process. Inside this function you have to set the estimator which is adaBoostRegressor, the parameter grid which we just made, the cross validation which we made when we created the baseline model, and the n_jobs which allocates resources for the calculation. Below is the code.

ada=AdaBoostRegressor()

search_grid={'n_estimators':[500,1000,2000],'learning_rate':[.001,0.01,.1],'random_state':[1]}

search=GridSearchCV(estimator=ada,param_grid=search_grid,scoring='neg_mean_squared_error',n_jobs=1,cv=crossvalidation)

Next, we can run the model with the desired grid in place. Below is the code for fitting the mode as well as the best parameters and the score to expect when using the best parameters.

search.fit(X,y)

search.best_params_

Out[31]: {'learning_rate': 0.01, 'n_estimators': 500, 'random_state': 1}

search.best_score_

Out[32]: -164.93176650920856

The best mix of hyperparameters is a learning rate of 0.01 and 500 estimators. This mix led to a mean error score of 164, which is a little lower than our single decision tree of 176. We will see how this works when we run our model with the refined hyperparameters.

**AdaBoost Regression Model**

Below is our model but this time with the refined hyperparameters.

ada2=AdaBoostRegressor(n_estimators=500,learning_rate=0.001,random_state=1)

score=np.mean(cross_val_score(ada2,X,y,scoring='neg_mean_squared_error',cv=crossvalidation,n_jobs=1))

score

Out[36]: -174.52604137201791

You can see the score is not as good but it is within reason.

**Conclusion**

In this post, we explored how to use the AdaBoost algorithm for regression. Employing this algorithm can help to strengthen a model in many ways at times.

# AdaBoost Classification in Python

Boosting is a technique in machine learning in which multiple models are developed sequentially. Each new model tries to successful predict what prior models were unable to do. The average for regression and majority vote for classification are used. For classification, boosting is commonly associated with decision trees. However, boosting can be used with any machine learning algorithm in the supervised learning context.

Since several models are being developed with aggregation, boosting is associated with ensemble learning. Ensemble is just a way of developing more than one model for machine-learning purposes. With boosting, the assumption is that the combination of several weak models can make one really strong and accurate model.

For our purposes, we will be using adaboost classification to improve the performance of a decision tree in python. We will use the cancer dataset from the pydataset library. Our goal will be to predict the status of a patient based on several independent variables. The steps of this process are as follows.

- Data preparation
- Decision tree baseline model
- Hyperparameter tuning of Adaboost model
- AdaBoost model development

Below is some initial code

from sklearn.ensemble import AdaBoostClassifier

from sklearn import tree

from sklearn.model_selection import GridSearchCV

import numpy as np

from pydataset import data

import pandas as pd

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import KFold

**Data Preparation**

Data preparation is minimal in this situation. We will load are data and at the same time drop any NA using the .dropna() function. In addition, we will place the independent variables in dataframe called X and the dependent variable in a dataset called y. Below is the code.

df=data('cancer').dropna()

X=df[['time','sex','ph.karno','pat.karno','meal.cal','wt.loss']]

y=df['status']

**Decision Tree Baseline Model**

We will make a decision tree just for the purposes of comparison. First, we will set the parameters for the cross-validation. Then we will use a for loop to run several different decision trees. The difference in the decision trees will be their depth. The depth is how far the tree can go in order to purify the classification. The more depth the more likely your decision tree is to overfit the data. The last thing we will do is print the results. Below is the code with the output

crossvalidation=KFold(n_splits=10,shuffle=True,random_state=1)

for depth in range (1,10):

tree_classifier=tree.DecisionTreeClassifier(max_depth=depth,random_state=1)

if tree_classifier.fit(X,y).tree_.max_depth<depth:

break

score=np.mean(cross_val_score(tree_classifier,X,y,scoring='accuracy', cv=crossvalidation,n_jobs=1))

print(depth, score)

1 0.71875

2 0.6477941176470589

3 0.6768382352941177

4 0.6698529411764707

5 0.6584558823529412

6 0.6525735294117647

7 0.6283088235294118

8 0.6573529411764706

9 0.6577205882352941

You can see that the most accurate decision tree had a depth of 1. After that there was a general decline in accuracy.

We now can determine if the adaBoost model is better based on whether the accuracy is above 72%. Before we develop the AdaBoost model, we need to tune several hyperparameters in order to develop the most accurate model possible.

**Hyperparameter Tuning AdaBoost Model**

In order to tune the hyperparameters there are several things that we need to do. First we need to initiate our AdaBoostClassifier with some basic settings. Then We need to create our search grid with the hyperparameters. There are two hyperparameters that we will set and they are number of estimators (n_estimators) and the learning rate.

Number of estimators has to do with how many trees are developed. The learning rate indicates how each tree contributes to the overall results. We have to place in the grid several values for each of these. Once we set the arguments for the AdaBoostClassifier and the search grid we combine all this information into an object called search. This object uses the GridSearchCV function and includes additional arguments for scoring, n_jobs, and for cross-validation. Below is the code for all of this

ada=AdaBoostClassifier()

search_grid={'n_estimators':[500,1000,2000],'learning_rate':[.001,0.01,.1]}

search=GridSearchCV(estimator=ada,param_grid=search_grid,scoring='accuracy',n_jobs=1,cv=crossvalidation)

We can now run the model of hyperparameter tuning and see the results. The code is below.

search.fit(X,y)

search.best_params_

Out[33]: {'learning_rate': 0.01, 'n_estimators': 1000}

search.best_score_

Out[34]: 0.7425149700598802

We can see that if the learning rate is set to 0.01 and the number of estimators to 1000 We can expect an accuracy of 74%. This is superior to our baseline model.

**AdaBoost Model**

We can now rune our AdaBoost Classifier based on the recommended hyperparameters. Below is the code.

score=np.mean(cross_val_score(ada,X,y,scoring='accuracy',cv=crossvalidation,n_jobs=1))

score

Out[36]: 0.7415441176470589

We knew we would get around 74% and that is what we got. It’s only a 3% improvement but depending on the context that can be a substantial difference.

**Conclusion**

In this post, we look at how to use boosting for classification. In particular, we used the AdaBoost algorithm. Boosting in general uses many models to determine the most accurate classification in a sequential manner. Doing this will often lead to an improvement in the prediction of a model.

# Recommendation Engine with Python

Recommendation engines make future suggestion to a person based on their prior behavior. There are several ways to develop recommendation engines but for purposes, we will be looking at the development of a user-based collaborative filter. This type of filter takes the ratings of others to suggest future items to another user based on the other user’s ratings.

Making a recommendation engine in Python actually does not take much code and is somewhat easy consider what can be done through coding. We will make a movie recommendation engine using data from movielens.

Below is the link for downloading the zip file

Inside the zip file are several files we will use. We will use each in a few moments. Below is the initial code to get started

```
import pandas as pd
from scipy.sparse import csr_matrix
from sklearn.decomposition import TruncatedSVD
import numpy as np
```

We will now make 4 dataframes. Dataframes 1-3 will be the user, rating, and movie title data. The last dataframe will be a merger of the first 3. The code is below with a printout of the final result.

```
user = pd.read_table('/home/darrin/Documents/python/new/ml-1m/users.dat', sep='::', header=None, names=['user_id', 'gender', 'age', 'occupation', 'zip'],engine='python')
rating = pd.read_table('/home/darrin/Documents/python/new/ml-1m/ratings.dat', sep='::', header=None, names=['user_id', 'movie_id', 'rating', 'timestamp'],engine='python')
movie = pd.read_table('/home/darrin/Documents/python/new/ml-1m/movies.dat', sep='::', header=None, names=['movie_id', 'title', 'genres'],engine='python')
MovieAll = pd.merge(pd.merge(rating, user), movie)
```

We now need to create a matrix using the .pivot_table function. This matrix will include ratings and user_id from our “MovieAll” dataframe. We will then move this information into a dataframe called “movie_index”. This index will help us keep track of what movie each column represents. The code is below.

`rating_mtx_df = MovieAll.pivot_table(values='rating', index='user_id', columns='title', fill_value=0)`

There are many variables in our matrix. This makes the computational time long and expensive. To reduce this we will reduce the dimensions using the TruncatedSVD function. We will reduce the matrix to 20 components. We also need to transform the data because we want the Vh matrix and no tthe U matrix. All this is hand in the code below.

```
recomm = TruncatedSVD(n_components=20, random_state=10)
R = recomm.fit_transform(rating_mtx_df.values.T)
```

What we saved our modified dataset as “R”. If we were to print this it would show that each row has two columns with various numbers in it that cannot be interpreted by us. Instead, we will move to the actual recommendation part of this post.

To get a recommendation you have to tell Python the movie that you watch first. Python will then compare this movie with other movies that have a similiar rating and genera in the training dataset and then provide recommendation based on which movies have the highest correlation to the movie that was watched.

We are going to tell Python that we watched “One Flew Over the Cuckoo’s Nest” and see what movies it recommends.

First, we need to pull the information for just “One Flew Over the Cuckoo’s Nest” and place this in a matrix. Then we need to calculate the correlations of all our movies using the modified dataset we named “R”. These two steps are completed below.

```
cuckoo_idx = list(movie_index).index("One Flew Over the Cuckoo's Nest (1975)")
correlation_matrix = np.corrcoef(R)
```

Now we can determine which movies have the highest correlation with our movie. However, to determine this, we must gvive Python a range of acceptable correlations. For our purposes we will set this between 0.93 and 1.0. The code is below with the recommendations.

```
P = correlation_matrix[cuckoo_idx]
print (list(movie_index[(P > 0.93) & (P < 1.0)]))
['Graduate, The (1967)', 'Taxi Driver (1976)']
```

You can see that the engine recommended two movies which are “The Graduate” and “Taxi Driver”. We could increase the number of recommendations by lower the correlation requirement if we desired.

**Conclusion**

Recommendation engines are a great tool for generating sales automatically for customers. Understanding the basics of how to do this a practical application of machine learning

# Elastic Net Regression in Python

Elastic net regression combines the power of ridge and lasso regression into one algorithm. What this means is that with elastic net the algorithm can remove weak variables altogether as with lasso or to reduce them to close to zero as with ridge. All of these algorithms are examples of regularized regression.

This post will provide an example of elastic net regression in Python. Below are the steps of the analysis.

- Data preparation
- Baseline model development
- Elastic net model development

To accomplish this, we will use the Fair dataset from the pydataset library. Our goal will be to predict marriage satisfaction based on the other independent variables. Below is some initial code to begin the analysis.

from pydataset import data

import numpy as np

import pandas as pd

pd.set_option('display.max_rows', 5000)

pd.set_option('display.max_columns', 5000)

pd.set_option('display.width', 10000)

from sklearn.model_selection import GridSearchCV

from sklearn.linear_model import ElasticNet

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error

**Data Preparation**

We will now load our data. The only preparation that we need to do is convert the factor variables to dummy variables. Then we will make our and y datasets. Below is the code.

df=pd.DataFrame(data('Fair'))

df.loc[df.sex== 'male', 'sex'] = 0

df.loc[df.sex== 'female','sex'] = 1

df['sex'] = df['sex'].astype(int)

df.loc[df.child== 'no', 'child'] = 0

df.loc[df.child== 'yes','child'] = 1

df['child'] = df['child'].astype(int)

X=df[['religious','age','sex','ym','education','occupation','nbaffairs']]

y=df['rate']

We can now proceed to creating the baseline model** **

**Baseline Model**

This model is a basic regression model for the purpose of comparison. We will instantiate our regression model, use the fit command and finally calculate the mean squared error of the data. The code is below.

regression=LinearRegression()

regression.fit(X,y)

first_model=(mean_squared_error(y_true=y,y_pred=regression.predict(X)))

print(first_model)

1.0498738644696668

This mean standard error score of 1.05 is our benchmark for determining if the elastic net model will be better or worst. Below are the coefficients of this first model. We use a for loop to go through the model and the zip function to combine the two columns.

coef_dict_baseline = {}

for coef, feat in zip(regression.coef_,X.columns):

coef_dict_baseline[feat] = coef

coef_dict_baseline

Out[63]:

{'religious': 0.04235281110639178,

'age': -0.009059645428673819,

'sex': 0.08882013337087094,

'ym': -0.030458802565476516,

'education': 0.06810255742293699,

'occupation': -0.005979506852998164,

'nbaffairs': -0.07882571247653956}

We will now move to making the elastic net model.

**Elastic Net Model**

Elastic net, just like ridge and lasso regression, requires normalize data. This argument is set inside the ElasticNet function. The second thing we need to do is create our grid. This is the same grid as we create for ridge and lasso in prior posts. The only thing that is new is the l1_ratio argument.

When the l1_ratio is set to 0 it is the same as ridge regression. When l1_ratio is set to 1 it is lasso. Elastic net is somewhere between 0 and 1 when setting the l1_ratio. Therefore, in our grid, we need to set several values of this argument. Below is the code.

elastic=ElasticNet(normalize=True)

search=GridSearchCV(estimator=elastic,param_grid={'alpha':np.logspace(-5,2,8),'l1_ratio':[.2,.4,.6,.8]},scoring='neg_mean_squared_error',n_jobs=1,refit=True,cv=10)

We will now fit our model and display the best parameters and the best results we can get with that setup.

search.fit(X,y)

search.best_params_

Out[73]: {'alpha': 0.001, 'l1_ratio': 0.8}

abs(search.best_score_)

Out[74]: 1.0816514028705004

The best hyperparameters was an alpha set to 0.001 and a l1_ratio of 0.8. With these settings we got an MSE of 1.08. This is above our baseline model of MSE 1.05 for the baseline model. Which means that elastic net is doing worse than linear regression. For clarity, we will set our hyperparameters to the recommended values and run on the data.

elastic=ElasticNet(normalize=True,alpha=0.001,l1_ratio=0.75)

elastic.fit(X,y)

second_model=(mean_squared_error(y_true=y,y_pred=elastic.predict(X)))

print(second_model)

1.0566430678343806

Now our values are about the same. Below are the coefficients

coef_dict_baseline = {}

for coef, feat in zip(elastic.coef_,X.columns):

coef_dict_baseline[feat] = coef

coef_dict_baseline

Out[76]:

{'religious': 0.01947541724957858,

'age': -0.008630896492807691,

'sex': 0.018116464568090795,

'ym': -0.024224831274512956,

'education': 0.04429085595448633,

'occupation': -0.0,

'nbaffairs': -0.06679513627963515}

The coefficients are mostly the same. Notice that occupation was completely removed from the model in the elastic net version. This means that this values was no good to the algorithm. Traditional regression cannot do this.

**Conclusion**

This post provided an example of elastic net regression. Elastic net regression allows for the maximum flexibility in terms of finding the best combination of ridge and lasso regression characteristics. This flexibility is what gives elastic net its power.

# Lasso Regression with Python

Lasso regression is another form of regularized regression. With this particular version, the coefficient of a variable can be reduced all the way to zero through the use of the l1 regularization. This is in contrast to ridge regression which never completely removes a variable from an equation as it employs l2 regularization.

Regularization helps to stabilize estimates as well as deal with bias and variance in a model. In this post, we will use the “CaSchools” dataset from the pydataset library. Our goal will be to predict test scores based on several independent variables. The steps we will follow are as follows.

- Data preparation
- Develop a baseline linear model
- Develop lasso regression model

The initial code is as follows

from pydataset import data

import numpy as np

import pandas as pd

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error

from sklearn.model_selection import GridSearchCV

from sklearn.linear_model import Lasso

df=pd.DataFrame(data(‘Caschool’))

**Data Preparation**

The data preparation is simple in this example. We only have to store the desired variables in our X and y datasets. We are not using all of the variables. Some were left out because they were highly correlated. Lasso is able to deal with this to a certain extent w=but it was decided to leave them out anyway. Below is the code.

X=df[['teachers','calwpct','mealpct','compstu','expnstu','str','avginc','elpct']]

y=df['testscr']

**Baseline Model**

We can now run our baseline model. This will give us a measure of comparison for the lasso model. Our metric is the mean squared error. Below is the code with the results of the model.

regression=LinearRegression()

regression.fit(X,y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

first_model=(mean_squared_error(y_true=y,y_pred=regression.predict(X)))

print(first_model)

69.07380530137416

First, we instantiate the LinearRegression class. Then, we run the .fit method to do the analysis. Next, we predicted future values of our regression model and save the results to the object first_model. Lastly, we printed the results.

Below are the coefficient for the baseline regression model.

coef_dict_baseline = {}

for coef, feat in zip(regression.coef_,X.columns):

coef_dict_baseline[feat] = coef

coef_dict_baseline

Out[52]:

{'teachers': 0.00010011947964873427,

'calwpct': -0.07813766458116565,

'mealpct': -0.3754719080127311,

'compstu': 11.914006268826652,

'expnstu': 0.001525630709965126,

'str': -0.19234209691788984,

'avginc': 0.6211690806021222,

'elpct': -0.19857026121348267}

The for loop simply combines the features in our model with their coefficients. With this information we can now make our lasso model and compare the results.

**Lasso Model**

For our lasso model, we have to determine what value to set the l1 or alpha to prior to creating the model. This can be done with the grid function, This function allows you to assess several models with different l1 settings. Then python will tell which setting is the best. Below is the code.

lasso=Lasso(normalize=True)

search=GridSearchCV(estimator=lasso,param_grid={'alpha':np.logspace(-5,2,8)},scoring='neg_mean_squared_error',n_jobs=1,refit=True,cv=10)

search.fit(X,y)

We start be instantiate lasso with normalization set to true. It is important to scale data when doing regularized regression. Next, we setup our grid, we include the estimator, and parameter grid, and scoring. The alpha is set using logspace. We want values between -5 and 2, and we want 8 evenly spaced settings for the alpha. The other arguments include cv which stands for cross-validation. n_jobs effects processing and refit updates the parameters.

After completing this, we used the fit function. The code below indicates the appropriate alpha and the expected score if we ran the model with this alpha setting.

search.best_params_

Out[55]: {'alpha': 1e-05}

abs(search.best_score_)

Out[56]: 85.38831122904011

`The alpha is set almost to zero, which is the same as a regression model. You can also see that the mean squared error is actually worse than in the baseline model. In the code below, we run the lasso model with the recommended alpha setting and print the results.

lasso=Lasso(normalize=True,alpha=1e-05)

lasso.fit(X,y)

second_model=(mean_squared_error(y_true=y,y_pred=lasso.predict(X)))

print(second_model)

69.0738055527604

The value for the second model is almost the same as the first one. The tiny difference is due to the fact that there is some penalty involved. Below are the coefficient values.

coef_dict_baseline = {}

for coef, feat in zip(lasso.coef_,X.columns):

coef_dict_baseline[feat] = coef

coef_dict_baseline

Out[63]:

{'teachers': 9.795933425676567e-05,

'calwpct': -0.07810938255735576,

'mealpct': -0.37548182158171706,

'compstu': 11.912164626067028,

'expnstu': 0.001525439984250718,

'str': -0.19225486069458508,

'avginc': 0.6211695477945162,

'elpct': -0.1985510490295491}

The coefficient values are also slightly different. The only difference is the teachers variable was essentially set to zero. This means that it is not a useful variable for predicting testscrs. That is ironic to say the least.

**Conclusion**

Lasso regression is able to remove variables that are not adequate predictors of the outcome variable. Doing this in Python is fairly simple. This yet another tool that can be used in statistical analysis.

# Ridge Regression in Python

Ridge regression is one of several regularized linear models. Regularization is the process of penalizing coefficients of variables either by removing them and or reduce their impact. Ridge regression reduces the effect of problematic variables close to zero but never fully removes them.

We will go through an example of ridge regression using the VietNamI dataset available in the pydataset library. Our goal will be to predict expenses based on the variables available. We will complete this task using the following steps/

- Data preparation
- Baseline model development
- Ridge regression model

Below is the initial code

from pydataset import data

import numpy as np

import pandas as pd

from sklearn.model_selection import GridSearchCV

from sklearn.linear_model import Ridge

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_erro

**Data Preparation**

The data preparation is simple. All we have to do is load the data and convert the sex variable to a dummy variable. We also need to set up our X and y datasets. Below is the code.

df=pd.DataFrame(data('VietNamI'))

df.loc[df.sex== 'male', 'sex'] = 0

df.loc[df.sex== 'female','sex'] = 1

df['sex'] = df['sex'].astype(int)

X=df[['pharvis','age','sex','married','educ','illness','injury','illdays','actdays','insurance']]

y=df['lnhhexp'

We can now create our baseline regression model.

**Baseline Model**

The metric we are using is the mean squared error. Below is the code and output for our baseline regression model. This is a model that has no regularization to it. Below is the code.

regression=LinearRegression()

regression.fit(X,y)

first_model=(mean_squared_error(y_true=y,y_pred=regression.predict(X)))

print(first_model)

0.35528915032173053

This value of 0.355289 will be our indicator to determine if the regularized ridge regression model is superior or not.

**Ridge Model**

In order to create our ridge model we need to first determine the most appropriate value for the l2 regularization. L2 is the name of the hyperparameter that is used in ridge regression. Determining the value of a hyperparameter requires the use of a grid. In the code below, we first are ridge model and indicate normalization in order to get better estimates. Next we setup the grid that we will use. Below is the code.

ridge=Ridge(normalize=True)

search=GridSearchCV(estimator=ridge,param_grid={'alpha':np.logspace(-5,2,8)},scoring='neg_mean_squared_error',n_jobs=1,refit=True,cv=10)

The search object has several arguments within it. Alpha is hyperparameter we are trying to set. The log space is the range of values we want to test. We want the log of -5 to 2, but we only get 8 values from within that range evenly spread out. Are metric is the mean squared error. Refit set true means to adjust the parameters while modeling and cv is the number of folds to develop for the cross-validation. We can now use the .fit function to run the model and then use the .best_params_ and .best_scores_ function to determine the model;s strength. Below is the code.

search.fit(X,y)

search.best_params_

{'alpha': 0.01}

abs(search.best_score_)

0.3801489007094425

The best_params_ tells us what to set alpha too which in this case is 0.01. The best_score_ tells us what the best possible mean squared error is. In this case, the value of 0.38 is worse than what the baseline model was. We can confirm this by fitting our model with the ridge information and finding the mean squared error. This is done below.

ridge=Ridge(normalize=True,alpha=0.01)

ridge.fit(X,y)

second_model=(mean_squared_error(y_true=y,y_pred=ridge.predict(X)))

print(second_model)

0.35529321992606566

The 0.35 is lower than the 0.38. This is because the last results are not cross-validated. In addition, these results indicate that there is little difference between the ridge and baseline models. This is confirmed with the coefficients of each model found below.

coef_dict_baseline = {}

for coef, feat in zip(regression.coef_,data("VietNamI").columns):

coef_dict_baseline[feat] = coef

coef_dict_baseline

Out[188]:

{'pharvis': 0.013282050886950674,

'lnhhexp': 0.06480086550467873,

'age': 0.004012412278795848,

'sex': -0.08739614349708981,

'married': 0.075276463838362,

'educ': -0.06180921300600292,

'illness': 0.040870384578962596,

'injury': -0.002763768716569026,

'illdays': -0.006717063310893158,

'actdays': 0.1468784364977112}

coef_dict_ridge = {}

for coef, feat in zip(ridge.coef_,data("VietNamI").columns):

coef_dict_ridge[feat] = coef

coef_dict_ridge

Out[190]:

{'pharvis': 0.012881937698185289,

'lnhhexp': 0.06335455237380987,

'age': 0.003896623321297935,

'sex': -0.0846541637961565,

'married': 0.07451889604357693,

'educ': -0.06098723778992694,

'illness': 0.039430607922053884,

'injury': -0.002779341753010467,

'illdays': -0.006551280792122459,

'actdays': 0.14663287713359757}

The coefficient values are about the same. This means that the penalization made little difference with this dataset.

**Conclusion**

Ridge regression allows you to penalize variables based on their useful in developing the model. With this form of regularized regression the coefficients of the variables is never set to zero. Other forms of regularization regression allows for the total removal of variables. One example of this is lasso regression.

# Hyperparameter Tuning in Python

Hyperparameters are a numerical quantity you must set yourself when developing a model. This is often one of the last steps of model development. Choosing an algorithm and determining which variables to include often come before this step.

Algorithms cannot determine hyperparameters themselves which is why you have to do it. The problem is that the typical person has no idea what is an optimally choice for the hyperparameter. To deal with this confusion, often a range of values are inputted and then it is left to python to determine which combination of hyperparameters is most appropriate.

In this post, we will learn how to set hyperparameters by developing a grid in Python. To do this, we will use the PSID dataset from the pydataset library. Our goal will be to classify who is married and not married based on several independent variables. The steps of this process is as follows

- Data preparation
- Baseline model (for comparison)
- Grid development
- Revised model

Below is some initial code that includes all the libraries and classes that we need.

import pandas as pd

import numpy as np

from pydataset import data

pd.set_option('display.max_rows', 5000)

pd.set_option('display.max_columns', 5000)

pd.set_option('display.width', 10000)

from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import cross_val_score

from sklearn.model_selection import GridSearchCV

**Data Preparation**

The dataset PSID has several problems that we need to address.

- We need to remove all NAs
- The married variable will be converted to a dummy variable. It will simply be changed to married or not rather than all of the other possible categories.
- The educatnn and kids variables have codes that are 98 and 99. These need to be removed because they do not make sense.

Below is the code that deals with all of this

df=data('PSID').dropna()

df.loc[df.married!= 'married', 'married'] = 0

df.loc[df.married== 'married','married'] = 1

df['married'] = df['married'].astype(int)

df['marry']=df.married

df.drop(df.loc[df['kids']>90].index, inplace=True)

df.drop(df.loc[df['educatn']>90].index, inplace=True

- Line 1 loads the dataset and drops the NAs
- Line 2-4 create our dummy variable for marriage. We create a new variable called marry to hold the results
- Lines 5-6 drop the values in kids and educatn that are above 90.

Below we create our X and y datasets and then are ready to make our baseline model.

X=df[['age','educatn','hours','kids','earnings']]

y=df['marry']

**Baseline Model**

The purpose of baseline model is to see how much better or worst the hyperparameter tuning works. We are using K Nearest Neighbors for our classification. In our example, there are 4 hyperparameters we need to set. They are as follows.

- number of neighbors
- weight of neighbors
- metric for measuring distance
- power parameter for minkowski

Below is the baseline model with the set hyperparameters. The second line shows the accuracy of the model after a k-fold cross-validation that was set to 10.

classifier=KNeighborsClassifier(n_neighbors=5,weights=’uniform’, metric=’minkowski’,p=2)

np.mean(cross_val_score(classifier,X,y,cv=10,scoring=’accuracy’,n_jobs=1)) 0.6188104238047426

Our model has an accuracy of about 62%. We will now move to setting up our grid so we can see if tuning the hyperparameters improves the performance

**Grid Development**

The grid allows you to develop scores of models with all of the hyperparameters tuned slightly differently. In the code below, we create our grid object, and then we calculate how many models we will run

grid={'n_neighbors':range(1,13),'weights':['uniform','distance'],'metric':['manhattan','minkowski'],'p':[1,2]}

np.prod([len(grid[element]) for element in grid])

96

You can see we made a simple list that has several values for each hyperparameter

- Number if neighbors can be 1 to 13
- weight of neighbors can be uniform or distance
- metric can be manhatten or minkowski
- p can be 1 or 2

We will develop 96 models all together. Below is the code to begin tuning the hyperparameters.

search=GridSearchCV(estimator=classifier,param_grid=grid,scoring='accuracy',n_jobs=1,refit=True,cv=10)

search.fit(X,y)

The estimator is the code for the type of algorithm we are using. We set this earlier. The param_grid is our grid. Accuracy is our metric for determining the best model. n_jobs has to do with the amount of resources committed to the process. refit is for changing parameters and cv is for cross-validation folds.The search.fit command runs the model

The code below provides the output for the results.

print(search.best_params_)

print(search.best_score_)

{'metric': 'manhattan', 'n_neighbors': 11, 'p': 1, 'weights': 'uniform'}

0.6503975265017667

The best_params_ function tells us what the most appropriate parameters are. The best_score_ tells us what the accuracy of the model is with the best parameters. Are model accuracy improves from 61% to 65% from adjusting the hyperparameters. We can confirm this by running our revised model with the updated hyper parameters.

**Model Revision**

Below is the cod efor the erevised model

classifier2=KNeighborsClassifier(n_neighbors=11,weights='uniform', metric='manhattan',p=1)

np.mean(cross_val_score(classifier2,X,y,cv=10,scoring='accuracy',n_jobs=1)) #new res

Out[24]: 0.6503909993913031

Exactly as we thought. This is a small improvement but this can make a big difference in some situation such as in a data science competition.

**Conclusion**

Tuning hyperparameters is one of the final pieces to improving a model. With this tool, small gradually changes can be seen in a model. It is important to keep in mind this aspect of model development in order to have the best success final.

# Variable Selection in Python

A key concept in machine learning and data science in general is variable selection. Sometimes, a dataset can have hundreds of variables to include in a model. The benefit of variable selection is that it reduces the amount of useless information aka noise in the model. By removing noise it can improve the learning process and help to stabilize the estimates.

In this post, we will look at two ways to do this. These two common approaches are the univariate approach and the greedy approach. The univariate approach selects variables that are most related to the dependent variable based on a metric. The greedy approach will alone remove a variable if getting rid of it does not affect the model’s performance.

We will now move to our first example which is the univariate approach using Python. We will use the VietNamH dataset from the pydataset library. Are goal is to predict how much a family spends on medical expenses. Below is the initial code.

import pandas as pd

import numpy as np

from pydataset import data

from sklearn.linear_model import LinearRegression

from sklearn.feature_selection import SelectPercentile

from sklearn.feature_selection import f_regression

df=data('VietNamH').dropna()

Are data is called df. If you use the head function, you will see that we need to convert several variables to dummy variables. Below is the code for doing this.

df.loc[df.sex== 'female', 'sex'] = 0

df.loc[df.sex== 'male','sex'] = 1

df.loc[df.farm== 'no', 'farm'] = 0

df.loc[df.farm== 'yes','farm'] = 1

df.loc[df.urban== 'no', 'urban'] = 0

df.loc[df.urban== 'yes','urban'] = 1

We now need to setup or X and y datasets as shown below

X=df[['age','educyr','sex','hhsize','farm','urban','lnrlfood']]

y=df['lnmed']

We are now ready to actual use the univariate approach. This involves the use of two different classes in Python. The SelectPercentile class allows you to only include the variables that meet a certain percentile rank such as 25%. The f_regression class is designed for checking a variable’s performance in the context of regression. Below is the code to run the analysis.

selector_f=SelectPercentile(f_regression,percentile=25)

selector_f.fit(X,y)

We can now see the results using a for loop. We want the scores from our selector_f object. To do this we setup a for lop and use the zip function to iterate over the data. The output is placed in the print statement. Below is the code and output for this.

for n,s in zip(X,selector_f.scores_):

print('F-score: %3.2f\t for feature %s ' % (s,n))

F-score: 62.42 for feature age

F-score: 33.86 for feature educyr

F-score: 3.17 for feature sex

F-score: 106.35 for feature hhsize

F-score: 14.82 for feature farm

F-score: 5.95 for feature urban

F-score: 97.77 for feature lnrlfood

You can see the f-score for all of the independent variables. You can decide for yourself which to include.

**Greedy Approach**

The greedy approach only removes variables if they do not impact model performance. We are using the same dataset so all we have to do is run the code. We need the RFECV class from the model_selection library. We then use the function RFECV and set the estimator, cross-validation, and scoring metric. Finally, we run the analysis and print the results. The code is below with the output.

from sklearn.feature_selection import RFECV

select=RFECV(estimator=regression,cv=10,scoring='neg_mean_squared_error')

select.fit(X,y)

print(select.n_features_)

7

The number 7 represents how many independent variables to include in the model. Since we only had 7 total variables we should include all variables in the model.

**Conclusion**

With help with univariate and greedy approaches, it is possible to deal with a large number of variables efficiently one developing models. The example here involve only a handful of variables. However, bear in mind that the approaches mentioned here are highly scalable and useful.

# Scatter Plots in Python

Scatterplots are one of many crucial forms of visualization in statistics. With scatterplots, you can examine the relationship between two variables. This can lead to insights in terms of decision making or additional analysis.

We will be using the “Prestige” dataset form the pydataset module to look at scatterplot use. Below is some initial code.

from pydataset import data import matplotlib.pyplot as plt import pandas as pd import seaborn as sns df=data('Prestige')

We will begin by making a correlation matrix. this will help us to determine which pairs of variables have strong relationships with each other. This will be done with the .corr() function. below is the code

You can see that there are several strong relationships. For our purposes, we will look at the relationship between education and income.

The seaborn library is rather easy to use for making visuals. To make a plot you can use the .lmplot() function. Below is a basic scatterplot of our data.

The code should be self-explanatory. THe only thing that might be unknown is the fit_reg argument. This is set to False so that the function does not make a regression line. Below is the same visual but this time with the regression line.

facet = sns.lmplot(data=df, x='education', y='income',fit_reg=True)

It is also possible to add a third variable to our plot. One of the more common ways is through including a categorical variable. Therefore, we will look at job type and see what the relationship is. To do this we use the same .lmplot.() function but include several additional arguments. These include the hue and the indication of a legend. Below is the code and output.

You can clearly see that type separates education and income. A look at the boxplots for these variables confirms this.

As you can see, we can conclude that job type influences both education and income in this example.

**Conclusion**

This post focused primarily on making scatterplots with the seaborn package. Scatterplots are a tool that all data analyst should be familiar with as it can be used to communicate information to people who must make decisions.

# Data Visualization in Python

In this post, we will look at how to set up various features of a graph in Python. The fine tune tweaks that can be made when creating a data visualization can be enhanced the communication of results with an audience. This will all be done using the matplotlib module available for python. Our objectives are as follows

- Make a graph with two lines
- Set the tick marks
- Change the linewidth
- Change the line color
- Change the shape of the line
- Add a label to each axes
- Annotate the graph
- Add a legend and title

We will use two variables from the “toothpaste” dataset from the pydataset module for this demonstration. Below is some initial code.

from pydataset import data import matplotlib.pyplot as plt DF = data('toothpaste')

**Make Graph with Two Lines**

To make a plot you use the .plot() function. Inside the parentheses you out the dataframe and variable you want. If you want more than one line or graph you use the .plot() function several times. Below is the code for making a graph with two line plots using variables from the toothpaste dataset.

plt.plot(DF['meanA']) plt.plot(DF['sdA'])

To get the graph above you must run both lines of code simultaneously. Otherwise, you will get two separate graphs.

**Set Tick Marks**

Setting the tick marks requires the use of the .axes() function. However, it is common to save this function in a variable called axes as a handle. This makes coding easier. Once this is done you can use the .set_xticks() function for the x-axes and .set_yticks() for the y axes. In our example below, we are setting the tick marks for the odd numbers only. Below is the code.

ax=plt.axes() ax.set_xticks([1,3,5,7,9]) ax.set_yticks([1,3,5,7,9]) plt.plot(DF['meanA']) plt.plot(DF['sdA'])

**Changing the Line Type**

It is also possible to change the line type and width. There are several options for the line type. The important thing here is to put this information after the data you want to plot inside the code. Line width is changed with an argument that has the same name. Below is the code and visual

ax=plt.axes() ax.set_xticks([1,3,5,7,9]) ax.set_yticks([1,3,5,7,9]) plt.plot(DF['meanA'],'--',linewidth=3) plt.plot(DF['sdA'],':',linewidth=3)

**Changing the Line Color**

It is also possible to change the line color. There are several options available. The important thing is that the argument for the line color goes inside the same parentheses as the line type. Below is the code. r means red and k means black.

ax=plt.axes() ax.set_xticks([1,3,5,7,9]) ax.set_yticks([1,3,5,7,9]) plt.plot(DF['meanA'],'r--',linewidth=3) plt.plot(DF['sdA'],'k:',linewidth=3)

**Change the Point Type**

Changing the point type requires more code inside the same quotation marks where the line color and line type are. Again there are several choices here. The code is below

ax=plt.axes() ax.set_xticks([1,3,5,7,9]) ax.set_yticks([1,3,5,7,9]) plt.plot(DF['meanA'],'or--',linewidth=3) plt.plot(DF['sdA'],'Dk:',linewidth=3)

**Add X and Y Labels**

Adding LAbels is simple. You just use the .xlabel() function or .ylabel() function. Inside the parentheses, you put the text you want in quotation marks. Below is the code.

ax=plt.axes() ax.set_xticks([1,3,5,7,9]) ax.set_yticks([1,3,5,7,9]) plt.xlabel('X Example') plt.ylabel('Y Example') plt.plot(DF['meanA'],'or--',linewidth=3) plt.plot(DF['sdA'],'Dk:',linewidth=3)

**Adding Annotation, Legend, and Title**

Annotation allows you to write text directly inside the plot wherever you want. This involves the use of the .annotate function. Inside this function, you must indicate the location of the text and the actual text you want added to the plot. For our example, we will add the word ‘python’ to the plot for fun.

The .legend() function allows you to give a description of the line types that you have included. Lastly, the .title() function allows you to add a title. Below is the code.

ax=plt.axes() ax.set_xticks([1,3,5,7,9]) ax.set_yticks([1,3,5,7,9]) plt.xlabel('X Example') plt.ylabel('Y Example') plt.annotate(xy=[3,4],text='Python') plt.plot(DF['meanA'],'or--',linewidth=3) plt.plot(DF['sdA'],'Dk:',linewidth=3) plt.legend(['1st','2nd']) plt.title("Plot Example")

**Conclusion**

Now you have a practical understanding of how you can communicate information visually with matplotlib in python. This is barely scratching the surface in terms of the potential that is available.

# Random Forest Regression in Python

Random forest is simply the making of dozens if not thousands of decision trees. The decision each tree makes about an example are then tallied for the purpose of voting with the classification that receives the most votes winning. For regression, the results of the trees are averaged in order to give the most accurate results

In this post, we will use the cancer dataset from the pydataset module to predict the age of people. Below is some initial code.

import pandas as pd import numpy as np from pydataset import data from sklearn.model_selection import train_test_split from sklearn.model_selection import cross_val_score from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import mean_squared_error

We can load our dataset as df, drop all NAs, and create our dataset that contains the independent variables and a separate dataset that includes the dependent variable of age. The code is below

df = data('cancer') df=df.dropna() X=df[['time','status',"sex","ph.ecog",'ph.karno','pat.karno','meal.cal','wt.loss']] y=df['age']

Next, we need to set up our train and test sets using a 70/30 split. After that, we set up our model using the RandomForestRegressor function. n_estimators is the number of trees we want to create and the random_state argument is for supporting reproducibility. The code is below

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0) h=RandomForestRegressor(n_estimators=100,random_state=1)

We can now run our model and test it. Running the model requires the .fit() function and testing involves the .predict() function. The results of the test are found using the mean_squared_error() function.

h.fit(x_train,y_train) y_pred=h.predict(x_test) mean_squared_error(y_test,y_pred) 71.75780196078432

The MSE of 71.75 is only useful for model comparison and has little meaning by its self. Another way to assess the model is by determining variable importance. This helps you to determine in a descriptive way the strongest variables for the regression model. The code is below followed by the plot of the variables.

model_ranks=pd.Series(h.feature_importances_,index=x_train.columns,name="Importance").sort_values(ascending=True,inplace=False) ax=model_ranks.plot(kind='barh')

As you can see, the strongest predictors of age include calories per meal, weight loss, and time sick. Sex and whether the person is censored or dead make a smaller difference. This makes sense as younger people eat more and probably lose more weight because they are heavier initially when dealing with cancer.

**Conclusison**

This post provided an example of the use of regression with random forest. Through the use of ensemble voting, you can improve the accuracy of your models. This is a distinct power that is not available with other machine learning algorithm.

# Bagging Classification with Python

Bootstrap aggregation aka bagging is a technique used in machine learning that relies on resampling from the sample and running multiple models from the different samples. The mean or some other value is calculated from the results of each model. For example, if you are using Decisions trees, bagging would have you run the model several times with several different subsamples to help deal with variance in statistics.

Bagging is an excellent tool for algorithms that are considered weaker or more susceptible to variances such as decision trees or KNN. In this post, we will use bagging to develop a model that determines whether or not people voted using the turnout dataset. These results will then be compared to a model that was developed in a traditional way.

We will use the turnout dataset available in the pydataset module. Below is some initial code.

from pydataset import data import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn.ensemble import BaggingClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.model_selection import cross_val_score from sklearn.metrics import classification_report

We will load our dataset. Then we will separate the independnet and dependent variables from each other and create our train and test sets. The code is below.

df=data("turnout") X=df[['age','educate','income',]] y=df['vote'] X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=.3,random_state=0)

We can now prepare to run our model. We need to first set up the bagging function. There are several arguments that need to be set. The max_samples argument determines the largest amount of the dataset to use in resampling. The max_features argument is the max number of features to use in a sample. Lastly, the n_estimators is for determining the number of subsamples to draw. The code is as follows

h=BaggingClassifier(KNeighborsClassifier(n_neighbors=7),max_samples=0.7,max_features=0.7,n_estimators=1000)

Basically, what we told python was to use up to 70% of the samples, 70% of the features, and make 100 different KNN models that use seven neighbors to classify. Now we run the model with the fit function, make a prediction with the predict function, and check the accuracy with the classificarion_reoirts function.

h.fit(X_train,y_train) y_pred=h.predict(X_test) print(classification_report(y_test,y_pred))

This looks oka below are the results when you do a traditional model without bagging

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=.3,random_state=0) clf=KNeighborsClassifier(7) clf.fit(X_train,y_train) y_pred=clf.predict(X_test) print(classification_report(y_test,y_pred))

The improvement is not much. However, this depends on the purpose and scale of your project. A small improvement can mean millions in the reight context such as for large companies such as Google who deal with billions of people per day.

**Conclusion**

This post provides an example of the use of bagging in the context of classification. Bagging provides a why to improve your model through the use of resampling.

# K Nearest Neighbor Classification with Python

K Nearest Neighbor uses the idea of proximity to predict class. What this means is that with KNN Python will look at K neighbors to determine what the unknown examples class should be. It is your job to determine the K or number of neighbors that should be used to determine the unlabeled examples class.

KNN is great for a small dataset. However, it normally does not scale well when the dataset gets larger and larger. As such, unless you have an exceptionally powerful computer KNN is probably not going to do well in a Big Data context.

In this post, we will go through an example of the use of KNN with the turnout dataset from the pydataset module. We want to predict whether someone voted or not based on the independent variables. Below is some initial code.

from pydataset import data import pandas as pd from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import classification_report

We now need to load the data and separate the independent variables from the dependent variable by making two datasets.

df=data("turnout") X=df[['age','educate','income']] y=df['vote']

Next, we will make our train and test sets with a 70/30 split. The random.state is set to 0. This argument allows us to reproduce our model if we want. After this, we will run our model. We will set the K to 7 for our model and run the model. This means that Python will look at the 7 closes examples to predict the value of an unknown example. below is the code

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=.3,random_state=0) clf=KNeighborsClassifier(7) clf.fit(X_train,y_train)

We can now predict with our model and see the results with the classification reports function.

y_pred=clf.predict(X_test) print(classification_report(y_test,y_pred))

The results are shown above. To determine the quality of the model relies more on domain knowledge. What we can say for now is that the model is better at classifying people who vote rather than people who do not vote.

**Conclusion**

This post shows you how to work with Python when using KNN. This algorithm is useful in using neighboring examples tot predict the class of an unknown example.

# Naive Bayes with Python

Naive Bayes is a probabilistic classifier that is often employed when you have multiple or more than two classes in which you want to place your data. This algorithm is particularly used when you dealing with text classification with large datasets and many features.

If you are more familiar with statistics you know that Bayes developed a method of probability that is highly influential today. In short, his system takes into conditional probability. In the case of naive Bayes, the classifier assumes that the presence of a certain feature in a class is not related to the presence of any other feature. This assumption is why Naive Bayes is Naive.

For our purposes, we will use Naive Bayes to predict the type of insurance a person has in the DoctorAUS dataset in the pydataset module. Below is some initial code.

from pydataset import data import pandas as pd from sklearn.naive_bayes import GaussianNB from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report

Next, we will load our dataset DoctorAUS. Then we will separate the independent variables that we will use from the dependent variable of insurance in two different datasets. If you want to know more about the dataset and the variables you can type data(“DoctorAUS”, show_doc=True)

df=data("DoctorAUS") X=df[['age','income','sex','illness','actdays','hscore','doctorco','nondocco','hospadmi','hospdays','medecine','prescrib']] y=df['insurance']

Now, we will create our train and test datasets. We will do a 70/30 split. We will also use Gaussian Naive Bayes as our algorithm. This algorithm assumes the data is normally distributed. There are other algorithms available for Naive Bayes as well. We will also create our model with the .fit function.

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=.3,random_state=0) clf=GaussianNB() clf.fit(X_train,y_train)

Finally, we will predict with our model and run the classification report to determine the success of the model.

y_pred=clf.predict(X_test) print(classification_report(y_test,y_pred))

You can see that our overall numbers are not that great. This means that the current algorithm is probably not the best choice for classification. Of course, there could other problems as well that need to be explored.

**Conclusion**

This post was simply a demonstration of how to conduct an analysis with Naive Bayes using Python. The process is not all that complicate and is similar to other algorithms that are used.

# K Nearest Neighbor Regression with Python

K Nearest Neighbor Regression (KNN) works in much the same way as KNN for classification. The difference lies in the characteristics of the dependent variable. With classification KNN the dependent variable is categorical. WIth regression KNN the dependent variable is continuous. Both involve the use neighboring examples to predict the class or value of other examples.

This post will provide an example of KNN regression using the turnout dataset from the pydataset module. Our purpose will be to predict the age of a voter through the use of other variables in the dataset. Below is some initial code.

from pydataset import data import pandas as pd from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsRegressor from sklearn.metrics import mean_squared_error

We now need to setup our data. We need to upload our actual dataset. Then we need to separate the independnet and dependent variables. Once this is done we need to create our train and test sets using the tarin test spli t funvtion. Below is the code to accmplouh each of these steps.

df=data("turnout") X=df[['age','income','vote']] y=df['educate'] X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=.3,random_state=0)

We are now ready to train our model. We need to call the function we will use and determine the size of K, which will be 11 in our case. Then we need to train our model and then predict with it. Lastly, we will print out the mean squared error. This value is useful for comparing models but does not have much value by itself. The MSE is calculated by comparing the actual test set with the predicted test data. The code is below

clf=KNeighborsRegressor(11) clf.fit(X_train,y_train) y_pred=clf.predict(X_test) print(mean_squared_error(y_test,y_pred))

9.239

If we were to continue with model development we may look for ways to improve our MAE through different nethods such as regular linear regression. However, for our purposes this is adequate.

**Conclusison**

This post provides an example of regression with KNN in Python. This tool is a practical and simple way to make numeric predictions that can be accurate at times.

# Support Vector Machines Regression with Python

This post will provide an example of how to do regression with support vector machines SVM. SVM is a complex algorithm that allows for the development of non-linear models. This is particularly useful for messy data that does not have clear boundaries.

The steps that we will use are listed below

- Data preparation
- Model Development

We will use two different kernels in our analysis. The LinearSVR kernel and SVR kernel. The difference between these two kernels has to do with slight changes in the calculations of the boundaries between classes.

**Data Preparation
**

We are going to use the OFP dataset available in the pydataset module. This dataset was used previously for classification with SVM on this site. Our plan this time is that we want to predict family inc (famlinc), which is a continuous variable. Below is some initial code.

import numpy as np import pandas as pd from pydataset import data from sklearn import svm from sklearn import model_selection from statsmodels.tools.eval_measures import mse

We now need to load our dataset and remove any missing values.

df=pd.DataFrame(data('OFP')) df=df.dropna()

AS in the previous post, we need to change the text variables into dummy variables and we also need to scale the data. The code below creates the dummy variables, removes variables that are not needed, and also scales the data.

dummy=pd.get_dummies(df['black']) df=pd.concat([df,dummy],axis=1) df=df.rename(index=str, columns={"yes": "black_person"}) df=df.drop('no', axis=1) dummy=pd.get_dummies(df['sex']) df=pd.concat([df,dummy],axis=1) df=df.rename(index=str, columns={"male": "Male"}) df=df.drop('female', axis=1) dummy=pd.get_dummies(df['employed']) df=pd.concat([df,dummy],axis=1) df=df.rename(index=str, columns={"yes": "job"}) df=df.drop('no', axis=1) dummy=pd.get_dummies(df['maried']) df=pd.concat([df,dummy],axis=1) df=df.rename(index=str, columns={"no": "single"}) df=df.drop('yes', axis=1) dummy=pd.get_dummies(df['privins']) df=pd.concat([df,dummy],axis=1) df=df.rename(index=str, columns={"yes": "insured"}) df=df.drop('no', axis=1) df=df.drop(['black','sex','maried','employed','privins','medicaid','region','hlth'],axis=1) df = (df - df.min()) / (df.max() - df.min()) df.head()

We now need to set up our datasets. The X dataset will contain the independent variables while the y dataset will contain the dependent variable

X=df[['ofp','ofnp','opp','opnp','emr','hosp','numchron','adldiff','age','school','single','black_person','Male','job','insured']] y=df['faminc']

We can now move to model development

**Model Development**

We now need to create our train and test sets for or X and y datasets. We will do a 70/30 split of the data. Below is the code

X_train,X_test,y_train,y_test=model_selection.train_test_split(X,y,test_size=.3,random_state=1)

Next, we will create our two models with the code below.

h1=svm.SVR() h2=svm.LinearSVR()

We will now run our first model and assess the results. Our metric is the mean squared error. Generally, the lower the number the better. We will use the .fit() function to train the model and the .predict() function for test the model

The mse was 0.27. This number means nothing only and is only beneficial for comparison reasons. Therefore, the second model will be judged as better or worst only if the mse is lower than 0.27. Below are the results of the second model.

We can see that the mse for our second model is 0.34 which is greater than the mse for the first model. This indicates that the first model is superior based on the current results and parameter settings.

**Conclusion**

This post provided an example of how to use SVM for regression.

# Support Vector Machines Classification with Python

Support vector machines (SVM) is an algorithm used to fit non-linear models. The details are complex but to put it simply SVM tries to create the largest boundaries possible between the various groups it identifies in the sample. The mathematics behind this is complex especially if you are unaware of what a vector is as defined in algebra.

This post will provide an example of SVM using Python broken into the following steps.

- Data preparation
- Model Development

We will use two different kernels in our analysis. The linear kernel and he rbf kernel. The difference in terms of kernels has to do with how the boundaries between the different groups are made.

**Data Preparation
**

We are going to use the OFP dataset available in the pydataset module. We want to predict if someone single or not. Below is some initial code.

import numpy as np import pandas as pd from pydataset import data from sklearn import svm from sklearn.metrics import classification_report from sklearn import model_selection

We now need to load our dataset and remove any missing values.

df=pd.DataFrame(data('OFP')) df=df.dropna() df.head()

Looking at the dataset we need to do something with the variables that have text. We will create dummy variables for all except region and hlth. The code is below.

dummy=pd.get_dummies(df['black']) df=pd.concat([df,dummy],axis=1) df=df.rename(index=str, columns={"yes": "black_person"}) df=df.drop('no', axis=1) dummy=pd.get_dummies(df['sex']) df=pd.concat([df,dummy],axis=1) df=df.rename(index=str, columns={"male": "Male"}) df=df.drop('female', axis=1) dummy=pd.get_dummies(df['employed']) df=pd.concat([df,dummy],axis=1) df=df.rename(index=str, columns={"yes": "job"}) df=df.drop('no', axis=1) dummy=pd.get_dummies(df['maried']) df=pd.concat([df,dummy],axis=1) df=df.rename(index=str, columns={"no": "single"}) df=df.drop('yes', axis=1) dummy=pd.get_dummies(df['privins']) df=pd.concat([df,dummy],axis=1) df=df.rename(index=str, columns={"yes": "insured"}) df=df.drop('no', axis=1)

For each variable, we did the following

- Created a dummy in the dummy dataset
- Combined the dummy variable with our df dataset
- Renamed the dummy variable based on yes or no
- Drop the other dummy variable from the dataset. Python creates two dummies instead of one.

If you look at the dataset now you will see a lot of variables that are not necessary. Below is the code to remove the information we do not need.

df=df.drop(['black','sex','maried','employed','privins','medicaid','region','hlth'],axis=1) df.head()

This is much cleaner. Now we need to scale the data. This is because SVM is sensitive to scale. The code for doing this is below.

df = (df - df.min()) / (df.max() - df.min()) df.head()

We can now create our dataset with the independent variables and a separate dataset with our dependent variable. The code is as follows.

X=df[['ofp','ofnp','opp','opnp','emr','hosp','numchron','adldiff','age','school','faminc','black_person','Male','job','insured']] y=df['single']

We can now move to model development

**Model Development**

We need to make our test and train sets first. We will use a 70/30 split.

X_train,X_test,y_train,y_test=model_selection.train_test_split(X,y,test_size=.3,random_state=1)

Now, we need to create the models or the hypothesis we want to test. We will create two hypotheses. The first model is using a linear kernel and the second is one using the rbf kernel. For each of these kernels, there are hyperparameters that need to be set which you will see in the code below.

h1=svm.LinearSVC(C=1) h2=svm.SVC(kernel='rbf',degree=3,gamma=0.001,C=1.0)

The details about the hyperparameters are beyond the scope of this post. Below are the results for the first model.

The overall accuracy is 73%. The crosstab() function provides a breakdown of the results and the classification_report() function provides other metrics related to classification. In this situation, 0 means not single or married while 1 means single. Below are the results for model 2

You can see the results are similar with the first model having a slight edge. The second model really struggls with predicting people who are actually single. You can see thtat the recall in particular is really poor.

**Conclusion**

This post provided how to ob using SVM in python. How this algorithm works can be somewhat confusing. However, its use can be powerful if use appropriately.

# Linear Discriminant Analysis in Python

Linear discriminant analysis is a classification algorithm commonly used in data science. In this post, we will learn how to use LDA with Python. The steps we will for this are as follows.

- Data preparation
- Model training and evaluation

**Data Preparation**

We will be using the bioChemists dataset which comes from the pydataset module. We want to predict whether someone is married or single based on academic output and prestige. Below is some initial code.

import pandas as pd from pydataset import data import matplotlib.pyplot as plt from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report

Now we will load our data and take a quick look at it using the .head() function.

There are two variables that contain text so we need to convert these two dummy variables for our analysis the code is below with the output.

Here is what we did.

- We created the dummy variable by using the .get_dummies() function.
- We saved the output in an object called dummy
- We then combine the dummy and df dataset with the .concat() function
- We repeat this process for the second variable

The output shows that we have our original variables and the dummy variables. However, we do not need all of this information. Therefore, we will create a dataset that has the X variables we will use and a separate dataset that will have our y values. Below is the code.

X=df[['Men','kid5','phd','ment','art']] y=df['Married']

The X dataset has our five independent variables and the y dataset has our dependent variable which is married or not. We can not split our data into a train and test set. The code is below.

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=.3,random_state=0)

The data was split 70% for training and 30% for testing. We made a train and test set for the independent and dependent variables which meant we made 4 sets altogether. We can now proceed to model development and testing

**Model Training and Testing**

Below is the code to run our LDA model. We will use the .fit() function for this.

clf=LDA() clf.fit(X_train,y_train)

We will now use this model to predict using the .predict function

y_pred=clf.predict(X_test)

Now for the results, we will use the classification_report function to get all of the metrics associated with a confusion matrix.

The interpretation of this information is described in another place. For our purposes, we have an accuracy of 71% for our prediction. Below is a visual of our model using the ROC curve.

Here is what we did

- We had to calculate the roc_curve for the model this is explained in detail here
- Next, we plotted our own curve and compared to a baseline curve which is the dotted lines.

A ROC curve of 0.67 is considered fair by many. Our classification model is not that great but there are worst models out there.

**Conclusion**

This post went through an example of developing and evaluating a linear discriminant model. To do this you need to prepare the data, train the model, and evaluate.

# Factor Analysis in Python

Factor analysis is a dimensionality reduction technique commonly used in statistics. FA is similar to principal component analysis. The difference are highly technical but include the fact the FA does not have an orthogonal decomposition and FA assumes that there are latent variables and that are influencing the observed variables in the model. For FA the goal is the explanation of the covariance among the observed variables present.

Our purpose here will be to use the BioChemist dataset from the pydataset module and perform a FA that creates two components. This dataset has data on the people completing PhDs and their mentors. We will also create a visual of our two-factor solution. Below is some initial code.

import pandas as pd from pydataset import data from sklearn.decomposition import FactorAnalysis import matplotlib.pyplot as plt

We now need to prepare the dataset. The code is below

df = data('bioChemists') df=df.iloc[1:250] X=df[['art','kid5','phd','ment']]

In the code above, we did the following

- The first line creates our dataframe called “df” and is made up of the dataset bioChemist
- The second line reduces the df to 250 rows. This is done for the visual that we will make. To take the whole dataset and graph it would make a giant blob of color that would be hard to interpret.
- The last line pulls the variables we want to use for our analysis. The meaning of these variables can be found by typing data(“bioChemists”,show_doc=True)

In the code below we need to set the number of factors we want and then run the model.

fact_2c=FactorAnalysis(n_components=2) X_factor=fact_2c.fit_transform(X)

The first line tells Python how many factors we want. The second line takes this information along with or revised dataset X to create the actual factors that we want. We can now make our visualization

To make the visualization requires several steps. We want to identify how well the two components separate students who are married from students who are not married. First, we need to make a dictionary that can be used to convert the single or married status to a number. Below is the code.

thisdict = { "Single": "1", "Married": "2",}

Now we are ready to make our plot. The code is below. Pay close attention to the ‘c’ argument as it uses our dictionary.

plt.scatter(X_factor[:,0],X_factor[:,1],c=df.mar.map(thisdict),alpha=.8,edgecolors='none')

You can perhaps tell why we created the dictionary now. By mapping the dictionary to the mar variable it automatically changed every single and married entry in the df dataset to a 1 or 2. The c argument needs a number in order to set a color and this is what the dictionary was able to supply it with.

You can see that two factors do not do a good job of separating the people by their marital status. Additional factors may be useful but after two factors it becomes impossible to visualize them.

**Conclusion**

This post provided an example of factor analysis in Python. Here the focus was primarily on visualization but there are so many other ways in which factor analysis can be deployed.

# Analyzing Twitter Data in Python

In this post, we will look at how to analyze text from Twitter. We will do each of the following for tweets that refer to Donald Trump and tweets that refer to Barrack Obama.

- Conduct a sentiment analysis
- Create a word cloud

This is a somewhat complex analysis so I am assuming that you are familiar with Python as explaining everything would make the post much too long. In order to achieve our two objectives above we need to do the following.

- Obtain all of the necessary information from your twitter apps account
- Download the tweets & clean
- Perform the analysis

Before we begin, here is a list of modules we will need to load to complete our analysis

import wordcloud import matplotlib.pyplot as plt import twython import re import numpy

**Obtain all Needed Information**

From your twitter app account, you need the following information

- App key
- App key secret
- Access token
- Access token secret

All this information needs to be stored in individual objects in Python. Then each individual object needs to be combined into one object. The code is below.

TWITTER_APP_KEY=XXXXXXXXXXXXXXXXXXXXXXXXXX TWITTER_APP_KEY_SECRET=XXXXXXXXXXXXXXXXXXX TWITTER_ACCESS_TOKEN=XXXXXXXXXXXXXXXXXXXXX TWITTER_ACCESS_TOKEN_SECRET=XXXXXXXXXXXXXX t=twython.Twython(app_key=TWITTER_APP_KEY,app_secret=TWITTER_APP_KEY_SECRET,oauth_token=TWITTER_ACCESS_TOKEN,oauth_token_secret=TWITTER_ACCESS_TOKEN_SECRET)

In the code above we saved all the information in different objects at first and then combined them. You will of course replace the XXXXXXX with your own information.

Next, we need to create a function that will pull the tweets from Twitter. Below is the code,

def get_tweets(twython_object,query,n): count=0 result_generator=twython_object.cursor(twython_object.search,q=query) result_set=[] for r in result_generator: result_set.append(r['text']) count+=1 if count ==n: break return result_set

You will have to figure out the code yourself. We can now download the tweets.

**Downloading Tweets & Clean**

Downloading the tweets involves making an empty dictionary that we can save our information in. We need two keys in our dictionary one for Trump and the other for Obama because we are downloading tweets about these two people.

There are also two additional things we need to do. We need to use regular expressions to get rid of punctuation and we also need to lower case all words. All this is done in the code below.

tweets={} tweets['trump']=[re.sub(r'[-.#/?!.":;()\']',' ',tweet.lower())for tweet in get_tweets(t,'#trump',1500)] tweets['obama']=[re.sub(r'[-.#/?!.":;()\']',' ',tweet.lower())for tweet in get_tweets(t,'#obama',1500)]

The get_tweets function is also used in the code above along with our twitter app information. We pulled 1500 tweets concerning Obama and 1500 tweets about Trump. We were able to download and clean our tweets at the same time. We can now do our analysis

**Analysis**

To do the sentiment analysis you need dictionaries of positive and negative words. The ones in this post were taken from GitHub. Below is the code for loading them into Python.

positive_words=open('XXXXXXXXXXXX').read().split('\n') negative_words=open('XXXXXXXXXXXX').read().split('\n')

We now will make a function to calculate the sentiment

def sentiment_score(text,pos_list,neg_list): positive_score=0 negative_score=0 for w in text.split(' '): if w in pos_list:positive_score+=1 if w in neg_list:negative_score+=1 return positive_score-negative_score

Now we create an empty dictionary and run the analysis for Trump and then for Obama

tweets_sentiment={} tweets_sentiment['trump']=[sentiment_score(tweet,positive_words,negative_words)for tweet in tweets['trump']] tweets_sentiment['obama']=[sentiment_score(tweet,positive_words,negative_words)for tweet in tweets['obama']]

Now we can make visuals of our results with the code below

trump=plt.hist(tweets_sentiment['trump'],5) obama=plt.hist(tweets_sentiment['obama'],5)

Obama is on the left and trump is on the right. It seems that trump tweets are consistently more positive. Below are the means for both.

numpy.mean(tweets_sentiment['trump']) Out[133]: 0.36363636363636365 numpy.mean(tweets_sentiment['obama']) Out[134]: 0.2222222222222222

Trump tweets are slightly more positive than Obama tweets. Below is the code for the Trump word cloud

Here is the code for the Obama word cloud

A lot of speculating can be made from the word clouds and sentiment analysis. However, the results will change every single time because of the dynamic nature of Twitter. People are always posting tweets which changes the results.

**Conclusion**

This post provided an example of how to download and analyze tweets from twitter. It is important to develop a clear idea of what you want to know before attempting this sort of analysis as it is easy to become confused and not accomplish anything.

# Word Clouds in Python

Word clouds are a type of data visualization in which various words from a dataset are actuated. Words that are larger in the word cloud are more common and words in the middle are also more common. In addition, some word clouds even use various colors to indicated importance.

This post will provide an example of how to make a word cloud using python. We will be using the “Women’s E-Commerce Clothing Reviews” available on the kaggle website. We are going to only use the text reviews to make our word cloud even though other data is in the dataset. To prepare our dataset for making the word cloud we need to the following.

- Lowercase all words
- Remove punctuation
- Remove stopwords

After completing these steps we can make the word cloud. First, we need to load all of the necessary modules.

import pandas as pd import re from nltk.corpus import stopwords import wordcloud import matplotlib.pyplot as plt

We now need to load our dataset we will store it as the object ‘df’

df=pd.read_csv('YOUR LOCATION HERE') df.head()

It’s hard to read but we will be working only with the “Review Text” column as this has the text data we need. Here is what our column looks like up close.

df['Review Text'].head() Out[244]: 0 Absolutely wonderful - silky and sexy and comf... 1 Love this dress! it's sooo pretty. i happene... 2 I had such high hopes for this dress and reall... 3 I love, love, love this jumpsuit. it's fun, fl... 4 This shirt is very flattering to all due to th... Name: Review Text, dtype: object

We will now make all words lower case and remove punctuation with the code below.

df["Review Text"]=df['Review Text'].str.lower() df["Review Text"]=df['Review Text'].str.replace(r'[-./?!,":;()\']',' ')

The first line in the code above lower cases all words. The second line removes the punctuation. The second line is trickier as you have to explain to python exactly what type of punctuation you want to remove and what to replace it with. Everything we want to remove is in the first set of single quotes. We want to replace the punctuation with a space which is the second set of single quotation marks with a space in the middle. THe r at the beginning of the parentheses stands for remove.

Here is what our data looks like after making these two changes

df['Review Text'].head() Out[249]: 0 absolutely wonderful silky and sexy and comf... 1 love this dress it s sooo pretty i happene... 2 i had such high hopes for this dress and reall... 3 i love love love this jumpsuit it s fun fl... 4 this shirt is very flattering to all due to th... Name: Review Text, dtype: object

All the words are in lowercase. In addition, you can see that the dash in line 0 is gone as all the punctuation in the other lines. We now need to remove stopwords. Stopwords are the functional words that glue the meaning together without. Examples include and, for, but, etc. We are trying to make a cloud of substantial words and not stopwords so these words need to be removed.

If you have never done this on your computer before you may need to import the nltk module and run nltk.download_gui(). Once this is done you need to download the stopwords package.

Below is the code for removing the stopwords. First, we need to load the stopwords this is done below.

stopwords_list=stopwords.words('english') stopwords_list=stopwords_list+['to']

We create an object called stopwords_list which has all the English stopwords. The second line just adds the word ‘to’ to the list. Nex,t we need to make an object that will look for the pattern of words we want to remove. Below is the code

pat = r'\b(?:{})\b'.format('|'.join(stopwords_list))

This code is the basically telling Python what to look for. Using regularized expressions Python will look for any word whos pattern on the left is the same as the pattern on the right after the .join function. Inside the .join function is our stopwords_list. We will now take this object called ‘pat’ and use it on our ‘Review Text’ variable.

df['Split Text'] = df['Review Text'].str.replace(pat, '') df['Split Text'].head() Out[258]: 0 absolutely wonderful silky sexy comfortable 1 love dress sooo pretty happened find ... 2 high hopes dress really wanted work ... 3 love love love jumpsuit fun flirty f... 4 shirt flattering due adjustable front t... Name: Split Text, dtype: object

You can see that we have created a new column called ‘Split Text’ and the results is a text that has lost many stop words.

We are now ready to make our word cloud below is the code and the output.

wordcloud1=wordcloud.WordCloud(width=1000,height=500).generate(' '.join(map(str, df['Split Text']))) plt.figure(figsize=(15,8)) plt.imshow(wordcloud1) plt.axis('off')

This code is complex. We used the word cloud function and we had to use both generate map, and join as inner functions. All of these function were needed to take the words from the dataframe and make them simple text for the wordcloud function.

The rest of the code is common to mathplotlib so does not require much explanation. Ass you look at the word cloud, you can see that the most common words include top, look, dress, shirt, fabric. etc. This is reasonable given that these are women’s reviews of clothing.

**Conclusion**

This post provided an example of text analysis using word clouds in Python. The insights here are primarily descriptive in nature. This means that if the desire is prediction or classification other additional tools would need to build upon what is shown here.

# KMeans Clustering in Python

Kmeans clustering is a technique in which the examples in a dataset our divided through segmentation. The segmentation has to do with complex statistical analysis in which examples within a group are more similar the examples outside of a group.

The application of this is that it provides the analysis with various groups that have similar characteristics which can be used to cater services to in various industries such as business or education. In this post, we will look at how to do this using Python. We will use the steps below to complete this process.

- Data preparation
- Determine the number of clusters
- Conduct analysis

**Data Preparation**

Our data for this examples comes from the sat.act dataset available in the pydataset module. Below is some initial code.

import pandas as pd from pydataset import data from sklearn.cluster import KMeans from scipy.spatial.distance import cdist import numpy as np import matplotlib.pyplot as plt

We will now load our dataset and drop any NAs they may be present

You can see there are six variables that will be used for the clustering. Next, we will turn to determining the number of clusters.

**Determine the Number of Clusters**

Before you can actually do a kmeans analysis you must specify the number of clusters. This can be tricky as there is no single way to determine this. For our purposes, we will use the elbow method.

The elbow method measures the within sum of error in each cluster. As the number of clusters increases this error decreases. However, a certain point the return on increasing clustering becomes minimal and this is known as the elbow. Below is the code to calculate this.

distortions = [] K = range(1,10) for k in K: kmeanModel = KMeans(n_clusters=k).fit(df) distortions.append(sum(np.min(cdist(df, kmeanModel.cluster_centers_, 'euclidean'), axis=1)) / df.shape[0])

Here is what we did

- We made an empty list called ‘distortions’ we will save our results there.
- In the second line, we told R the range of clusters we want to consider. Simply, we want to consider anywhere from 1 to 10 clusters.
- Line 3 and 4, we use a for loop to calculate the number of clusters when fitting it to the df object.
- In Line 5, we save the sum of the cluster distance in the distortions list.

Below is a visual to determine the number of clusters

plt.plot(K, distortions, 'bx-') plt.xlabel('k') plt.ylabel('Distortion') plt.title('The Elbow Method showing the optimal k')

The graph indicates that 3 clusters are sufficient for this dataset. We can now perform the actual kmeans clustering.

**KMeans Analysis**

The code for the kmeans analysis is as follows

km=KMeans(3,init='k-means++',random_state=3425) km.fit(df.values)

- We use the KMeans function and tell Python the number of clusters, the type of, initialization, and we set the seed with the random_state argument. All this is saved in the object called km
- The km object has the .fit function used on it with df.values

Next, we will predict with the predict function and look at the first few lines of the modified df with the .head() function.

You can see we created a new variable called predict. This variable contains the kmeans algorithm prediction of which group each example belongs too. We then printed the first five values as an illustration. Below are the descriptive statistics for the three clusters that were produced for the variable in the dataset.

It is clear that the clusters are mainly divided based on the performance on the various test used. In the last piece of code, gender is used. 1 represents male and 2 represents female.

We will now make a visual of the clusters using two dimensions. First, w e need to make a map of the clusters that is saved as a dictionary. Then we will create a new variable in which we take the numerical value of each cluster and convert it to a string in our cluster map dictionary.

clust_map={0:'Weak',1:'Average',2:'Strong'} df['perf']=df.predict.map(clust_map)

Next, we make a different dictionary to color the points in our graph.

d_color={'Weak':'y','Average':'r','Strong':'g'}

Here is what is happening in the code above.

- We set the ax object to a value.
- A for loop is used to go through every example in clust_map.values so that they are colored according the color
- Lastly, a plot is called which lines up the perf and clust values for color.

The groups are clearly separated when looking at them in two dimensions.

**Conclusion**

Kmeans is a form of unsupervised learning in which there is no dependent variable which you can use to assess the accuracy of the classification or the reduction of error in regression. As such, it can be difficult to know how well the algorithm did with the data. Despite this, kmeans is commonly used in situations in which people are trying to understand the data rather than predict.

# Random Forest in Python

This post will provide a demonstration of the use of the random forest algorithm in python. Random forest is similar to decision trees except that instead of one tree a multitude of trees are grown to make predictions. The various trees all vote in terms of how to classify an example and majority vote is normally the winner. Through making many trees the accuracy of the model normally improves.

The steps are as follows for the use of random forest

- Data preparation
- Model development & evaluation
- Model comparison
- Determine variable importance

**Data Preparation**

We will use the cancer dataset from the pydataset module. We want to predict if someone is censored or dead in the status variable. The other variables will be used as predictors. Below is some code that contains all of the modules we will use.

import pandas as pd import sklearn.ensemble as sk from pydataset import data from sklearn.model_selection import train_test_split from sklearn import metrics import matplotlib.pyplot as plt

We will now load our data cancer in an object called ‘df’. Then we will remove all NA’s use the .dropna() function. Below is the code.

df = data('cancer') df=df.dropna()

We now need to make two datasets. One dataset, called X, will contain all of the predictor variables. Another dataset, called y, will contain the outcome variable. In the y dataset, we need to change the numerical values to a string. This will make interpretation easier as we will not need to lookup what the numbers represents. Below is the code.

X=df[['time','age',"sex","ph.ecog",'ph.karno','pat.karno','meal.cal','wt.loss']] df['status']=df.status.replace(1,'censored') df['status']=df.status.replace(2,'dead') y=df['status']

Instead of 1 we now have the string “censored” and instead of 2 we now have the string “dead” in the status variable. The final step is to set up our train and test sets. We will do a 70/30 split. We will have a train set for the X and y dataset as well as a test set for the X and y datasets. This means we will have four datasets in all. Below is the code.

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

We are now ready to move to model development

**Model Development and Evaluation**

We now need to create our classifier and fit the data to it. This is done with the following code.

clf=sk.RandomForestClassifier(n_estimators=100) clf=clf.fit(x_train,y_train)

The clf object has our random forest algorithm,. The number of estimators is set to 100. This is the number of trees that will be generated. In the second line of code, we use the .fit function and use the training datasets x and y.

We now will test our model and evaluate it. To do this we will use the .predict() with the test dataset Then we will make a confusion matrix followed by common metrics in classification. Below is the code and the output.

You can see that our model is good at predicting who is dead but struggles with predicting who is censored. The metrics are reasonable for dead but terrible for censored.

We will now make a second model for the purpose of comparison

**Model ****Comparison**

We will now make a different model for the purpose of comparison. In this model, we will use out of bag samples to determine accuracy, set the minimum split size at 5 examples, and that each leaf has at least 2 examples. Below is the code and the output.

There was some improvement in classify people who were censored as well as for those who were dead.

**Variable Importance**

We will now look at which variables were most important in classifying our examples. Below is the code

model_ranks=pd.Series(clf.feature_importances_,index=x_train.columns,name="Importance").sort_values(ascending=True,inplace=False) ax=model_ranks.plot(kind='barh')

We create an object called model_ranks and we indicate the following.

- Classify the features by importance
- Set index to the columns in the training dataset of x
- Sort the features from most to least importance
- Make a barplot

Below is the output

You can see that time is the strongest classifier. How long someone has cancer is the strongest predictor of whether they are censored or dead. Next is the number of calories per meal followed by weight and lost and age.

**Conclusion**

Here we learned how to use random forest in Python. This is another tool commonly used in the context of machine learning.

# Decision Trees in Python

Decision trees are used in machine learning. They are easy to understand and are able to deal with data that is less than ideal. In addition, because of the pictorial nature of the results decision trees are easy for people to interpret. We are going to use the ‘cancer’ dataset to predict mortality based on several independent variables.

We will follow the steps below for our decision tree analysis

- Data preparation
- Model development
- Model evaluation

**Data Preparation**

We need to load the following modules in order to complete this analysis.

import pandas as pd import statsmodels.api as sm import sklearn from pydataset import data from sklearn.model_selection import train_test_split from sklearn import metrics from sklearn import tree import matplotlib.pyplot as plt from sklearn.externals.six import StringIO from IPython.display import Image from sklearn.tree import export_graphviz import pydotplus

The ‘cancer’ dataset comes from the ‘pydataset’ module. You can learn more about the dataset by typing the following

data('cancer', show_doc=True)

This provides all you need to know about our dataset in terms of what each variable is measuring. We need to load our data as ‘df’. In addition, we need to remove rows with missing values and this is done below.

df = data('cancer') len(df) Out[58]: 228 df=df.dropna() len(df) Out[59]: 167

The initial number of rows in the data set was 228. After removing missing data it dropped to 167. We now need to setup up our lists with the independent variables and a second list with the dependent variable. While doing this, we need to recode our dependent variable “status” so that the numerical values are replaced with a string. This will help us to interpret our decision tree later. Below is the code

X=df[['time','age',"sex","ph.ecog",'ph.karno','pat.karno','meal.cal','wt.loss']] df['status']=df.status.replace(1,'censored') df['status']=df.status.replace(2,'dead') y=df['status']

Next, we need to make our train and test sets using the train_test_split function. We want a 70/30 split. The code is below.

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

We are now ready to develop our model.

**Model Development**

The code for the model is below

clf=tree.DecisionTreeClassifier(min_samples_split=10) clf=clf.fit(x_train,y_train)

We first make an object called “clf” which calls the DecisionTreeClassifier. Inside the parentheses, we tell Python that we do not want any split in the tree to contain less than 10 examples. The second “clf” object uses the .fit function and calls the training datasets.

We can also make a visual of our decision tree.

dot_data = StringIO() export_graphviz(clf, out_file=dot_data, filled=True, rounded=True,feature_names=list(x_train.columns.values), special_characters=True) graph = pydotplus.graph_from_dot_data(dot_data.getvalue()) Image(graph.create_png())

If we interpret the nodes furthest to the left we get the following

- If a person has had cancer less than 171 days and
- If the person is less than 74.5 years old then
- The person is dead

If you look closely every node is classified as ‘dead’ this may indicate a problem with our model. The evaluation metrics are below.

**Model Evaluation**

We will use the .crosstab function and the metrics classification functions

You can see that the metrics are not that great in general. This may be why everything was classified as ‘dead’. Another reason is that few people were classified as ‘censored’ in the dataset.

**Conclusion**

Decisions trees are another machine learning tool. Python allows you to develop trees rather quickly that can provide insights into how to take action.

# Multiple Regression in Python

In this post, we will go through the process of setting up and a regression model with a training and testing set using Python. We will use the insurance dataset from kaggle. Our goal will be to predict charges. In this analysis, the following steps will be performed.

- Data preparation
- Model training
- model testing

**Data Preparation**

Below is a list of the modules we will need in order to complete the analysis

import matplotlib.pyplot as plt import pandas as pd from sklearn import linear_model,model_selection, feature_selection,preprocessing import statsmodels.formula.api as sm from statsmodels.tools.eval_measures import mse from statsmodels.tools.tools import add_constant from sklearn.metrics import mean_squared_error

After you download the dataset you need to load it and take a look at it. You will use the .read_csv function from pandas to load the data and .head() function to look at the data. Below is the code and the output.

insure=pd.read_csv('YOUR LOCATION HERE')

We need to create some dummy variables for sex, smoker, and region. We will address that in a moment, right now we will look at descriptive stats for our continuous variables. We will use the .describe() function for descriptive stats and the .corr() function to find the correlations.

The descriptives are left for your own interpretation. As for the correlations, they are generally weak which is an indication that regression may be appropriate.

As mentioned earlier, we need to make dummy variables sex, smoker, and region in order to do the regression analysis. To complete this we need to do the following.

- Use the pd.get_dummies function from pandas to create the dummy
- Save the dummy variable in an object called ‘dummy’
- Use the pd.concat function to add our new dummy variable to our ‘insure’ dataset
- Repeat this three times

Below is the code for doing this

dummy=pd.get_dummies(insure['sex']) insure=pd.concat([insure,dummy],axis=1) dummy=pd.get_dummies(insure['smoker']) insure=pd.concat([insure,dummy],axis=1) dummy=pd.get_dummies(insure['region']) insure=pd.concat([insure,dummy],axis=1) insure.head()

The .get_dummies function requires the name of the dataframe and in the brackets the name of the variable to convert. The .concat function requires the name of the two datasets to combine as well the axis on which to perform it.

We now need to remove the original text variables from the dataset. In addition, we need to remove the y variable “charges” because this is the dependent variable.

y = insure.charges insure=insure.drop(['sex', 'smoker','region','charges'], axis=1)

We can now move to model development.

**Model Training**

Are train and test sets are model with the model_selection.trainin_test_split function. We will do an 80-20 split of the data. Below is the code.

X_train, X_test, y_train, y_test = model_selection.train_test_split(insure, y, test_size=0.2)

In this single line of code, we create a train and test set of our independent variables and our dependent variable.

We can not run our regression analysis. This requires the use of the .OLS function from statsmodels module. Below is the code.

answer=sm.OLS(y_train, add_constant(X_train)).fit()

In the code above inside the parentheses, we put the dependent variable(y_train) and the independent variables (X_train). However, we had to use the function add_constant to get the intercept for the output. All of this information is then used inside the .fit() function to fit a model.

To see the output you need to use the .summary() function as shown below.

answer.summary()

The assumption is that you know regression but our reading this post to learn python. Therefore, we will not go into great detail about the results. The r-square is strong, however, the region and gender are not statistically significant.

We will now move to model testing

**Model Testing**

Our goal here is to take the model that we developed and see how it does on other data. First, we need to predict values with the model we made with the new data. This is shown in the code below

ypred=answer.predict(add_constant(X_test))

We use the .predict() function for this action and we use the X_test data as well. With this information, we will calculate the mean squared error. This metric is useful for comparing models. We only made one model so it is not that useful in this situation. Below is the code and results.

print(mse(ypred,y_test)) 33678660.23480476

For our final trick, we will make a scatterplot with the predicted and actual values of the test set. In addition, we will calculate the correlation of the predict values and test set values. This is an alternative metric for assessing a model.

You can see the first two lines are for making the plot. Lines 3-4 are for making the correlation matrix and involves the .concat() function. The correlation is high at 0.86 which indicates the model is good at accurately predicting the values. THis is confirmed with the scatterplot which is almost a straight line.

**Conclusion**

IN this post we learned how to do a regression analysis in Python. We prepared the data, developed a model, and tested a model with an evaluation of it.

# Principal Component Analysis in Python

Principal component analysis is a form of dimension reduction commonly used in statistics. By dimension reduction, it is meant to reduce the number of variables without losing too much overall information. This has the practical application of speeding up computational times if you want to run other forms of analysis such as regression but with fewer variables.

Another application of principal component analysis is for data visualization. Sometimes, you may want to reduce many variables to two in order to see subgroups in the data.

Keep in mind that in either situation PCA works better when there are high correlations among the variables. The explanation is complex but has to do with the rotation of the data which helps to separate the overlapping variance.

**Prepare the Data**

We will be using the pneumon dataset from the pydataset module. We want to try and explain the variance with fewer variables than in the dataset. Below is some initial code.

import pandas as pd from sklearn.decomposition import PCA from pydataset import data from sklearn.preprocessing import StandardScaler import matplotlib.pyplot as plt

Next, we will set up our dataframe. We will only take the first 200 examples from the dataset. If we take all (over 3000 examples), the visualization will be a giant blob of dotes that cannot be interpreted. We will also drop in missing values. Below is the code

df = data('pneumon') df=df.dropna() df=df.iloc[0:199,]

When doing a PCA, it is important to scale the data because PCA is sensitive to this. The result of the scaling process is an array. This is a problem because the PCA function needs a dataframe. This means we have to convert the array to a dataframe. When this happens you also have to rename the columns in the new dataframe. All this is done in the code below.

scaler = StandardScaler() #instance df_scaled = scaler.fit_transform(df) #scaled the data df_scaled= pd.DataFrame(df_scaled) #made the dataframe df_scaled=df_scaled.rename(index=str, columns={0: "chldage", 1: "hospital",2:"mthage",3:"urban",4:"alcohol",5:"smoke",6:"region",7:"poverty",8:"bweight",9:"race",10:"education",11:"nsibs",12:"wmonth",13:"sfmonth",14:"agepn"}) # renamed columns

**Analysis**

We are now ready to do our analysis. We first use the PCA function to indicate how many components we want. For our first example, we will have two components. Next, you use the .fit_transform function to fit the model. Below is the code.

pca_2c=PCA(n_components=2) X_pca_2c=pca_2c.fit_transform(df_scaled)

Now we can see the variance explained by component and the sum

pca_2c.explained_variance_ratio_ Out[199]: array([0.18201588, 0.12022734]) pca_2c.explained_variance_ratio_.sum() Out[200]: 0.30224321247148167

In the first line of code, we can see that the first component explained 18% of the variance and the second explained 12%. This leads to a total of about 30%. Below is a visual of our 2 component model the color represents the race of the respondent. The three different colors represent three different races.

Our two components do a reasonable separating the data. Below is the code for making four components. We can not graph four components since our graph can only handle two but you will see that as we increase the components we also increase the variance explained.

pca_4c=PCA(n_components=4) X_pca_4c=pca_4c.fit_transform(df_scaled) pca_4c.explained_variance_ratio_ Out[209]: array([0.18201588, 0.12022734, 0.09290502, 0.08945079]) pca_4c.explained_variance_ratio_.sum() Out[210]: 0.4845990164486457

With four components we now have almost 50% of the variance explained.

**Conclusion**

PCA is for summarising and reducing the number of variables used in an analysis or for the purposes of data visualization. Once this process is complete you can use the results to do further analysis if you desire.

# for Loops in Python VIDEO

for loops in python

# Data Exploration with Python

In this post, we will explore a dataset using Python. The dataset we will use is the Ghouls, Goblins, and Ghost (GGG) dataset available at the kaggle website. The analysis will not be anything complex we will simply do the following.

- Data preparation
- Data visualization
- Descriptive statistics
- Regression analysis

**Data Preparation**

The GGG dataset is fictitious data on the characteristics of spirits. Below are the modules we will use for our analysis.

import pandas as pd import statsmodels.regression.linear_model as sm import numpy as np

Once you download the dataset to your computer you need to load it into Python using the pd.read.csv function. Below is the code.

df=pd.read_csv('FILE LOCATION HERE')

We store the data as “df” in the example above. Next, we will take a peek at the first few rows of data to see what we are working with.

Using the print function and accessing the first five rows reveals. It appears the first five columns are continuous data and the last two columns are categorical. The ‘id’ variable is useless for our purposes so we will remove it with the code below.

df=df.drop(['id'],axis=1)

The code above uses the drop function to remove the variable ‘id’. This is all saved into the object ‘df’. In other words, we wrote over are original ‘df’.

**Data Visualization**

We will start with our categorical variables for the data visualization. Below is a table and a graph of the ‘color’ and ‘type’ variables.

First, we make an object called ‘spirits’ using the groupby function to organize the table by the ‘type’ variable.

Below we make a graph of the data above using the .plot function. A professional wouldn’t make this plot but we are just practicing how to code.

We now know how many ghosts, goblins and, ghouls there are in the dataset. We will now do a breakdown of ‘type’ by ‘color’ using the .crosstab function from pandas.

We will now make bar graphs of both of the categorical variables using the .plot function.

We will now turn our attention to the continuous variables. We will simply make histograms and calculate the correlation between them. First the histograms

The code is simply subset the variable you want in the brackets and then type .plot.hist() to access the histogram function. It appears that all of our data is normally distributed. Now for the correlation

Using the .corr() function has shown that there are now high correlations among the continuous variables. We will now do an analysis in which we combine the continuous and categorical variables through making boxplots

The code is redundant. We use the .boxplot() function and tell python the column which is continuous and the ‘by’ which is the categorical variable.

**Descriptive Stats**

We are simply going to calcualte the mean and standard deviation of the continuous variables.

df["bone_length"].mean() Out[65]: 0.43415996604821117 np.std(df["bone_length"]) Out[66]: 0.13265391313941383 df["hair_length"].mean() Out[67]: 0.5291143100058727 np.std(df["hair_length"]) Out[68]: 0.16967268504935665 df["has_soul"].mean() Out[69]: 0.47139203219259107 np.std(df["has_soul"]) Out[70]: 0.17589180837106724

The mean is calcualted with the .mean(). Standard deviation is calculated using the .std() function from the numpy package.

**Multiple Regression**

Our final trick is we want to explain the variable “has_soul” using the other continuous variables that are available. Below is the code

X = df[["bone_length", "rotting_flesh","hair_length"]] y = df["has_soul"] model = sm.OLS(y, X).fit()

In the code above we crate to new list. X contains are independent variables and y contains the dependent variable. Then we create an object called model and use the OLS() function. We place the y and X inside the parenthesis and we then use the .fit() function as well. Below is the summary of the analysis

There is obviously a lot of information in the output. The r-square is 0.91 which is surprisingly high given that there were not high correlations in the matrix. The coefficiencies for the three independent variables are listed and all are significant. The AIC and BIC are for model comparison and do not mean much in isolation. The JB stat indicates that are distribution is not normal. Durbin watson test indicates negative autocorrelation which is important in time-series analysis.

**Conclusion**

Data exploration can be an insightful experience. Using Python, we found mant different patterns and ways to describe the data.

# Logistic Regression in Python

This post will provide an example of a logistic regression analysis in Python. Logistic regression is commonly used when the dependent variable is categorical. Our goal will be to predict the gender of an example based on the other variables in the model. Below are the steps we will take to achieve this.

- Data preparation
- Model development
- Model testing
- Model evaluation

**Data Preparation**

The dataset we will use is the ‘Survey of Labour and Income Dynamics’ (SLID) dataset available in the pydataset module in Python. This dataset contains basic data on labor and income along with some demographic information. The initial code that we need is below.

import pandas as pd import statsmodels.api as sm import numpy as np import matplotlib.pyplot as plt from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn import metrics from pydataset import data

The code above loads all the modules and other tools we will need in this example. We now can load our data. In addition to loading the data, we will also look at the count and the characteristics of the variables. Below is the code.

At the top of this code, we create the ‘df’ object which contains our data from the “SLID”. Next, we used the .count() function to determine if there was any missing data and to see what variables were available. It appears that we have five variables and a lot of missing data as each variable has different amounts of data. Lastly, we used the .head() function to see what each variable contained. It appears that wages, education, and age are continuous variables well sex and language are categorical. The categorical variables will need to be converted to dummy variables as well.

The next thing we need to do is drop all the rows that are missing data since it is hard to create a model when data is missing. Below is the code and the output for this process.

In the code above, we used the .dropna() function to remove missing data. Then we used the .count() function to see how many rows remained. You can see that all the variables have the same number of rows which is important for model analysis. We will now make our dummy variables for sex and language in the code below.

Here is what we did,

- We used the .get_dummies function from pandas first on the sex variable. All this was stored in a new object called “dummy”
- We then combined the dummy and df datasets using the .concat() function. The axis =1 argument is for combing by column.
- We repeat steps 1 and 2 for the language variable
- Lastly, we used the .head() function to see the results

With this, we are ready to move to model development.

**Model Development**

The first thing we need to do is put all of the independent variables in one dataframe and the dependent variable in its own dataframe. Below is the code for this

X=df[['wages','education','age',"French","Other"]] y=df['Male']

Notice that we did not use every variable that was available. For the language variables, we only used “French” and “Other”. This is because when you make dummy variables you only need k-1 dummies created. Since the language variable had three categories we only need two dummy variables. Therefore, we excluded “English” because when “French” and “Other” are coded 0 it means that “English” is the characteristic of the example.

In addition, we only took “male” as our dependent variable because if “male” is set to 0 it means that example is female. We now need to create our train and test dataset. The code is below.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

We created four datasets

- train dataset with the independent variables
- train dataset with the dependent variable
- test dataset with the independent variables
- test dataset with the independent variable

The split is 70/30 with 70% being used for the training and 30% being used for testing. This is the purpose of the “test_size” argument. we used the train_test_split function to do this. We can now run our model and get the results. Below is the code.

Here is what we did

- We used the .Logit() function from statsmodel to create the logistic model. Notice we used only the training data.
- We then use the .fit() function to get the results and stored this in the result object.
- Lastly, we printed the results in the ‘result’ object using the .summary()

There are some problems with the results. The Pseudo R-square is infinity which is usually. Also, you may have some error output about hessian inversion in your output. For these reasons, we cannot trust the results but will continue for the sake of learning.

The coefficients are clear. Only wage, education, and age are significant. In order to determine the probability you have to take the coefficient from the model and use the .exp() function from numpy. Below are the results.

np.exp(.08) Out[107]: 1.0832870676749586 np.exp(-0.06) Out[108]: 0.9417645335842487 np.exp(-.01) Out[109]: 0.9900498337491681

For the first value, for every unit wages increaser the probability that they are male increase 8%. For every 1 unit increase in education there probability of the person being male decrease 6%. Lastly, for every one unit increase in age the probability of the person being male decrease by 1%. Notice that we subtract 1 from the outputs to find the actual probability.

We will now move to model testing

**Model Testing**

To do this we first test our model with the code below

y_pred=result.predict(X_test)

We made the result object earlier. Now we just use the .predict() function with the X_test data. Next, we need to flag examples that the model believes has a 60% chance or greater of being male. The code is below

y_pred_flag=y_pred>.6

This creates a boolean object with True and False as the output. Now we will make our confusion matrix as well as other metrics for classification.

The results speak for themselves. There are a lot of false positives if you look at the confusion matrix. In addition precision, recall, and f1 are all low. Hopefully, the coding should be clear the main point is to be sure to use the test set dependent dataset (y_test) with the flag data you made in the previous step.

We will not make the ROC curve. For a strong model, it should have a strong elbow shape while with a weak model it will be a diagonal straight line.

The first plot is of our data. The second plot is what a really bad model would look like. As you can see there is littte difference between the two. Again this is because of all the false positives we have in the model. The actual coding should be clear. fpr is the false positive rate, tpr is the true positive rate. The function is .roc_curve. Inside goes the predict vs actual test data.

**Conclusion**

This post provided a demonstration of the use of logistic regression in Python. It is necessary to follow the steps above but keep in mind that this was a demonstration and the results are dubious.

# Working with a Dataframe in Python

In this post, we will learn to do some basic exploration of a dataframe in Python. Some of the task we will complete include the following…

- Import data
- Examine data
- Work with strings
- Calculating descriptive statistics

**Import Data **

First, you need data, therefore, we will use the Titanic dataset, which is readily available on the internet. We will need to use the pd.read_csv() function from the pandas package. This means that we must also import pandas. Below is the code.

import pandas as pd df=pd.read_csv('FILE LOCATION HERE')

In the code above we imported pandas as pd so we can use the functions within it. Next, we create an object called ‘df’. Inside this object, we used the pd.read_csv() function to read our file into the system. The location of the file needs to type in quotes inside the parentheses. Having completed this we can now examine the data.

**Data Examination**

Now we want to get an idea of the size of our dataset, any problems with missing. To determine the size we use the .shape function as shown below.

df.shape Out[33]: (891, 12)

Results indicate that we have 891 rows and 12 columns/variables. You can view the whole dataset by typing the name of the dataframe “df” and pressing enter. If you do this you may notice there are a lot of NaN values in the “Cabin” variable. To determine exactly how many we can use is.null() in combination with the values_count. variables.

df['Cabin'].isnull().value_counts() Out[36]: True 687 False 204 Name: Cabin, dtype: int64

The code starts with the name of the dataframe. In the brackets, you put the name of the variable. After that, you put the functions you are using. Keep in mind that the order of the functions matters. You can see we have over 200 missing examples. For categorical varable, you can also see how many examples are part of each category as shown below.

df['Embarked'].value_counts() Out[39]: S 644 C 168 Q 77 Name: Embarked, dtype: int64

This time we used our ‘Embarked’ variable. However, we need to address missing values before we can continue. To deal with this we will use the .dropna() function on the dataset. THen we will check the size of the dataframe again with the “shape” function.

df=df.dropna(how='any') df.shape Out[40]: (183, 12)

You can see our dataframe is much smaller going 891 examples to 183. We can now move to other operations such as dealing with strings.

**Working with Strings**

What you do with strings really depends or your goals. We are going to look at extraction, subsetting, determining the length. Our first step will be to extract the last name of the first five people. We will do this with the code below.

df['Name'][0:5].str.extract('(\w+)') Out[44]: 1 Cumings 3 Futrelle 6 McCarthy 10 Sandstrom 11 Bonnell Name: Name, dtype: object

As you can see we got the last names of the first five examples. We did this by using the following format…

dataframe name[‘Variable Name’].function.function(‘whole word’))

.str is a function for dealing with strings in dataframes. The .extract() function does what its name implies.

If you want, you can even determine how many letters each name is. We will do this with the .str and .len() function on the first five names in the dataframe.

df['Name'][0:5].str.len() Out[64]: 1 51 3 44 6 23 10 31 11 24 Name: Name, dtype: int64

Hopefully, the code is becoming easier to read and understand.

**Aggregation**

We can also calculate some descriptive statistics. We will do this for the “Fare” variable. The code is repetitive in that only the function changes so we will run all of them at once. Below we are calculating the mean, max, minimum, and standard deviation for the price of a fare on the Titanic

df['Fare'].mean() Out[77]: 78.68246885245901 df['Fare'].max() Out[78]: 512.32920000000001 df['Fare'].min() Out[79]: 0.0 df['Fare'].std() Out[80]: 76.34784270040574

**Conclusion**

This post provided you with some ways in which you can maneuver around a dataframe in Python.