VIDEOIn this video, we will look at a simple example of SVM regression. In this context, regression involves predicting a continuous dependent variable. This is similar to the basic form of regression that is taught in an introduction to stats class
Tag Archives: python programming
Support Vector Machines Classification with Python VIDEO
Learn how to use support vector machines for classification using Python. This is another great algorithm for data science purposes.
Natural Language Process and WordClouds with Python VIDEO
Random Forest in Python VIDEO
Decision Trees in Python VIDEO
Principal Component Analysis with Python VIDEO

Visualizations with Altair
We are going to take a look at Altair which is a data visulization library for Python. What is unique abiut Altair compared to other packages experienced on this blog is that it allows for interactions.
The interactions can take place inside jupyter or they can be exported and loaded onto websites as we shall see. In the past, making interactions for website was often tught using a jacascript library such as d3.js. D3.js works but is cumbersome to work with for the avaerage non-coder. Altair solves this problem as Python is often seen as easier to work with compared to javascript.
Installing Altair
If Altair is not already install on your computer you can do so with the following code
pip install altair vega_datasets
OR
conda install -c conda-forge altair vega_datasets
Which one of the lines above you use will depend on the type of Python installation you have.
Goal
We are going to make some simple visualizations using the “Duncan” dataset from the pydataset library using Altair. If you do not have pydataset install on your ocmputer you can use the code listed above to install it. Simple replace “altair vega_datasets” with “pydataset.” Below is the initial code followed by the output
import pandas as pd
from pydataset import data
df=data("Duncan")
df.head()
In the code above, we load pandas and import “data” from the “pydataset” library. Next, we load the “Duncan” dataset as the object “df”. Lastly, we use the .head() function to take a look at the dataset. You can see in the imagine above what variables are available.
Our first visualization is a simple bar graph. The code is below followed by the visualization.
import altair as alt
alt.Chart(df).mark_bar().encode(
x= "type",
y = "prestige"
)

In the code above we did the following,
- Line one loads the altair library.
- Line 2 uses several functions together to make the bar graph. .Chart(df) loads the data for the plot. .mark_bar() assigns the geomtric shape for the plot which in this case is bars. Lastly, the .encode() function contains the information for the variables that will be assigned to the x and y axes. In this case we are looking at job type and prestige.
The three dots in the upper right provide options for saving or editing the plot. We will learn more about saving plots later. In addition, Altair follows the grammar of graphics for creating plots. This has been discussed in another post but a summary of the components are below.
- Data
- Aesthetics
- Scale.
- Statistical transformation
- Geometric object
- Facets
- Coordinate system
We will not deal with all of these but we have dealt with the following
- Data as .Chart()
- Aesthetics and Geometric object as .mark_bar()
- coordinate system as .encode()
In our second example, we will make a scatterplot. The code and output are below.
alt.Chart(df).mark_circle().encode(
x= "education",
y = "prestige"
)

The code is mostly the same. We simple use .mark_circle() as to indicate the type of geometric object. For .encode() we made sure to use two continuous variables.
In the next plot, we add a categorical variable to the scatterplot by manipulating the color.
alt.Chart(df).mark_circle().encode(
x= "education",
y = "prestige",
color='type'
)

The only change is the addition of the “color”argument which is set to the categorical vareiable of “type.”
It is also possible to use bubbles to indicate size. In the plot below we can add the income varibale to the plot using bubbles.
alt.Chart(df).mark_circle().encode(
x= "education",
y = "prestige",
color='type',
size="income"
)

The latest argument that was added was the “size” argument which was used to map income to the plot.
You can also facet data by piping. The code below makes two plots and saving them as objects. Then you print both by typing the name of the objects while separated by the pipe symbol (|) which you can find above the enter key on your keyboard. Below you will find two different plots created through this piping process.
educationPlot=alt.Chart(df).mark_circle().encode(
x= "education",
y = "prestige",
color='type',
)
incomePlot=alt.Chart(df).mark_circle().encode(
x= "income",
y = "prestige",
color='type',
)
educationPlot | incomePlot

With this code you can make multiple plots. Simply keep adding pipes to make more plots.
Interaction and Saving Plots
It is also possible to move plots interactive. In the code below we add the command called tool tip. This allows us to add an additional variable called “income” to the chart. When the mouse hoovers over a data-point the income will display.
However, since we are in a browser right now this will not work unless w save the chart as an html file. The last line of code saves the plot as an html file and renders it using svg. We also remove the three dots in the upper left corner by adding the ‘actions’:False. Below is the code and the plot once the html was loaded to this blog.
interact_plot=alt.Chart(df).mark_circle().encode(
x= "education",
y = "prestige",
color='type',
tooltip=["income"]
)
interact_plot.save('interact_plot.html',embed_options={'renderer':'svg','actions':False})
I’ve made a lot of visuals in the past and never has it been this simple
Conclusion
Altair is another tool for visualizations. This may be the easiest way to make complex and interactive charts that I have seen. As such, this is a great way to achieve goals if visualizing data is something that needs to be done.

Random Forest Classification with Python
Random forest is a type of machine learning algorithm in which the algorithm makes multiple decision trees that may use different features and subsample to making as many trees as you specify. The trees then vote to determine the class of an example. This approach helps to deal with the high variance that is a problem with making only one decision tree.
In this post, we will learn how to develop a random forest model in Python. We will use the cancer dataset from the pydataset module to classify whether a person status is censored or dead based on several independent variables. The steps we need to perform to complete this task are defined below
- Data preparation
- Model development and evaluation
Data Preparation
Below are some initial modules we need to complete all of the tasks for this project.
import pandas as pd import numpy as np from pydataset import data from sklearn.model_selection import train_test_split from sklearn.model_selection import cross_val_score from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import classification_report
We will now load our dataset “Cancer” and drop any rows that contain NA using the .dropna() function.
df = data('cancer') df=df.dropna()
Next, we need to separate our independent variables from our dependent variable. We will do this by make two datasets. The X dataset will contain all of our independent variables and the y dataset will contain our dependent variable. You can check the documentation for the dataset using the code data(“Cancer”, show_doc=True)
Before we make the y dataset we need to change the numerical values in the status variable to text. Doing this will aid in the interpretation of the results. If you look at the documentation of the dataset you will see that a 1 in the status variable means censored while a 2 means dead. We will change the 1 to censored and the 2 to dead when we make the y dataset. This involves the use of the .replace() function. The code is below.
X=df[['time','age',"sex","ph.ecog",'ph.karno','pat.karno','meal.cal','wt.loss']] df['status']=df.status.replace(1,'censored') df['status']=df.status.replace(2,'dead') y=df['status']
We can now proceed to model development.
Model Development and Evaluation
We will first make our train and test datasets. We will use a 70/30 split. Next, we initialize the actual random forest classifier. There are many options that can be set. For our purposes, we will set the number of trees to make to 100. Setting the random_state option is similar to setting the seed for the purpose of reproducibility. Below is the code.
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0) h=RandomForestClassifier(n_estimators=100,random_state=1)
We can now run our modle with the .fit() function and test it with the .pred() function. The code is velow.
h.fit(x_train,y_train) y_pred=h.predict(x_test)
We will now print two tables. The first will provide the raw results for the classification using the .crosstab() function. THe classification_reports function will provide the various metrics used for determining the value of a classification model.
print(pd.crosstab(y_test,y_pred)) print(classification_report(y_test,y_pred))
Our overall accuracy is about 75%. How good this is depends in context. We are really good at predicting people are dead but have much more trouble with predicting if people are censored.
Conclusion
This post provided an example of using random forest in python. Through the use of a forest of trees, it is possible to get much more accurate results when a comparison is made to a single decision tree. This is one of many reasons for the use of random forest in machine learning.

Gradient Boosting Regression in Python
In this post, we will take a look at gradient boosting for regression. Gradient boosting simply makes sequential models that try to explain any examples that had not been explained by previously models. This approach makes gradient boosting superior to AdaBoost.
Regression trees are mostly commonly teamed with boosting. There are some additional hyperparameters that need to be set which includes the following
- number of estimators
- learning rate
- subsample
- max depth
We will deal with each of these when it is appropriate. Our goal in this post is to predict the amount of weight loss in cancer patients based on the independent variables. This is the process we will follow to achieve this.
- Data preparation
- Baseline decision tree model
- Hyperparameter tuning
- Gradient boosting model development
Below is some initial code
from sklearn.ensemble import GradientBoostingRegressor from sklearn import tree from sklearn.model_selection import GridSearchCV import numpy as np from pydataset import data import pandas as pd from sklearn.model_selection import cross_val_score from sklearn.model_selection import KFold
Data Preparation
The data preparation is not that difficult in this situation. We simply need to load the dataset in an object and remove any missing values. Then we separate the independent and dependent variables into separate datasets. The code is below.
df=data('cancer').dropna() X=df[['time','sex','ph.karno','pat.karno','status','meal.cal']] y=df['wt.loss']
We can now move to creating our baseline model.
Baseline Model
The purpose of the baseline model is to have something to compare our gradient boosting model to. Therefore, all we will do here is create several regression trees. The difference between the regression trees will be the max depth. The max depth has to with the number of nodes python can make to try to purify the classification. We will then decide which tree is best based on the mean squared error.
The first thing we need to do is set the arguments for the cross-validation. Cross validating the results helps to check the accuracy of the results. The rest of the code requires the use of for loops and if statements that cannot be reexplained in this post. Below is the code with the output.
crossvalidation=KFold(n_splits=10,shuffle=True,random_state=1)
for depth in range (1,10): tree_regressor=tree.DecisionTreeRegressor(max_depth=depth,random_state=1) if tree_regressor.fit(X,y).tree_.max_depth<depth: break score=np.mean(cross_val_score(tree_regressor,X,y,scoring='neg_mean_squared_error', cv=crossvalidation,n_jobs=1)) print(depth, score)
1 -193.55304528235052 2 -176.27520747356175 3 -209.2846723461564 4 -218.80238479654003 5 -222.4393459885871 6 -249.95330609042858 7 -286.76842138165705 8 -294.0290706405905 9 -287.39016236497804
You can see that a max depth of 2 had the lowest amount of error. Therefore, our baseline model has a mean squared error of 176. We need to improve on this in order to say that our gradient boosting model is superior.
However, before we create our gradient boosting model. we need to tune the hyperparameters of the algorithm.
Hyperparameter Tuning
Hyperparameter tuning has to with setting the value of parameters that the algorithm cannot learn on its own. As such, these are constants that you set as the researcher. The problem is that you are not any better at knowing where to set these values than the computer. Therefore, the process that is commonly used is to have the algorithm use several combinations of values until it finds the values that are best for the model/. Having said this, there are several hyperparameters we need to tune, and they are as follows.
- number of estimators
- learning rate
- subsample
- max depth
The number of estimators is show many trees to create. The more trees the more likely to overfit. The learning rate is the weight that each tree has on the final prediction. Subsample is the proportion of the sample to use. Max depth was explained previously.
What we will do now is make an instance of the GradientBoostingRegressor. Next, we will create our grid with the various values for the hyperparameters. We will then take this grid and place it inside GridSearchCV function so that we can prepare to run our model. There are some arguments that need to be set inside the GridSearchCV function such as estimator, grid, cv, etc. Below is the code.
GBR=GradientBoostingRegressor() search_grid={'n_estimators':[500,1000,2000],'learning_rate':[.001,0.01,.1],'max_depth':[1,2,4],'subsample':[.5,.75,1],'random_state':[1]} search=GridSearchCV(estimator=GBR,param_grid=search_grid,scoring='neg_mean_squared_error',n_jobs=1,cv=crossvalidation)
We can now run the code and determine the best combination of hyperparameters and how well the model did base on the means squared error metric. Below is the code and the output.
search.fit(X,y) search.best_params_ Out[13]: {'learning_rate': 0.01, 'max_depth': 1, 'n_estimators': 500, 'random_state': 1, 'subsample': 0.5} search.best_score_ Out[14]: -160.51398257591643
The hyperparameter results speak for themselves. With this tuning we can see that the mean squared error is lower than with the baseline model. We can now move to the final step of taking these hyperparameter settings and see how they do on the dataset. The results should be almost the same.
Gradient Boosting Model Development
Below is the code and the output for the tuned gradient boosting model
GBR2=GradientBoostingRegressor(n_estimators=500,learning_rate=0.01,subsample=.5,max_depth=1,random_state=1) score=np.mean(cross_val_score(GBR2,X,y,scoring='neg_mean_squared_error',cv=crossvalidation,n_jobs=1)) score Out[18]: -160.77842893572068
These results were to be expected. The gradient boosting model has a better performance than the baseline regression tree model.
Conclusion
In this post, we looked at how to use gradient boosting to improve a regression tree. By creating multiple models. Gradient boosting will almost certainly have a better performance than other type of algorithms that rely on only one model.

Gradient Boosting Classification in Python
Gradient Boosting is an alternative form of boosting to AdaBoost. Many consider gradient boosting to be a better performer than adaboost. Some differences between the two algorithms is that gradient boosting uses optimization for weight the estimators. Like adaboost, gradient boosting can be used for most algorithms but is commonly associated with decision trees.
In addition, gradient boosting requires several additional hyperparameters such as max depth and subsample. Max depth has to do with the number of nodes in a tree. The higher the number the purer the classification become. The downside to this is the risk of overfitting.
Subsampling has to do with the proportion of the sample that is used for each estimator. This can range from a decimal value up until the whole number 1. If the value is set to 1 it becomes stochastic gradient boosting.
This post is focused on classification. To do this, we will use the cancer dataset from the pydataset library. Our goal will be to predict the status of patients (alive or dead) using the available independent variables. The steps we will use are as follows.
- Data preparation
- Baseline decision tree model
- Hyperparameter tuning
- Gradient boosting model development
Below is some initial code.
from sklearn.ensemble import GradientBoostingClassifier
from sklearn import tree
from sklearn.model_selection import GridSearchCV
import numpy as np
from pydataset import data
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
Data Preparation
The data preparation is simple in this situtation. All we need to do is load are dataset, dropping missing values, and create our X dataset and y dataset. All this happens in the code below.
df=data('cancer').dropna()
X=df[['time','sex','ph.karno','pat.karno','meal.cal','wt.loss']]
y=df['status']
We will now develop our baseline decision tree model.
Baseline Model
The purpose of the baseline model is to have something to compare our gradient boosting model to. The strength of a model is always relative to some other model, so we need to make at least two, so we can say one is better than the other.
The criteria for better in this situation is accuracy. Therefore, we will make a decision tree model, but we will manipulate the max depth of the tree to create 9 different baseline models. The best accuracy model will be the baseline model.
To achieve this, we need to use a for loop to make python make several decision trees. We also need to set the parameters for the cross validation by calling KFold(). Once this is done, we print the results for the 9 trees. Below is the code and results.
crossvalidation=KFold(n_splits=10,shuffle=True,random_state=1)
for depth in range (1,10):
tree_classifier=tree.DecisionTreeClassifier(max_depth=depth,random_state=1)
if tree_classifier.fit(X,y).tree_.max_depth<depth:
break
score=np.mean(cross_val_score(tree_classifier,X,y,scoring='accuracy', cv=crossvalidation,n_jobs=1))
print(depth, score)
1 0.71875
2 0.6477941176470589
3 0.6768382352941177
4 0.6698529411764707
5 0.6584558823529412
6 0.6525735294117647
7 0.6283088235294118
8 0.6573529411764706
9 0.6577205882352941
It appears that when the max depth is limited to 1 that we get the best accuracy at almost 72%. This will be our baseline for comparison. We will now tune the parameters for the gradient boosting algorithm
Hyperparameter Tuning
There are several hyperparameters we need to tune. The ones we will tune are as follows
- number of estimators
- learning rate
- subsample
- max depth
First, we will create an instance of the gradient boosting classifier. Second, we will create our grid for the search. It is inside this grid that we set several values for each hyperparameter. Then we call GridSearchCV and place the instance of the gradient boosting classifier, the grid, the cross validation values from mad earlier, and n_jobs all together in one place. Below is the code for this.
GBC=GradientBoostingClassifier()
search_grid={'n_estimators':[500,1000,2000],'learning_rate':[.001,0.01,.1],'max_depth':[1,3,5],'subsample':[.5,.75,1],'random_state':[1]}
search=GridSearchCV(estimator=GBC,param_grid=search_grid,scoring='accuracy',n_jobs=1,cv=crossvalidation)
You can now run your model by calling .fit(). Keep in mind that there are several hyperparameters. This means that it might take some time to run the calculations. It is common to find values for max depth, subsample, and number of estimators first. Then as second run through is done to find the learning rate. In our example, we are doing everything at once which is why it takes longer. Below is the code with the out for best parameters and best score.
search.fit(X,y)
search.best_params_
Out[11]:
{'learning_rate': 0.01,
'max_depth': 5,
'n_estimators': 2000,
'random_state': 1,
'subsample': 0.75}
search.best_score_
Out[12]: 0.7425149700598802
You can see what the best hyperparameters are for yourself. In addition, we see that when these parameters were set we got an accuracy of 74%. This is superior to our baseline model. We will now see if we can replicate these numbers when we use them for our Gradient Boosting model.
Gradient Boosting Model
Below is the code and results for the model with the predetermined hyperparameter values.
ada2=GradientBoostingClassifier(n_estimators=2000,learning_rate=0.01,subsample=.75,max_depth=5,random_state=1)
score=np.mean(cross_val_score(ada2,X,y,scoring='accuracy',cv=crossvalidation,n_jobs=1))
score
Out[17]: 0.742279411764706
You can see that the results are similar. This is just additional information that the gradient boosting model does outperform the baseline decision tree model.
Conclusion
This post provided an example of what gradient boosting classification can do for a model. With its distinct characteristics gradient boosting is generally a better performing boosting algorithm in comparison to AdaBoost.

AdaBoost Regression with Python
This post will share how to use the adaBoost algorithm for regression in Python. What boosting does is that it makes multiple models in a sequential manner. Each newer model tries to successful predict what older models struggled with. For regression, the average of the models are used for the predictions. It is often most common to use boosting with decision trees but this approach can be used with any machine learning algorithm that deals with supervised learning.
Boosting is associated with ensemble learning because several models are created that are averaged together. An assumption of boosting, is that combining several weak models can make one really strong and accurate model.
For our purposes, we will be using adaboost classification to improve the performance of a decision tree in python. We will use the cancer dataset from the pydataset library. Our goal will be to predict the weight loss of a patient based on several independent variables. The steps of this process are as follows.
- Data preparation
- Regression decision tree baseline model
- Hyperparameter tuning of Adaboost regression model
- AdaBoost regression model development
Below is some initial code
from sklearn.ensemble import AdaBoostRegressor
from sklearn import tree
from sklearn.model_selection import GridSearchCV
import numpy as np
from pydataset import data
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
Data Preparation
There is little data preparation for this example. All we need to do is load the data and create the X and y datasets. Below is the code.
df=data('cancer').dropna()
X=df[['time','sex','ph.karno','pat.karno','status','meal.cal']]
y=df['wt.loss']
We will now proceed to creating the baseline regression decision tree model.
Baseline Regression Tree Model
The purpose of the baseline model is to compare it to the performance of our model that utilizes adaBoost. To make this model we need to Initiate a K-fold cross-validation. This will help in stabilizing the results. Next, we will create a for loop to create several trees that vary based on their depth. By depth, it is meant how far the tree can go to purify the classification. More depth often leads to a higher likelihood of overfitting.
Finally, we will then print the results for each tree. The criteria used for judgment is the mean squared error. Below is the code and results
crossvalidation=KFold(n_splits=10,shuffle=True,random_state=1)
for depth in range (1,10):
tree_regressor=tree.DecisionTreeRegressor(max_depth=depth,random_state=1)
if tree_regressor.fit(X,y).tree_.max_depth<depth:
break
score=np.mean(cross_val_score(tree_regressor,X,y,scoring='neg_mean_squared_error', cv=crossvalidation,n_jobs=1))
print(depth, score)
1 -193.55304528235052
2 -176.27520747356175
3 -209.2846723461564
4 -218.80238479654003
5 -222.4393459885871
6 -249.95330609042858
7 -286.76842138165705
8 -294.0290706405905
9 -287.39016236497804
Looks like a tree with a depth of 2 had the lowest amount of error. We can now move to tuning the hyperparameters for the adaBoost algorithm.
Hyperparameter Tuning
For hyperparameter tuning we need to start by initiating our AdaBoostRegresor() class. Then we need to create our grid. The grid will address two hyperparameters which are the number of estimators and the learning rate. The number of estimators tells Python how many models to make and the learning indicates how each tree contributes to the overall results. There is one more parameter which is random_state, but this is just for setting the seed and never changes.
After making the grid, we need to use the GridSearchCV function to finish this process. Inside this function, you have to set the estimator, which is adaBoostRegressor, the parameter grid which we just made, the cross-validation which we made when we created the baseline model, and the n_jobs, which allocates resources for the calculation. Below is the code.
ada=AdaBoostRegressor()
search_grid={'n_estimators':[500,1000,2000],'learning_rate':[.001,0.01,.1],'random_state':[1]}
search=GridSearchCV(estimator=ada,param_grid=search_grid,scoring='neg_mean_squared_error',n_jobs=1,cv=crossvalidation)
Next, we can run the model with the desired grid in place. Below is the code for fitting the mode as well as the best parameters and the score to expect when using the best parameters.
search.fit(X,y)
search.best_params_
Out[31]: {'learning_rate': 0.01, 'n_estimators': 500, 'random_state': 1}
search.best_score_
Out[32]: -164.93176650920856
The best mix of hyperparameters is a learning rate of 0.01 and 500 estimators. This mix led to a mean error score of 164, which is a little lower than our single decision tree of 176. We will see how this works when we run our model with refined hyperparameters.
AdaBoost Regression Model
Below is our model, but this time with the refined hyperparameters.
ada2=AdaBoostRegressor(n_estimators=500,learning_rate=0.001,random_state=1)
score=np.mean(cross_val_score(ada2,X,y,scoring='neg_mean_squared_error',cv=crossvalidation,n_jobs=1))
score
Out[36]: -174.52604137201791
You can see the score is not as good but it is within reason.
Conclusion
In this post, we explored how to use the AdaBoost algorithm for regression. Employing this algorithm can help to strengthen a model in many ways at times.

AdaBoost Classification in Python
Boosting is a technique in machine learning in which multiple models are developed sequentially. Each new model tries to successful predict what prior models were unable to do. The average for regression and majority vote for classification are used. For classification, boosting is commonly associated with decision trees. However, boosting can be used with any machine learning algorithm in the supervised learning context.
Since several models are being developed with aggregation, boosting is associated with ensemble learning. Ensemble is just a way of developing more than one model for machine-learning purposes. With boosting, the assumption is that the combination of several weak models can make one really strong and accurate model.
For our purposes, we will be using adaboost classification to improve the performance of a decision tree in python. We will use the cancer dataset from the pydataset library. Our goal will be to predict the status of a patient based on several independent variables. The steps of this process are as follows.
- Data preparation
- Decision tree baseline model
- Hyperparameter tuning of Adaboost model
- AdaBoost model development
Below is some initial code
from sklearn.ensemble import AdaBoostClassifier
from sklearn import tree
from sklearn.model_selection import GridSearchCV
import numpy as np
from pydataset import data
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
Data Preparation
Data preparation is minimal in this situation. We will load are data and at the same time drop any NA using the .dropna() function. In addition, we will place the independent variables in dataframe called X and the dependent variable in a dataset called y. Below is the code.
df=data('cancer').dropna()
X=df[['time','sex','ph.karno','pat.karno','meal.cal','wt.loss']]
y=df['status']
Decision Tree Baseline Model
We will make a decision tree just for the purposes of comparison. First, we will set the parameters for the cross-validation. Then we will use a for loop to run several different decision trees. The difference in the decision trees will be their depth. The depth is how far the tree can go in order to purify the classification. The more depth the more likely your decision tree is to overfit the data. The last thing we will do is print the results. Below is the code with the output
crossvalidation=KFold(n_splits=10,shuffle=True,random_state=1)
for depth in range (1,10):
tree_classifier=tree.DecisionTreeClassifier(max_depth=depth,random_state=1)
if tree_classifier.fit(X,y).tree_.max_depth<depth:
break
score=np.mean(cross_val_score(tree_classifier,X,y,scoring='accuracy', cv=crossvalidation,n_jobs=1))
print(depth, score)
1 0.71875
2 0.6477941176470589
3 0.6768382352941177
4 0.6698529411764707
5 0.6584558823529412
6 0.6525735294117647
7 0.6283088235294118
8 0.6573529411764706
9 0.6577205882352941
You can see that the most accurate decision tree had a depth of 1. After that there was a general decline in accuracy.
We now can determine if the adaBoost model is better based on whether the accuracy is above 72%. Before we develop the AdaBoost model, we need to tune several hyperparameters in order to develop the most accurate model possible.
Hyperparameter Tuning AdaBoost Model
In order to tune the hyperparameters there are several things that we need to do. First we need to initiate our AdaBoostClassifier with some basic settings. Then We need to create our search grid with the hyperparameters. There are two hyperparameters that we will set and they are number of estimators (n_estimators) and the learning rate.
Number of estimators has to do with how many trees are developed. The learning rate indicates how each tree contributes to the overall results. We have to place in the grid several values for each of these. Once we set the arguments for the AdaBoostClassifier and the search grid we combine all this information into an object called search. This object uses the GridSearchCV function and includes additional arguments for scoring, n_jobs, and for cross-validation. Below is the code for all of this
ada=AdaBoostClassifier()
search_grid={'n_estimators':[500,1000,2000],'learning_rate':[.001,0.01,.1]}
search=GridSearchCV(estimator=ada,param_grid=search_grid,scoring='accuracy',n_jobs=1,cv=crossvalidation)
We can now run the model of hyperparameter tuning and see the results. The code is below.
search.fit(X,y)
search.best_params_
Out[33]: {'learning_rate': 0.01, 'n_estimators': 1000}
search.best_score_
Out[34]: 0.7425149700598802
We can see that if the learning rate is set to 0.01 and the number of estimators to 1000 We can expect an accuracy of 74%. This is superior to our baseline model.
AdaBoost Model
We can now rune our AdaBoost Classifier based on the recommended hyperparameters. Below is the code.
ada=AdaBoostClassifier(n_estimators=1000,learning_rate=0.01) score=np.mean(cross_val_score(ada,X,y,scoring='accuracy',cv=crossvalidation,n_jobs=1)) score Out[36]: 0.7415441176470589
We knew we would get around 74% and that is what we got. It’s only a 3% improvement but depending on the context that can be a substantial difference.
Conclusion
In this post, we look at how to use boosting for classification. In particular, we used the AdaBoost algorithm. Boosting in general uses many models to determine the most accurate classification in a sequential manner. Doing this will often lead to an improvement in the prediction of a model.

Hyperparameter Tuning in Python
Hyperparameters are a numerical quantity you must set yourself when developing a model. This is often one of the last steps of model development. Choosing an algorithm and determining which variables to include often come before this step.
Algorithms cannot determine hyperparameters themselves which is why you have to do it. The problem is that the typical person has no idea what is an optimally choice for the hyperparameter. To deal with this confusion, often a range of values are inputted and then it is left to python to determine which combination of hyperparameters is most appropriate.
In this post, we will learn how to set hyperparameters by developing a grid in Python. To do this, we will use the PSID dataset from the pydataset library. Our goal will be to classify who is married and not married based on several independent variables. The steps of this process is as follows
- Data preparation
- Baseline model (for comparison)
- Grid development
- Revised model
Below is some initial code that includes all the libraries and classes that we need.
import pandas as pd
import numpy as np
from pydataset import data
pd.set_option('display.max_rows', 5000)
pd.set_option('display.max_columns', 5000)
pd.set_option('display.width', 10000)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
Data Preparation
The dataset PSID has several problems that we need to address.
- We need to remove all NAs
- The married variable will be converted to a dummy variable. It will simply be changed to married or not rather than all of the other possible categories.
- The educatnn and kids variables have codes that are 98 and 99. These need to be removed because they do not make sense.
Below is the code that deals with all of this
df=data('PSID').dropna()
df.loc[df.married!= 'married', 'married'] = 0
df.loc[df.married== 'married','married'] = 1
df['married'] = df['married'].astype(int)
df['marry']=df.married
df.drop(df.loc[df['kids']>90].index, inplace=True)
df.drop(df.loc[df['educatn']>90].index, inplace=True
- Line 1 loads the dataset and drops the NAs
- Line 2-4 create our dummy variable for marriage. We create a new variable called marry to hold the results
- Lines 5-6 drop the values in kids and educatn that are above 90.
Below we create our X and y datasets and then are ready to make our baseline model.
X=df[['age','educatn','hours','kids','earnings']]
y=df['marry']
Baseline Model
The purpose of baseline model is to see how much better or worst the hyperparameter tuning works. We are using K Nearest Neighbors for our classification. In our example, there are 4 hyperparameters we need to set. They are as follows.
- number of neighbors
- weight of neighbors
- metric for measuring distance
- power parameter for minkowski
Below is the baseline model with the set hyperparameters. The second line shows the accuracy of the model after a k-fold cross-validation that was set to 10.
classifier=KNeighborsClassifier(n_neighbors=5,weights=’uniform’, metric=’minkowski’,p=2)
np.mean(cross_val_score(classifier,X,y,cv=10,scoring=’accuracy’,n_jobs=1)) 0.6188104238047426
Our model has an accuracy of about 62%. We will now move to setting up our grid so we can see if tuning the hyperparameters improves the performance
Grid Development
The grid allows you to develop scores of models with all of the hyperparameters tuned slightly differently. In the code below, we create our grid object, and then we calculate how many models we will run
grid={'n_neighbors':range(1,13),'weights':['uniform','distance'],'metric':['manhattan','minkowski'],'p':[1,2]}
np.prod([len(grid[element]) for element in grid])
96
You can see we made a simple list that has several values for each hyperparameter
- Number if neighbors can be 1 to 13
- weight of neighbors can be uniform or distance
- metric can be manhatten or minkowski
- p can be 1 or 2
We will develop 96 models all together. Below is the code to begin tuning the hyperparameters.
search=GridSearchCV(estimator=classifier,param_grid=grid,scoring='accuracy',n_jobs=1,refit=True,cv=10)
search.fit(X,y)
The estimator is the code for the type of algorithm we are using. We set this earlier. The param_grid is our grid. Accuracy is our metric for determining the best model. n_jobs has to do with the amount of resources committed to the process. refit is for changing parameters and cv is for cross-validation folds.The search.fit command runs the model
The code below provides the output for the results.
print(search.best_params_)
print(search.best_score_)
{'metric': 'manhattan', 'n_neighbors': 11, 'p': 1, 'weights': 'uniform'}
0.6503975265017667
The best_params_ function tells us what the most appropriate parameters are. The best_score_ tells us what the accuracy of the model is with the best parameters. Are model accuracy improves from 61% to 65% from adjusting the hyperparameters. We can confirm this by running our revised model with the updated hyper parameters.
Model Revision
Below is the cod efor the erevised model
classifier2=KNeighborsClassifier(n_neighbors=11,weights='uniform', metric='manhattan',p=1)
np.mean(cross_val_score(classifier2,X,y,cv=10,scoring='accuracy',n_jobs=1)) #new res
Out[24]: 0.6503909993913031
Exactly as we thought. This is a small improvement but this can make a big difference in some situation such as in a data science competition.
Conclusion
Tuning hyperparameters is one of the final pieces to improving a model. With this tool, small gradually changes can be seen in a model. It is important to keep in mind this aspect of model development in order to have the best success final.

Random Forest Regression in Python
Random forest is simply the making of dozens if not thousands of decision trees. The decision each tree makes about an example are then tallied for the purpose of voting with the classification that receives the most votes winning. For regression, the results of the trees are averaged in order to give the most accurate results
In this post, we will use the cancer dataset from the pydataset module to predict the age of people. Below is some initial code.
import pandas as pd import numpy as np from pydataset import data from sklearn.model_selection import train_test_split from sklearn.model_selection import cross_val_score from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import mean_squared_error
We can load our dataset as df, drop all NAs, and create our dataset that contains the independent variables and a separate dataset that includes the dependent variable of age. The code is below
df = data('cancer') df=df.dropna() X=df[['time','status',"sex","ph.ecog",'ph.karno','pat.karno','meal.cal','wt.loss']] y=df['age']
Next, we need to set up our train and test sets using a 70/30 split. After that, we set up our model using the RandomForestRegressor function. n_estimators is the number of trees we want to create and the random_state argument is for supporting reproducibility. The code is below
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0) h=RandomForestRegressor(n_estimators=100,random_state=1)
We can now run our model and test it. Running the model requires the .fit() function and testing involves the .predict() function. The results of the test are found using the mean_squared_error() function.
h.fit(x_train,y_train) y_pred=h.predict(x_test) mean_squared_error(y_test,y_pred) 71.75780196078432
The MSE of 71.75 is only useful for model comparison and has little meaning by its self. Another way to assess the model is by determining variable importance. This helps you to determine in a descriptive way the strongest variables for the regression model. The code is below followed by the plot of the variables.
model_ranks=pd.Series(h.feature_importances_,index=x_train.columns,name="Importance").sort_values(ascending=True,inplace=False) ax=model_ranks.plot(kind='barh')
As you can see, the strongest predictors of age include calories per meal, weight loss, and time sick. Sex and whether the person is censored or dead make a smaller difference. This makes sense as younger people eat more and probably lose more weight because they are heavier initially when dealing with cancer.
Conclusison
This post provided an example of the use of regression with random forest. Through the use of ensemble voting, you can improve the accuracy of your models. This is a distinct power that is not available with other machine learning algorithm.

Bagging Classification with Python
Bootstrap aggregation aka bagging is a technique used in machine learning that relies on resampling from the sample and running multiple models from the different samples. The mean or some other value is calculated from the results of each model. For example, if you are using Decisions trees, bagging would have you run the model several times with several different subsamples to help deal with variance in statistics.
Bagging is an excellent tool for algorithms that are considered weaker or more susceptible to variances such as decision trees or KNN. In this post, we will use bagging to develop a model that determines whether or not people voted using the turnout dataset. These results will then be compared to a model that was developed in a traditional way.
We will use the turnout dataset available in the pydataset module. Below is some initial code.
from pydataset import data import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn.ensemble import BaggingClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.model_selection import cross_val_score from sklearn.metrics import classification_report
We will load our dataset. Then we will separate the independnet and dependent variables from each other and create our train and test sets. The code is below.
df=data("turnout") X=df[['age','educate','income',]] y=df['vote'] X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=.3,random_state=0)
We can now prepare to run our model. We need to first set up the bagging function. There are several arguments that need to be set. The max_samples argument determines the largest amount of the dataset to use in resampling. The max_features argument is the max number of features to use in a sample. Lastly, the n_estimators is for determining the number of subsamples to draw. The code is as follows
h=BaggingClassifier(KNeighborsClassifier(n_neighbors=7),max_samples=0.7,max_features=0.7,n_estimators=1000)
Basically, what we told python was to use up to 70% of the samples, 70% of the features, and make 100 different KNN models that use seven neighbors to classify. Now we run the model with the fit function, make a prediction with the predict function, and check the accuracy with the classificarion_reoirts function.
h.fit(X_train,y_train) y_pred=h.predict(X_test) print(classification_report(y_test,y_pred))
This looks oka below are the results when you do a traditional model without bagging
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=.3,random_state=0) clf=KNeighborsClassifier(7) clf.fit(X_train,y_train) y_pred=clf.predict(X_test) print(classification_report(y_test,y_pred))
The improvement is not much. However, this depends on the purpose and scale of your project. A small improvement can mean millions in the reight context such as for large companies such as Google who deal with billions of people per day.
Conclusion
This post provides an example of the use of bagging in the context of classification. Bagging provides a why to improve your model through the use of resampling.

K Nearest Neighbor Classification with Python
K Nearest Neighbor uses the idea of proximity to predict class. What this means is that with KNN Python will look at K neighbors to determine what the unknown examples class should be. It is your job to determine the K or number of neighbors that should be used to determine the unlabeled examples class.
KNN is great for a small dataset. However, it normally does not scale well when the dataset gets larger and larger. As such, unless you have an exceptionally powerful computer KNN is probably not going to do well in a Big Data context.
In this post, we will go through an example of the use of KNN with the turnout dataset from the pydataset module. We want to predict whether someone voted or not based on the independent variables. Below is some initial code.
from pydataset import data import pandas as pd from sklearn.model_selection import train_test_split from sklearn.neighbors import KNeighborsClassifier from sklearn.metrics import classification_report
We now need to load the data and separate the independent variables from the dependent variable by making two datasets.
df=data("turnout") X=df[['age','educate','income']] y=df['vote']
Next, we will make our train and test sets with a 70/30 split. The random.state is set to 0. This argument allows us to reproduce our model if we want. After this, we will run our model. We will set the K to 7 for our model and run the model. This means that Python will look at the 7 closes examples to predict the value of an unknown example. below is the code
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=.3,random_state=0) clf=KNeighborsClassifier(7) clf.fit(X_train,y_train)
We can now predict with our model and see the results with the classification reports function.
y_pred=clf.predict(X_test) print(classification_report(y_test,y_pred))
The results are shown above. To determine the quality of the model relies more on domain knowledge. What we can say for now is that the model is better at classifying people who vote rather than people who do not vote.
Conclusion
This post shows you how to work with Python when using KNN. This algorithm is useful in using neighboring examples tot predict the class of an unknown example.

Naive Bayes with Python
Naive Bayes is a probabilistic classifier that is often employed when you have multiple or more than two classes in which you want to place your data. This algorithm is particularly used when you dealing with text classification with large datasets and many features.
If you are more familiar with statistics you know that Bayes developed a method of probability that is highly influential today. In short, his system takes into conditional probability. In the case of naive Bayes, the classifier assumes that the presence of a certain feature in a class is not related to the presence of any other feature. This assumption is why Naive Bayes is Naive.
For our purposes, we will use Naive Bayes to predict the type of insurance a person has in the DoctorAUS dataset in the pydataset module. Below is some initial code.
from pydataset import data import pandas as pd from sklearn.naive_bayes import GaussianNB from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report
Next, we will load our dataset DoctorAUS. Then we will separate the independent variables that we will use from the dependent variable of insurance in two different datasets. If you want to know more about the dataset and the variables you can type data(“DoctorAUS”, show_doc=True)
df=data("DoctorAUS") X=df[['age','income','sex','illness','actdays','hscore','doctorco','nondocco','hospadmi','hospdays','medecine','prescrib']] y=df['insurance']
Now, we will create our train and test datasets. We will do a 70/30 split. We will also use Gaussian Naive Bayes as our algorithm. This algorithm assumes the data is normally distributed. There are other algorithms available for Naive Bayes as well. We will also create our model with the .fit function.
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=.3,random_state=0) clf=GaussianNB() clf.fit(X_train,y_train)
Finally, we will predict with our model and run the classification report to determine the success of the model.
y_pred=clf.predict(X_test) print(classification_report(y_test,y_pred))
You can see that our overall numbers are not that great. This means that the current algorithm is probably not the best choice for classification. Of course, there could other problems as well that need to be explored.
Conclusion
This post was simply a demonstration of how to conduct an analysis with Naive Bayes using Python. The process is not all that complicate and is similar to other algorithms that are used.

Support Vector Machines Regression with Python
This post will provide an example of how to do regression with support vector machines SVM. SVM is a complex algorithm that allows for the development of non-linear models. This is particularly useful for messy data that does not have clear boundaries.
The steps that we will use are listed below
- Data preparation
- Model Development
We will use two different kernels in our analysis. The LinearSVR kernel and SVR kernel. The difference between these two kernels has to do with slight changes in the calculations of the boundaries between classes.
Data Preparation
We are going to use the OFP dataset available in the pydataset module. This dataset was used previously for classification with SVM on this site. Our plan this time is that we want to predict family inc (famlinc), which is a continuous variable. Below is some initial code.
import numpy as np import pandas as pd from pydataset import data from sklearn import svm from sklearn import model_selection from statsmodels.tools.eval_measures import mse
We now need to load our dataset and remove any missing values.
df=pd.DataFrame(data('OFP')) df=df.dropna()
AS in the previous post, we need to change the text variables into dummy variables and we also need to scale the data. The code below creates the dummy variables, removes variables that are not needed, and also scales the data.
dummy=pd.get_dummies(df['black']) df=pd.concat([df,dummy],axis=1) df=df.rename(index=str, columns={"yes": "black_person"}) df=df.drop('no', axis=1) dummy=pd.get_dummies(df['sex']) df=pd.concat([df,dummy],axis=1) df=df.rename(index=str, columns={"male": "Male"}) df=df.drop('female', axis=1) dummy=pd.get_dummies(df['employed']) df=pd.concat([df,dummy],axis=1) df=df.rename(index=str, columns={"yes": "job"}) df=df.drop('no', axis=1) dummy=pd.get_dummies(df['maried']) df=pd.concat([df,dummy],axis=1) df=df.rename(index=str, columns={"no": "single"}) df=df.drop('yes', axis=1) dummy=pd.get_dummies(df['privins']) df=pd.concat([df,dummy],axis=1) df=df.rename(index=str, columns={"yes": "insured"}) df=df.drop('no', axis=1) df=df.drop(['black','sex','maried','employed','privins','medicaid','region','hlth'],axis=1) df = (df - df.min()) / (df.max() - df.min()) df.head()
We now need to set up our datasets. The X dataset will contain the independent variables while the y dataset will contain the dependent variable
X=df[['ofp','ofnp','opp','opnp','emr','hosp','numchron','adldiff','age','school','single','black_person','Male','job','insured']] y=df['faminc']
We can now move to model development
Model Development
We now need to create our train and test sets for or X and y datasets. We will do a 70/30 split of the data. Below is the code
X_train,X_test,y_train,y_test=model_selection.train_test_split(X,y,test_size=.3,random_state=1)
Next, we will create our two models with the code below.
h1=svm.SVR() h2=svm.LinearSVR()
We will now run our first model and assess the results. Our metric is the mean squared error. Generally, the lower the number the better. We will use the .fit() function to train the model and the .predict() function for test the model
The mse was 0.27. This number means nothing only and is only beneficial for comparison reasons. Therefore, the second model will be judged as better or worst only if the mse is lower than 0.27. Below are the results of the second model.
We can see that the mse for our second model is 0.34 which is greater than the mse for the first model. This indicates that the first model is superior based on the current results and parameter settings.
Conclusion
This post provided an example of how to use SVM for regression.

Support Vector Machines Classification with Python
Support vector machines (SVM) is an algorithm used to fit non-linear models. The details are complex but to put it simply SVM tries to create the largest boundaries possible between the various groups it identifies in the sample. The mathematics behind this is complex especially if you are unaware of what a vector is as defined in algebra.
This post will provide an example of SVM using Python broken into the following steps.
- Data preparation
- Model Development
We will use two different kernels in our analysis. The linear kernel and he rbf kernel. The difference in terms of kernels has to do with how the boundaries between the different groups are made.
Data Preparation
We are going to use the OFP dataset available in the pydataset module. We want to predict if someone single or not. Below is some initial code.
import numpy as np import pandas as pd from pydataset import data from sklearn import svm from sklearn.metrics import classification_report from sklearn import model_selection
We now need to load our dataset and remove any missing values.
df=pd.DataFrame(data('OFP')) df=df.dropna() df.head()
Looking at the dataset we need to do something with the variables that have text. We will create dummy variables for all except region and hlth. The code is below.
dummy=pd.get_dummies(df['black']) df=pd.concat([df,dummy],axis=1) df=df.rename(index=str, columns={"yes": "black_person"}) df=df.drop('no', axis=1) dummy=pd.get_dummies(df['sex']) df=pd.concat([df,dummy],axis=1) df=df.rename(index=str, columns={"male": "Male"}) df=df.drop('female', axis=1) dummy=pd.get_dummies(df['employed']) df=pd.concat([df,dummy],axis=1) df=df.rename(index=str, columns={"yes": "job"}) df=df.drop('no', axis=1) dummy=pd.get_dummies(df['maried']) df=pd.concat([df,dummy],axis=1) df=df.rename(index=str, columns={"no": "single"}) df=df.drop('yes', axis=1) dummy=pd.get_dummies(df['privins']) df=pd.concat([df,dummy],axis=1) df=df.rename(index=str, columns={"yes": "insured"}) df=df.drop('no', axis=1)
For each variable, we did the following
- Created a dummy in the dummy dataset
- Combined the dummy variable with our df dataset
- Renamed the dummy variable based on yes or no
- Drop the other dummy variable from the dataset. Python creates two dummies instead of one.
If you look at the dataset now you will see a lot of variables that are not necessary. Below is the code to remove the information we do not need.
df=df.drop(['black','sex','maried','employed','privins','medicaid','region','hlth'],axis=1) df.head()
This is much cleaner. Now we need to scale the data. This is because SVM is sensitive to scale. The code for doing this is below.
df = (df - df.min()) / (df.max() - df.min()) df.head()
We can now create our dataset with the independent variables and a separate dataset with our dependent variable. The code is as follows.
X=df[['ofp','ofnp','opp','opnp','emr','hosp','numchron','adldiff','age','school','faminc','black_person','Male','job','insured']] y=df['single']
We can now move to model development
Model Development
We need to make our test and train sets first. We will use a 70/30 split.
X_train,X_test,y_train,y_test=model_selection.train_test_split(X,y,test_size=.3,random_state=1)
Now, we need to create the models or the hypothesis we want to test. We will create two hypotheses. The first model is using a linear kernel and the second is one using the rbf kernel. For each of these kernels, there are hyperparameters that need to be set which you will see in the code below.
h1=svm.LinearSVC(C=1) h2=svm.SVC(kernel='rbf',degree=3,gamma=0.001,C=1.0)
The details about the hyperparameters are beyond the scope of this post. Below are the results for the first model.
The overall accuracy is 73%. The crosstab() function provides a breakdown of the results and the classification_report() function provides other metrics related to classification. In this situation, 0 means not single or married while 1 means single. Below are the results for model 2
You can see the results are similar with the first model having a slight edge. The second model really struggls with predicting people who are actually single. You can see thtat the recall in particular is really poor.
Conclusion
This post provided how to ob using SVM in python. How this algorithm works can be somewhat confusing. However, its use can be powerful if use appropriately.

Linear Discriminant Analysis in Python
Linear discriminant analysis is a classification algorithm commonly used in data science. In this post, we will learn how to use LDA with Python. The steps we will for this are as follows.
- Data preparation
- Model training and evaluation
Data Preparation
We will be using the bioChemists dataset which comes from the pydataset module. We want to predict whether someone is married or single based on academic output and prestige. Below is some initial code.
import pandas as pd from pydataset import data import matplotlib.pyplot as plt from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA from sklearn.model_selection import train_test_split from sklearn.metrics import classification_report
Now we will load our data and take a quick look at it using the .head() function.
There are two variables that contain text so we need to convert these two dummy variables for our analysis the code is below with the output.
Here is what we did.
- We created the dummy variable by using the .get_dummies() function.
- We saved the output in an object called dummy
- We then combine the dummy and df dataset with the .concat() function
- We repeat this process for the second variable
The output shows that we have our original variables and the dummy variables. However, we do not need all of this information. Therefore, we will create a dataset that has the X variables we will use and a separate dataset that will have our y values. Below is the code.
X=df[['Men','kid5','phd','ment','art']] y=df['Married']
The X dataset has our five independent variables and the y dataset has our dependent variable which is married or not. We can not split our data into a train and test set. The code is below.
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=.3,random_state=0)
The data was split 70% for training and 30% for testing. We made a train and test set for the independent and dependent variables which meant we made 4 sets altogether. We can now proceed to model development and testing
Model Training and Testing
Below is the code to run our LDA model. We will use the .fit() function for this.
clf=LDA() clf.fit(X_train,y_train)
We will now use this model to predict using the .predict function
y_pred=clf.predict(X_test)
Now for the results, we will use the classification_report function to get all of the metrics associated with a confusion matrix.
The interpretation of this information is described in another place. For our purposes, we have an accuracy of 71% for our prediction. Below is a visual of our model using the ROC curve.
Here is what we did
- We had to calculate the roc_curve for the model this is explained in detail here
- Next, we plotted our own curve and compared to a baseline curve which is the dotted lines.
A ROC curve of 0.67 is considered fair by many. Our classification model is not that great but there are worst models out there.
Conclusion
This post went through an example of developing and evaluating a linear discriminant model. To do this you need to prepare the data, train the model, and evaluate.

Factor Analysis in Python
Factor analysis is a dimensionality reduction technique commonly used in statistics. FA is similar to principal component analysis. The difference are highly technical but include the fact the FA does not have an orthogonal decomposition and FA assumes that there are latent variables and that are influencing the observed variables in the model. For FA the goal is the explanation of the covariance among the observed variables present.
Our purpose here will be to use the BioChemist dataset from the pydataset module and perform a FA that creates two components. This dataset has data on the people completing PhDs and their mentors. We will also create a visual of our two-factor solution. Below is some initial code.
import pandas as pd from pydataset import data from sklearn.decomposition import FactorAnalysis import matplotlib.pyplot as plt
We now need to prepare the dataset. The code is below
df = data('bioChemists') df=df.iloc[1:250] X=df[['art','kid5','phd','ment']]
In the code above, we did the following
- The first line creates our dataframe called “df” and is made up of the dataset bioChemist
- The second line reduces the df to 250 rows. This is done for the visual that we will make. To take the whole dataset and graph it would make a giant blob of color that would be hard to interpret.
- The last line pulls the variables we want to use for our analysis. The meaning of these variables can be found by typing data(“bioChemists”,show_doc=True)
In the code below we need to set the number of factors we want and then run the model.
fact_2c=FactorAnalysis(n_components=2) X_factor=fact_2c.fit_transform(X)
The first line tells Python how many factors we want. The second line takes this information along with or revised dataset X to create the actual factors that we want. We can now make our visualization
To make the visualization requires several steps. We want to identify how well the two components separate students who are married from students who are not married. First, we need to make a dictionary that can be used to convert the single or married status to a number. Below is the code.
thisdict = { "Single": "1", "Married": "2",}
Now we are ready to make our plot. The code is below. Pay close attention to the ‘c’ argument as it uses our dictionary.
plt.scatter(X_factor[:,0],X_factor[:,1],c=df.mar.map(thisdict),alpha=.8,edgecolors='none')
You can perhaps tell why we created the dictionary now. By mapping the dictionary to the mar variable it automatically changed every single and married entry in the df dataset to a 1 or 2. The c argument needs a number in order to set a color and this is what the dictionary was able to supply it with.
You can see that two factors do not do a good job of separating the people by their marital status. Additional factors may be useful but after two factors it becomes impossible to visualize them.
Conclusion
This post provided an example of factor analysis in Python. Here the focus was primarily on visualization but there are so many other ways in which factor analysis can be deployed.

Analyzing Twitter Data in Python
In this post, we will look at how to analyze text from Twitter. We will do each of the following for tweets that refer to Donald Trump and tweets that refer to Barrack Obama.
- Conduct a sentiment analysis
- Create a word cloud
This is a somewhat complex analysis so I am assuming that you are familiar with Python as explaining everything would make the post much too long. In order to achieve our two objectives above we need to do the following.
- Obtain all of the necessary information from your twitter apps account
- Download the tweets & clean
- Perform the analysis
Before we begin, here is a list of modules we will need to load to complete our analysis
import wordcloud import matplotlib.pyplot as plt import twython import re import numpy
Obtain all Needed Information
From your twitter app account, you need the following information
- App key
- App key secret
- Access token
- Access token secret
All this information needs to be stored in individual objects in Python. Then each individual object needs to be combined into one object. The code is below.
TWITTER_APP_KEY=XXXXXXXXXXXXXXXXXXXXXXXXXX TWITTER_APP_KEY_SECRET=XXXXXXXXXXXXXXXXXXX TWITTER_ACCESS_TOKEN=XXXXXXXXXXXXXXXXXXXXX TWITTER_ACCESS_TOKEN_SECRET=XXXXXXXXXXXXXX t=twython.Twython(app_key=TWITTER_APP_KEY,app_secret=TWITTER_APP_KEY_SECRET,oauth_token=TWITTER_ACCESS_TOKEN,oauth_token_secret=TWITTER_ACCESS_TOKEN_SECRET)
In the code above we saved all the information in different objects at first and then combined them. You will of course replace the XXXXXXX with your own information.
Next, we need to create a function that will pull the tweets from Twitter. Below is the code,
def get_tweets(twython_object,query,n): count=0 result_generator=twython_object.cursor(twython_object.search,q=query) result_set=[] for r in result_generator: result_set.append(r['text']) count+=1 if count ==n: break return result_set
You will have to figure out the code yourself. We can now download the tweets.
Downloading Tweets & Clean
Downloading the tweets involves making an empty dictionary that we can save our information in. We need two keys in our dictionary one for Trump and the other for Obama because we are downloading tweets about these two people.
There are also two additional things we need to do. We need to use regular expressions to get rid of punctuation and we also need to lower case all words. All this is done in the code below.
tweets={} tweets['trump']=[re.sub(r'[-.#/?!.":;()\']',' ',tweet.lower())for tweet in get_tweets(t,'#trump',1500)] tweets['obama']=[re.sub(r'[-.#/?!.":;()\']',' ',tweet.lower())for tweet in get_tweets(t,'#obama',1500)]
The get_tweets function is also used in the code above along with our twitter app information. We pulled 1500 tweets concerning Obama and 1500 tweets about Trump. We were able to download and clean our tweets at the same time. We can now do our analysis
Analysis
To do the sentiment analysis you need dictionaries of positive and negative words. The ones in this post were taken from GitHub. Below is the code for loading them into Python.
positive_words=open('XXXXXXXXXXXX').read().split('\n') negative_words=open('XXXXXXXXXXXX').read().split('\n')
We now will make a function to calculate the sentiment
def sentiment_score(text,pos_list,neg_list): positive_score=0 negative_score=0 for w in text.split(' '): if w in pos_list:positive_score+=1 if w in neg_list:negative_score+=1 return positive_score-negative_score
Now we create an empty dictionary and run the analysis for Trump and then for Obama
tweets_sentiment={} tweets_sentiment['trump']=[sentiment_score(tweet,positive_words,negative_words)for tweet in tweets['trump']] tweets_sentiment['obama']=[sentiment_score(tweet,positive_words,negative_words)for tweet in tweets['obama']]
Now we can make visuals of our results with the code below
trump=plt.hist(tweets_sentiment['trump'],5) obama=plt.hist(tweets_sentiment['obama'],5)
Obama is on the left and trump is on the right. It seems that trump tweets are consistently more positive. Below are the means for both.
numpy.mean(tweets_sentiment['trump']) Out[133]: 0.36363636363636365 numpy.mean(tweets_sentiment['obama']) Out[134]: 0.2222222222222222
Trump tweets are slightly more positive than Obama tweets. Below is the code for the Trump word cloud
Here is the code for the Obama word cloud
A lot of speculating can be made from the word clouds and sentiment analysis. However, the results will change every single time because of the dynamic nature of Twitter. People are always posting tweets which changes the results.
Conclusion
This post provided an example of how to download and analyze tweets from twitter. It is important to develop a clear idea of what you want to know before attempting this sort of analysis as it is easy to become confused and not accomplish anything.

KMeans Clustering in Python
Kmeans clustering is a technique in which the examples in a dataset our divided through segmentation. The segmentation has to do with complex statistical analysis in which examples within a group are more similar the examples outside of a group.
The application of this is that it provides the analysis with various groups that have similar characteristics which can be used to cater services to in various industries such as business or education. In this post, we will look at how to do this using Python. We will use the steps below to complete this process.
- Data preparation
- Determine the number of clusters
- Conduct analysis
Data Preparation
Our data for this examples comes from the sat.act dataset available in the pydataset module. Below is some initial code.
import pandas as pd from pydataset import data from sklearn.cluster import KMeans from scipy.spatial.distance import cdist import numpy as np import matplotlib.pyplot as plt
We will now load our dataset and drop any NAs they may be present
You can see there are six variables that will be used for the clustering. Next, we will turn to determining the number of clusters.
Determine the Number of Clusters
Before you can actually do a kmeans analysis you must specify the number of clusters. This can be tricky as there is no single way to determine this. For our purposes, we will use the elbow method.
The elbow method measures the within sum of error in each cluster. As the number of clusters increases this error decreases. However, a certain point the return on increasing clustering becomes minimal and this is known as the elbow. Below is the code to calculate this.
distortions = [] K = range(1,10) for k in K: kmeanModel = KMeans(n_clusters=k).fit(df) distortions.append(sum(np.min(cdist(df, kmeanModel.cluster_centers_, 'euclidean'), axis=1)) / df.shape[0])
Here is what we did
- We made an empty list called ‘distortions’ we will save our results there.
- In the second line, we told R the range of clusters we want to consider. Simply, we want to consider anywhere from 1 to 10 clusters.
- Line 3 and 4, we use a for loop to calculate the number of clusters when fitting it to the df object.
- In Line 5, we save the sum of the cluster distance in the distortions list.
Below is a visual to determine the number of clusters
plt.plot(K, distortions, 'bx-') plt.xlabel('k') plt.ylabel('Distortion') plt.title('The Elbow Method showing the optimal k')
The graph indicates that 3 clusters are sufficient for this dataset. We can now perform the actual kmeans clustering.
KMeans Analysis
The code for the kmeans analysis is as follows
km=KMeans(3,init='k-means++',random_state=3425) km.fit(df.values)
- We use the KMeans function and tell Python the number of clusters, the type of, initialization, and we set the seed with the random_state argument. All this is saved in the object called km
- The km object has the .fit function used on it with df.values
Next, we will predict with the predict function and look at the first few lines of the modified df with the .head() function.
You can see we created a new variable called predict. This variable contains the kmeans algorithm prediction of which group each example belongs too. We then printed the first five values as an illustration. Below are the descriptive statistics for the three clusters that were produced for the variable in the dataset.
It is clear that the clusters are mainly divided based on the performance on the various test used. In the last piece of code, gender is used. 1 represents male and 2 represents female.
We will now make a visual of the clusters using two dimensions. First, w e need to make a map of the clusters that is saved as a dictionary. Then we will create a new variable in which we take the numerical value of each cluster and convert it to a string in our cluster map dictionary.
clust_map={0:'Weak',1:'Average',2:'Strong'} df['perf']=df.predict.map(clust_map)
Next, we make a different dictionary to color the points in our graph.
d_color={'Weak':'y','Average':'r','Strong':'g'}
Here is what is happening in the code above.
- We set the ax object to a value.
- A for loop is used to go through every example in clust_map.values so that they are colored according the color
- Lastly, a plot is called which lines up the perf and clust values for color.
The groups are clearly separated when looking at them in two dimensions.
Conclusion
Kmeans is a form of unsupervised learning in which there is no dependent variable which you can use to assess the accuracy of the classification or the reduction of error in regression. As such, it can be difficult to know how well the algorithm did with the data. Despite this, kmeans is commonly used in situations in which people are trying to understand the data rather than predict.

Random Forest in Python
This post will provide a demonstration of the use of the random forest algorithm in python. Random forest is similar to decision trees except that instead of one tree a multitude of trees are grown to make predictions. The various trees all vote in terms of how to classify an example and majority vote is normally the winner. Through making many trees the accuracy of the model normally improves.
The steps are as follows for the use of random forest
- Data preparation
- Model development & evaluation
- Model comparison
- Determine variable importance
Data Preparation
We will use the cancer dataset from the pydataset module. We want to predict if someone is censored or dead in the status variable. The other variables will be used as predictors. Below is some code that contains all of the modules we will use.
import pandas as pd import sklearn.ensemble as sk from pydataset import data from sklearn.model_selection import train_test_split from sklearn import metrics import matplotlib.pyplot as plt
We will now load our data cancer in an object called ‘df’. Then we will remove all NA’s use the .dropna() function. Below is the code.
df = data('cancer') df=df.dropna()
We now need to make two datasets. One dataset, called X, will contain all of the predictor variables. Another dataset, called y, will contain the outcome variable. In the y dataset, we need to change the numerical values to a string. This will make interpretation easier as we will not need to lookup what the numbers represents. Below is the code.
X=df[['time','age',"sex","ph.ecog",'ph.karno','pat.karno','meal.cal','wt.loss']] df['status']=df.status.replace(1,'censored') df['status']=df.status.replace(2,'dead') y=df['status']
Instead of 1 we now have the string “censored” and instead of 2 we now have the string “dead” in the status variable. The final step is to set up our train and test sets. We will do a 70/30 split. We will have a train set for the X and y dataset as well as a test set for the X and y datasets. This means we will have four datasets in all. Below is the code.
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
We are now ready to move to model development
Model Development and Evaluation
We now need to create our classifier and fit the data to it. This is done with the following code.
clf=sk.RandomForestClassifier(n_estimators=100) clf=clf.fit(x_train,y_train)
The clf object has our random forest algorithm,. The number of estimators is set to 100. This is the number of trees that will be generated. In the second line of code, we use the .fit function and use the training datasets x and y.
We now will test our model and evaluate it. To do this we will use the .predict() with the test dataset Then we will make a confusion matrix followed by common metrics in classification. Below is the code and the output.
You can see that our model is good at predicting who is dead but struggles with predicting who is censored. The metrics are reasonable for dead but terrible for censored.
We will now make a second model for the purpose of comparison
Model Comparison
We will now make a different model for the purpose of comparison. In this model, we will use out of bag samples to determine accuracy, set the minimum split size at 5 examples, and that each leaf has at least 2 examples. Below is the code and the output.
There was some improvement in classify people who were censored as well as for those who were dead.
Variable Importance
We will now look at which variables were most important in classifying our examples. Below is the code
model_ranks=pd.Series(clf.feature_importances_,index=x_train.columns,name="Importance").sort_values(ascending=True,inplace=False) ax=model_ranks.plot(kind='barh')
We create an object called model_ranks and we indicate the following.
- Classify the features by importance
- Set index to the columns in the training dataset of x
- Sort the features from most to least importance
- Make a barplot
Below is the output
You can see that time is the strongest classifier. How long someone has cancer is the strongest predictor of whether they are censored or dead. Next is the number of calories per meal followed by weight and lost and age.
Conclusion
Here we learned how to use random forest in Python. This is another tool commonly used in the context of machine learning.

Decision Trees in Python
Decision trees are used in machine learning. They are easy to understand and are able to deal with data that is less than ideal. In addition, because of the pictorial nature of the results decision trees are easy for people to interpret. We are going to use the ‘cancer’ dataset to predict mortality based on several independent variables.
We will follow the steps below for our decision tree analysis
- Data preparation
- Model development
- Model evaluation
Data Preparation
We need to load the following modules in order to complete this analysis.
import pandas as pd import statsmodels.api as sm import sklearn from pydataset import data from sklearn.model_selection import train_test_split from sklearn import metrics from sklearn import tree import matplotlib.pyplot as plt from sklearn.externals.six import StringIO from IPython.display import Image from sklearn.tree import export_graphviz import pydotplus
The ‘cancer’ dataset comes from the ‘pydataset’ module. You can learn more about the dataset by typing the following
data('cancer', show_doc=True)
This provides all you need to know about our dataset in terms of what each variable is measuring. We need to load our data as ‘df’. In addition, we need to remove rows with missing values and this is done below.
df = data('cancer') len(df) Out[58]: 228 df=df.dropna() len(df) Out[59]: 167
The initial number of rows in the data set was 228. After removing missing data it dropped to 167. We now need to setup up our lists with the independent variables and a second list with the dependent variable. While doing this, we need to recode our dependent variable “status” so that the numerical values are replaced with a string. This will help us to interpret our decision tree later. Below is the code
X=df[['time','age',"sex","ph.ecog",'ph.karno','pat.karno','meal.cal','wt.loss']] df['status']=df.status.replace(1,'censored') df['status']=df.status.replace(2,'dead') y=df['status']
Next, we need to make our train and test sets using the train_test_split function. We want a 70/30 split. The code is below.
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
We are now ready to develop our model.
Model Development
The code for the model is below
clf=tree.DecisionTreeClassifier(min_samples_split=10) clf=clf.fit(x_train,y_train)
We first make an object called “clf” which calls the DecisionTreeClassifier. Inside the parentheses, we tell Python that we do not want any split in the tree to contain less than 10 examples. The second “clf” object uses the .fit function and calls the training datasets.
We can also make a visual of our decision tree.
dot_data = StringIO() export_graphviz(clf, out_file=dot_data, filled=True, rounded=True,feature_names=list(x_train.columns.values), special_characters=True) graph = pydotplus.graph_from_dot_data(dot_data.getvalue()) Image(graph.create_png())
If we interpret the nodes furthest to the left we get the following
- If a person has had cancer less than 171 days and
- If the person is less than 74.5 years old then
- The person is dead
If you look closely every node is classified as ‘dead’ this may indicate a problem with our model. The evaluation metrics are below.
Model Evaluation
We will use the .crosstab function and the metrics classification functions
You can see that the metrics are not that great in general. This may be why everything was classified as ‘dead’. Another reason is that few people were classified as ‘censored’ in the dataset.
Conclusion
Decisions trees are another machine learning tool. Python allows you to develop trees rather quickly that can provide insights into how to take action.

Principal Component Analysis in Python
Principal component analysis is a form of dimension reduction commonly used in statistics. By dimension reduction, it is meant to reduce the number of variables without losing too much overall information. This has the practical application of speeding up computational times if you want to run other forms of analysis such as regression but with fewer variables.
Another application of principal component analysis is for data visualization. Sometimes, you may want to reduce many variables to two in order to see subgroups in the data.
Keep in mind that in either situation PCA works better when there are high correlations among the variables. The explanation is complex but has to do with the rotation of the data which helps to separate the overlapping variance.
Prepare the Data
We will be using the pneumon dataset from the pydataset module. We want to try and explain the variance with fewer variables than in the dataset. Below is some initial code.
import pandas as pd from sklearn.decomposition import PCA from pydataset import data from sklearn.preprocessing import StandardScaler import matplotlib.pyplot as plt
Next, we will set up our dataframe. We will only take the first 200 examples from the dataset. If we take all (over 3000 examples), the visualization will be a giant blob of dotes that cannot be interpreted. We will also drop in missing values. Below is the code
df = data('pneumon') df=df.dropna() df=df.iloc[0:199,]
When doing a PCA, it is important to scale the data because PCA is sensitive to this. The result of the scaling process is an array. This is a problem because the PCA function needs a dataframe. This means we have to convert the array to a dataframe. When this happens you also have to rename the columns in the new dataframe. All this is done in the code below.
scaler = StandardScaler() #instance df_scaled = scaler.fit_transform(df) #scaled the data df_scaled= pd.DataFrame(df_scaled) #made the dataframe df_scaled=df_scaled.rename(index=str, columns={0: "chldage", 1: "hospital",2:"mthage",3:"urban",4:"alcohol",5:"smoke",6:"region",7:"poverty",8:"bweight",9:"race",10:"education",11:"nsibs",12:"wmonth",13:"sfmonth",14:"agepn"}) # renamed columns
Analysis
We are now ready to do our analysis. We first use the PCA function to indicate how many components we want. For our first example, we will have two components. Next, you use the .fit_transform function to fit the model. Below is the code.
pca_2c=PCA(n_components=2) X_pca_2c=pca_2c.fit_transform(df_scaled)
Now we can see the variance explained by component and the sum
pca_2c.explained_variance_ratio_ Out[199]: array([0.18201588, 0.12022734]) pca_2c.explained_variance_ratio_.sum() Out[200]: 0.30224321247148167
In the first line of code, we can see that the first component explained 18% of the variance and the second explained 12%. This leads to a total of about 30%. Below is a visual of our 2 component model the color represents the race of the respondent. The three different colors represent three different races.
Our two components do a reasonable separating the data. Below is the code for making four components. We can not graph four components since our graph can only handle two but you will see that as we increase the components we also increase the variance explained.
pca_4c=PCA(n_components=4) X_pca_4c=pca_4c.fit_transform(df_scaled) pca_4c.explained_variance_ratio_ Out[209]: array([0.18201588, 0.12022734, 0.09290502, 0.08945079]) pca_4c.explained_variance_ratio_.sum() Out[210]: 0.4845990164486457
With four components we now have almost 50% of the variance explained.
Conclusion
PCA is for summarising and reducing the number of variables used in an analysis or for the purposes of data visualization. Once this process is complete you can use the results to do further analysis if you desire.

Data Exploration with Python
In this post, we will explore a dataset using Python. The dataset we will use is the Ghouls, Goblins, and Ghost (GGG) dataset available at the kaggle website. The analysis will not be anything complex we will simply do the following.
- Data preparation
- Data visualization
- Descriptive statistics
- Regression analysis
Data Preparation
The GGG dataset is fictitious data on the characteristics of spirits. Below are the modules we will use for our analysis.
import pandas as pd import statsmodels.regression.linear_model as sm import numpy as np
Once you download the dataset to your computer you need to load it into Python using the pd.read.csv function. Below is the code.
df=pd.read_csv('FILE LOCATION HERE')
We store the data as “df” in the example above. Next, we will take a peek at the first few rows of data to see what we are working with.
Using the print function and accessing the first five rows reveals. It appears the first five columns are continuous data and the last two columns are categorical. The ‘id’ variable is useless for our purposes so we will remove it with the code below.
df=df.drop(['id'],axis=1)
The code above uses the drop function to remove the variable ‘id’. This is all saved into the object ‘df’. In other words, we wrote over are original ‘df’.
Data Visualization
We will start with our categorical variables for the data visualization. Below is a table and a graph of the ‘color’ and ‘type’ variables.
First, we make an object called ‘spirits’ using the groupby function to organize the table by the ‘type’ variable.
Below we make a graph of the data above using the .plot function. A professional wouldn’t make this plot but we are just practicing how to code.
We now know how many ghosts, goblins and, ghouls there are in the dataset. We will now do a breakdown of ‘type’ by ‘color’ using the .crosstab function from pandas.
We will now make bar graphs of both of the categorical variables using the .plot function.
We will now turn our attention to the continuous variables. We will simply make histograms and calculate the correlation between them. First the histograms
The code is simply subset the variable you want in the brackets and then type .plot.hist() to access the histogram function. It appears that all of our data is normally distributed. Now for the correlation
Using the .corr() function has shown that there are now high correlations among the continuous variables. We will now do an analysis in which we combine the continuous and categorical variables through making boxplots
The code is redundant. We use the .boxplot() function and tell python the column which is continuous and the ‘by’ which is the categorical variable.
Descriptive Stats
We are simply going to calcualte the mean and standard deviation of the continuous variables.
df["bone_length"].mean() Out[65]: 0.43415996604821117 np.std(df["bone_length"]) Out[66]: 0.13265391313941383 df["hair_length"].mean() Out[67]: 0.5291143100058727 np.std(df["hair_length"]) Out[68]: 0.16967268504935665 df["has_soul"].mean() Out[69]: 0.47139203219259107 np.std(df["has_soul"]) Out[70]: 0.17589180837106724
The mean is calcualted with the .mean(). Standard deviation is calculated using the .std() function from the numpy package.
Multiple Regression
Our final trick is we want to explain the variable “has_soul” using the other continuous variables that are available. Below is the code
X = df[["bone_length", "rotting_flesh","hair_length"]] y = df["has_soul"] model = sm.OLS(y, X).fit()
In the code above we crate to new list. X contains are independent variables and y contains the dependent variable. Then we create an object called model and use the OLS() function. We place the y and X inside the parenthesis and we then use the .fit() function as well. Below is the summary of the analysis
There is obviously a lot of information in the output. The r-square is 0.91 which is surprisingly high given that there were not high correlations in the matrix. The coefficiencies for the three independent variables are listed and all are significant. The AIC and BIC are for model comparison and do not mean much in isolation. The JB stat indicates that are distribution is not normal. Durbin watson test indicates negative autocorrelation which is important in time-series analysis.
Conclusion
Data exploration can be an insightful experience. Using Python, we found mant different patterns and ways to describe the data.

Working with a Dataframe in Python
In this post, we will learn to do some basic exploration of a dataframe in Python. Some of the task we will complete include the following…
- Import data
- Examine data
- Work with strings
- Calculating descriptive statistics
Import Data
First, you need data, therefore, we will use the Titanic dataset, which is readily available on the internet. We will need to use the pd.read_csv() function from the pandas package. This means that we must also import pandas. Below is the code.
import pandas as pd df=pd.read_csv('FILE LOCATION HERE')
In the code above we imported pandas as pd so we can use the functions within it. Next, we create an object called ‘df’. Inside this object, we used the pd.read_csv() function to read our file into the system. The location of the file needs to type in quotes inside the parentheses. Having completed this we can now examine the data.
Data Examination
Now we want to get an idea of the size of our dataset, any problems with missing. To determine the size we use the .shape function as shown below.
df.shape Out[33]: (891, 12)
Results indicate that we have 891 rows and 12 columns/variables. You can view the whole dataset by typing the name of the dataframe “df” and pressing enter. If you do this you may notice there are a lot of NaN values in the “Cabin” variable. To determine exactly how many we can use is.null() in combination with the values_count. variables.
df['Cabin'].isnull().value_counts() Out[36]: True 687 False 204 Name: Cabin, dtype: int64
The code starts with the name of the dataframe. In the brackets, you put the name of the variable. After that, you put the functions you are using. Keep in mind that the order of the functions matters. You can see we have over 200 missing examples. For categorical varable, you can also see how many examples are part of each category as shown below.
df['Embarked'].value_counts() Out[39]: S 644 C 168 Q 77 Name: Embarked, dtype: int64
This time we used our ‘Embarked’ variable. However, we need to address missing values before we can continue. To deal with this we will use the .dropna() function on the dataset. THen we will check the size of the dataframe again with the “shape” function.
df=df.dropna(how='any') df.shape Out[40]: (183, 12)
You can see our dataframe is much smaller going 891 examples to 183. We can now move to other operations such as dealing with strings.
Working with Strings
What you do with strings really depends or your goals. We are going to look at extraction, subsetting, determining the length. Our first step will be to extract the last name of the first five people. We will do this with the code below.
df['Name'][0:5].str.extract('(\w+)') Out[44]: 1 Cumings 3 Futrelle 6 McCarthy 10 Sandstrom 11 Bonnell Name: Name, dtype: object
As you can see we got the last names of the first five examples. We did this by using the following format…
dataframe name[‘Variable Name’].function.function(‘whole word’))
.str is a function for dealing with strings in dataframes. The .extract() function does what its name implies.
If you want, you can even determine how many letters each name is. We will do this with the .str and .len() function on the first five names in the dataframe.
df['Name'][0:5].str.len() Out[64]: 1 51 3 44 6 23 10 31 11 24 Name: Name, dtype: int64
Hopefully, the code is becoming easier to read and understand.
Aggregation
We can also calculate some descriptive statistics. We will do this for the “Fare” variable. The code is repetitive in that only the function changes so we will run all of them at once. Below we are calculating the mean, max, minimum, and standard deviation for the price of a fare on the Titanic
df['Fare'].mean() Out[77]: 78.68246885245901 df['Fare'].max() Out[78]: 512.32920000000001 df['Fare'].min() Out[79]: 0.0 df['Fare'].std() Out[80]: 76.34784270040574
Conclusion
This post provided you with some ways in which you can maneuver around a dataframe in Python.
If Statements in Python VIDEO
Using if statements in Python

Numpy Arrays in Python
In this post, we are going to explore arrays is created by the numpy package in Python. Understanding how arrays are created and manipulated is useful when you need to perform complex coding and or analysis. In particular, we will address the following,
- Creating and exploring arrays
- Math with arrays
- Manipulating arrays
Creating and Exploring an Array
Creating an array is simple. You need to import the numpy package and then use the np.array function to create the array. Below is the code.
import numpy as np example=np.array([[1,2,3,4,5],[6,7,8,9,10]])
Making an array requires the use of square brackets. If you want multiple dimensions or columns than you must use inner square brackets. In the example above I made an array with two dimensions and each dimension has it’s own set of brackets.
Also, notice that we imported numpy as np. This is a shorthand so that we do not have to type the word numpy but only np. In addition, we now created an array with ten data points spread in two dimensions.
There are several functions you can use to get an idea of the size of a data set. Below is a list with the function and explanation.
- .ndim = number of dimensions
- .shape = Shares the number of rows and columns
- .size = Counts the number of individual data points
- .dtype.name = Tells you the data structure type
Below is code that uses all four of these functions with our array.
example.ndim Out[78]: 2 example.shape Out[79]: (2, 5) example.size Out[80]: 10 example.dtype.name Out[81]: 'int64'
You can see we have 2 dimensions. The .shape function tells us we have 2 dimensions and 5 examples in each one. The .size function tells us we have 10 total examples (5 * 2). Lastly, the .dtype.name function tells us that this is an integer data type.
Math with Arrays
All mathematical operations can be performed on arrays. Below are examples of addition, subtraction, multiplication, and conditionals.
example=np.array([[1,2,3,4,5],[6,7,8,9,10]]) example+2 Out[83]: array([[ 3, 4, 5, 6, 7], [ 8, 9, 10, 11, 12]]) example-2 Out[84]: array([[-1, 0, 1, 2, 3], [ 4, 5, 6, 7, 8]]) example*2 Out[85]: array([[ 2, 4, 6, 8, 10], [12, 14, 16, 18, 20]]) example<3 Out[86]: array([[ True, True, False, False, False], [False, False, False, False, False]], dtype=bool)
Each number inside the example array was manipulated as indicated. For example, if we typed example + 2 all the values in the array increased by 2. Lastly, the example < 3 tells python to look inside the array and find all the values in the array that are less than 3.
Manipulating Arrays
There are also several ways you can manipulate or access data inside an array. For example, you can pull a particular element in an array by doing the following.
example[0,0] Out[92]: 1
The information in the brackets tells python to access the first bracket and the first number in the bracket. Recall that python starts from 0. You can also access a range of values using the colon as shown below
example=np.array([[1,2,3,4,5],[6,7,8,9,10]]) example[:,2:4] Out[96]: array([[3, 4], [8, 9]])
In this example, the colon means take all values or dimension possible for finding numbers. This means to take columns 1 & 2. After the comma we have 2:4, this means take the 3rd and 4th value but not the 5th.
It is also possible to turn a multidimensional array into a single dimension with the .ravel() function and also to transpose with the transpose() function. Below is the code for each.
example=np.array([[1,2,3,4,5],[6,7,8,9,10]]) example.ravel() Out[97]: array([ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) example.transpose() Out[98]: array([[ 1, 6], [ 2, 7], [ 3, 8], [ 4, 9], [ 5, 10]])
You can see the .ravel function made a one-dimensional array. The .transpose broke the array into several more dimensions with two numbers each.
Conclusion
We now have a basic understanding of how numpy array work using python. As mention before, this is valuable information to understand when trying to wrestling with different data science questions.

Lists in Python
Lists allow you to organize information. In the real world, we make list all the time to keep track of things. This same concept applies in Python when making list. A list is a sequence of stored data. By sequence, it is mean a data structure that allows multiple items to exist in a single storage unit. By making list we are explaining to the computer how to store the data in the computer’s memory.
In this post, we learn the following about list
- How to make a list
- Accessing items in a list
- Looping through a list
- Modifying a list
Making a List
Making a list is not difficult at all. To make one you first create a variable name followed by the equal sign and then place your content inside square brackets. Below is an example of two different lists.
numList=[1,2,3,4,5] alphaList=['a','b','c','d','e'] print(numList,alphaList) [1, 2, 3, 4, 5] ['a', 'b', 'c', 'd', 'e']
Above we made two lists, a numeric and a character list. We then printed both. In general, you want your list to have similar items such as all numbers or all characters. This makes it easier to recall what is in them then if you mixed them. However, Python can handle mixed list as well.
Access a List
To access individual items in a list is the same as for a sting. Just employ brackets with the index that you want. Below are some examples.
numList=[1,2,3,4,5] alphaList=['a','b','c','d','e'] numList[0] Out[255]: 1 numList[0:3] Out[256]: [1, 2, 3] alphaList[0] Out[257]: 'a' alphaList[0:3] Out[258]: ['a', 'b', 'c']
numList[0] gives us the first value in the list. numList[0:3] gives us the first three values. This is repeated with the alphaList as well.
Looping through a List
A list can be looped through as well. Below is a simple example.
for item in numList : print(item) for item in alphaList : print(item) 1 2 3 4 5 a b c d e
By making the two for loops above we are able to print all of the items inside each list.
Modifying List
There are several functions for modifying lists. Below are a few
The append() function as a new item to the list
numList.append(9) print(numList) alphaList.append('h') print(alphaList) [1, 2, 3, 4, 5, 9] ['a', 'b', 'c', 'd', 'e', 'h']
You can see our lists new have one new member each at the end.
You can also remove the last member of a list with the pop() function.
numList.pop() print(numList) alphaList.pop() print(alphaList) [1, 2, 3, 4, 5] ['a', 'b', 'c', 'd', 'e']
By using the pop() function we have returned our lists back to there original size.
Another trick is to merge lists together with the extend() function. For this, we will merge the same list with its self. This will cause the list to have duplicates of all of its original values.
numList.extend(numList) print(numList) alphaList.extend(alphaList) print(alphaList) [1, 2, 3, 4, 5, 1, 2, 3, 4, 5] ['a', 'b', 'c', 'd', 'e', 'a', 'b', 'c', 'd', 'e']
All the values in each list have been duplicated. Finally, you can sort a list using the sort() function.
numList.sort() print(numList) alphaList.sort() print(alphaList) [1, 1, 2, 2, 3, 3, 4, 4, 5, 5] ['a', 'a', 'b', 'b', 'c', 'c', 'd', 'd', 'e', 'e']
Now all the numbers and letters are sorted.
Conclusion
THere is way more that could be done with lists. However, the purpose here was just to cover some of the basic ways that list can be used in Python.

Z-Scores and Inferential Stats in Python
In this post, we will look at some ways to calculate some inferential statistics in Python. We are mainly going to focus on z-scores and one/two-tailed test.
We will begin by import some needed packages and then we will make some data and plot it. Below is the code and plot
import numpy as np import matplotlib.pyplot as plt import scipy.stats as stats quizscore=np.random.normal(69,9,60) plt.hist(quizscore,density=True)
We create an object called “quizscore” which contains as you can tell quiz scores. In line four in the code above, we used the .random.normal function from the numpy package to create some random data. This function has 3 arguments in it, the first is the mean of the distribution, next is the standard deviation, and the last is the sample size. After this is the code for the histogram which uses the .hist function from the matplotlib.pyplot package.
Z-Scores
Z-Scores are a measure of how far a data point is from the mean. In the code below, we calculate the first five z-scores of our “quizscore” dataset.
stats.zscore(quizscore)[0:5,] Out[127]: array([ 0.54362001, 1.56135871, -0.36924596, 0.53182556, -0.06014972])
In the output, you can see that the first number in the dataset is 0.54 standard deviations above the mean. Remember, our standard deviation was set to 9 so this means the score for the first quiz was about 73. We confirm this with the code below.
quizscore[0,] Out[129]: 72.851341820695538
Fairly close, another question we can answer is what is the probability that someone got a score that was 1.5 standard deviations or higher on the quiz (about a score of 82). Below is the code followed by the answer.
1-stats.norm.cdf(1.5) Out[132]: 0.066807201268858085
In the code above we subtract 1 from our code. The code uses the .cdf function. Inside the function, we put our z-score of 1.5. The answer is 0.066 or 6% of the scores have a Z-score of 1.5 or higher or a quiz score of around 82 or higher.
We can also determine what is the cutoff point for a certain high score. For us, we want to know what the cutoff for the top 15% of quiz scores. Below is the code.
stats.norm.ppf(0.85) Out[136]: 1.0364333894937898 (1.03 * quizscore.std()) +quizscore.mean() Out[137]: 77.748927759179054
In the code above, first we had to convert the percentage to a z-score and this is done with the .ppf function. The 85 percentile is equivalent to a Z-score of about 1. Next, we multiplied the z-score by the standard deviation of the quizscore and added the mean of the quizscore to this to get our final answer of 77.74. This means that a score of 77.74 and above is within the top 15%.
One & Two-Tailed Test
A one-tail test is used to compare the sample mean to a population mean. Another way to look at this is to convert a z-score to a p-value. What is happening here is you create a region in your distribution in which scores are considered unusual. For a one-tail test, this is the top 5%. For example, let’s say you want to know the probability that someone got a score of 92 or higher on the quiz. This can be solved with the following code.
quizZ=(92-quizscore.mean())/quizscore.std() 1-stats.norm.cdf(quizZ) Out[138]: 0.0072370644085374414
Here is what we did
- We create an object called “quizZ” which took our specific score of 92 subtracted the mean of “quizscore” and divided it by the standard deviation of “quizscore”. This becomes the z-score for the value 92.
- We then subtract one from this while using the .cdf function
- The output indicates that there is less than a 1% chance that someone got a score of 92% or higher.
In this example our sample mean was small (only 1) but the concept remains the same.
A two-tailed test is the same concept except that the rejections regions are on both sides of the distribution and divide in half. This means that the top 2.5% and bottom 2.5% are now considered unusual scores. As an example, if the average score on the quiz is 75 with a standard deviation of 5 and we want to see if our class mean of 67 is different. We can use the following code to answer this.
two_tail=(quizscore.mean()-75)/5 stats.norm.cdf(two_tail) Out[160]: 0.063688920253590395
You can see that the probability of our class is not unusual as it is above the cutoff of 5% indicating no difference. This means that if 75 is the center of our distribution the quiz score of 67 would be within 2 standard deviations of the mean.
Conclusion
In this post, we explored some of the ways you can do inferential statistics with Python. Whatever you want to know can be calculated by knowing a few lines of code.
Making Functions in Python VIDEO
Making functions in Python

Working with Strings in Python
It is somewhat difficult to define a string. In some ways, a string is text such as what is found in a book or even in this blog. Strings are made up of characters such as letters and even symbols such as %#&@**. However, the computer does not use these characters but converts them to numbers for processing.
In this post, we will learn some of the basics of working with strings in Python. This post could go on forever in terms of what you can do with text so we will only address the following concepts
- Finding individual characters
- Modifying text
Finding Individual Characters
Finding individual characters in a string is simple. You simply need to type the variable name and after the name use brackets and put the number of the location of the character in the string. Below is an example
example="Educational Research Techniques" print(example[0]) E
As you can see, we created a variable called “example” the content of “example” is the string “Educational Research Techniques”. To access the first letter we use the “print” function, type the name of the variable and inside brackets put the letter 0 for the first position in the string. This gives us the letter E. Remeber that Python starts from the number 0 and not 1.
If you want to start from the end of the string and go backward you simply use negative numbers instead of positive as in the example below.
example="Educational Research Techniques" print(example[-1]) s
You can also get a range of characters.
example="Educational Research Techniques" print(example[0:5]) Educa
The code above is telling python to take from position 0 to 4 but not including 5.
You can do even more creative things such as pick characters from different spots to create new words
example="Educational Research Techniques" print(example[0:2] + example[20:25]) Ed Tech
Modifying Strings
It is also possible to modify the text in the string such as making all letters upper or lower case as shown below.
example.upper() 'EDUCATIONAL RESEARCH TECHNIQUES' example.lower() 'educational research techniques'
The upper() function capitalizes and the lower() function make small case.
It is also possible to count the length of a string and swap the case with the len() and swapcase() functions respectively.
len(example) 31 example.swapcase() 'eDUCATIONAL rESEARCH tECHNIQUES'
Python allows you to count how many times a character appears using the count() function. You can also use the replace() to find characters and change them.
example.count('e') 4 example.replace('e','q') 'Educational Rqsqarch Tqchniquqs'
As you can see, the lower case letter ‘e’ appears four times. In the second line, python replaces all the lower case ‘e’s with the letter ‘q’.
Conclusion
Of course, there is so much more than we could do with strings. However, this post provides an introduction to what is possible.

while statements and Nested for loops in Python
When you are unsure how much data your application may need to process it is probably appropriate to use a while statements. The while statement we keep processing until it runs out of items to process. If you remember with a traditional for loop the limit is preset by the data structure you are analyzing.
Nested for loops is another concept we will look at. They are useful when it is unclear what the conditions of execution should be.
Since they can go on forever it is possible with a while statement to create an endless loop. This is a loop that never stops processing. This will essentially crash most computers and should naturally be avoided. To avoid this you need set the environment, state the while statement, and update the condition of the environment. Below is a simple example of this.
- In line 1, we have the environment for the condition which is a variable called “number” set to 0.
- Line 2 is the while statement which states that as long as “number” is less than 10 do the following.
- In line 3 the variable “number” is printed.
- In line 4, after “number” is printed the number 2 is added to the current value
- This takes place until the variable “number” is equal to ten.
Below is what the output looks like if you ran this
You can see that we start with 0. This is because we set the variable “number” originally to 0. Then the value increases by just as in line 4 of the code above.
Nested for Loops
Just as with functions you can also have nested for loops. This is a loop within a loop. Below is a simple example.
- Lines 1-2 as for input and you can type whatever you want
- Line 3-4 is the first for loop at it process whatever your input is from line 1. The final result is that the loop prints this
- Line 5-6 are the second for loop and simply processes whatever you inputted in line 2. This loop simply prints the input from line 2.
Here is what the output would look like
You can see that the loops took turns. The first loop ran its first letter them then the second loop ran everything. Then the first loop ran its second letter and the second loop ran everything again. Therefore, nested for loops affects the timing of when the code is ran.
Conclusion
This post looked at the use of while statements and for loops. while statements are useful when you do not know how long you may need to process data. for loops allow you to run complex looping in which you are trying to do multiple tasks.

for Loops in Python
The use of for loops are valuable when you need your application to do a repetitive task. Once the task is completed there is some sort of output that is returned. Understanding how to create a for loop is a critical step in utilizing the Python language.
Making for loops
Here is the basic syntax for a for loop
for item in data: do something
The word “for” indicates a for loop. The word “item” is an iteration variable. An iteration variable is a variable that changes value each time the loop goes through the data. It takes on the current value that is being analyzed for whatever purpose the loop has. You can name the iteration variable anything you want but a general rule is to use names that make sense for the context. Otherwise, nobody else will be able to understand your code.
After the colon is where you find “do something” here you put the command for whatever the loop is supposed to do. Below is an actual example of the use of the for loop.
Here is what happened
- At the top, we have our for loop. The iterator variable is “letter” and we are looping through the data of the string “education”.
- The next line is the action the for loop will perform. Essentially, the loop will pull each later from the string “education” and insert them one at a time into the phrase “Give me an”,. Notice how the word “letter” is at the end of our print statement. This the iteration variable that changes each time our for loop goes through the string “education.
- The output is several print statements each containing a different letter from the string “education”
for loops with Breaks
Breaks are used to provide conditions in which the loop will stop. In the example below, we add some code to our cheer that allows you to enter your own cheer. However, the church must be less than 10 letters otherwise you get a message that word is too long. Below is the code
Here is what it does.
- In line 1, you provide a word as indicated by the instructions in the parentheses.
- Line 2 is the for loop. letter is the iteration variable for our word in “Value”
- Line 3 is the if statement. The strong on “Value” is checked to make sure it is 10 characters or less.
- In line 4, if “Value” is greater than 10 characters you get the message that the cheer is too long.
- Line 5 is the break which stops the loop from continuing.
- In line 6, if the word is less than 10 characters you get the cheer with each letter.
Below is the output with less than 10 characters
Here is the output with more than 10 characters
Continue and for loop
The continue clause allows you to check the data and only process it based on certain conditions. In the code below, we are going to change our cheer code so that it removes spaces when making the cheer.
The code is mostly the same with a few exceptions
- The if statement looks for blank spaces and these are left out of the cheer.
- The continue clause tells python to keep going
- Finally, the cheer is given
Below is what the output looks like if you ran this code
You can see that I put many blank spaces in-between the letters but these do not appear in the code. This is because of the continue clause.
Conclusion
for loops are a basic yet powerful tool of programming. In Python, for loops are used for the same reason as other languages and that is for completing repetitive tasks. The examples, here provide some simple ways in which this can be done.

elif Clause and Nested Statements in Python
This post will provide a brief introduction into the use of the elif clause and nested statements in Python.
elif Clause
The elif clause is used to add an additional set of conditions to an if statement. If you have ever used some sort of menu on a computer in which you had to make several choices it is possible that an elif clause was involved in the code.
The syntax for the elif clause is the same as for the if statement. Below is a simple example that employs the elif clause.
Here is what this code does.
- In lines 1-3, I print three lines of code at the beginning. These are the choices available to the user.
- In line 4, the “pick” variable stores whatever number the user inputs through the “input” function. It must be an integer which is why I used the “int” function
- In line 5 we begin the if statement. If the “pick” variable is set to 1 you can see the printout in line 6.
- In lines 7 and 8 we use the elif clause. The settings are mostly the same as in the if statement in line 5 except the “pick” variable is set to different numbers with different outputs.
- Lastly, in line 11 we have the else clause. If for any reason the person picks something besides 1,2 or 3 this message will appear.
What this code means in simple English is this
- If the pick is 1 then print “dogs are your favorite animal”
- or else if the pick is 2 then print “cats are your favorite animal”
- or else if the pick is 3 then print “rabbits are your favorite animal”
- else print “I do not understand your choice”
Here is what the code looks like when you run it
As a note, if you type in a letter or string you will get an error message. This is because our code is not sophisticated enough to deal with non-integers at this point.
Nested Statements
Just like with functions which can be nested so can decision statements by nesting inside each other. To put this simply the conditions set by the first if statement can potentially affect the second condition. This is explained better with an example.
Here is what is happening in the code above.
- In lines 1 and 2 the user has to pick numbers that are saved in the variables “num1” and “num2.”
- Line 3 and 4 are the if statements. Line 3 and line 9 are the outer if statement and line 4-8 are the inner if statement.
- Line 3 shares that “num1” must be between 1 and 10.
- Line 4 shares that “num2” must be between 1 and 10.
- Line 5 is the results of the inner if statement. The results are printed using the “.format” method where the {0} and {1} are the variables “num1” and “num2”. After the comma is what is done to the variables, they are simply added together.
- Line 8 is the else clause. If someone types something different form a number between 1-10 for the second number they will see this message
- Line 9 is the else clause for the outer if statement. This is only seen if a value different from 1-10 is inputted.
If you run the code here is what it should look like
Conclusion
The elif clause and nested decision statements are additional tools that can be used to empower your applications. This is some of the most basic ideas in using a language such as Python.

Logical Flow in Python
Applications often have to make decisions and to do this they need a set of conditions to him them decide what to do. Python, like all programming languages, has tools that allow the application to execute various actions based on conditions. In this post, we will look at the use of if statements.
If Statement Define
An if statement is a statement used in Python that determines when an action should happen. At a minimum, an if statement should have two parts to it. The first part is the condition and the second part is the action that Python performs if the condition is true. Below is what this looks like
if SOMETHING IS TRUE: DO THIS
This is not the most beautiful code but this is the minimum for a if statement in Python.
If Statement
Below is the actual application of the use of an if statement
number=5 if number ==5: print("Correct") Correct
In the code above we created a variable called number and set its value to 5. Then we created the if statement by setting the condition as “if number equals 5” the action for this being true was prin the string “correct”, which Python does at the bottom.
Notice the double equal sign used in the if statement. The double equal sign is used for relational equality while the single equal sign is used for assigning values to variables. Also, notice the colon at the end of the if statement. This must be there in order to complete the code.
There is no limit to the number of tasks an if statement can perform. In to code above the, if statement only printed “Correct” however, we can put many more print functions or other actions as shown below.
number=5 if number ==5: print("Correct") print("Done") Correct Done
All that is new above is that we have two print statements that were executed because the if statement was true.
Multiple Comparisons
In the code above python only had to worry about whether number equaled 5. However, we can have Python make multiple comparisons as well below is an example.
number=5 if (number>0) and (number<10): print("Correct") print("Finished") Correct Finished
Now for the print functions to execute the number has to be greater than 0 but less than 10. If you change the number to something less than 0 or greater than 10 nothing will happen when you run the code because the conditions were not met. In addition, if you type in a letter like ‘a’ you will get an error message because Python is anticipating a number and not a string.
If Else Statements
The use of an else clause in an if statement allows for an alternative task to be executed if the first conditions in the if statement are not met. This allows the application to do something when the conditions are not met rather than leaving a blank screen. Below we modify our code in the previous example slightly so that something happens because of the else clause in the if statement.
number=15 if (number>0) and (number<10): print("Correct") print("Finished") else: print("Number is out of range") Number is out of range
In this code above we did the following
- we set our number to 15
- We created an if statement that searches for a number greater than 0 and less than 10.
- If the conditions in step 2 are true we print the two statements
- If the conditions in step 2 are not met we print the statement after the else clause
- Our number was 15 so Python printed the statement after the else clause
It can get much more complicated and powerful than this but I think this is clear and enough for now.
Conclusion
This post provided an introduction to the if statement in Python. THe if statement will execute a command based on one or more conditions. You must be careful with the syntax. Lastly, you can also include an alternative task with the else clause if the conditions are not met.

Making Functions in Python
Efficiency is important when writing code. Why would you type something a second time if you do not have to? One way coders save time, reduce error, and make their code easier to read is through the use of functions.
Functions are a way of storing code so that it can be reused without having to retype or code it again. By saving a piece of reusable code as a function you only have to call the function in order to use it again. This improves efficiency and reliability of your code. Functions simply take an input, rearrange the input however it is supposed to, and provide an output. In this post, we will look at how to make a function using Python.
Simple Function
A function requires the minimum information
- A name
- Determines if any requirements (arguments) are needed
- The actual action the function is supposed to do to the input
We will now make a simple function as shown in the code below.
def example(): print("My first function")
In the code above we have the three pieces of information.
- The name of the function is “example” and you set the name using the “def” command.
- There are no requirements or arguments in this function. This is why the parentheses are empty. This will make more sense later.
- What this function does is use the “print” function to print the string “My first function”
We will now use the function by calling it in the example below.
example() My first function
As you can see, when we call the function it simply prints the string. This function is not that impressive but it shows you how functions work.
Functions with Arguments
Arguments are found with the parentheses of a function. They are placeholders for information that you must supply in order for the function to work. Below is an example.
def example(info): print(info)
Now our “example” function has a required argument called “info” we must always put something in place of this for the function to run. Below is an example of us calling the “example” function with a string in place of the argument “info”.
example("My second function") My second function
You can see that the function simply printed what we placed in the paratheses. If we had left the parentheses empty we would have gotten an error message. You can try that yourself.
You can assign a default value to your argument. This is useful if people do not provide their own value. Below we create the same function but with a default value for the argument.
def example(info="You forgot to give a value"): print(info)
We will now call it but we will not include the argument
example() You forgot to give a value
return and print
When creating functions, it is common to have to decide when to use the “return” or “print” function. Below are some guidelines
- Print is for people. If a person only needs to see the output without any other execution the print is a good choice.
- Return is appropriate when sending the data back to the caller for additional execution. For example, using one function before using a second function
If you take any of the examples and use “return” instead of “print” they will still work so the difference between “return” and “print” depends on the ultimate purpose of the application.
Conclusion
Functions play a critical role in making useful applications. This is due to their ability to save time for coders. THere are several concepts to keep in mind when developing functions. Understanding these ideas is important for future success.

Common Data Types in Python
All programming languages have a way of storing certain types of information in variables. Certain data types or needed for one situation and different data types for another. It is important to know the differences in the data types otherwise serious problems could arise when developing an application. In this post, we will look at some of the more commonly used data types in Python.
Making Variables
It is first important to understand how to make a variable in Python. It is not that complicated. The format is the following
variable name = data inside the variable
You simply type a name, use the equal sign, and then include the data to be saved in the variable. Below is an example where I save the number 3 inside a variable called “example”
example=3 print(example) 3
The “print” function was used to display the contents of the “example” variable.
Numeric Types
There are two commonly used numeric data types in Python and they are integers and floating point values.
Integers
Integers are simply whole positive or negative numbers. To specifically save a number as an integer you place the number inside the “int” before saving as a variable as in the example below.
example=int(3) print(example) 3
You can check the data type by using the “type” function on your variable. This is shown below.
type(example) Out[17]: int
The results are “int” which stands for integer.
Floating-Point Types
Floating-point numbers are numbers with decimals. If your number includes a decimal it will automatically be stored as a floating type. If your number is a whole number and you want to save it as a floating type you need to use the “float” function when storing the data. Below are examples of both
#This is an example of a float number example=3.23 print(example) 3.23 #This is an example of converting a whole number to a floating point example=float(3) print(example) 3.0
Floating points can store exponent numbers using scientific notation. Floating point numbers are used because decimals are part of the real world. The downside is they use a lot of memory compared to integers.
Other Types
We will look at two additional data types and they are boolean and string.
Boolean
A boolean variable only has two possible values which are True or False. This seems useless but it is powerful when it is time to have your application do things based on conditions. You are not really limited to True or False you can also type in mathematical expressions that Python evaluates. Below are some examples.
#Variable set to True example=True print(example) True #Variable set to True after evaluting an expression example=1<2 print(example) True
String
A string is a variable that contains text. The text is always enclosed in quotations and can be numbers, text, or a combination of both.
example="ERT is an awesome blog" print(example) ERT is an awesome blog
Conclusion
Programming is essentially about rearranging data for various purposes. Therefore, it only makes sense that there would be different ways to store data. This post provides some common forms in which data can manifest itself while using Python.