A common problem in machine learning is data quality. In other words, if the data is bad the model will be bad even if it is designed using best practices. Below is a short of some possible problems with data

- Sample size is to small-Hurts all algorithms
- Sample size too big-Hurts complex algorithms
- Wrong data-Hurts all algorithms
- Too many variables-Hurts complex algorithms

Naturally, this list is not exhaustive. Whenever some of the above situations take place it can lead to a model that has bias or variance. Bias takes place when the model highly over and under estimates values. This is common in regression when the relationship among the variables is not linear. The linear line that is developed by the model works sometimes but is often erroneous.

Variance is when the model is too sensitive to the characteristics of the training data. This means that the model develops a complex way to classify or performs regression that does not generalize to other datasets

One solution to addressing these problems is the use of cross-validation. Cross-validation involves dividing the training set into several folds. For example, you may divide the data into 10 folds. With 9 folds you train the data and with the 10rh fold you test it. You then calculate the average prediction or classification of the ten test folds. This method is commonly called k-folds cross-validation. This process helps to stabilize the results of the final model. We will now look at how to do this using Python.

**Data Preparation**

We will develop a regression model using the PSID dataset. Our goal will be to predict earnings based on the other variables in the dataset. Below is some initial code.

import pandas as pd

import numpy as np

from pydataset import data

from sklearn.model_selection import train_test_split

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error

from sklearn.model_selection import KFold

from sklearn.model_selection import cross_val_score

We now need to load the dataset PSID. When this is done, there are several things we also need to.

- We have to drop all NA’s in the dataset
- We also need to convert the “married” variable to a dummy variable.

Below is the code for completing these steps

df=data('PSID').dropna()

df.loc[df.married!= 'married', 'married'] = 0

df.loc[df.married== 'married','married'] = 1

df['married'] = df['married'].astype(int)

df['marry']=df.married

The code above loads the data while dropping the NAs. We then use the .loc function to make everyone who is not married a 0 and everyone who is married a 1. This variable is then converted to an integer using the .astype function. Lastly, we make a new variable called ‘marry’ and store our data there.

There is one other problem we need to address. In the ‘kids’ and the ‘educatn’ variable are values of 98 and 99. In the original survey, these responses meant that the person did not want to say how man kids or how much education they had or that they did not know. We will remove these individuals from the sample using the code below.

df.drop(df.loc[df['kids']>90].index, inplace=True)

df.drop(df.loc[df['educatn']>90].index, inplace=True)

The code above tells Python to remove in values greater than 90. With this We can now make are dataset that includes the independent variables and the dataset that contains the dependent variable.

X=df[['age','educatn','hours','kids','marry']]

y=df['earnings']

**Model Development**

We are now going to make several models and use the mean squared error as our way of comparing them. The first model will use all of the data. The second model will use the training data. The third model will use cross-validation. Below is the code for the first model that uses all of the data,

regression=LinearRegression()

regression.fit(X,y)

first_model=(mean_squared_error(y_true=y,y_pred=regression.predict(X)))

print(first_model)

138544429.96275884

For the second model, we first need to make our train and test sets. Then we will run our model. The code is below.

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=.3,random_state=5)

regression.fit(X_train,y_train)

second_model=(mean_squared_error(y_true=y_train,y_pred=regression.predict(X_train)))

print(second_model)

148286805.4129756

You can see that the number are somewhat different. This is to be expected when dealing with different sample sizes. With cross validation using the full dataset we get results similar to the first model we developed. This is done through an instance of the KFold function. For KFold we want 10 folds, we want to shuffle the data, and set the seed.

The other function we need is the cross_val_score function. In this function, we set the type of model, the data we will use, the metric for evaluation, and the characteristics of the type of cross-validation. Once this is done we print the mean and standard deviation of the fold results. Below is the code.

crossvalidation=KFold(n_splits=10,shuffle=True,random_state=1)

scores=cross_val_score(regression,X,y,scoring='neg_mean_squared_error',cv=crossvalidation,n_jobs=1)

print(len(scores),np.mean(np.abs(scores)),np.std(scores))

10 138817648.05153447 35451961.12217143

These numbers are closer to what is expected from the dataset. Despite the fact that we didn’t use all the data at the same time. You can also run these results on the training set as well for additional comparison.

**Conclusion**

This post provides an example of cross-validation in Python. The use of cross-validation helps to stabilize the results that ma come from your model. With increase stability comes increased confidence in your models ability to generalize to other datasets.