In this post, we are using logistic regression and the sampling technique of SMOTE to improve our model’s ability to detect fraud. SMOTE creates synthetic cases of actual fraud in order to balance out the number of true and false cases in the dataset. We will begin by loading our libraries
Libraries
The libraries we are using are below. As we use these libraries, they will be explained.
from imblearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
Data Preparation
We load our data using the .read_csv() method from pandas. The object data_loc was created to store the location of the data on the computer. The data used in this example is not available. After loading the data, we use .shape to see how many columns and rows of data we have. The code and output are below
df = pd.read_csv(data_loc)
df.shape
(5050, 31)
You can see that we have 5050 rows of data and 31 columns of data. Next, we need to separate the X values from the y value. To do this, we will take columns 2 to 29 as X values and column 30 as the y value. The code below completes all of this for us.
X = df.iloc[:, 1:30]
X = np.array(X).astype('float')
y = df.iloc[:, 30]
y=np.array(y).astype('float')
In the code below, we are creating our train and test sets. We are going to split our X and y objects so that 70% of the data is for training and 30% of the data is for testing purposes. The function train_test_split() is used for this, with the argument test_size being set to 0.3 for 30% test data and the random_state being set to 0, which is the seed number.
# Split your data X and y, into a training and a test set and fit the pipeline onto the training data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=0)
Pipeline Development
A pipeline is used to chain several actions together sequentially and is similar to piping in R. To do this, we are using the Pipeline() function from the imblearn library. The imblearn library is used to address imbalances in datasets, as our data has. We will complete the pipeline by first creating an instance of SMOTE and logistic regression. We do this because these are the two objects we will pipe one after the other.
Next, we will actually create our pipe. We created an object called “pipeline” and used the Pipeline function. Inside this function are two tuples. The first is for SMOTE and uses the first object we create at the beginning of this cell, and the second contains the information of the object we created. Also, notice how both tuples are wrapped inside square brackets. The code for all of this is below
# Define which resampling method and which ML model to use in the pipeline
resampling = SMOTE()
model = LogisticRegression(max_iter=1000)
# Define the pipeline, tell it to combine SMOTE with the Logistic Regression model
pipeline = Pipeline([('SMOTE', resampling), ('Logistic Regression',model)])
What we did in this code was tell Python to use SMOTE to create synthetic cases of instances of fraud. Once the resampling is completed, the resampled data will be used to train the model.
Model Development and Performance Metrics
We will now train our model with the SMOTE data using logistic regression and make the predictions. We use the .fit() method with the pipeline object and then use the .predict() method with the test data. The code is below
# Fit your pipeline onto your training set and obtain predictions by fitting the model onto the test data
pipeline.fit(X_train, y_train)
predicted = pipeline.predict(X_test)
Now we run our performance metrics to see how our model did. We will use the classification_report() and confusion_matrix() functions. The classification_report function tells us the precision, recall, and f1-score. The confusion_metrix() function is a printout of a crosstab of our data. Notice in both of these metrics, we are using the y test values compared to the predicted values.
# Obtain the results from the classification report and confusion matrix
print('Classifcation report:\n', classification_report(y_test, predicted))
conf_mat = confusion_matrix(y_true=y_test, y_pred=predicted)
print('Confusion matrix:\n', conf_mat)
Classifcation report:
precision recall f1-score support
0.0 1.00 1.00 1.00 1505
1.0 0.82 0.90 0.86 10
accuracy 1.00 1515
macro avg 0.91 0.95 0.93 1515
weighted avg 1.00 1.00 1.00 1515
Confusion matrix:
[[1503 2]
[ 1 9]]
Conclusion
With the help of SMOTE, it is possible to improve the performance of your algorithm when detecting fraud. As such, SMOTE is a powerful tool that can be useful in the appropriate context.










































































