Fraud Detection with Python: Sampling

Advertisements

In this post, we will explore how to approach resampling when implementing fraud detection with Python. When examining fraud detection, a significant imbalance often exists between negative and positive fraud cases. The problem with this is that by guessing randomly, your model can be highly accurate in predicting whether a case is fraudulent or not. Therefore, we must consider how to address the low number of positive instances when conducting a fraud analysis.

We are going to first look at the characteristics of the data as is, then we will use Python to balance our data and compare the original data with the modified data.

Data Preparation of Original Data

Below are the libraries we will use.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from imblearn.over_sampling import SMOTE

Pandas and numpy are for creating our dataset. Matplotlib is for data visualization, and the SMOTE function will be used to rebalance our data later on.

The data we will use is not available on the web. Therefore, the code for this dataset is unclear, as I will hide the string where the data comes from on my computer. The code is below.

df = pd.read_csv(data_loc)

We will now look at the data using the .info() method

print(df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5050 entries, 0 to 5049
Data columns (total 31 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  5050 non-null   int64  
 1   V1          5050 non-null   float64
 2   V2          5050 non-null   float64
 3   V3          5050 non-null   float64
 4   V4          5050 non-null   float64
 5   V5          5050 non-null   float64
 6   V6          5050 non-null   float64
 7   V7          5050 non-null   float64
 8   V8          5050 non-null   float64
 9   V9          5050 non-null   float64
 10  V10         5050 non-null   float64
 11  V11         5050 non-null   float64
 12  V12         5050 non-null   float64
 13  V13         5050 non-null   float64
 14  V14         5050 non-null   float64
 15  V15         5050 non-null   float64
 16  V16         5050 non-null   float64
 17  V17         5050 non-null   float64
 18  V18         5050 non-null   float64
 19  V19         5050 non-null   float64
 20  V20         5050 non-null   float64
 21  V21         5050 non-null   float64
 22  V22         5050 non-null   float64
 23  V23         5050 non-null   float64
 24  V24         5050 non-null   float64
 25  V25         5050 non-null   float64
 26  V26         5050 non-null   float64
 27  V27         5050 non-null   float64
 28  V28         5050 non-null   float64
 29  Amount      5050 non-null   float64
 30  Class       5050 non-null   int64  
dtypes: float64(29), int64(2)
memory usage: 1.2 MB
None

For our purposes, there are 30 variables available from V1 to Class. We will now look at a breakdown of the Class variable, which tells us if there is fraud or not.

fraud_breakdown = df['Class'].value_counts()
print(fraud_breakdown)
Class
0    5000
1      50
Name: count, dtype: int64

By subsetting for the Class variable and using the .valuecounts() method, we can see that the results indicate there are 5050 rows of data, with 5000 not being fraud and 50 being fraud. This indicates that less than one percent of the cases are instances of fraud, and this is confirmed with the code below.

print(fraud_breakdown / len(df))
Class
0    0.990099
1    0.009901
Name: count, dtype: float64

Next, we will create a visualization of our data.

Data Visualization

In order to create the data visualization, we need to separate the X values, which are all the variables we are using that do not tell us if the case is fraud or not, from the y value, which is the Class variable. We also need to convert them to a numpy array. Below is the code to do this.

X = df.iloc[:, 1:30]    
X = np.array(X).astype('float')    
y = df.iloc[:, 30]    
y=np.array(y).astype('float')

Below is the code for the data scatterplot. The plot will be based on the first two variables of the dataset and colored by the Class variable.

plt.scatter(X[y == 0, 0], X[y == 0, 1], 
            label="Class #0", alpha=0.5, linewidth=0.15)
plt.scatter(X[y == 1, 0], X[y == 1, 1], 
            label="Class #1", alpha=0.5, linewidth=0.15, c='r')
plt.legend()
plt.show()

Here is a breakdown of the code.

1. We use plt.scatter. Inside, we indicate that for the values in X when y = 0 take the values of the first column. In the second subset, we indicate that in X, when y = 0 take the second column of values.

2. Next, we set the label, alpha, and linewidth.

3. We repeat this process, but this time we take the values when y = 1 instead of y= 0. We also set the color to red instead of the default blue.

4. Finally, we plot both scatter plots on the same plot with a legend indicating what the color means.

You can see the huge imbalance with just this visual. We will now look at how to correct this imbalance.

SMOTE

Resampling can be performed in several ways. Undersampling involves reducing the amount of data you are using to match the number of fraud cases. In other words, for our 5050 dataset with 50 fraud cases, we would reduce this to perhaps 100 rows of data with 50 fraud cases. One problem with this is that you throw out a lot of data.

Another approach is oversampling, which involves duplicating your fraud cases until they match half of your data. For example, since our dataset contains 5050 cases with 50 cases of fraud, we would duplicate our fraud cases until we had 5000 fraud cases for a total dataset size of 10,000. Here you can see the problem of duplicating so much data, which can cause problems.

SMOTE, or synthetic minority oversampling technique, is a variation on oversampling. It involves creating additional fraud cases by generating new cases through the traits of the nearest neighbors. This works if your fraud cases are similar to each other.

Below, we will generate a dataset using SMOTE

# Define the resampling method
method = SMOTE()

# Create the resampled feature set
X_resampled, y_resampled = method.fit_resample(X, y)

print(pd.Series(y).value_counts())
print(pd.Series(y_resampled).value_counts())
print(X.shape[0])
print(X_resampled.shape[0])

0.0    5000
1.0      50
Name: count, dtype: int64
0.0    5000
1.0    5000
Name: count, dtype: int64
5050
10000

The code involves creating an instance of SMOTE(). We then create our resampled X and y values using .fit_resample. Next, we print our results. The first output shows the original shape of the data y values from 5000 to 50 cases of fraud. The next output shows the resampled y values with 5000 to 5000 cases using SMOTE(). Now the data is balanced. The last two outputs show the original shape of the X values and compare it to the new shape, thanks to resampling.

Below is the code for the visualization. It is the same as the previous visual, just with the resampled data.

# Plot the resampled data
plt.scatter(X_resampled[y_resampled == 0, 0], 
            X_resampled[y_resampled == 0, 1], 
            label="Class #0", alpha=0.5, linewidth=0.15)
plt.scatter(X_resampled[y_resampled == 1, 0], 
            X_resampled[y_resampled == 1, 1], 
            label="Class #1", alpha=0.5, linewidth=0.15, c='r')
plt.legend()
plt.show()

You can see the difference compared to the first plot. This data is much more balanced, which will help in the detection of fraud cases. How you address imbalances depends on the situation, so let’s not assume SMOTE is the best approach every single time.

Conclusion

Fraud detection is critical in different industries to prevent crime and abuse. Python can be used to support this process. Naturally, fraud is unusual when compared to legitimate transactions. This necessitates the use of various techniques to balance the data and ensure the accuracy of the model.