In this post, we will explore how to approach resampling when implementing fraud detection with Python. When examining fraud detection, a significant imbalance often exists between negative and positive fraud cases. The problem with this is that by guessing randomly, your model can be highly accurate in predicting whether a case is fraudulent or not. Therefore, we must consider how to address the low number of positive instances when conducting a fraud analysis.
We are going to first look at the characteristics of the data as is, then we will use Python to balance our data and compare the original data with the modified data.
Data Preparation of Original Data
Below are the libraries we will use.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from imblearn.over_sampling import SMOTE
Pandas and numpy are for creating our dataset. Matplotlib is for data visualization, and the SMOTE function will be used to rebalance our data later on.
The data we will use is not available on the web. Therefore, the code for this dataset is unclear, as I will hide the string where the data comes from on my computer. The code is below.
df = pd.read_csv(data_loc)
We will now look at the data using the .info() method
print(df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5050 entries, 0 to 5049
Data columns (total 31 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 5050 non-null int64
1 V1 5050 non-null float64
2 V2 5050 non-null float64
3 V3 5050 non-null float64
4 V4 5050 non-null float64
5 V5 5050 non-null float64
6 V6 5050 non-null float64
7 V7 5050 non-null float64
8 V8 5050 non-null float64
9 V9 5050 non-null float64
10 V10 5050 non-null float64
11 V11 5050 non-null float64
12 V12 5050 non-null float64
13 V13 5050 non-null float64
14 V14 5050 non-null float64
15 V15 5050 non-null float64
16 V16 5050 non-null float64
17 V17 5050 non-null float64
18 V18 5050 non-null float64
19 V19 5050 non-null float64
20 V20 5050 non-null float64
21 V21 5050 non-null float64
22 V22 5050 non-null float64
23 V23 5050 non-null float64
24 V24 5050 non-null float64
25 V25 5050 non-null float64
26 V26 5050 non-null float64
27 V27 5050 non-null float64
28 V28 5050 non-null float64
29 Amount 5050 non-null float64
30 Class 5050 non-null int64
dtypes: float64(29), int64(2)
memory usage: 1.2 MB
None
For our purposes, there are 30 variables available from V1 to Class. We will now look at a breakdown of the Class variable, which tells us if there is fraud or not.
fraud_breakdown = df['Class'].value_counts()
print(fraud_breakdown)
Class
0 5000
1 50
Name: count, dtype: int64
By subsetting for the Class variable and using the .valuecounts() method, we can see that the results indicate there are 5050 rows of data, with 5000 not being fraud and 50 being fraud. This indicates that less than one percent of the cases are instances of fraud, and this is confirmed with the code below.
print(fraud_breakdown / len(df))
Class
0 0.990099
1 0.009901
Name: count, dtype: float64
Next, we will create a visualization of our data.
Data Visualization
In order to create the data visualization, we need to separate the X values, which are all the variables we are using that do not tell us if the case is fraud or not, from the y value, which is the Class variable. We also need to convert them to a numpy array. Below is the code to do this.
X = df.iloc[:, 1:30]
X = np.array(X).astype('float')
y = df.iloc[:, 30]
y=np.array(y).astype('float')
Below is the code for the data scatterplot. The plot will be based on the first two variables of the dataset and colored by the Class variable.
plt.scatter(X[y == 0, 0], X[y == 0, 1],
label="Class #0", alpha=0.5, linewidth=0.15)
plt.scatter(X[y == 1, 0], X[y == 1, 1],
label="Class #1", alpha=0.5, linewidth=0.15, c='r')
plt.legend()
plt.show()

Here is a breakdown of the code.
1. We use plt.scatter. Inside, we indicate that for the values in X when y = 0 take the values of the first column. In the second subset, we indicate that in X, when y = 0 take the second column of values.
2. Next, we set the label, alpha, and linewidth.
3. We repeat this process, but this time we take the values when y = 1 instead of y= 0. We also set the color to red instead of the default blue.
4. Finally, we plot both scatter plots on the same plot with a legend indicating what the color means.
You can see the huge imbalance with just this visual. We will now look at how to correct this imbalance.
SMOTE
Resampling can be performed in several ways. Undersampling involves reducing the amount of data you are using to match the number of fraud cases. In other words, for our 5050 dataset with 50 fraud cases, we would reduce this to perhaps 100 rows of data with 50 fraud cases. One problem with this is that you throw out a lot of data.
Another approach is oversampling, which involves duplicating your fraud cases until they match half of your data. For example, since our dataset contains 5050 cases with 50 cases of fraud, we would duplicate our fraud cases until we had 5000 fraud cases for a total dataset size of 10,000. Here you can see the problem of duplicating so much data, which can cause problems.
SMOTE, or synthetic minority oversampling technique, is a variation on oversampling. It involves creating additional fraud cases by generating new cases through the traits of the nearest neighbors. This works if your fraud cases are similar to each other.
Below, we will generate a dataset using SMOTE
# Define the resampling method
method = SMOTE()
# Create the resampled feature set
X_resampled, y_resampled = method.fit_resample(X, y)
print(pd.Series(y).value_counts())
print(pd.Series(y_resampled).value_counts())
print(X.shape[0])
print(X_resampled.shape[0])
0.0 5000
1.0 50
Name: count, dtype: int64
0.0 5000
1.0 5000
Name: count, dtype: int64
5050
10000
The code involves creating an instance of SMOTE(). We then create our resampled X and y values using .fit_resample. Next, we print our results. The first output shows the original shape of the data y values from 5000 to 50 cases of fraud. The next output shows the resampled y values with 5000 to 5000 cases using SMOTE(). Now the data is balanced. The last two outputs show the original shape of the X values and compare it to the new shape, thanks to resampling.
Below is the code for the visualization. It is the same as the previous visual, just with the resampled data.
# Plot the resampled data
plt.scatter(X_resampled[y_resampled == 0, 0],
X_resampled[y_resampled == 0, 1],
label="Class #0", alpha=0.5, linewidth=0.15)
plt.scatter(X_resampled[y_resampled == 1, 0],
X_resampled[y_resampled == 1, 1],
label="Class #1", alpha=0.5, linewidth=0.15, c='r')
plt.legend()
plt.show()

You can see the difference compared to the first plot. This data is much more balanced, which will help in the detection of fraud cases. How you address imbalances depends on the situation, so let’s not assume SMOTE is the best approach every single time.
Conclusion
Fraud detection is critical in different industries to prevent crime and abuse. Python can be used to support this process. Naturally, fraud is unusual when compared to legitimate transactions. This necessitates the use of various techniques to balance the data and ensure the accuracy of the model.







































































