grayscale photo of person holding stack of us dollar bill

Python fraud Detection: Traditional Approach

Fraud detection today leverages complex algorithms and machine learning approaches. However, this was not always the case. In the past, fraud detection used simple yet highly efficient methods. In this post, we will look at a traditional method of fraud detection that involves setting threshold values for variables to flag a case a fraud or not.

Load Libraries

We will begin by loading our libraries and data. The data for this demonstration is not available on the web. Below, we load pandas, numpy, and matplotlib.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df = pd.read_csv(data_loc)

We will now look at the means of the individual variables.

Examine Means

To determine the cutoff values for setting our thresholds, we need to examine the means of each variable when a case is marked as fraud or not. Next, we will look at boxplots of the variables we will use. Below is the code and output for the means of the variable based on class.

df.groupby('Class').mean() #provides a general threshold for fraud
Out[2]: 
        Unnamed: 0        V1        V2        V3        V4        V5  \
Class                                                                  
0      143084.8702  0.035030  0.011553  0.037444 -0.045760 -0.013825   
1      121384.7000 -4.985211  3.321539 -7.293909  4.827952 -3.326587   

             V6        V7        V8        V9       V10       V11       V12  \
Class                                                                         
0     -0.030885  0.014315 -0.022432 -0.002227  0.001667 -0.004511  0.017434   
1     -1.591882 -5.776541  1.395058 -2.537728 -5.917934  4.020563 -7.032865   

            V13       V14       V15       V16       V17       V18       V19  \
Class                                                                         
0      0.004204  0.006542 -0.026640  0.001190  0.004481 -0.010892 -0.016554   
1     -0.104179 -7.100399 -0.120265 -4.658854 -7.589219 -2.650436  0.894255   

            V20       V21       V22       V23       V24       V25       V26  \
Class                                                                         
0     -0.002896 -0.010583 -0.010206 -0.003305 -0.000918 -0.002613 -0.004651   
1      0.194580  0.703182  0.069065 -0.088374 -0.029425 -0.073336 -0.023377   

            V27       V28      Amount  
Class                                  
0     -0.009584  0.002414   85.843714  
1      0.380072  0.009304  113.469000  

Now, there are many different ways to explore the data to determine which variables to select and what to set the threshold values to. We can look at histograms, descriptive statistics, rely on domain knowledge, etc. For the sake of simplicity, we are selecting variables V1 and V3 for additional analysis. You can have more than two variables if you desire. Below are boxplots of V1 and V3.

#data to plot
V1=df[df['Class'] == 1]['V1']
V3=df[df['Class'] == 1]['V3']
plot_data=[V1,V3]
# Create a basic box plot
plt.boxplot(plot_data,tick_labels=["V1","V3"] )
plt.show()

Here is an explanation of the code.

ad

We create two objects called V1 and V3. Both of these objects subset the data for Class when it equals 1 (which indicates fraud). The V1 object pulls the values of V1 when Class equals 1. The V3 does the same for the V3 variable. In other words, we now have all values of V1 and V3 when fraud is indicated. Next, we store our values in another object called plot_data. We then create our boxplot and label the x-axis.

The box plot for V1 indicates a median value of around -3, while the box plot for V3 indicates a median value of around -5. We will use these values as our thresholds.

Confusion Matrix with Thresholds

We will now set our thresholds and create the confusion matrix. Below is the code and output.

df['flag_as_fraud'] = np.where(np.logical_and(df['V1']<-3, df['V3']<-5), 1, 0)

print(pd.crosstab(df.Class, df.flag_as_fraud, rownames=['Actual Fraud'], colnames=['Flagged Fraud']))
Flagged Fraud     0   1
Actual Fraud           
0              4984  16
1                28  22

Here is an explanation of the code.

1. We create a new column called “flag_as_fraud”. This column uses a 1 when V1 < -3 and V3 < -5. All other instances are flagged as 0.
2. Next, we create our crosstabs comparing Class with flag_as_fraud. Here are the results.

4984 True negatives = It was flagged as not being fraud, and was not actual fraud
22 True positives = It was flagged as fraud, and it was actual fraud
28 False negatives = It was not flagged as fraud, but it was actual fraud
22 False positives = It was flagged as fraud, but it was not actual fraud

Now, whether these results are good or bad depends on the situation. There are problems with false negatives and false positives. Correcting for one means losing accuracy for another. If the context were credit card fraud, false negatives may be worse, as the criminal may get away with fraud. Another way to assess the values is through a classification report.

from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

print('Classification report:\n', classification_report(df['Class'],df['flag_as_fraud']))
conf_mat = confusion_matrix(y_true=df['Class'], y_pred=df['flag_as_fraud'])
print('Confusion matrix:\n', conf_mat)

Classification report:
               precision    recall  f1-score   support

           0       0.99      1.00      1.00      5000
           1       0.58      0.44      0.50        50

    accuracy                           0.99      5050
   macro avg       0.79      0.72      0.75      5050
weighted avg       0.99      0.99      0.99      5050

Confusion matrix:
 [[4984   16]
 [  28   22]]

The output above provides numbers for us to assess. The precision indicates how well our model is at predicting true positives compared to all positives. Recall indicates how well our model predicts true positives compared to all true positives. The F1-score is an aggregate of the precision and recall values. Our model struggles with both precision and recall, so we need to modify this.

Multiple Rules

It is possible to have more than one rule. Below is an example.

df['flag_as_fraud'] = np.where(np.logical_and(df['V1']<-3, df['V3']<-5),1, 0)
df['flag_as_fraud'] = np.where(np.logical_and(df['flag_as_fraud']== 1, df['V7']<-6),1, 0)
df['flag_as_fraud'] = np.where(np.logical_and(df['flag_as_fraud']== 1, df['V9']<-0),1, 0)
df['flag_as_fraud'] = np.where(np.logical_and(df['flag_as_fraud']== 1, df['V10']<-4.5),1, 0)

In this code, we set the initial rule as done previously. Then, for the second rule, we use the previous rule and place the new variable as the second comparison. We then repeat this as many times as necessary. Below is a verbal explanation of the code above

Create “flag_as_fraud” where V1 < -3 and V3 < -5. Then create “flag_as_fraud” where “flag_as_fraud” = 1 and V7 < -6. Then create “flag_as_fraud” where “flag_as_fraud” = 1 and V9 < 0. Then create “flag_as_fraud” where “flag_as_fraud” = 1 and V10 < -4.5.

Below is the classification report and confusion matrix.

print('Classification report:\n', classification_report(df['Class'],df['flag_as_fraud']))
conf_mat = confusion_matrix(y_true=df['Class'], y_pred=df['flag_as_fraud'])
print('Confusion matrix:\n', conf_mat)

Classification report:
               precision    recall  f1-score   support

           0       0.99      1.00      1.00      5000
           1       0.94      0.34      0.50        50

    accuracy                           0.99      5050
   macro avg       0.97      0.67      0.75      5050
weighted avg       0.99      0.99      0.99      5050

Confusion matrix:
 [[4999    1]
 [  33   17]]

Our precision is improved, which means we did excellent work reducing the number of false positives. However, our false negatives have increased, and our recall has decreased.

Conclusion

The traditional approach is excellent in many circumstances. This approach is easy to understand, which can relieve the anxiety ofleaders who need to know what is going on in case there is a problem. Complex algorithms may yield better results, but it is not always clear how they work and what they are doing. With the traditional approach, this is not a problem. However, if high accuracy is needed, sometimes the traditional approach falls short. Which approach to use depends on the context and the needs of the stakeholders.

Leave a Reply