Random Forest Model Modification for Fraud Detection

Advertisements

In this post, we will modify a model when trying to detect fraud. Most, if not all, machine learning algorithms have parameters that can be adjusted. Adjusting these parameters can potentially improve the accuracy of the model. Each algorithm also has different parameters that can be tuned. For our purposes, we will be using the random forest algorithm.

Libraries

Below are the libraries we will use in this post.


from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

The RandomForestClassifier() is the function to create an instance of the random forest algorithm. The train_test_split() function will be used for splitting our data into training and test sets. The confusion_matrix(), classification_report(), and roc_auc_score() functions will be used for assessing our model’s performance. Pandas and numpy are for data preparation. Lastly, matplotlib will be used in conjunction with the roc_auc_score(), which will be explained in detail later.

Data Preparation

Below is the data preparation. In this code, we are separating the independent variables from the dependent variable. Columns 2-29 will be used to predict column 30. Column 30 tells us if the example is fraudulent or not.

X = df.iloc[:, 1:30]    
X = np.array(X).astype('float')    
y = df.iloc[:, 30]    
y=np.array(y).astype('float')

For the X object above, we pull columns 2-29. Then we convert the X object to an array in the next line. We repeat this process for the y object, but we only pull column 30.

In the code below, we are now splitting our X and y objects into training and testing data. The training data teaches the algorithm, and you then assess your model by using the testing data.

# Split your data into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=0)

We create four objects to the left of the equal sign, two each for the X object and the y object. To the left of the equal sign, we use our train_test_split() function to divide the X and y objects. The argument test_size tells Python what proportion of the data should be used for the test data. For our example, 30% of the data is set aside for testing purposes.

Model Development

In the code below, we are going to create our initial model using random forest.

# Define the model with balanced subsample
model = RandomForestClassifier(class_weight='balanced_subsample', random_state=5)

# Fit your training model to your training set
model.fit(X_train, y_train)

Above, we create an object called “model” that contains an instance of the random forest algorithm. Inside the function, we set the argument class_weights to balanced_subsample. Setting the class weight to balanced is common in fraud detection because there is an imbalance in the classes, as fraud is highly uncommon. By setting the class weights to balance it so that misclassifications of fraud and non-fraud have the same penalty. Remember that by default, 95% percent of our data is not fraudulent without the use of a model. In addition, a balanced subsample is used when each tree is bootstrapped or not, based on the training data.

After addressing imbalances, we then fit our model and calculate the probabilities that each predicted example is correct. These probabilities will be useful in making the orc curve score.

# Obtain the predicted values and probabilities from the model 
predicted = model.predict(X_test)
probs = model.predict_proba(X_test)

Next, we will assess the original model

Model Assessment

The code below provides several metrics. The roc_auc_score calculates sensitivity (true positive rate) against its 1-specificity (false positive rate) and ranges in value from 0 to 1. The closer the value is to 1, the better. Other metrics we calculate include metrics related to the classification_report() function (precision, recall, f1-score, and accuracy) and the confiusion_matrix(), which creates a crosstab of the results.

# Print the roc_auc_score, the classification report and confusion matrix
print(roc_auc_score(y_test, probs[:,1]))
print(classification_report(y_test, predicted))
print(confusion_matrix(y_test, predicted))

0.9604599783256286
              precision    recall  f1-score   support

         0.0       0.99      1.00      1.00      2099
         1.0       0.99      0.81      0.89        91

    accuracy                           0.99      2190
   macro avg       0.99      0.91      0.94      2190
weighted avg       0.99      0.99      0.99      2190

[[2098    1]
 [  17   74]]

The ROC curve value is 0.96, which indicates a strong model as the value is close to 1. Precision is much stronger than recall, which means the model is better at avoiding false positives than it is at avoiding false negatives. The F1-score is a composite of precision and recall. Also note that model accuracy is 99%, which is expected when dealing with fraud detection.

Model Adjustment

The initial model looks rather good, but there is always a question as to whether we can improve the model. In the code below, we make the following modifications to our model.

Bootstrap set to true: This means that each tree that is developed will be based on a subsample of the data that is resampled. Therefore, each tree is not developed from identical data.
class_weight: Previously, the weights were balanced. The new setting indicates we are manually assigning a weight of 1 to class 0 and a weight of 12 to class 1, and this tells the RandomForestClassifier model to penalize misclassifications of class 1 twelve times more heavily than misclassifications of class 0.
criterion=’entropy’: Entropy is a measure of the purity of each node. The less mixture within a node (fraud and non-fraud), the higher the purity.
max_depth: How deep the truth is allowed to go. If this is not set, the tree will descend until the nodes are pure.
min_samples_leaf: The minimum number of examples required to split a node.
n_estimators: The number of trees to developed
n_jobs: Affects processing power that is used
random_state: Sets the seed

The rest of the code is a repeat of before

# Change the model options
model = RandomForestClassifier(bootstrap=True, class_weight={0:1, 1:12}, criterion='entropy',
			
			# Change depth of model
            max_depth=10,
		
			# Change the number of samples in leaf nodes
            min_samples_leaf=10, 

			# Change the number of trees to use
            n_estimators=20, n_jobs=-1, random_state=5)

# Fit your training model to your training set
model.fit(X_train, y_train)

# Obtain the predicted values and probabilities from the model 
predicted = model.predict(X_test)
probs = model.predict_proba(X_test)

We will now assess this model

2nd Assessment

Below is the code for the second assessment of the model. This code is the same as before.

# Print the roc_auc_score, the classification report and confusion matrix
print(roc_auc_score(y_test, probs[:,1]))
print(classification_report(y_test, predicted))
print(confusion_matrix(y_test, predicted))

0.9575150909119465
              precision    recall  f1-score   support

         0.0       0.99      1.00      1.00      2099
         1.0       0.94      0.84      0.88        91

    accuracy                           0.99      2190
   macro avg       0.97      0.92      0.94      2190
weighted avg       0.99      0.99      0.99      2190

[[2094    5]
 [  15   76]]

The model lacks improvement. We were able to decrease the number of false negatives by increasing the number of false positives. Whether this is better depends on the context and deciding if false negatives or false positives are more detrimental.

Conclusion

What we learned here is how to not only create a model and assess it, but also how to make modifications to the model in hopes of improving it. The power of machine learning can help you improve models to have more success in detecting fraud.