Tag Archives: python

Ensemble Methods for Fraud Detection

Ensemble methods enable the use of multiple algorithms to make predictions. Instead of only random forest or logistic regression, you can use both, and the results from each model can be used in a “vote” to make predictions. This is one way to combine the strengths of various models to make stronger predictions

Libraries

Below are the libraries that we are using. We are using three different algorithms for our ensemble (random forest, logistic regression, and decision trees). A new function we are using is the VotingClassifer() function, which is used to create our ensemble model. The other functions have been used and explained previously.

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score
from sklearn.model_selection import GridSearchCV
import pandas as pd
import numpy as np
df = pd.read_csv("C:/Users/dthom/Documents/python/fraud/chapter_2/chapter_2/creditcard_sampledata_2.csv")

We will now proceed to the data preparation.

Data Prep

The data preparation is simple. We will separate the independent variables from the dependent variables. The X object represents all of the independent variables, while the y object represents our dependent variable, fraud or no fraud. Once everything is separated, we will create our train and test sets using the train_test_split() function. 70% of our data will be used for training, and 30% will be used for testing.

X = df.iloc[:, 1:30]    
X = np.array(X).astype('float')    
y = df.iloc[:, 30]    
y=np.array(y).astype('float')

# Split your data into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.3, random_state=0)

The next step will involve creating our initial ensemble model.

Model Development

We will use three different classifiers in our ensemble model. The classifiers are logistic regression, random forest, and decision tree. In the code below, each classifier is called, and we also set the various parameters of each classifier to appropriate initial values.

# Define the three classifiers to use in the ensemble
clf1 = LogisticRegression(class_weight={0:1, 1:15},max_iter=1000, random_state=5)
clf2 = RandomForestClassifier(class_weight={0:1, 1:12}, criterion='gini', max_depth=8, max_features='log2',
            min_samples_leaf=10, n_estimators=30, n_jobs=-1, random_state=5)
clf3 = DecisionTreeClassifier(random_state=5, class_weight="balanced")

We will now combine all of our different models into a single model using the VotingClassifier() function. The estimators are given names in quotes, followed by the object after a comma. The “voting” parameter is set to “hard.” Hard voting allows for each model to get one vote per case, with the simple majority winning. For example, if logistic regression and random forest predict fraud simple majority wins this case.

Once we create our combined model, we also fit our data and make predictions. This will allow us to determine the strength of our model.

# Combine the classifiers in the ensemble model
ensemble_model = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('dt', clf3)], voting='hard') #define  voting
ensemble_model.fit(X_train, y_train)
predicted = ensemble_model.predict(X_test) #no probabilities with voting

Next, we will assess the initial results

Model Assessment

In the code below, we use the classification_report() function and confusion_matrix() to see our results.

print(classification_report(y_test, predicted))
print(confusion_matrix(y_test, predicted))

              precision    recall  f1-score   support

         0.0       0.99      1.00      0.99      2099
         1.0       0.89      0.86      0.87        91

    accuracy                           0.99      2190
   macro avg       0.94      0.93      0.93      2190
weighted avg       0.99      0.99      0.99      2190

[[2089   10]
 [  13   78]]

The strength of this model depends on its goals and how it compares to other models. For practice, we will modify the model below.

Model Modification

We will not make any changes to the individual models. Instead, we will make some adjustments to the ensemble model. IN the code below, we are changing the voting to “soft,” which means we are using the probabilities to predict rather than a majority vote. The weights are set so that the second model (random forest) has 4 times the influence compared to the other models. Lastly, the flatten_transform argument is related to the voting argument and changes the output of the data. Below is the code

#Change the weight of the models
# Define the ensemble model
ensemble_model = VotingClassifier(estimators=[('lr', clf1), ('rf', clf2), ('dt', clf3)], 
                                 voting='soft', 
                                 weights=[1, 4, 1], 
                                 flatten_transform=True)

We will now fit our data and predict with it

ensemble_model.fit(X_train, y_train)
predicted = ensemble_model.predict(X_test) #no probabilities with voting

Next, we assess the model

Model Assessment

The model is mostly the same, with a slight improvement in precision. In other words, false positives were reduced.

print(classification_report(y_test, predicted))
print(confusion_matrix(y_test, predicted))

              precision    recall  f1-score   support

         0.0       0.99      1.00      1.00      2099
         1.0       0.94      0.86      0.90        91

    accuracy                           0.99      2190
   macro avg       0.97      0.93      0.95      2190
weighted avg       0.99      0.99      0.99      2190

[[2094    5]
 [  13   78]]

Conclusion

In this post, we saw how models can work together to make stronger, more robust predictions. Ensemble methods are a powerful way to improve fraud detection, and you now know ways to modify the model.

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

T-Test

ANOVA

Pairwise Comparision No Adjustment

Pairwise Comparision with Adjustment

Conclusion

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this:

Like this:

Share this: