(Day 232) Creating a script for the technical part of the KB project

Ivan Ivanov · August 20, 2024

Hello :) Today is Day 232!

A quick summary of today:

today we met with my lab mate in order to prepare our ppt

We got an email from the competition organisers that the deadline for an updated ppt is this Sunday evening. So with the help of our lab professor (Prof. Minju Park) we created a plan for our updated ppt and my part is to create a script and provide detailed info about the technical parts of the project.

Below is the script I have so far

Architecture

For each number I have a short description, but it is in Korean.

We use Docker to containerize everything so that it can easily be used across different systems
Mage is used as an orchestrator that allows the training and real-time pipelines to be developed and run. Once developed, Mage’s interface allows for the pipelines to be run by both technical engineers and non-technical analysts.
We use neo4j as our graph database
We developed and use models based on a few libraries - PyG, scikit-learn, catboost, and xgboost
We use mlflow for model tracking and registry
To imitate real-time data, we set up a script that sends transactions through Kafka. Kafka ensures the efficient processing of real-time data
For real-time inference, we run a pipeline that in real-time received transaction data from Kafka, takes models saved in mlflow, processes the transactions by adding predictions from our loaded models, and uploads it to neo4j
For insights and anslysis, we connect neo4j to Grafana to create a real-time dashboard that can update every second, or in an interval chosen by the user
Finally, using the models saved in mlflow, we provide a model dictionary that helps to better understand the decisions of the developed models

Data collection

We thought we had to find actual credit card data to detect fraudulent transactions. Kaggle is an online platform that provides datasets for data science and machine learning, and allows data scientists around the world to use them to solve various problems. There were many options in Kaggle, but we needed to understand the specific requirements for detecting fraudulent transactions.

This involves (Seon - used by Revolut, Wise, 2024):

Information about the user - like personal information, location, and our credit card number;
Information about the merchant - their name, and location
Information about the transaction itself - amount, time and date, and what was the transaction for (category)

Database

The data comes in tabular format from Kaggle, but because fraud detection is inherently a graph problem (Li et al., 2024, Acevedo-Viloria et al., 2021) we wanted to transform it to a graph for better visualisations, analysis and later model development. We chose the most widely used open-source graph database - neo4j. Neo4j is made specifically for graph data so queries are very fast and as you can see from the picture, we can do ad-hoc analysis of the data as a graph and get quick insights if needed.

Data preprocessing

Fraud transaction data tends to be imbalanced. That is because most of the transactions that happen on a daily basis are not fraudulent. The not fraud to fraud ratio in our dataset reflects reality and we have 99.4% not fraud transactions and 0.6% fraud transactions.

The Kaggle dataset comes with 1,852,393 transactions. We use 70% of the data to create our models and insert into our database. And the rest 30% we save to develop our pseudo streaming pipeline where every second transactions are sent through our real-time data processor Kafka.

Of the data used for the models, we split the data 80/20 into train/test.

When performing feature selection, we took into account all features and considered their implications for AI ethics, so we did not want to include features based on name, gender, or job.

The selected variables are circled in red. Before using the features, extra features like day of week, and hour of day were created, and categorical features were turned into dummies.

To overcome the imbalance in our data we considered the 3 top methods - oversampling the minority class, downsampling the majority class, and a mixture of the previous two. However from our model development process, we found that downsampling the majority class fits our model the best. This results in around 1600 not fraud and 1600 fraud transactions left to train the models.

We will see in a second, but our main model is a variation of a Graph Neural Network and we use lat, long, merch_lat, and merch_long only for that model because it is more powerful and can learn from those values without them being preprocessed.

Model information

Because fraud detection is inherently a graph problem, we wanted to construct a Graph Neural Network as our main model. A Graph Convolutional Network (GCN) is great for detecting fraud in transactions because it can handle large amounts of data efficiently - this is important because every second millions of transactions occur and we need a system that can handle this volume. In addition, GCNs understand and analyse complex connections between different accounts and transactions. It looks at how transactions are related to each other and can spot suspicious patterns that might be hard to see with other methods. A GCN updates each node’s features by aggregating and transforming features from its neighbours. It repeats this process across multiple layers to capture information from farther nodes, and improves its knowledge to predict whether a transaction is fraud or not fraud.

We wanted to give extra help to our decision-making system, so we looked into more traditional machine learning models like Logistic Regression, XGBoost, RandomForest, CatBoost and LGBMClassifier. Among those, XGBoost and CatBoost ended up performing the best on our test dataset. We looked into non-graph neural network models because while the GCN focuses on relationships between transactions, CatBoost and XGBoost enhance the system by providing high accuracy by analysing individual transactions. By providing a system with multiple models, we ensure that all types of fraud are effectively detected, and make our fraud detection system more robust and reliable.

For evaluation metrics, we chose Recall (TP / (TP + FN)) as the main one as it measures how well the system identifies all the fraudulent transactions, which is crucial for minimising missed fraud cases.

Next, let’s quickly see how easy it is to train a model.

Model training pipeline

I created a video using capcut. Link to youtube

If our models are outdated, or if we detect data drift or some other anomaly that signifies our models need to be updated, we can easily do that in Mage - our orchestrator.

As seen in the clip, training a model of choice in our orchestrator is done only in a few clicks. This allows both engineers and non-engineers to develop hypotheses, train a model and evaluate their ideas. A person just needs to open mage, go to pipelines, select the training pipeline of choice and click run.

After training a model, if an analyst wants to see more in-depth information related to feature importance, confusion matrix, etc - they can go to the model dictionary.

Model dictionary

I created a video for this too. Link to youtube

The model dictionary automatically picks up newly trained models and includes information related to model metrics, confusion matrices, feature importance and SHAP values. CatBoost and XGBoost models have SHAP values on their pages as SHAP values help with explaining predictions of machine learning models by attributing the contribution of each feature to the prediction. As for explaining our Graph Convolutional Network, neural networks are known for being black box models and as users we are not able to understand how they made decisions. However, the graph neural network library that we use (PyG) provides explainability features which, just like for a non-neural network model, can show us the feature importance.

Real-time pipeline

Just like with the training pipelines in our orchestrator - Mage, we can run the real-time pipeline with a click of a button, and it will run indefinitely. We use Kafka which is an industry standard when it comes to real-time data processing.

In the pipeline, models from our model registry are loaded and used for inference. Real-time transaction data comes in, it is put through the pipeline and saved in neo4j.

Below is an example. A transaction with all of its data goes into the pipeline, and comes out with the same transaction data plus predictions whether it is fraud or not fraud according to our models.

Data that goes in: {‘trans_date_trans_time’: ‘2020-06-21 18:25:53’, ‘cc_num’: 5289285402893489, ‘merchant’: ‘fraud_Zboncak LLC’, ‘category’: ‘food_dining’, ‘amt’: 8.77, ‘first’: ‘Amanda’, ‘last’: ‘Adams’, ‘gender’: ‘F’, ‘street’: ‘08580 Jeremy Falls’, ‘city’: ‘Bay City’, ‘state’: ‘OR’, ‘zip’: 97107, ‘lat’: 45.5197, ‘long’: -123.8761, ‘city_pop’: 1530, ‘job’: ‘Colour technologist’, ‘dob’: ‘1986-11-24’, ‘trans_num’: ‘cbd1cd064824086c5854ec227a58ea8a’, ‘unix_time’: 1371839153, ‘merch_lat’: 45.544353, ‘merch_long’: -124.396705}

Data that goes out: {‘trans_date_trans_time’: ‘2020-06-21 18:25:53’, ‘cc_num’: 5289285402893489, ‘merchant’: ‘fraud_Zboncak LLC’, ‘category’: ‘food_dining’, ‘amt’: 8.77, ‘first’: ‘Amanda’, ‘last’: ‘Adams’, ‘gender’: ‘F’, ‘street’: ‘08580 Jeremy Falls’, ‘city’: ‘Bay City’, ‘state’: ‘OR’, ‘zip’: 97107, ‘lat’: 45.5197, ‘long’: -123.8761, ‘city_pop’: 1530, ‘job’: ‘Colour technologist’, ‘dob’: ‘1986-11-24’, ‘trans_num’: ‘cbd1cd064824086c5854ec227a58ea8a’, ‘unix_time’: 1371839153, ‘merch_lat’: 45.544353, ‘merch_long’: -124.396705, ‘is_fraud_gcn’: 0, ‘is_fraud_xgboost’: 0, ‘is_fraud_catboost’: 0}

Thanks to Kafka, we are ensured that real-time transaction data is received on time and in our pipeline we append 3 values: a prediction from a GCN model, an XGBoost model, and a CatBoost model.

The last step in the real-time pipeline is sending the data to our graph database (neo4j). Mage and Kafka allow for real-time transaction processing with minimum lag between transaction occurrence and the transaction being sent to our database.

Visualisation

TODO: create a video showing off Grafana

After a transaction has passed through our real-time pipeline, it is stored in neo4j. To better understand our data, and be able to create dashboards to share with upper management we use Grafana. Grafana one of the top dashboard options as it has integrations with all widely used databases, including neo4j. Grafana allows for both real-time and static dashboards as well, which is perfect for our case. In the video sample, we can see an example dashboard with the latest transaction sample, geomap showing fraud transactions, and other information that can be useful to an analyst.

Thanks to Grafana’s great accessibility, both engineers and business analysts can go to Grafana at any time, query the database and immediately create their own visualisations and share them with team members or other departments without any previous setup.

Tomorrow we continue!

That is all for today!

See you tomorrow :)