(Day 204) Transaction data EDA + MLflow & minIO docker setup

Ivan Ivanov · July 23, 2024

Hello :) Today is Day 204!

A quick summary of today:

  • some EDA on the KB AI competition data
  • setting up mlflow and minIO

Firstly, about doing some basic cleaning and EDA on my part of the data for the Kukmin Bank project

These are the variables assigned to myself: trans_date_trans_time,cc_num,merchant,category,amt,first,last,gender,street,city,state

image

No missing/null values.

Box plot of log(amount) in Not Fraud vs Fraud

image

Distribution of Not Fraud vs Fraud

image

Other graphs

image image image image image image

Interesting ~ only Fraud transactions in the state of Delaware (of course this is not real data, but interesting nonetheless)

We should also do some basic transformation and cleanup before the raw data goes to the db. In my columns, an example case is: all merchant’s name start with fraud_, so removing it would be fine, just for a bit more clarity.

On another note ~ mlflow and minIO

I found this website (in Korean but can be translated) that provides an easy “plug-n-play” Dockerfile and docker-compose serivces code for setting up mlflow, and minIO as an artifact and backend store. I have never used minIO before but from a quick online search (before using it) it seemed like a UI similar to any cloud provider’s storage, and is built on AWS so it seems that it is scalable too.

Following the guide I set up the Dockerfile and the services for mlflow-backend-store

image

mlflow-artifact-store

image

mlflow-server

image

ran it and it works fine. This was the 1st time I saw the UI of minIO

image

It is empty for now, but reminds me of AWS S3 and GCS

From the setup, there is already a created bucket as well:

image

Another thing related to kafka. For some reason the setup I had on this project was using some schema-registry image which was ~1.5GB and taking lots of my space in Docker, and it stopped working, so I switched the services in my docker-compose file with the same as from my transaction-stream-data-pipeline project. At the moment my docker containers are:

image

That is all for today!

See you tomorrow :)