Hello :) Today is Day 204!
A quick summary of today:
- some EDA on the KB AI competition data
- setting up mlflow and minIO
Firstly, about doing some basic cleaning and EDA on my part of the data for the Kukmin Bank project
These are the variables assigned to myself: trans_date_trans_time,cc_num,merchant,category,amt,first,last,gender,street,city,state
No missing/null values.
Box plot of log(amount) in Not Fraud vs Fraud
Distribution of Not Fraud vs Fraud
Other graphs
Interesting ~ only Fraud transactions in the state of Delaware (of course this is not real data, but interesting nonetheless)
We should also do some basic transformation and cleanup before the raw data goes to the db. In my columns, an example case is: all merchant’s name start with fraud_, so removing it would be fine, just for a bit more clarity.
On another note ~ mlflow and minIO
I found this website (in Korean but can be translated) that provides an easy “plug-n-play” Dockerfile and docker-compose serivces code for setting up mlflow, and minIO as an artifact and backend store. I have never used minIO before but from a quick online search (before using it) it seemed like a UI similar to any cloud provider’s storage, and is built on AWS so it seems that it is scalable too.
Following the guide I set up the Dockerfile and the services for mlflow-backend-store
mlflow-artifact-store
mlflow-server
ran it and it works fine. This was the 1st time I saw the UI of minIO
It is empty for now, but reminds me of AWS S3 and GCS
From the setup, there is already a created bucket as well:
Another thing related to kafka. For some reason the setup I had on this project was using some schema-registry image which was ~1.5GB and taking lots of my space in Docker, and it stopped working, so I switched the services in my docker-compose file with the same as from my transaction-stream-data-pipeline project. At the moment my docker containers are:
That is all for today!
See you tomorrow :)