Hello :) Today is Day 170!
A quick summary of today:
- decided to give mage.ai another go and now use it alongside GCP
Well… I obviously did not suffer enough with the countless problems I experienced when I was first learning about mage from DataTalksClub’s MLOps zoomcamp, so I decided to do a cool-looking data engineering project [youtube]. The caveat is that I will use GCP. And I just hope I do not incur any major costs. I checked, and I have 21 days left on plenty of free credits.
Getting to the project
It uses the infamous NYC taxi dataset.
Using lucid I learned a bit amout data deminsion modelling
Then using python, some basic preprocessing was done on the raw data, to convert it into the top 8 tables. I actually put everything so far on my github, and plan on doing a nice readme documentation once everything is finished. Even though I started it after work today, I did not finish it because of again mage problems.
Before I get to the mage problems, a bit about GCP.
I set up a cloud storage (similar to S3 in AWS) and uploaded the raw data.
Then I set up a VM with the right access permissions. And this gave ma nice SSH-in browser to install python and run mage. There is a lot of IP addreses, so I am afraid to share pictures today.
Using that VM, I installed python, pip and mage, and started a project.
Added a block to import data, and a transformer block that basically follows the jupyter notebook in my repo. However whenever I run the transformer block that ends up with the above 8 tables, the kernel restarts and seems like it never starts again. I restart mage through the terminal, same thing happens.
Turns out mage bugs out when I was converting 8 datasets to dicts, and when I just retirn 8 dataframes as a tuple - it is all fine.
Then I created a user and credentials so I can connect to big query and push data to it,
The final mage pipeline looks like this:
And ~
yay~~~ This took a bit of debugging as well due to uninstalled packages and putting access keys and secrets i n the right place and format.
FYI, this is the project stack:
So for the final step ~ using looker for some kind of a visualisation dashboard.
This is the final dashboard (unsurprisingly, looker is very similar to powerBI). Looks awesome. Can’t wait to apply what I learned to data of my choosing.
Lastly, I deleted everything from GCP, but I know what I need next time. ^^
Just a quick mention - today I kept reviewing my intro that I wrote yesterday, and added a table which I can share.
All of these papers’ summaries/notes I have shared throughout the last few weeks, and I was taught that tables like that (and even more detailed) are a good idea in a paper.
That is all for today!
See you tomorrow :)