Hello :) Today is Day 160!
A quick summary of today:
- simple data engineering pipeline with prefect
- tons of trouble learning about orchestration with mage.ai
After yesterday’s journey with prefect the youtube algorithm recommended me another tutorial for prefect - this time for creating data pipelines with prefect. So I decided to give it a go.
What is data engineering?
- data scientists can do data engineering, but in specific cases where the two jobs cannot or are not needed to be separate
- data engineers build databases, they build lots of data pipelines and manage infrastructure (also care about cost, security)
What are data pipelines?
- ETL(ELT)/batch pipelines that move data from A to B
- databases, APIs, files
- streaming pipelines - as data comes in, we consume that data and send it wherever it needs to go
- message queues, polled data
The main github repo used is here.
After some basic setup, when we run ‘pipeline’ in the terminal which runs the main.py file:
In prefect we get a flow run
and logged outputs from the petstore url
Now for a flow with a bit more tasks~
1st task: retrieve from API
2nd task: clean data
3rd task: insert to postgres db
final flow:
Success!
We can also see the data loaded into postgres:
Beege (the teacher) showed also how to do simple task tests and avoid a common prefect error when executing prefect task functions outside a prefect flow
Prefect is amaing, but the MLOps zoomcamp course works with Mage.ai this year, and after I gave up on it due to errors that I could not fix last time - I decided to give it another go today. And… omg so many errors and problems - I almost gave up on it several times. The videos are recorded 2 weeks ago, and yet there is so much different on my UI (and other students’ UI) that it is mind-boggling.
The content was not much, but the constant errors… made me invest upwards of 12 hours of bug fixing for a total of 30 minutes of video content.
Orchestration with Mage.ai from MLOps zoomcamp module 3
The content of the intro to orchestration is:
I have to do 3.5 tomorrow (hopefully). I just want to say a huge thank you to the QnA bot on the mlops zoomcamp slack channel that at least pointed me into the right direction to solve my issues. Just quickly - I am bit sad that there are so many issues (almost every video) because it will drive people away from this amazing course, and below I will just provide my successes, and if you are really curious about my problems - they are all on the datatalks slack channel haha. I started from the beginning
3.1 Data preparation
I created the above pipeline that reads NY taxi data, does a bit of preprocessing, and then outputs X, X_train, X_val, y, y_train, y_val, dv. By the way, each block is a piece of code.
3.2 Training
First, create a pipeline to train a linear regression and lasso model. It takes data from the above 3.1 pipeline, loads models, does hparam search, and finally trains the two models.
Secondly, I had to create a pipeline for an xgboost model
The last (pink) bit is connected to creating visualisations.
3.3 Observability
We can create any kind of bar/flow/line/custom charts, and the above 3 are some SHAP values from the xgboost model.
3.4 Triggering
Here I created an automatic trigger to train the xgboost model when new data is detected.
Also, created this predict pipeline
There is a nice way to setup a basic interface for inference
We can also setup an API to do inference through that as well:
That was all for today. I have very strong feelings towards Mage.ai but for now I will roll with it because of the nice course. Otherwise I will try to use prefect for some side project (have to think about that a bit).
That is all for today!
See you tomorrow :)