(Day 135) Going deeper into MLOps

Ivan Ivanov · May 15, 2024

Hello :) Today is Day 135!

A quick summary of today:

  • covered module 2 of the mlops-zoomcamp by DataTalks club about experiment tracking and model management
  • cut 2 more videos for the Scottish dataset project

Firstly, about using mlflow for MLOps

Maybe this is because I am starting to learn about MLOps for the 1st time and I don’t know other tools, but WOW mlflow is amazing. Below are my notes from the module 2 lectures. Full code on my github repo.

First, some important concepts

  • ML experiment: the process of building an ML model
  • experiment run: each trial in an ML experiment
  • run artifact: any file that is associated with an ML run
  • experiment metadata

What’s experiment tracking?

  • the process of keeping track of all the relevant info from an ML experiment (could include source code, environment, data, model, hyperparams, metrics, other - these can vary)

Why is experiment tracking important?

  • reproducability
  • organization
  • optimization

Why can’t we just use an excel spreadsheet?

  • error prone
  • no standard format
  • visibility and collaboration

MLflow enters the stage

image image

We can spin up mlflow with: mlflow ui –backend-store-uri sqlite:///mlflow.db

And the sqlite db will keep artifacts and metadata inside of it. To create tracking:

image image

With the above we can log a run, its params, metric, more info. In the UI, we see

image image

Run another model with alpha=0.1, and we can compare the two

image

Next, doing hyperparam optimization using xgboost and pyperopt

image

we need to define an objective

image

Search space and then run

image

After this ran (for around 30 mins), in mlflows’ ui we can visualise all the runs that use the xgboost model

image

it also shows which hyperparam combinations lead to which results

image

For example if we see a scatter between rmse and min_child_weight we ca see somewhat of a pattern where smaller min_child_weight results in smaller rmse

image

we can also see a countour plot where we can see 2 hyperparams and their effect on the rmse

How do we go about model selection?

A naive way would be to choose the model with lowest rmse

image

and after clicking on it we can check more info about it.

image

And we can make further judgements like run duration, data used, etc. Instead of writing lines of what to log, we can use autolog, and when we run the model with the best params from the hyperparam search

image

we get a bit info:

image image image

In model metrics:

image

We have more content in the ‘Artifacts’ tab as well

image

Like model info, dependencies, how to run the model

image

Next, I learned about model management

image

What’s wrong with using a folder system?

  • error prone
  • no versioning
  • no model lineage

To begin with, we can use to save the model in the artifacts tab

image

And when we run the code again, we get

image

Instead of just getting the bin file, we can get better logging using

image

Which results in

image

We can also log the DictVectorizer used in the data preprocessing step, and get it from mlflow

image image

Using the saved model for predictions mlflow provides this sample code when a model is logged in the artifact tab

image

If we use the sample pandas code to load the model, we can check out the model

image

Since it is a xgboost model, we can load it with mlflow’s xgboost and then use it as normal

image

Next, model registry

Imagine a scenario where a DS sends you a developed model and asks for me to deploy it. My task is to take it and run it in prod. But before I do it, I should ask what is different, and what is the purpose. What are the needed dependencies and their versions which are needed for the model’s environment

image

This email is not very informative, so I might start to look through my emails for the previous prod version. If I am lucky enough someone that was in my place before me or a DS, has put the info of the current prod version in a tool for experiment tracking (like mlflow).

image

The pictures above with lots of model runs is the tracking server. A model registry lists models that are in different stages, production and archive. To do actualy deployment, we need some CI/CD. When analysing which model is best for production, we can consider model metrics, size, run time (complexity), data. To promote a model to the model registry, we can just click on the ‘register model’ button in a model’s artifact tab.

image

We can go to the models tab now in mlflow

image

and see the registered models (I ran and registered a 2nd model)

image

In the newest version of mlflow, they have removed the staging/deployed/archived flags, and it is more flexible, so I assigned champion and challenger (custom) aliases

image

Now the deployment engineer (assuming we understand the meaning of the aliases) can decide what to do. All of the above actions can be done through code as well, no need to use the UI (I learned some of the functions and used them in the github notebook and using the mlflow client docs)

Next, I learned about mlflow in practice

image

1 - no need to use mlflow, saving locally is fine

2 - running mlflow locally would be enough. Using model registry would be a good idea to manage the lifecycle of the models. But it is not clear if we can just run it locally or need a host.

3 - MLflow needed. We need a remote tracking server. Also model registry is important because there might be different ppl with different tasks.

image

For the 3rd scenario I wanted to follow along and create an AWS instance, but I don’t have an account and for some reason my card keeps getting rejected. So I guess I will have to do it some other time 😕

image image image

This website shows comparison between mlflow and paid/free alternatives.

Secondly, about the Scottish dataset

Today I finished the last 2 videos from my collaborator

image

And now the total audio length in our dataset from all 10 videos is 770 seconds. I need to find a better way to store our audio files and the dataset itself. It all lives on my google drive at the moment.

I also added some other metadata for each clip

image

And this is a describe of the length_seconds column

image

Tomorrow and Friday is AWS Summit Seoul 2024, and there are a lot of panels each day (2 days in total). So I will share my experience and attended panels.

That is all for today!

See you tomorrow :)