(Day 134) Finished CS109 + Scottish dataset project + Started MLOps zoomcamp by DataTalks club

Ivan Ivanov · May 14, 2024

Hello :) Today is Day 134!

A quick summary of today:

  • covered the last 2 lectures of Stanford’s CS109
  • processed 4 more videos for the Scottish accent dataset
  • started MLOps zoomcamp by DataTalks club
  • watched a nice video comparing the roles of data scientist vs AI engineer

Firstly, on lecture 27: advanced probability and lecture 28: future of probability from Stanford’s CS109

What a course! Sad it is over. Chris Piech - what an extraordinary professor!

Both lectures were more or less an overview of the course and professor Piech’s hopes for the future of probability and his hopes for his students.

Lecture 27

image

Around 2013, autograd was introduced for the first time and this allowed for backprop to be automated, instead of manual calculations. This was one of the reasons for deep learning to explode.

The professor also talked a bit about diffusion models

image

#1 remove the noise

#2 predict where the noise is

And a bit about language models (at the time of recording ChatGPT had just come out). And he introduced a basic version of RNN/LSTM.

image

Lecture 28

Besides an overview of what was covered throughout the previous 27 lectures, he talked about potential usage of AI in education and how it can help provide feedback to students, and in medicine.

Secondly, about the Scottish dataset project

Today I cut videos ‘Scottish phrases 5-8’ (4 videos). And now the total time of the audio clips we have in the dataset is 10 minutes :party: We are improving slowly.

Thirdly, started MLOps zoomcamp by DataTalks club

Last night was the opening day for the MLOps camp that I signed up for. It is one (if not the) most popular open courses for MLOps, so I was excited to start it. Well today I covered module 1 and its homework. Below are notes from the intro lecture + module 1 + the homework.

Intro lecture

The course will cover the following topics

image image

Module 1

I think the coolest thing was learning about the different MLOps maturity levels (introduced by Microsoft)

Level 0: no MLOps

  • no automation
  • all code in jupyter
  • good for experiments

Level 1: DevOps, No MLOps

there are experienced devs helping the data scientists some automation releases are automeated unit & integration tests CI/CD ops metrics no experiment tracking no reproducability data scientists separated from engineers

Level 2: Automated training

  • training pipeline
  • experiment tracking
  • model registry
  • low friction deployment
  • data scientists work with engineers
  • good level if we have 2-3 ML cases

Level 3: Automated deployment

  • easy to deploy models
  • pipeline is: data prep, model traning, model deployment
  • A/B tests between models
  • some model monitoring

Level 4: Full MLOps automation

  • automatic training
  • automatic retraining
  • automatic deployment
  • A/B tests
  • approaching a zero-downtime system

Not all orgs need to be on level 4. Level 3 is still fine because we can have a human making that final decision whether a model goes live. So we need to judge what level is best for a particular project.

The homework

I uploaded all my work to my github.

The homework followed the module 1 materia on creating a linear regression model that predicts the taxi travel time from point A to point B. It involves some data preprocessing, model creation and exporting the model with pickle.

Q1: loading the data and how many columns there are

image

Q2: What’s the standard deviation of the trips duration in January?

image

Q3: What fraction of the records left after you dropped the outliers?

image

Q4: What’s the dimensionality of the matrix (number of columns) after using DictVectorizer?

image

Q5: What’s the RMSE on train?

image

We can interpret this as being 7.6 minutes wrong on average in our ride duration predictions.

Q6: What’s the RMSE on validation?

After loading the val dataset (data from February 2023, train is January 2023), I got the result

image

Overall, pretty satisfied, and excited for the future modules and homeworks.

Finally, the comparison between a data scientist and an AI engineer

This IBM video was in my recommended on youtube so I clicked on it to see. And the below pic is the final explanation.

Some abbreviations: FM: foundation model, FE: feature eng, CV: cross-val, HPT: hyperparam tuning, PEFT: param efficiant finetuning. Of course the below is not static, DS might work on prescriptive cases, and AI eng can work with structured data.

image

That is all for today!

See you tomorrow :)