Hello :) Today is Day 359!
A quick summary of today:
- applying analytical patterns day 2
- streamed - looking back at books/courses I read/did over the past year for this blog
- covered a 1hr course on MLOps on databricks by Maria Vechtomova
Applying analytical patterns day 2 (from Zach Wilson’s bootcamp)
My notes:
Lab code:
-- applying analytical patterns lab 2
-- find how many of the people that go to the signup page end up signing up
with deduped_events as (
select
user_id,
url,
event_time,
date(event_time) as event_date
from events
where user_id is not null
group by user_id, url, event_time, date(event_time)
order by user_id, event_time
),
selfjoined as (
select
d1.user_id,
d1.url,
d2.url as destination_url,
d1.event_time,
d2.event_time
from deduped_events d1
join deduped_events d2
on d1.user_id = d2.user_id
and d1.event_date = d2.event_date
-- count the visit if it happens on the same day
and d2.event_time > d1.event_time
),
userlevel as (
select
user_id,
url,
count(1) as number_of_hits,
sum(case when destination_url = '/api/v1/user' then 1 else 0 end) as converted
from selfjoined
group by user_id, url
)
select
url,
sum(number_of_hits) as num_hits,
sum(converted) as num_converted,
cast(sum(converted) as real) / count(number_of_hits) as pct_converted
from userlevel
group by url
having sum(number_of_hits) > 500; -- filter random url hits
-- look at website events and what type of device is the most common to hist the website
create table device_hits_dashboard as
with events_augmented as (
select coalesce(d.os_type, '(overall)') as os_type,
coalesce(d.device_type, '(overall)') as device_type,
coalesce(d.browser_type, '(overall)') as browser_type,
url,
user_id
from events e
join devices d on e.device_id = d.device_id
)
select
case
when grouping(os_type) = 0
and grouping(device_type) = 0
and grouping(browser_type) = 0
then 'os_type__device_type__browser'
when grouping(browser_type) = 0 then 'browser_type'
when grouping(device_type) = 0 then 'device_type'
when grouping(os_type) = 0 then 'os_type'
end as aggregation_level,
coalesce(os_type, '(overall)') as os_type,
coalesce(device_type, '(overall)') as device_type,
coalesce(browser_type, '(overall)') as browser_type,
count(1) as number_of_hits
from events_augmented
group by grouping sets (
(browser_type, device_type, os_type),
(browser_type),
(os_type),
(device_type)
)
order by count(1) desc;
-- example usage (it's pre-aggregated)
select * from device_hits_dashboard
where aggregation_level = 'os_type__device_type__browser';
Streamed
I went over 358 days worth of posts and collected a list of books and courses
I will update it with the books/courses I read/do until the end of day 366 and maybe add books I will definitely read after the blog ends. I plan to post each part on linkedin on 2nd of Jan, or maybe do like a top 5/10 of books/courses.
## Book
Mathematics for Machine learning
Deep Learning by MIT
Build a LLM from scratch
An Introduction to Statistical Learning
Foundations and Linear Models part from Probabilistic Machine Learning: An Introduction
What are embeddings? (Vicki Boykis)
Dive into Deep Learning
Grokking Machine Learning
Outlier Detection in Python
A Primer For The Mathematics Of Financial Engineering
Design a Machine Learning System (From Scratch)
Multivariate Statistical Methods: A Primer
Effective Data Science Infrastructure
MLOps blog: https://ml-ops.org/
Fundamentals of Data Engineering
Introducing MLOps
Deep Learning at Scale
Machine Learning Algorithms in Depth
Databricks Certified Associate Developer for Apache Spark Using Python
Streaming Databases
The Little Book of Deep Learning
The Elements of Statistical Learning
Deep Learning (Bishop and Bishop)
The Matrix Calculus You Need For Deep Learning (arxiv paper)
Graph Algorithms for Data Science
Before Machine Learning book series
The 100 page ML book
MLE (Andriy Burkov)
Financial Data Engineering
Machine Learning for Financial Risk Management with Python
Optimization Algorithms
Financial calculus (Baxter and Rennie)
Data Analysis with Python and PySpark
LLM Engineer's Handbook
scikit-learn User Guide
Designing Machine Learning Systems
Designing Data-Intensive Applications
Data Quality Fundamentals
Apache Iceberg: The Definitive Guide
Learning Spark 2nd ed.
Data Pipelines Pocket Reference
Stream Processing with Apache Flink
The 100 page LM book
The Little Book of ML Metrics
Effective Machine Learning Teams
Data Engineering with dbt
The AI Engineer's Guide to Surviving the EU AI Act
## Course
caltech - ML CS 156 by Prof. Abu-Mostafa: https://youtu.be/mbyG85GZ0PI?list=PLD63A284B7615313A
Statistics with Python in Codecademy: https://www.codecademy.com/learn/paths/master-statistics-with-python
Microsoft Beginner ML course: https://github.com/microsoft/ML-For-Beginners
ML course on Kaggle Learn: https://www.kaggle.com/learn
Machine Learning specialization by Andrew Ng: https://www.coursera.org/specializations/machine-learning-introduction
DeepLearning.AI TensorFlow Certfication Specialization course: https://www.coursera.org/learn/introduction-tensorflow?specialization=tensorflow-in-practice
Andrej Karpathy's NN series: https://www.youtube.com/watch?v=VMj-3S1tku0&list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ&ab_channel=AndrejKarpathy
DL Specialization on Coursera: https://www.coursera.org/specializations/deep-learning
NLP Specialization on Coursera: https://www.coursera.org/specializations/natural-language-processing
TensorFlow: Data and Deployment Specialization: https://www.coursera.org/specializations/tensorflow-data-and-deployment
Tensorflow Advanced techniques specialization: https://www.coursera.org/specializations/tensorflow-advanced-techniques
IBM's DL with PyTorch: https://www.coursera.org/learn/deep-neural-networks-with-pytorch
CS231n Winter 2016: https://youtu.be/NfnWJUyUJYU?list=PLkt2uSq6rBVctENoVBg1TpCC7OQi31AlC
Aladdin Persson's YT videos: https://www.youtube.com/@AladdinPersson/playlists
AI 504 by Prof. Choi: https://www.youtube.com/watch?v=TG3OJQjN-l4&list=PLLENHvsRRLjAmAjc8mV0f9C6i8Gh308SS&ab_channel=EdwardChoi
GAN specialization on Coursera: https://www.coursera.org/specializations/generative-adversarial-networks-gans
AI503 Math for AI from KAIST: https://alinlab.kaist.ac.kr/ai503_2021.html
Started Stanford University's cs231n: https://cs231n.github.io/
Becoming a backprop ninja by Andrej Karpathy (deserves a separate mention): https://youtu.be/q8SA3rM6ckI
Stanford's CS224N: https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1234/
11-711 Advanced NLP by CMU: https://phontron.com/class/anlp2024/
ACL 2023 talk on retrieval-based LMs: https://acl2023-retrieval-lm.github.io/
Local Retrieval Augmented Generation (RAG) from Scratch: https://youtu.be/qN_2fnOPY-M
XCS224W ML with Graphs by Stanford University: https://online.stanford.edu/courses/xcs224w-machine-learning-graphs
GCP's Digital Leader Learning Path: https://www.cloudskillsboost.google/paths/9
Open Source Models with Hugging Face: https://www.deeplearning.ai/short-courses/open-source-models-hugging-face/
AI for Global Goals OxML Summer School series: https://www.oxfordml.school/
Stanford's CS246: Mining Massive Datasets: https://web.stanford.edu/class/cs246/
Stanford's CS109 with Prof. Chris Piech: https://www.youtube.com/playlist?list=PLoROMvodv4rOpr_A7B9SriE_iZmkanvUg
MLOps Zoomcamp by DataTalksClub: https://github.com/DataTalksClub/mlops-zoomcamp
LLM Zoomcamp by DataTalksClub: https://github.com/DataTalksClub/llm-zoomcamp
Darshil Parmar's DE projects: https://www.youtube.com/@DarshilParmar/playlists
Data Eng Zoomcamp by DataTalksClub: https://github.com/DataTalksClub/data-engineering-zoomcamp
Stock Market Analysis Zoomcamp: https://github.com/DataTalksClub/stock-markets-analytics-zoomcamp
EvidentlyAI's observability course: https://www.evidentlyai.com/ml-observability-course
Intro to LangGraph by LangChain: https://academy.langchain.com/courses/intro-to-langgraph
Neo4j's free certification courses: https://graphacademy.neo4j.com/categories/
DeepLearning.AI Data Engineering Professional Certificate
The Complete Mathematics of Neural Networks and Deep Learning: https://youtu.be/Ixl3nykKG9M
Credit Risk Modelling in Python: https://www.udemy.com/course/credit-risk-modeling-in-python/
dbt fundamentals: https://learn.getdbt.com/courses/dbt-fundamentals
scikit-learn's MOOC: https://inria.github.io/scikit-learn-mooc/
Zach Wilson's free DE boot camp: https://techcreator.io/
Free streaming courses on Confluent: https://training.confluent.io/channeldetail/apache-kafka-fundamentals-and-accreditation
Apache Iceberg/Lakehouse course on Dremio University: https://university.dremio.com/
Airflow courses by Astronomer: https://academy.astronomer.io/
MadeWithML's MLOps course: https://madewithml.com/
## People
Andrew Ng
Andrej Karpathy
Andriy Burkov
Aladdin Persson
Alexey Grigorev
Vincent D. Warmerdam (probabl)
Zach Wilson
MLOps course on LinkedIn learning by Maria Vechtomova
I saw that Maria Vechtomova posted a ‘free’ course on MLOps on linkedin learning so I signed up for a 30-day free trial on the website and decided to check it out ~
Learning objectives:
- explain main components and principles required to deploy machine learning models to production on Databricks
- identify how to use experiment tracking system, model registry, feature engineering, model/feature serving, and other features required to deploy ML applications
- articulate how to package your Python code using best practices and deploy your project using Databricks Asset Bundles.
- review how to monitor your ML applications
1. MLOps components and principles
Interesting trend:
DS became popular and soon after MLOps. Not surprising, I guess. Companies wanted to build models and then soon after they realised they need a more consistent way of developing, delivering, and maintaining them.
The minimal set of tools for MLOps:
- version control
- CI/CD
- orchestration
- model registry
- container registry
- compute
- serving
- monitoring
- data version control
- other (feature store, labeling tool, exp. tracking, responsible AI, vector db, llm evaluation, llm frameworks, model hub)
There are many different options for each of the above
- reuse the tools available in your organization
- only add tools if needed
- the more tools you have, the more complex migrations you’ll have to go through
Principles
Real business impact is involved if ML model predictions are unavailable or of bad quality. The goal of MLOps is to minimise the risk of things going wrong
Four main principles:
- documentation (business goals and KPIs, data definition, ML system architecture, the choice of ML models)
- code quality (review processes, code quality checks, unit and integration tests)
- traceability and reproducability (for a model this is the ease of finding: corresponding code and commits on git, the infrastructure used for training and serving, ML model artifacts, what data wasa used to train the model; reproducability: the same model with the same data and code can be generated, rollback is easy)
- monitoring and alerting (infrastructure health and costs, application health, response time and response codes, model eval and business metrics, data and model drift)
The first two are related to people and processes, while the last two are related to tools.
Databricks for MLOps
It offers:
- orchestration
- model registry
- compute
- serving
- vector search
- exp. tracking
- prompt engineering
- feature engineering
- monitoring
2. MLflow
Offered a managed version by databricks
I could not follow the code exactly as I do not have a proper databricks subscription. I will create one for her course at the end of Jan.
But that did not stop me from running the code locally with a local mlflow server 😆.
(I cloned the repo and started running files locally). But I faced some issues with pyspark as the env was done all with uv and I kept getting RuntimeError: Only remote Spark sessions using Databricks Connect are supported. Use DatabricksSession.builder to create a remote Spark session instead.
I guess I could try to do the whole thing locally ~ I just continued watching at this point. She even says that some parts don’t work well in vs code so it’s better to run them in databricks ~ 😆 So as most of the time ~ I will focus on the ideas rather than the exact implementation/code
The code for all of it is on github
3. Feature engineering
Databricks feature engineering concepts
FeatureTable
: a table stored as a Delta table in Unity catalog that contains ML model featuresFeatureFunction
: a class that specifies a Python UDF in Unity catalogFeatureLookup
: a class that specifies how to look up features in the feature tableTrainingSet
: a class that defines how to construct a train set using a feature table, FeatureFunction, and FeatureLookupFeatureSpec
: feature specification defined as a list containing instances of FeatureLookup and FeatureFunction
Feature engineering client functionality
- create, update, and delete feature tables and feature specifications
- create a training dataset
- log an MLflow model packaged with information on how a training set was built
4. Feature and Model serving
When to use feature serving
- predictions are precomputed and need to be accessed at the moment of request
- a simple function that does not require a model artifact must be computed at the moment of request
When to use model serving
- there is a model artifact
- optionally, some features must be retrieved at the moment of request
5. Deploy apps using Databricks Asset Bundles
Ways to deploy a databricks job:
- terraform
- databricks APIs
- databricks asset bundles (DABs) (recommended way as it manages dependencies and files required to be manually handles if using the first two options)
DAB is deployed by defining a pipeline in a yaml file where we essentially define a job, its schedule, tags, where to run it (compute), and its tasks + there are extra things we can specify ~
This is how it looks like on Maria’s databricks:
6. Monitoring using Inference Tables and Lakehouse Monitoring
What must be monitored
- application and infrastructure health
- application performance and costs
- model and data drift
- business value
- for some use cases, fairness and bias
Monitoring can be done again in the databricks.yml
file as a job.
With Lakehouse monitoring we can
- monitor data drift
- monitor model drift (if the GT is provided)
Lakehouse Monitoring components
- profile metrics table (Delta table with summary stats for each slicing window)
- drift metrics table (Delta table with with stats and different drift detection metrics)
- customisable dashboard
Overall ~ awesome 🔥
That is all for today!
Merry Christmas!
See you tomorrow :)