(Day 359) Creating a list of books and courses I covered in the last 358 days

Ivan Ivanov · December 25, 2024

Hello :) Today is Day 359!

A quick summary of today:

applying analytical patterns day 2
streamed - looking back at books/courses I read/did over the past year for this blog
covered a 1hr course on MLOps on databricks by Maria Vechtomova

Applying analytical patterns day 2 (from Zach Wilson’s bootcamp)

My notes:

Applying analytical patterns 2 P1

Lab code:

-- applying analytical patterns lab 2
-- find how many of the people that go to the signup page end up signing up

with deduped_events as (
    select
        user_id,
        url,
        event_time,
        date(event_time) as event_date
    from events
    where user_id is not null
    group by user_id, url, event_time, date(event_time)
    order by user_id, event_time
),
    selfjoined as (
        select
            d1.user_id,
            d1.url,
            d2.url as destination_url,
            d1.event_time,
            d2.event_time
        from deduped_events d1
            join deduped_events d2
                on d1.user_id = d2.user_id
                and d1.event_date = d2.event_date
                -- count the visit if it happens on the same day
                and d2.event_time > d1.event_time
    ),
    userlevel as (
        select
            user_id,
            url,
            count(1) as number_of_hits,
            sum(case when destination_url = '/api/v1/user' then 1 else 0 end) as converted
        from selfjoined
        group by user_id, url
    )
select
    url,
    sum(number_of_hits) as num_hits,
    sum(converted) as num_converted,
    cast(sum(converted) as real) / count(number_of_hits) as pct_converted
from userlevel
group by url
having sum(number_of_hits) > 500; -- filter random url hits

-- look at website events and what type of device is the most common to hist the website

create table device_hits_dashboard as
with events_augmented as (
    select coalesce(d.os_type, '(overall)') as os_type,
           coalesce(d.device_type, '(overall)') as device_type,
           coalesce(d.browser_type, '(overall)') as browser_type,
           url,
           user_id
    from events e
        join devices d on e.device_id = d.device_id
)

select
       case
           when grouping(os_type) = 0
               and grouping(device_type) = 0
               and grouping(browser_type) = 0
               then 'os_type__device_type__browser'
           when grouping(browser_type) = 0 then 'browser_type'
           when grouping(device_type) = 0 then 'device_type'
           when grouping(os_type) = 0 then 'os_type'
       end as aggregation_level,
       coalesce(os_type, '(overall)') as os_type,
       coalesce(device_type, '(overall)') as device_type,
       coalesce(browser_type, '(overall)') as browser_type,
       count(1) as number_of_hits
from events_augmented
group by grouping sets (
        (browser_type, device_type, os_type),
        (browser_type),
        (os_type),
        (device_type)
    )
order by count(1) desc;

-- example usage (it's pre-aggregated)
select * from device_hits_dashboard
where aggregation_level = 'os_type__device_type__browser';

Streamed

I went over 358 days worth of posts and collected a list of books and courses

I will update it with the books/courses I read/do until the end of day 366 and maybe add books I will definitely read after the blog ends. I plan to post each part on linkedin on 2nd of Jan, or maybe do like a top 5/10 of books/courses.

## Book

Mathematics for Machine learning
Deep Learning by MIT 
Build a LLM from scratch
An Introduction to Statistical Learning
Foundations and Linear Models part from Probabilistic Machine Learning: An Introduction
What are embeddings? (Vicki Boykis)
Dive into Deep Learning
Grokking Machine Learning
Outlier Detection in Python
A Primer For The Mathematics Of Financial Engineering
Design a Machine Learning System (From Scratch)
Multivariate Statistical Methods: A Primer
Effective Data Science Infrastructure
MLOps blog: https://ml-ops.org/
Fundamentals of Data Engineering
Introducing MLOps
Deep Learning at Scale
Machine Learning Algorithms in Depth
Databricks Certified Associate Developer for Apache Spark Using Python
Streaming Databases
The Little Book of Deep Learning
The Elements of Statistical Learning
Deep Learning (Bishop and Bishop)
The Matrix Calculus You Need For Deep Learning (arxiv paper)
Graph Algorithms for Data Science
Before Machine Learning book series
The 100 page ML book
MLE (Andriy Burkov)
Financial Data Engineering
Machine Learning for Financial Risk Management with Python
Optimization Algorithms
Financial calculus (Baxter and Rennie)
Data Analysis with Python and PySpark
LLM Engineer's Handbook
scikit-learn User Guide
Designing Machine Learning Systems
Designing Data-Intensive Applications
Data Quality Fundamentals
Apache Iceberg: The Definitive Guide
Learning Spark 2nd ed.
Data Pipelines Pocket Reference
Stream Processing with Apache Flink
The 100 page LM book
The Little Book of ML Metrics
Effective Machine Learning Teams
Data Engineering with dbt
The AI Engineer's Guide to Surviving the EU AI Act

## Course

caltech - ML CS 156 by Prof. Abu-Mostafa: https://youtu.be/mbyG85GZ0PI?list=PLD63A284B7615313A
Statistics with Python in Codecademy: https://www.codecademy.com/learn/paths/master-statistics-with-python
Microsoft Beginner ML course: https://github.com/microsoft/ML-For-Beginners
ML course on Kaggle Learn: https://www.kaggle.com/learn
Machine Learning specialization by Andrew Ng: https://www.coursera.org/specializations/machine-learning-introduction
DeepLearning.AI TensorFlow Certfication Specialization course: https://www.coursera.org/learn/introduction-tensorflow?specialization=tensorflow-in-practice
Andrej Karpathy's NN series: https://www.youtube.com/watch?v=VMj-3S1tku0&list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ&ab_channel=AndrejKarpathy
DL Specialization on Coursera: https://www.coursera.org/specializations/deep-learning
NLP Specialization on Coursera: https://www.coursera.org/specializations/natural-language-processing
TensorFlow: Data and Deployment Specialization: https://www.coursera.org/specializations/tensorflow-data-and-deployment
Tensorflow Advanced techniques specialization: https://www.coursera.org/specializations/tensorflow-advanced-techniques
IBM's DL with PyTorch: https://www.coursera.org/learn/deep-neural-networks-with-pytorch
CS231n Winter 2016: https://youtu.be/NfnWJUyUJYU?list=PLkt2uSq6rBVctENoVBg1TpCC7OQi31AlC
Aladdin Persson's YT videos: https://www.youtube.com/@AladdinPersson/playlists
AI 504 by Prof. Choi: https://www.youtube.com/watch?v=TG3OJQjN-l4&list=PLLENHvsRRLjAmAjc8mV0f9C6i8Gh308SS&ab_channel=EdwardChoi
GAN specialization on Coursera: https://www.coursera.org/specializations/generative-adversarial-networks-gans
AI503 Math for AI from KAIST: https://alinlab.kaist.ac.kr/ai503_2021.html
Started Stanford University's cs231n: https://cs231n.github.io/
Becoming a backprop ninja by Andrej Karpathy (deserves a separate mention): https://youtu.be/q8SA3rM6ckI
Stanford's CS224N: https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1234/
11-711 Advanced NLP by CMU: https://phontron.com/class/anlp2024/
ACL 2023 talk on retrieval-based LMs: https://acl2023-retrieval-lm.github.io/
Local Retrieval Augmented Generation (RAG) from Scratch: https://youtu.be/qN_2fnOPY-M
XCS224W ML with Graphs by Stanford University: https://online.stanford.edu/courses/xcs224w-machine-learning-graphs
GCP's Digital Leader Learning Path: https://www.cloudskillsboost.google/paths/9
Open Source Models with Hugging Face: https://www.deeplearning.ai/short-courses/open-source-models-hugging-face/
AI for Global Goals OxML Summer School series: https://www.oxfordml.school/
Stanford's CS246: Mining Massive Datasets: https://web.stanford.edu/class/cs246/
Stanford's CS109 with Prof. Chris Piech: https://www.youtube.com/playlist?list=PLoROMvodv4rOpr_A7B9SriE_iZmkanvUg
MLOps Zoomcamp by DataTalksClub: https://github.com/DataTalksClub/mlops-zoomcamp
LLM Zoomcamp by DataTalksClub: https://github.com/DataTalksClub/llm-zoomcamp
Darshil Parmar's DE projects: https://www.youtube.com/@DarshilParmar/playlists
Data Eng Zoomcamp by DataTalksClub: https://github.com/DataTalksClub/data-engineering-zoomcamp
Stock Market Analysis Zoomcamp: https://github.com/DataTalksClub/stock-markets-analytics-zoomcamp
EvidentlyAI's observability course: https://www.evidentlyai.com/ml-observability-course
Intro to LangGraph by LangChain: https://academy.langchain.com/courses/intro-to-langgraph
Neo4j's free certification courses: https://graphacademy.neo4j.com/categories/
DeepLearning.AI Data Engineering Professional Certificate
The Complete Mathematics of Neural Networks and Deep Learning: https://youtu.be/Ixl3nykKG9M
 Credit Risk Modelling in Python: https://www.udemy.com/course/credit-risk-modeling-in-python/
dbt fundamentals: https://learn.getdbt.com/courses/dbt-fundamentals
scikit-learn's MOOC: https://inria.github.io/scikit-learn-mooc/
Zach Wilson's free DE boot camp: https://techcreator.io/
Free streaming courses on Confluent: https://training.confluent.io/channeldetail/apache-kafka-fundamentals-and-accreditation
Apache Iceberg/Lakehouse course on Dremio University: https://university.dremio.com/
Airflow courses by Astronomer: https://academy.astronomer.io/
MadeWithML's MLOps course: https://madewithml.com/

## People

Andrew Ng
Andrej Karpathy
Andriy Burkov
Aladdin Persson
Alexey Grigorev
Vincent D. Warmerdam (probabl)
Zach Wilson

MLOps course on LinkedIn learning by Maria Vechtomova

I saw that Maria Vechtomova posted a ‘free’ course on MLOps on linkedin learning so I signed up for a 30-day free trial on the website and decided to check it out ~

Learning objectives:

explain main components and principles required to deploy machine learning models to production on Databricks
identify how to use experiment tracking system, model registry, feature engineering, model/feature serving, and other features required to deploy ML applications
articulate how to package your Python code using best practices and deploy your project using Databricks Asset Bundles.
review how to monitor your ML applications

1. MLOps components and principles

Interesting trend:

DS became popular and soon after MLOps. Not surprising, I guess. Companies wanted to build models and then soon after they realised they need a more consistent way of developing, delivering, and maintaining them.

The minimal set of tools for MLOps:

version control
CI/CD
orchestration
model registry
container registry
compute
serving
monitoring
data version control
other (feature store, labeling tool, exp. tracking, responsible AI, vector db, llm evaluation, llm frameworks, model hub)

There are many different options for each of the above

reuse the tools available in your organization
only add tools if needed
the more tools you have, the more complex migrations you’ll have to go through

Principles

Real business impact is involved if ML model predictions are unavailable or of bad quality. The goal of MLOps is to minimise the risk of things going wrong

Four main principles:

documentation (business goals and KPIs, data definition, ML system architecture, the choice of ML models)
code quality (review processes, code quality checks, unit and integration tests)
traceability and reproducability (for a model this is the ease of finding: corresponding code and commits on git, the infrastructure used for training and serving, ML model artifacts, what data wasa used to train the model; reproducability: the same model with the same data and code can be generated, rollback is easy)
monitoring and alerting (infrastructure health and costs, application health, response time and response codes, model eval and business metrics, data and model drift)

The first two are related to people and processes, while the last two are related to tools.

Databricks for MLOps

It offers:

orchestration
model registry
compute
serving
vector search
exp. tracking
prompt engineering
feature engineering
monitoring

2. MLflow

Offered a managed version by databricks

I could not follow the code exactly as I do not have a proper databricks subscription. I will create one for her course at the end of Jan.

But that did not stop me from running the code locally with a local mlflow server 😆.

(I cloned the repo and started running files locally). But I faced some issues with pyspark as the env was done all with uv and I kept getting RuntimeError: Only remote Spark sessions using Databricks Connect are supported. Use DatabricksSession.builder to create a remote Spark session instead. I guess I could try to do the whole thing locally ~ I just continued watching at this point. She even says that some parts don’t work well in vs code so it’s better to run them in databricks ~ 😆 So as most of the time ~ I will focus on the ideas rather than the exact implementation/code

The code for all of it is on github

3. Feature engineering

Databricks feature engineering concepts

FeatureTable: a table stored as a Delta table in Unity catalog that contains ML model features
FeatureFunction: a class that specifies a Python UDF in Unity catalog
FeatureLookup: a class that specifies how to look up features in the feature table
TrainingSet: a class that defines how to construct a train set using a feature table, FeatureFunction, and FeatureLookup
FeatureSpec: feature specification defined as a list containing instances of FeatureLookup and FeatureFunction

Feature engineering client functionality

create, update, and delete feature tables and feature specifications
create a training dataset
log an MLflow model packaged with information on how a training set was built

4. Feature and Model serving

When to use feature serving

predictions are precomputed and need to be accessed at the moment of request
a simple function that does not require a model artifact must be computed at the moment of request

When to use model serving

there is a model artifact
optionally, some features must be retrieved at the moment of request

5. Deploy apps using Databricks Asset Bundles

Ways to deploy a databricks job:

terraform
databricks APIs
databricks asset bundles (DABs) (recommended way as it manages dependencies and files required to be manually handles if using the first two options)

DAB is deployed by defining a pipeline in a yaml file where we essentially define a job, its schedule, tags, where to run it (compute), and its tasks + there are extra things we can specify ~

This is how it looks like on Maria’s databricks:

6. Monitoring using Inference Tables and Lakehouse Monitoring

What must be monitored

application and infrastructure health
application performance and costs
model and data drift
business value
for some use cases, fairness and bias

Monitoring can be done again in the databricks.yml file as a job.

With Lakehouse monitoring we can

monitor data drift
monitor model drift (if the GT is provided)

Lakehouse Monitoring components

profile metrics table (Delta table with summary stats for each slicing window)
drift metrics table (Delta table with with stats and different drift detection metrics)
customisable dashboard

Overall ~ awesome 🔥

That is all for today!

Merry Christmas!

See you tomorrow :)