(Day 321) Slowly changing dimensions

Ivan Ivanov · November 17, 2024

Hello :) Today is Day 321!

A quick summary of today:

learned about slowly changing dimensions
finished Chip Huyen’s book

Today I mostly hung out in Zach Wilson’s discord and chated with people about the homework and lectures.

Today, just a lecture about SDC was released and here are my notes from it:

Slowly changing dims and idempotency day 2 P1

And I started doing the homework

Here is my solution (not sure if its correct but looks good 😆) for 1 and 2

DDL for actors table: Create a DDL for an actors table with the following fields:
- films: An array of struct with the following fields:
  - film: The name of the film.
  - votes: The number of votes the film received.
  - rating: The rating of the film.
  - filmid: A unique identifier for each film.
- quality_class: This field represents an actor’s performance quality, determined by the average rating of movies of their most recent year. It’s categorized as follows:
  - star: Average rating > 8.
  - good: Average rating > 7 and ≤ 8.
  - average: Average rating > 6 and ≤ 7.
  - bad: Average rating ≤ 6.
- is_active: A BOOLEAN field that indicates whether an actor is currently active in the film industry (i.e., making films this year).
Cumulative table generation query: Write a query that populates the actors table one year at a time.

create type films as (
    film text,
    votes integer,
    rating real,
    filmid text
);

create type quality_class as enum ('star', 'good', 'average', 'bad');

create table actors (
    actorid text,
    actor text,
    films films[],
    quality_class quality_class,
    current_year integer,
    is_active boolean,
    primary key (actorid, current_year)
);

-- select min(year) from actor_films; -- 1970

DO $$
DECLARE
    y_year INT; -- Previous year
    t_year INT; -- Target year
BEGIN
    FOR y_year IN 1969..2020 LOOP
        t_year := y_year + 1;
        insert into actors
        with yesterday as (
            select *
            from actors
            where current_year = y_year
        ),
        today_aggregated as (
            select
                actorid,
                max(actor) as actor,
                array_agg(row(film, votes, rating, filmid)::films) as films,
                max(year) as year,
                avg(rating) as average_rating
            from actor_films
            where year = t_year
            group by actorid
        )
        select
            coalesce(t.actorid, y.actorid) as actorid,
            coalesce(t.actor, y.actor) as actor,
            case
                when y.films is null then t.films
                when t.films is not null then y.films || t.films
                else y.films
            end as films,
            case
                when t.films is not null then
                    case when t.average_rating > 8 then 'star'
                        when t.average_rating > 7 then 'good'
                        when t.average_rating > 6 then 'average'
                        else 'bad'
                    end::quality_class
                else y.quality_class
            end as quality_class,
            coalesce(t.year, y.current_year + 1) as current_year,
            case when t.films is not null then true else false end as is_active
        from today_aggregated t
        full outer join yesterday y
            on t.actorid = y.actorid;
    END LOOP;
END $$;

Designing ML systems Chapter 10: Infrastructure and Tooling for MLOps

Different layers of infra for ML

Storage and Compute

Storage Layer:

The storage layer is where data is collected and stored, ranging from simple HDDs or SSDs to cloud-based solutions like Amazon S3 and Snowflake. While earlier companies managed their own storage, the last decade has seen a shift to commoditized, cloud-based storage. The affordability of cloud storage enables companies to retain vast amounts of data without prohibitive costs.

Compute Layer: The compute layer powers ML workloads by providing the resources to process data. It can range from a single CPU/GPU core to cloud-managed instances like AWS EC2 or GCP units. Compute units are characterized by memory (i.e. GB) and speed (i.e. FLOPS). Utilization, or how effectively compute resources are used, depends on memory bandwidth and data input speeds.

Benchmarks like MLPerf help evaluate compute performance by measuring tasks such as training a ResNet-50 model or generating predictions using BERT. For simplicity, people often focus on the number of cores and memory when selecting instances.

Cloud vs. On-Premises: Cloud computing’s elasticity allows companies to scale workloads dynamically, reducing operational overhead. This flexibility suits bursty ML workloads, where experiments demand high compute power temporarily. However, cloud costs are significant, often prompting mature companies to repatriate workloads to private data centers to cut expenses. For example, Dropbox saved $75M by optimizing its infrastructure.

Multicloud Strategy: To avoid vendor lock-in, many organizations adopt multicloud strategies, spreading workloads across providers like AWS, GCP, and Azure. This approach leverages cost-effective services but introduces complexity, often resulting from acquisitions or independent decision-making within organizations.

Cloud and compute strategies should balance cost, scalability, and operational complexity based on a company’s size and workload patterns.

Development Environment

The development (dev) environment is where ML engineers write code, run experiments, and interact with production systems. It comprises tools like IDEs, versioning systems, and CI/CD pipelines. Despite its importance, many companies underinvest in their dev environments, often solving development processes in ad-hoc ways. Improvements in the dev environment directly enhance engineering productivity.

Key Components of a Dev Environment

Integrated Development Environment (IDE):

IDEs include applications like VS Code or browser-based tools like AWS Cloud9. Notebooks, such as Jupyter and Google Colab, are favored by data scientists for their stateful execution, supporting exploratory data analysis and iterative experiments. Notebooks, while powerful, can complicate reproducibility due to out-of-order cell execution.

Versioning:

Version control is critical for managing code, data, and model artifacts in ML projects.

Tools include Git (code), DVC (data), Weights & Biases, and MLflow (experiment tracking). Emerging platforms, like Claypot AI, aim to centralize ML workflow versioning.

CI/CD Pipelines:

Continuous Integration and Continuous Deployment (CI/CD) ensure code changes are tested before moving to production.

Common tools include GitHub Actions and CircleCI.

Standardization of Dev Environments

Standardizing dev environments minimizes compatibility issues and boosts productivity.

ensure consistency in tools, packages, and virtual environments (e.g. conda)
maintain strict package versioning in requirements.txt to prevent dependency conflicts
use the same programming language version across the team
consider cloud dev environments for standardized setups and easier IT support

Cloud environments offer added benefits:

cost-efficiency with auto-shutdown features (like GitHub Codespaces)
remote access through SSH
enhanced security by centralizing sensitive data
reduced gap between dev and production environments

Transitioning from Dev to Production: Containers

Containers streamline deployment by ensuring consistent environments. And to manage containers: use Kubernetes (K8s) for multi-host, scalable deployments, enabling auto-scaling and high availability.

Workflows

Simple ML workflow DAG

After a workflow is defined, the tasks in this workflow are scheduled and orchestrated

ML Platform

Machine learning (ML) platforms are transforming how companies manage and deploy ML solutions. A manager from a major streaming company shared a firsthand account of how his ML platform team was formed. Initially focused on recommender systems, they developed tools like feature management and model monitoring. Eventually, these tools were repurposed for broader ML applications, leading to the creation of a dedicated ML platform team. This mirrors a broader industry trend: companies are increasingly leveraging shared ML infrastructure to unify workflows across diverse applications.

Key ML Platform Components

Model deployment

Deployment tools make ML models accessible for predictions—either online (real-time) or batch (precomputed). The market offers various options, from cloud-native tools like AWS SageMaker and GCP Vertex AI to open-source alternatives like MLflow Models and Seldon.

Model store

Beyond simply storing models, a robust model store tracks a wide range of artifacts: model definitions, parameters, dependencies, and training data. This ensures reproducibility, simplifies debugging, and supports long-term maintenance. MLflow is a popular open-source solution, though its artifact management capabilities are evolving.

Feature store

feature management: centralized sharing and discovery of features across teams
feature computation: efficient computation and storage of features
feature consistency: ensuring consistent feature definitions between training and production

Build vs. Buy Considerations

Build versus buy decisions depend on factors such as:

company stage: early-stage companies often buy vendor solutions to focus on core offerings, while mature companies may build in-house solutions as vendor costs rise
competitive focus: companies prioritize in-house development for areas tied to their competitive advantage, while others prefer vendor solutions, especially in non-tech sectors. Tech companies often favor modular, customizable services
tool maturity: if mature vendor tools are unavailable, teams may need to build their own solutions, potentially leveraging open-source options

Chapter 11: The Human Side of Machine Learning

User Experience

Ensuring user experience consistency is crucial when integrating machine learning (ML) into applications or websites. Users expect predictable interfaces; for example, moving a frequently used button in a browser can lead to confusion. However, ML predictions are inherently probabilistic, meaning they can vary over time or contexts, potentially disrupting user expectations. A case study from Booking.com illustrates this challenge: the platform used ML to recommend filters to users but found that inconsistent suggestions frustrated users. To address this, they implemented rules to maintain consistency, such as retaining previously applied filters while allowing new recommendations when the user changes contexts. This highlights the balance between consistency and accuracy in ML systems.

Another challenge with ML systems is handling “mostly correct” predictions. Large language models like GPT-3 can generate outputs for various tasks with minimal training, such as producing React code for a website. While these predictions can accelerate workflows, they are often imperfect. Users familiar with the domain can make corrections, but non-experts may struggle. To mitigate this, developers can provide multiple predictions for the same input and render them in user-friendly ways, such as displaying visualized web pages for evaluation. This approach, known as “human-in-the-loop” AI, allows users to select or refine outputs, ensuring a better overall experience.

Lastly, handling latency issues in ML systems is critical for smooth user experiences. Some queries may take longer for models to process, especially in sequential tasks like language or time-series models. To address this, many companies implement backup systems that use simpler models, heuristics, or cached predictions to deliver quicker responses when the primary model is delayed. Alternatively, predictive models can determine which system to use based on the expected processing time. This approach balances the speed–accuracy trade-off, ensuring timely responses while maintaining acceptable prediction quality.

By addressing these challenges—ensuring consistency, managing imperfect predictions, and mitigating latency—developers can leverage ML to enhance user experiences effectively.

Team structure

Cross-functional Teams Collaboration

Subject matter experts (SMEs) like doctors, lawyers, and farmers play a vital role in the lifecycle of ML systems, beyond the initial data labeling phase. Their expertise is critical in problem formulation, feature engineering, error analysis, and system evaluation. However, collaboration between SMEs and ML engineers poses challenges, such as translating nuanced domain knowledge into code or versioning and enabling SMEs to contribute without needing engineering skills. To address this, companies are developing no-code/low-code platforms to empower SMEs to participate in tasks like dataset creation and quality assurance without extensive technical knowledge.

End-to-End Data Scientists

In ML projects, operational expertise is as important as ML expertise, creating the need for effective collaboration between teams or for data scientists to own the entire process. Two approaches are common:

separate Ops and ML teams: the ML team develops models, and an Ops team handles deployment and maintenance. This simplifies hiring but introduces drawbacks like communication overhead, debugging difficulties, and limited visibility across the entire workflow.
data scientists owning the process: data scientists manage both modeling and production, fostering accountability but increasing their workload with operational tasks. They risk becoming “grumpy unicorns,” burdened with tasks like managing infrastructure, which detracts from their core focus

A balance can be struck by creating tools that abstract complex infrastructure tasks, allowing data scientists to manage projects end-to-end without deep expertise in areas like containerization or distributed systems. Companies like Netflix and Stitch Fix have demonstrated the effectiveness of tools that automate infrastructure management, enabling data scientists to focus on data and modeling while leveraging robust operational capabilities.

Ultimately, successful ML systems require seamless collaboration between diverse expertise areas, supported by tools that simplify complex workflows, empowering teams to work efficiently and effectively.

Responsible AI

discovering sources of bias
- biases can originate in training data, where unrepresentative datasets marginalize underrepresented groups.
- human annotation can introduce subjective biases, compromising label quality.
- feature engineering may rely on variables correlated with sensitive attributes, causing disparate impacts.
- model objectives and evaluation often prioritize majority groups or overlook subgroup performance, leading to unfair outcomes.
recognizing data-driven limitations
- ml systems often fail to consider real-world socioeconomic and cultural factors.
- addressing these limitations requires collaboration across disciplines to integrate the experiences of those impacted by the systems.
managing trade-offs
- privacy versus accuracy: techniques like differential privacy can reduce accuracy, disproportionately affecting minority groups.
- compactness versus fairness: model compression techniques can harm underrepresented groups more, with varying impacts depending on the approach used.
- trade-offs require careful evaluation to minimize unintended consequences and ensure fairness.
acting early
- addressing ethical issues early in the development lifecycle saves costs and prevents long-term risks.
- late-stage fixes for bias or ethical concerns are significantly more expensive and challenging.
using model cards for transparency
- model cards document critical details about an ai system, including its design, intended uses, and limitations.
- automated tools can generate and update model cards, reducing manual effort and ensuring up-to-date transparency.
establishing bias mitigation processes
- create internal tools for bias detection and mitigation, drawing inspiration from organizations like google and ibm.
- consider using third-party audits to ensure independent evaluations.

Responsible ai development emphasizes proactive bias mitigation, thoughtful trade-offs, and transparent communication to maximize societal benefits while minimising harm.

That is all for today!

See you tomorrow :)