(Day 360) New book - Architecting Data and Machine Learning Platforms + some bootcamp homework

Ivan Ivanov · December 26, 2024

Hello :) Today is Day 360!

A quick summary of today:

  • read chapters 1 and 2 of Architecting Data and Machine Learning Platforms
  • did Q2(grouping sets) and Q3(window functions) from Zach Wilson’s applying analytical patterns homework

Architecting Data and Machine Learning Platforms - Chapter 1 - Modernizing Your Data Platform: An Introductory Overview

The data lifecycle

The journey

  • collect
  • store
  • process/transform
  • analyse/visualise
  • activate (decisions can be made)

image

Limitations of traditional approaches

Traditional data ecosystems often consist of isolated, task-specific solutions, leading to data silos that hinder insight generation and efficient collaboration across business units. Custom-built solutions exacerbate challenges in scalability, business continuity, disaster recovery, privacy, and data auditing. A unified data platform, particularly a cloud-based one, addresses these limitations by enabling consistent, scalable, and reliable analytics and machine learning across organizational data. Leveraging managed services minimizes infrastructure maintenance, allowing organizations to focus on business objectives.

Antipattern: breakding down silos through ETL

image

Organizations often use ETL tools to address data silos by transferring and transforming data into centralized analytics stores (data warehouses or data lakes). However, relying on ETL poses significant challenges:

  • data quality issues: etl processes may extract incorrect data due to insufficient understanding by the tool’s creators
  • latency: etl introduces delays, making data stale, though streaming etl mitigates this issue
  • bottlenecks: etl creation and maintenance require skilled data engineers, slowing down scalability
  • high maintenance costs: routine troubleshooting, infrastructure updates, and schema changes increase operational overhead
  • data gaps: errors often lead to downtime and incomplete records
  • governance issues: redundant and divergent etl processes result in inconsistent data
  • inefficiency and environmental impact: always-on infrastructure increases costs and carbon footprint

Antipattern: centralisation of control

Organizations attempting to address data silos by centralizing control under IT face significant challenges:

  • IT limitations: managing diverse technologies across multiple storage systems (on-premises and cloud) strains IT capabilities and resources. Defining consistent security, governance, and auditing across systems is complex. Access requests grow with the number of source and target systems, creating scalability issues.

  • analytics challenges: accessing the right data becomes difficult and costly, often leading to unnecessary ETL tasks. Pre-prepared data may lack recent updates or comprehensive details, limiting analytic flexibility. Data governance can restrict data access, reducing analytic agility

  • business limitations: ensuring reliable and quality data access is a struggle. Organizations must balance data quality with the extent of data provided. IT’s limited understanding of business needs can slow innovation, leading to bottlenecks and delays

Antipattern: data marts and hadoop

Antipattern: Data Marts and Hadoop - Summary

To address challenges with siloed, centrally managed systems, organizations adopted two problematic approaches: data marts and ungoverned data lakes.

  1. Data marts:
    • extracted subsets of enterprise data into structured models for specific workloads (e.g retail, banking)
    • enabled business users to derive insights (e.g revenue trends, user behavior)
    • relied on costly, on-premises technologies with scalability limits, requiring expensive upgrades for larger datasets
    • it struggled with infrastructure scaling, talent shortages, and rising costs
  2. Data lakes via hadoop:
    • introduced distributed processing (horizontal scaling) on low-cost servers, making advanced analytics affordable
    • served as central repositories for raw data, enabling large-scale analytics and data science
    • required extensive knowledge of hadoop ecosystems (e.g., hive, spark, presto) for access, adding complexity
    • became ungoverned, leading to poor data quality and low user adoption due to skills gaps and unclear data structures

Both approaches struggled to deliver consistent roi due to high costs, maintenance challenges, and difficulties in managing data quality and accessibility.

Creating a unified analytics platform

Cloud instead of on-premise

  • reduce cost by taking advantage of new pricing models (pay-per-use model)
  • speed up innovation by taking advantage of best-of-breed technologies
  • scale on-premises resources using a “bursting” approach
  • plan for business continuity and disaster recovery by storing data in multiple zones and regions
  • manage disaster recovery automatically using fully managed services
  • centralized governance and access management
  • increased productivity and reduced operational costs
  • greater data sharing across the organization
  • extended access by different personas
  • reduced latency of accessing data

Drawbacks of data marts and data lakes

  • management challenges: difficult to manage and expensive to scale, especially legacy systems like Teradata and Netezza
  • lack of modern capabilities: require additional work to integrate AI/ML or real-time features
  • limited access: analysts lack direct access to raw data and data owners, restricting hypothesis testing and deeper insights

  • storage and cost issues: unpredictable data production makes storage planning difficult, and provisioning for peak computational demand is expensive
  • maintenance overhead: engineers spend significant time on upkeep, reducing time for feature development, with high total cost of ownership often yielding negative ROI
  • governance problems: inconsistent security models create silos, limiting data and model sharing across teams
  • skill gap: business users lack programming expertise to leverage data lakes, while technical users remain disconnected from business objectives, leading to missed opportunities for insights and growth

Lakehouses

image

They offer:

  • decoupling of storage and compute that enable inespensive, virtually unlimited, and seamlessly scalable storage; stateless, resilient compute; ACID-compliant storage operations; a logical db storage model, rather than a physical
  • data governance
  • support for data analysis via the native integration with BI tools
  • native support of the typical multiversion approach of a data lake approach
  • data storage and management via open formats like parquet and Iceberg
  • support for different data types in the structured or unstructured format
  • streaming capabilities with the ability to handle real-time analysis of the data
  • enablement of a diverse set of applications varying BI to ML

Data mesh

image

Offers:

  • domain responsibility
  • on-demand solutions
  • organisational alignment

Challenges: managing complex architectures that span multiple environments requires careful planning and execution

Hybrid cloud

Reasons why hybrid is necessary

  • data residency regulations
  • legacy investments
  • transition
  • want to meet specific needs
  • benefit from combining different clouds

Challenges

  • governance
  • access control
  • workload interoperability
  • data movement
  • skill sets
  • economics

Cloud for AI

It is important to ensure that data platforms will be future proof in being capable of supporting AI use cases given its rise.

AI requires vast data to train models, with more data leading to better performance. Large-scale AI models, such as image classification and language models, need high-quality data and custom hardware (GPUs, TPUs) to process it. Cloud platforms provide the compute power and resources necessary for significant AI breakthroughs.

Cloud makes AI more accessible through:

  • ML APIs (pre-built models)
  • customisable ML models (AutoML)
  • simpler ML (i.e. BigQuery enable ML model training using SQL)
  • real-time processing architecture
  • ML lifecycle tools

Core principles

The authors suggest 5 key design principles for a data analytics stack:

  • delivery serverless analytics, not infrastructure
  • embed end-to-end ML
  • empower analytics across the entire data lifecycle
  • enable open source software tech
  • build for growth

Nobel Minds 2024

I saw that the 2024 Nobel laureates had a talk and that Geoffrey Hinton and Demis Hassabis attended (not all laureates attended) and decided to watch. Hinton kept talking about being careful with AI and bad actors and that with AI the rich can keep getting richer - I was expecting him to name drop OpenAI or even Sam Altman 😆 but he didn’t. Even thought AI is the most popular topic, I am glad the non-AI laureates got to talk about their own research as well - economics and genetics.

Applying analytical patterns homework Q2 and Q3

Q2 - grouping sets

- A query that uses `GROUPING SETS` to do efficient aggregations of `game_details` data
  - Aggregate this dataset along the following dimensions
    - player and team
      - Answer questions like who scored the most points playing for one team?
    - player and season
      - Answer questions like who scored the most points in one season?
    - team
      - Answer questions like which team has won the most games?

My answer (so far)

create table hw_2_table as
with game_data as (
    select
        gd.player_name as player_name,
        gd.team_abbreviation as team_abbreviation,
        g.season as season,
        gd.game_id as game_id,
        gd.pts as player_pts,
        case
            when (gd.team_id = g.home_team_id and home_team_wins = 1)
                     or (gd.team_id <> g.home_team_id and home_team_wins = 0) then 1 else 0
        end as won_game
    from game_details gd
        left join games g
            on gd.game_id = g.game_id
), grouped_data as (
    select
        coalesce(cast(player_name as text), '(overall)') as player_name,
        coalesce(cast(team_abbreviation as text), '(overall)') as team_abbreviation,
        coalesce(cast(season as text), '(overall)') as season,
        sum(player_pts) as player_total_pts,
        count(distinct (case when won_game = 1 then game_id end)) as team_total_wins,
        case
            when grouping(player_name, team_abbreviation) = 0 then 'player__team'
            when grouping(player_name, season) = 0 then 'player__season'
            when grouping(team_abbreviation) = 0 then 'team'
        end as agg_level
    from game_data
    group by grouping sets (
            (player_name, team_abbreviation),
            (player_name, season),
            (team_abbreviation)
        )
)
select * from grouped_data;

-- Who scored the most points playing for one team? -- Giannis Antetokounmpo 15591 points with MIL
select
    player_name,
    team_abbreviation,
    player_total_pts
from hw_2_table
where agg_level = 'player__team'
order by player_total_pts desc
limit 1;
-- Who scored the most points in one season? -- James Harden 3247 points in 2018
select
    player_name,
    player_total_pts,
    season
from hw_2_table
where agg_level = 'player__season'
order by player_total_pts desc
limit 1;
-- Which team has won the most games? -- GSW 445 games
select
    team_abbreviation,
    team_total_wins
from hw_2_table
where agg_level = 'team'
order by team_total_wins desc
limit 1;

Q3 - window functions

The question:

- A query that uses window functions on `game_details` to find out the following things:
  - What is the most games a team has won in a 90 game stretch? 
  - How many games in a row did LeBron James score over 10 points a game?

My answer (so far)

-- What is the most games a team has won in a 90 game stretch? -- 79
with game_data as (
    select
        gd.player_name as player_name,
        gd.team_abbreviation as team_abbreviation,
        g.season as season,
        gd.game_id as game_id,
        g.game_date_est as game_date_est,
        gd.pts as player_pts,
        case
            when (gd.team_id = g.home_team_id and home_team_wins = 1)
                     or (gd.team_id <> g.home_team_id and home_team_wins = 0) then 1 else 0
        end as won_game
    from game_details gd
        left join games g
            on gd.game_id = g.game_id
), wins_last_90 as (
    select
        team_abbreviation,
        sum(won_game) over (order by game_date_est rows between current row and 89 following) as wins_last_90_games
    from game_data
)
select
    team_abbreviation,
    max(wins_last_90_games) as wins_last_90_games
from wins_last_90
group by team_abbreviation
order by wins_last_90_games desc
limit 1;

-- How many games in a row did LeBron James score over 10 points a game? -- 62
with game_data as (
    select
        gd.player_name as player_name,
        g.game_date_est as game_date_est,
        case
            when gd.pts > 10 then 1 else 0
        end as scored_over_10
    from game_details gd
        left join games g
            on gd.game_id = g.game_id
    where gd.player_name = 'LeBron James'
), prev_game as (
    select
        *,
        lag(scored_over_10, 1) over (order by game_date_est) as prev_scored_over_10
    from game_data
), with_streak as (
    select
        *,
        sum(case when prev_scored_over_10 = scored_over_10 then 0 else 1 end) over (order by game_date_est) as streak_identifier
    from prev_game
), with_final as (
    select
        *,
        sum(scored_over_10) over (partition by streak_identifier order by game_date_est) as max_consecutive_scored_over_10
    from with_streak
    order by game_date_est
)
select
    player_name,
    max(max_consecutive_scored_over_10) as consecutive_scored_over_10
from with_final
group by player_name
order by consecutive_scored_over_10 desc;

Architecting Data and Machine Learning Platforms - Chapter 2 - Strategic Steps to Innovate with Data

image

Step 1: Strategy and Planning

Aligning data strategy with business goals, identifying stakeholders, and establishing a change management process are crucial for success.

  • business-driven goals: moving beyond cost-saving, the focus should be on how data can drive key business decisions and transformations. It is important to identify the decisions that will be made using the data and the metrics by which you can know the transformation has been successful
  • stakeholder identification: engaging both with IT and business unit leaders ensures alignment of incentives and long-term support for the project
  • change management: effectively cascading and communicating the approach throughout the organization is crucial to gain buy-in from end users

Step 2: Reduce Total Cost of Ownership by Adopting a Cloud Approach

Migrating to the cloud offers significant cost benefits through automation, right-sizing resources, and leveraging serverless capabilities.

  • cost savings: cloud providers leverage economies of scale and automation to offer lower infrastructure costs
  • right-sizing and autoscaling: scaling resources based on demand and usage eliminates overprovisioning and optimizes costs
  • serverless workloads: further cost optimization can be achieved by using serverless computing for specific tasks

Step 3: Break Down Silos

Balancing decentralization and data accessibility is crucial - centralizing data storage while maintaining domain-specific ownership and leveraging cloud IAM for access control fosters collaboration and insights.

  • centralized storage, decentralized ownership: storing data in a central location like Azure Blob Storage, with each department managing their data in their own bucket, strikes a balance between control and accessibility
  • cloud IAM for access control: replacing legacy authentication mechanisms with cloud IAM ensures standardized and auditable access across the organization
  • semantic layer: a semantic layer, implemented through tools like Looker or Collibra, ensures consistent terminology and KPI calculations, further breaking down analytics silos. We need to encourage data analytics teams to build a semantic layer (so that vocabularies can be standardized and KPIs can be computed once and reused everywhere else) and apply governance through it

Step 4: Make Decisions in Context Faster

The value of business decisions decreases with latency and distance, so a key goal for modernising a data platform should be to enable contextual decision-making capabilities such as geographic information systems (GIS), streaming, and machine learning (ML) without making copies of the data

  • batch to stream: data volumes grow exponentially, transitioning from batch to stream processing reduces latency, spreads computational load, and improves timeliness
  • contextual info: automating decision-making enhances user experience, particularly for occasional users. Embedding real-time, location-based visualizations into applications is a best practice
  • cost management: streaming can be costly, but combining batch and stream processing using tools like Apache Flink or managed services can reduce complexity and cost. Micro-batching is a viable alternative when near-real-time insights suffice

Step 5: Leapfrog with Packaged AI Solutions

  • use AI for predictive analysis
  • understanding and generating unstructured data
  • more personalised user experiences
  • use already available packaged solutions

Step 6: Operationalize AI-Driven Workflows

The final stage of a data platform strategy is to automate workflows for solving business problems autonomously. This involves determining the desired level of automation and fostering a data-driven culture to achieve the organization’s goals

image

  • identifying the right balance of automation and assistance: organizations must experiment to find the right mix of automation and human involvement, which varies across industries
  • building a data culture: a data culture requires more than analytics platforms—it involves encouraging data-driven experiments, enhancing data literacy, rewarding innovation, and training employees on data tools like dashboards and analytics
  • populating a data science team

Step 7: Product Management for Data

We should treat data as a product to maximize its value and impact on the organization. To treat data as a product, apply product management principles when conceiving and building data products. What product management principles? (1) Have a product strategy, (2) be customer-centric, (3) do lightweight product discovery, and (4) focus on finding market fit.

The authors recommend 10 data practices for this:

image

  1. Understand and maintain a map of data flows
    • maintain a high-level model of data flows for discoverability
    • include multiple levels of granularity, showing data sources, transformations, and usage
    • document data attributes (e.g., PII, quality assurances) and production use cases
  2. Identify key metrics
    • align on key metrics (e.g., business KPIs, SLAs, engagement, satisfaction)
    • define targets for these metrics and track performance consistently
  3. Agreed criteria, committed roadmap, and visionary backlog
    • establish prioritization criteria for transparency
    • maintain a committed roadmap based on validated ROI
    • use a backlog to capture long-term vision and unscheduled priorities
  4. Build for the customers you have
    • understand user needs (i.e tools, technical expertise, data formats)
    • serve data via APIs, dashboards, warehouses, or feature stores, depending on the audience
  5. Don’t shift the burden of change management
    • minimize downstream disruption during schema or format changes
    • ensure backward compatibility or provide migration support
    • integrate security and compliance measures proactively
  6. Interview customers to discover data needs
    • regularly engage with customers to identify problems and validate needs
    • avoid building speculative features; prioritize validated use cases
  7. Whiteboard and prototype extensively
    • collaborate with users during the design phase
    • validate ideas using minimum viable prototypes before production
    • scope requirements and expected ROI before committing to development
  8. Build only what will be used immediately
    • prioritize delivering essential features with immediate impact
    • capture future needs in a backlog and build iteratively
  9. Standardize common entities and KPIs
    • create enriched, standardized datasets for shared, high-impact use cases
    • limit standardization to critical datasets to avoid over-engineering
  10. Provide self-service capabilities
    • balance standardization with flexibility
    • enable teams to customize datasets for domain-specific use cases
    • adopt principles from data mesh architectures to promote self-sufficiency

That is all for today!

See you tomorrow :)