Hello :) Today is Day 360!
A quick summary of today:
- read chapters 1 and 2 of Architecting Data and Machine Learning Platforms
- did Q2(grouping sets) and Q3(window functions) from Zach Wilson’s applying analytical patterns homework
Architecting Data and Machine Learning Platforms - Chapter 1 - Modernizing Your Data Platform: An Introductory Overview
The data lifecycle
The journey
- collect
- store
- process/transform
- analyse/visualise
- activate (decisions can be made)
Limitations of traditional approaches
Traditional data ecosystems often consist of isolated, task-specific solutions, leading to data silos that hinder insight generation and efficient collaboration across business units. Custom-built solutions exacerbate challenges in scalability, business continuity, disaster recovery, privacy, and data auditing. A unified data platform, particularly a cloud-based one, addresses these limitations by enabling consistent, scalable, and reliable analytics and machine learning across organizational data. Leveraging managed services minimizes infrastructure maintenance, allowing organizations to focus on business objectives.
Antipattern: breakding down silos through ETL
Organizations often use ETL tools to address data silos by transferring and transforming data into centralized analytics stores (data warehouses or data lakes). However, relying on ETL poses significant challenges:
- data quality issues: etl processes may extract incorrect data due to insufficient understanding by the tool’s creators
- latency: etl introduces delays, making data stale, though streaming etl mitigates this issue
- bottlenecks: etl creation and maintenance require skilled data engineers, slowing down scalability
- high maintenance costs: routine troubleshooting, infrastructure updates, and schema changes increase operational overhead
- data gaps: errors often lead to downtime and incomplete records
- governance issues: redundant and divergent etl processes result in inconsistent data
- inefficiency and environmental impact: always-on infrastructure increases costs and carbon footprint
Antipattern: centralisation of control
Organizations attempting to address data silos by centralizing control under IT face significant challenges:
-
IT limitations: managing diverse technologies across multiple storage systems (on-premises and cloud) strains IT capabilities and resources. Defining consistent security, governance, and auditing across systems is complex. Access requests grow with the number of source and target systems, creating scalability issues.
-
analytics challenges: accessing the right data becomes difficult and costly, often leading to unnecessary ETL tasks. Pre-prepared data may lack recent updates or comprehensive details, limiting analytic flexibility. Data governance can restrict data access, reducing analytic agility
-
business limitations: ensuring reliable and quality data access is a struggle. Organizations must balance data quality with the extent of data provided. IT’s limited understanding of business needs can slow innovation, leading to bottlenecks and delays
Antipattern: data marts and hadoop
Antipattern: Data Marts and Hadoop - Summary
To address challenges with siloed, centrally managed systems, organizations adopted two problematic approaches: data marts and ungoverned data lakes.
- Data marts:
- extracted subsets of enterprise data into structured models for specific workloads (e.g retail, banking)
- enabled business users to derive insights (e.g revenue trends, user behavior)
- relied on costly, on-premises technologies with scalability limits, requiring expensive upgrades for larger datasets
- it struggled with infrastructure scaling, talent shortages, and rising costs
- Data lakes via hadoop:
- introduced distributed processing (horizontal scaling) on low-cost servers, making advanced analytics affordable
- served as central repositories for raw data, enabling large-scale analytics and data science
- required extensive knowledge of hadoop ecosystems (e.g., hive, spark, presto) for access, adding complexity
- became ungoverned, leading to poor data quality and low user adoption due to skills gaps and unclear data structures
Both approaches struggled to deliver consistent roi due to high costs, maintenance challenges, and difficulties in managing data quality and accessibility.
Creating a unified analytics platform
Cloud instead of on-premise
- reduce cost by taking advantage of new pricing models (pay-per-use model)
- speed up innovation by taking advantage of best-of-breed technologies
- scale on-premises resources using a “bursting” approach
- plan for business continuity and disaster recovery by storing data in multiple zones and regions
- manage disaster recovery automatically using fully managed services
- centralized governance and access management
- increased productivity and reduced operational costs
- greater data sharing across the organization
- extended access by different personas
- reduced latency of accessing data
Drawbacks of data marts and data lakes
- management challenges: difficult to manage and expensive to scale, especially legacy systems like Teradata and Netezza
- lack of modern capabilities: require additional work to integrate AI/ML or real-time features
-
limited access: analysts lack direct access to raw data and data owners, restricting hypothesis testing and deeper insights
- storage and cost issues: unpredictable data production makes storage planning difficult, and provisioning for peak computational demand is expensive
- maintenance overhead: engineers spend significant time on upkeep, reducing time for feature development, with high total cost of ownership often yielding negative ROI
- governance problems: inconsistent security models create silos, limiting data and model sharing across teams
- skill gap: business users lack programming expertise to leverage data lakes, while technical users remain disconnected from business objectives, leading to missed opportunities for insights and growth
Lakehouses
They offer:
- decoupling of storage and compute that enable inespensive, virtually unlimited, and seamlessly scalable storage; stateless, resilient compute; ACID-compliant storage operations; a logical db storage model, rather than a physical
- data governance
- support for data analysis via the native integration with BI tools
- native support of the typical multiversion approach of a data lake approach
- data storage and management via open formats like parquet and Iceberg
- support for different data types in the structured or unstructured format
- streaming capabilities with the ability to handle real-time analysis of the data
- enablement of a diverse set of applications varying BI to ML
Data mesh
Offers:
- domain responsibility
- on-demand solutions
- organisational alignment
Challenges: managing complex architectures that span multiple environments requires careful planning and execution
Hybrid cloud
Reasons why hybrid is necessary
- data residency regulations
- legacy investments
- transition
- want to meet specific needs
- benefit from combining different clouds
Challenges
- governance
- access control
- workload interoperability
- data movement
- skill sets
- economics
Cloud for AI
It is important to ensure that data platforms will be future proof in being capable of supporting AI use cases given its rise.
AI requires vast data to train models, with more data leading to better performance. Large-scale AI models, such as image classification and language models, need high-quality data and custom hardware (GPUs, TPUs) to process it. Cloud platforms provide the compute power and resources necessary for significant AI breakthroughs.
Cloud makes AI more accessible through:
- ML APIs (pre-built models)
- customisable ML models (AutoML)
- simpler ML (i.e. BigQuery enable ML model training using SQL)
- real-time processing architecture
- ML lifecycle tools
Core principles
The authors suggest 5 key design principles for a data analytics stack:
- delivery serverless analytics, not infrastructure
- embed end-to-end ML
- empower analytics across the entire data lifecycle
- enable open source software tech
- build for growth
Nobel Minds 2024
I saw that the 2024 Nobel laureates had a talk and that Geoffrey Hinton and Demis Hassabis attended (not all laureates attended) and decided to watch. Hinton kept talking about being careful with AI and bad actors and that with AI the rich can keep getting richer - I was expecting him to name drop OpenAI or even Sam Altman 😆 but he didn’t. Even thought AI is the most popular topic, I am glad the non-AI laureates got to talk about their own research as well - economics and genetics.
Applying analytical patterns homework Q2 and Q3
Q2 - grouping sets
- A query that uses `GROUPING SETS` to do efficient aggregations of `game_details` data
- Aggregate this dataset along the following dimensions
- player and team
- Answer questions like who scored the most points playing for one team?
- player and season
- Answer questions like who scored the most points in one season?
- team
- Answer questions like which team has won the most games?
My answer (so far)
create table hw_2_table as
with game_data as (
select
gd.player_name as player_name,
gd.team_abbreviation as team_abbreviation,
g.season as season,
gd.game_id as game_id,
gd.pts as player_pts,
case
when (gd.team_id = g.home_team_id and home_team_wins = 1)
or (gd.team_id <> g.home_team_id and home_team_wins = 0) then 1 else 0
end as won_game
from game_details gd
left join games g
on gd.game_id = g.game_id
), grouped_data as (
select
coalesce(cast(player_name as text), '(overall)') as player_name,
coalesce(cast(team_abbreviation as text), '(overall)') as team_abbreviation,
coalesce(cast(season as text), '(overall)') as season,
sum(player_pts) as player_total_pts,
count(distinct (case when won_game = 1 then game_id end)) as team_total_wins,
case
when grouping(player_name, team_abbreviation) = 0 then 'player__team'
when grouping(player_name, season) = 0 then 'player__season'
when grouping(team_abbreviation) = 0 then 'team'
end as agg_level
from game_data
group by grouping sets (
(player_name, team_abbreviation),
(player_name, season),
(team_abbreviation)
)
)
select * from grouped_data;
-- Who scored the most points playing for one team? -- Giannis Antetokounmpo 15591 points with MIL
select
player_name,
team_abbreviation,
player_total_pts
from hw_2_table
where agg_level = 'player__team'
order by player_total_pts desc
limit 1;
-- Who scored the most points in one season? -- James Harden 3247 points in 2018
select
player_name,
player_total_pts,
season
from hw_2_table
where agg_level = 'player__season'
order by player_total_pts desc
limit 1;
-- Which team has won the most games? -- GSW 445 games
select
team_abbreviation,
team_total_wins
from hw_2_table
where agg_level = 'team'
order by team_total_wins desc
limit 1;
Q3 - window functions
The question:
- A query that uses window functions on `game_details` to find out the following things:
- What is the most games a team has won in a 90 game stretch?
- How many games in a row did LeBron James score over 10 points a game?
My answer (so far)
-- What is the most games a team has won in a 90 game stretch? -- 79
with game_data as (
select
gd.player_name as player_name,
gd.team_abbreviation as team_abbreviation,
g.season as season,
gd.game_id as game_id,
g.game_date_est as game_date_est,
gd.pts as player_pts,
case
when (gd.team_id = g.home_team_id and home_team_wins = 1)
or (gd.team_id <> g.home_team_id and home_team_wins = 0) then 1 else 0
end as won_game
from game_details gd
left join games g
on gd.game_id = g.game_id
), wins_last_90 as (
select
team_abbreviation,
sum(won_game) over (order by game_date_est rows between current row and 89 following) as wins_last_90_games
from game_data
)
select
team_abbreviation,
max(wins_last_90_games) as wins_last_90_games
from wins_last_90
group by team_abbreviation
order by wins_last_90_games desc
limit 1;
-- How many games in a row did LeBron James score over 10 points a game? -- 62
with game_data as (
select
gd.player_name as player_name,
g.game_date_est as game_date_est,
case
when gd.pts > 10 then 1 else 0
end as scored_over_10
from game_details gd
left join games g
on gd.game_id = g.game_id
where gd.player_name = 'LeBron James'
), prev_game as (
select
*,
lag(scored_over_10, 1) over (order by game_date_est) as prev_scored_over_10
from game_data
), with_streak as (
select
*,
sum(case when prev_scored_over_10 = scored_over_10 then 0 else 1 end) over (order by game_date_est) as streak_identifier
from prev_game
), with_final as (
select
*,
sum(scored_over_10) over (partition by streak_identifier order by game_date_est) as max_consecutive_scored_over_10
from with_streak
order by game_date_est
)
select
player_name,
max(max_consecutive_scored_over_10) as consecutive_scored_over_10
from with_final
group by player_name
order by consecutive_scored_over_10 desc;
Architecting Data and Machine Learning Platforms - Chapter 2 - Strategic Steps to Innovate with Data
Step 1: Strategy and Planning
Aligning data strategy with business goals, identifying stakeholders, and establishing a change management process are crucial for success.
- business-driven goals: moving beyond cost-saving, the focus should be on how data can drive key business decisions and transformations. It is important to identify the decisions that will be made using the data and the metrics by which you can know the transformation has been successful
- stakeholder identification: engaging both with IT and business unit leaders ensures alignment of incentives and long-term support for the project
- change management: effectively cascading and communicating the approach throughout the organization is crucial to gain buy-in from end users
Step 2: Reduce Total Cost of Ownership by Adopting a Cloud Approach
Migrating to the cloud offers significant cost benefits through automation, right-sizing resources, and leveraging serverless capabilities.
- cost savings: cloud providers leverage economies of scale and automation to offer lower infrastructure costs
- right-sizing and autoscaling: scaling resources based on demand and usage eliminates overprovisioning and optimizes costs
- serverless workloads: further cost optimization can be achieved by using serverless computing for specific tasks
Step 3: Break Down Silos
Balancing decentralization and data accessibility is crucial - centralizing data storage while maintaining domain-specific ownership and leveraging cloud IAM for access control fosters collaboration and insights.
- centralized storage, decentralized ownership: storing data in a central location like Azure Blob Storage, with each department managing their data in their own bucket, strikes a balance between control and accessibility
- cloud IAM for access control: replacing legacy authentication mechanisms with cloud IAM ensures standardized and auditable access across the organization
- semantic layer: a semantic layer, implemented through tools like Looker or Collibra, ensures consistent terminology and KPI calculations, further breaking down analytics silos. We need to encourage data analytics teams to build a semantic layer (so that vocabularies can be standardized and KPIs can be computed once and reused everywhere else) and apply governance through it
Step 4: Make Decisions in Context Faster
The value of business decisions decreases with latency and distance, so a key goal for modernising a data platform should be to enable contextual decision-making capabilities such as geographic information systems (GIS), streaming, and machine learning (ML) without making copies of the data
- batch to stream: data volumes grow exponentially, transitioning from batch to stream processing reduces latency, spreads computational load, and improves timeliness
- contextual info: automating decision-making enhances user experience, particularly for occasional users. Embedding real-time, location-based visualizations into applications is a best practice
- cost management: streaming can be costly, but combining batch and stream processing using tools like Apache Flink or managed services can reduce complexity and cost. Micro-batching is a viable alternative when near-real-time insights suffice
Step 5: Leapfrog with Packaged AI Solutions
- use AI for predictive analysis
- understanding and generating unstructured data
- more personalised user experiences
- use already available packaged solutions
Step 6: Operationalize AI-Driven Workflows
The final stage of a data platform strategy is to automate workflows for solving business problems autonomously. This involves determining the desired level of automation and fostering a data-driven culture to achieve the organization’s goals
- identifying the right balance of automation and assistance: organizations must experiment to find the right mix of automation and human involvement, which varies across industries
- building a data culture: a data culture requires more than analytics platforms—it involves encouraging data-driven experiments, enhancing data literacy, rewarding innovation, and training employees on data tools like dashboards and analytics
- populating a data science team
Step 7: Product Management for Data
We should treat data as a product to maximize its value and impact on the organization. To treat data as a product, apply product management principles when conceiving and building data products. What product management principles? (1) Have a product strategy, (2) be customer-centric, (3) do lightweight product discovery, and (4) focus on finding market fit.
The authors recommend 10 data practices for this:
- Understand and maintain a map of data flows
- maintain a high-level model of data flows for discoverability
- include multiple levels of granularity, showing data sources, transformations, and usage
- document data attributes (e.g., PII, quality assurances) and production use cases
- Identify key metrics
- align on key metrics (e.g., business KPIs, SLAs, engagement, satisfaction)
- define targets for these metrics and track performance consistently
- Agreed criteria, committed roadmap, and visionary backlog
- establish prioritization criteria for transparency
- maintain a committed roadmap based on validated ROI
- use a backlog to capture long-term vision and unscheduled priorities
- Build for the customers you have
- understand user needs (i.e tools, technical expertise, data formats)
- serve data via APIs, dashboards, warehouses, or feature stores, depending on the audience
- Don’t shift the burden of change management
- minimize downstream disruption during schema or format changes
- ensure backward compatibility or provide migration support
- integrate security and compliance measures proactively
- Interview customers to discover data needs
- regularly engage with customers to identify problems and validate needs
- avoid building speculative features; prioritize validated use cases
- Whiteboard and prototype extensively
- collaborate with users during the design phase
- validate ideas using minimum viable prototypes before production
- scope requirements and expected ROI before committing to development
- Build only what will be used immediately
- prioritize delivering essential features with immediate impact
- capture future needs in a backlog and build iteratively
- Standardize common entities and KPIs
- create enriched, standardized datasets for shared, high-impact use cases
- limit standardization to critical datasets to avoid over-engineering
- Provide self-service capabilities
- balance standardization with flexibility
- enable teams to customize datasets for domain-specific use cases
- adopt principles from data mesh architectures to promote self-sufficiency
That is all for today!
See you tomorrow :)