(Day 356) Applying analytical patterns - new content from Zach Wilson's bootcamp

Ivan Ivanov · December 22, 2024

data-eng theory applying-knowledge

Hello :) Today is Day 356!

A quick summary of today:

finished the DE with dbt book - overall opinion - amazing!
covered lecture 1 + lab 1 from the analytical pattern content of Zach Wilson’s bootcamp (+ homework Q1 😆)

DE with dbt - Chapter 14 - Enhancing Software Quality

Refactoring and evolving models

The core of refactoring is to do multiple small changes that keep the project working, test after each small change, and in the end replace the initial code with better code that keeps working as desired.

Dealing with technical debt

Technical debt refers to the future cost of transforming a working but imperfect solution into a cleaner, more maintainable one. It is normal for initial solutions to be suboptimal, but teams must manage technical debt to maintain effectiveness and avoid stagnation. Like financial debt, technical debt should be used judiciously and paid back over time to keep it manageable.

The best approach is to refactor code during development, aiming for a ‘good enough’ version that is sustainable rather than perfect. Refactoring should be part of delivering functionality, not treated as extra work. Small, continuous improvements can help manage technical debt without halting new development. Only in extreme cases should a team pause new work to address excessive debt.

Tools for refactoring in dbt projects

code readability and organization
tests
data auditing

Publishing dependable datasets

provide clear and consistent naming for columns and tables
use a consistent naming convention
use correct and predictable data types
decide on the set of default records that you want to use and consistently apply them
right-size your data marts
once you publish a fact or dimension, refrain from renaming it, renaming the columns inside, or changing the data types
when you want/need to change your API, do so in a controlled and communicated way

What shape should you use for your data mart?

Different uses of the data call for different ways to deliver the dataset for optimal usability.

as an example, for most ai/ml workloads, it is preferable to provide a wide table with all the interesting data cleaned up and pre-joined to the desired granularity, while most bi tools expect to have data models provided in kimball style – that is, in the shape of a star schema
do not think that all cases should use the same data mart; instead, look at what could be the best solution for each use case
as an example, if you have to power a rest api, it is better to provide tables that are cut along the data that the api is serving. you should try to use hybrid tables from snowflake unistore’s new functionality
when you are looking into operational activities and you want to analyze error situations or when processes do not complete as expected or are left halfway, it might be useful to build a few composite facts using full outer joins so that, besides the usual matching data from the normal (inner) join, you can also observe the data that is not matching from both sides
if you recall our discussion of the unified star schema from chapter 3, you can leverage the union all operator to build wide tables with many fks built by following the relations in the data to report on seemingly disparate objects that instead have some common fk in their chain of relations
in most cases, you will have at least some data mart prepared for power reporting with bi tools, providing a kimball star schema

Self-completing dimensions

The self-completing dimension ensures all foreign keys in fact tables have corresponding entries in the dimension table, preventing missing data issues during joins. It handles orphaned keys by adding default rows to the dimension rather than altering fact tables, ensuring data completeness and integrity. This approach is efficient, flexible, and aligns with modern data practices by shifting validation to the BI delivery layer. It simplifies testing, reduces manual intervention, and maintains clean, consistent data models.

This chapter was mostly refactoring, and also understanding the self-completing dimensions macro. All of the code is on my repo

The jinja below may look wrong but I had to remove some of the jinja special characters to due to a rendering issue with my blog’s markdown.

- macro self_completing_dimension(
    dim_rel,
    dim_key_column,
    dim_default_key_value = '-1',
    rel_columns_to_exclude = [],

    fact_defs = []
) -
/* ** Usage notes **
 * - The primary key has to be the first field in the underlying reference for the dimension
 */

% do rel_columns_to_exclude.append(dim_key_column) -

WITH
dim_base as (
    SELECT
          {dim_key_column }
        , d.* EXCLUDE( { rel_columns_to_exclude | join(', ') } )
    FROM  as d
),

fact_key_list as ( 
    { if fact_defs|length > 0 %   -- If a FACT reference is passed, then check for orphans and add them in the dimension

        {- for fact_model_key in fact_defs }
        select distinct  as FOREIGN_KEY
        FROM 
        WHERE  is not null
            {if not loop.last } union { endif 
        {- endfor -%

    {- else           -- If NO FACT reference is passed, the list of fact keys is just empty.
    select null as FOREIGN_KEY WHERE false
    %- endif%
),
missing_keys as (
    SELECT fkl.FOREIGN_KEY 
    from fact_key_list fkl 
    left outer join dim_base on dim_base.{dim_key_column} = fkl.FOREIGN_KEY
    where dim_base.{dim_key_column} is null
),
default_record as (
    SELECT *
    FROM dim_base
    WHERE {dim_key_column} = '{dim_default_key_value}'
    LIMIT 1
),
dim_missing_entries as (
    SELECT 
        mk.FOREIGN_KEY,
        dr.* EXCLUDE( { dim_key_column } )
    FROM missing_keys as mk 
    join default_record dr
),

dim as (
    SELECT * FROM dim_base 
    UNION ALL
    SELECT * FROM dim_missing_entries 
)

SELECT * FROM dim

% endmacro 

DE with dbt - Chapter 15 - Patterns for Frequent Use Cases

Ingestion patterns

SQL INSERT: the standard SQL INSERT command; it works well for inserting a few rows, but it does not scale (dbt uses the SQL INSERT command to load the rows from the CSV files in the seed folder)
EXTERNAL TABLE: in Snowflake, it is possible to define external tables that point to a folder in a container on an external storage service and the table will parse the files according to the provided file format and expression for the fields

A very flexible way to access external data, structured or semi-structured. Snowflake can exploit the schema information stored in semi-structured files, making the creation very simple. The major downside of using external tables is that every time that you query the table, and you do not hit the cache, all the files that back it must be loaded, parsed, and eventually filtered to answer your query. Some improvement can be achieved by adding partitioning based on the file/path name, but they help only if you query with a filter on a partitioning column.

COPY INTO: the most important for loading data in Snowflake because all high-performance ingestion methods rely on ingesting files through this command in one way or the other

The main advantage of using COPY INTO is that files are parsed only once. A further advantage is that at the end of the command, the data is in a normal Snowflake table and every query that you issue on it takes advantage of micro-partitions and the associated metadata to reduce the data that’s queried. Another advantage is that you could transform the data while loading it, using almost all normal SQL expressions and functions.

History patterns

Storing history with deletions – full load

Detect deletions by comparing a full export from the source with active entities in the history table
Effective but costly for large fact tables due to the need for full exports
Alternative: use change data capture (CDC) systems to track deletions

Storing history with deletions – deletion list

Avoids full exports by leveraging deletion lists or database features like triggers to track changes
Scenarios: provided full table payload with deletion info or only the keys of deleted entities
Efficient for large tables, reducing expensive joins and export sizes

Storing history with multiple versions in the input

Allows multiple versions in one load by comparing the current version in history with ordered input versions
Relies on stable sorting criteria (e.g., timestamps) and high watermark columns to manage version order
Handles late arrivals by adding duplicates if necessary for correctness while ensuring accurate history

Storing history with PII and GDPR compliance

Stores anonymized surrogates (SPII) of PII in history tables to ensure GDPR compliance
Maintains a PII mapping table to convert SPII back to PII when necessary
Implementation varies between full and incremental exports, with hashing applied during staging

History and schema evolution

Manages schema changes by freezing old tables and creating new ones for updated schemas
Uses REF models to merge old and new data, maintaining compatibility and completeness
Adapts static models for non-breaking changes and dynamically unites histories for breaking changes

Overall thoughts about the book ~ amazing !!! It is a source to refer back to when doing a project or thinking about concepts like SCD and how it might be implemented and issues one might face.

Applying analytical patterns - new content from Zach Wilson’s bootcamp

My notes from the 1st lecture (2nd lecture will be uploaded later ~):

Applying analytical patterns P1

My code from lab 1:

CREATE TABLE users_growth_accounting (
     user_id TEXT,
     first_active_date DATE,
     last_active_date DATE,
     daily_active_state TEXT,
     weekly_active_state TEXT,
     dates_active DATE[],
     date DATE,
     PRIMARY KEY (user_id, date)
 );

-- growth accounting
INSERT INTO users_growth_accounting
WITH yesterday AS (
    SELECT * FROM users_growth_accounting
    WHERE date = DATE('2023-01-30')
),
     today AS (
         SELECT
            CAST(user_id AS TEXT) as user_id,
            DATE_TRUNC('day', event_time::timestamp) as today_date,
            COUNT(1)
         FROM events
         WHERE DATE_TRUNC('day', event_time::timestamp) = DATE('2023-01-31')
         AND user_id IS NOT NULL
         GROUP BY user_id, DATE_TRUNC('day', event_time::timestamp)
     )

         SELECT COALESCE(t.user_id, y.user_id) as user_id,
                COALESCE(y.first_active_date, t.today_date) AS first_active_date, -- new user
                COALESCE(t.today_date, y.last_active_date) AS last_active_date,
                CASE
                    WHEN y.user_id IS NULL THEN 'New'
                    WHEN y.last_active_date = t.today_date - Interval '1 day' THEN 'Retained'
                    WHEN y.last_active_date < t.today_date - Interval '1 day' THEN 'Resurrected'
                    WHEN t.today_date IS NULL AND y.last_active_date = y.date THEN 'Churned'
                    ELSE 'Stale'
                END as daily_active_state,
                CASE
                    WHEN y.user_id IS NULL THEN 'New'
                    WHEN y.last_active_date < t.today_date - Interval '7 day' THEN 'Resurrected'
                    WHEN t.today_date IS NULL AND y.last_active_date = y.date - interval '7 day' THEN 'Churned'
                    WHEN COALESCE(t.today_date, y.last_active_date) + INTERVAL '7 day' >= y.date THEN 'Retained'
                    ELSE 'Stale'
                END as weekly_active_state,
                COALESCE(y.dates_active,
                         ARRAY []::DATE[])
                    || CASE
                           WHEN
                               t.user_id IS NOT NULL
                               THEN ARRAY [t.today_date]
                           ELSE ARRAY []::DATE[]
                    END AS date_list,
                COALESCE(t.today_date, y.date + Interval '1 day') as date
         FROM today t
                  FULL OUTER JOIN yesterday y
                                  ON t.user_id = y.user_id

-- example usage
SELECT date, daily_active_state, COUNT(1)
FROM users_growth_accounting
GROUP BY date, daily_active_state;

-- look at retention analysis for cohorts (those J-curves from the lecture)
SELECT
    EXTRACT(DOW FROM first_active_date) AS dow,
    date - first_active_date as days_since_first_signup,
    COUNT(CASE WHEN daily_active_state in ('Retained', 'Resurrected', 'New') THEN 1 END) AS number_active,
    CAST(COUNT(CASE WHEN daily_active_state in ('Retained', 'Resurrected', 'New') THEN 1 END) AS REAL)/COUNT(1) AS pct_active,
    COUNT(1)
FROM users_growth_accounting
GROUP BY EXTRACT(DOW FROM first_active_date), date - first_active_date;

With the knowledge from this lab 1 (and also from week 1 where we learned about cumulative table design) I completed Q1 of this week’s homework:

Q1:

A query that does state change tracking for players
- A player entering the league should be New
- A player leaving the league should be Retired
- A player staying in the league should be Continued Playing
- A player that comes out of retirement should be Returned from Retirement
- A player that stays out of the league should be Stayed Retired

create type season_state as enum (
    'New',
    'Retired',
    'Continued Playing',
    'Returned from Retirement',
    'Stayed Retired'
);

create table players_state_tracking (
    player_name text,
    first_active_season integer,
    last_active_season integer,
    season_state season_state,
    current_season integer,
    primary key (player_name, current_season)
 );
insert into players_state_tracking
with yesterday as (
    select
        player_name,
        first_active_season,
        last_active_season,
        current_season
    from players_state_tracking
    where current_season = 1996
),
     today as (
         select
            player_name,
            current_season
         from players
         where current_season = 1997
     )
         select coalesce(t.player_name, y.player_name) as player_name,
                coalesce(y.first_active_season, t.current_season) as first_active_season,
                coalesce(t.current_season, y.last_active_season) as last_active_season,
                case
                    when y.player_name is null then 'New'
                    when y.last_active_season = t.current_season - 1 then 'Continued Playing'
                    when y.last_active_season < t.current_season -1 then 'Returned from Retirement'
                    when t.current_season is null and y.last_active_season = y.current_season then 'Retired'
                    else 'Stayed Retired'::season_state
                end as season_state,
                coalesce(t.current_season, y.current_season + 1) as current_season
         from today t
                  full outer join yesterday y
                                  on t.player_name = y.player_name

That is all for today!

See you tomorrow :)