(Day 351) Reading Data Engineering with dbt

Ivan Ivanov · December 17, 2024

Hello :) Today is Day 351!

A quick summary of today:

this book is amazing!!! learned a lot about using dbt for SCD
my repo for the code from today

Some of the jinja might look wrong, but I had to put it like that because of some weird formatting from my blog.

Data Engineering with dbt - Chapter 1/2 - The Basics of SQL to Transform Data/Setting up Snowflake and dbt cloud

The book starts by introducing SQL, its usage and some key words like table, queries, ddl, etc.

Then I had to create a snowflake account so that I can follow the book examples.

To set up users, a warehouse and a db I ran the below queries:

/**
 * Chapter 1 - Section 3: Set up initial users, roles and database in Snowflake
 */

-- **Create and grant your first role
USE ROLE USERADMIN;

CREATE ROLE DBT_EXECUTOR_ROLE
    COMMENT = 'Role for the users running DBT models';

--DROP ROLE DBT_EXECUTOR_ROLE;  -- if you want to drop and redo ;)

GRANT ROLE DBT_EXECUTOR_ROLE TO USER divakaivan;
SHOW GRANTS TO USER divakaivan;

-- ** Grant the privileges to create a database
USE ROLE SYSADMIN;
GRANT CREATE DATABASE ON ACCOUNT
  TO ROLE DBT_EXECUTOR_ROLE;

GRANT USAGE ON WAREHOUSE COMPUTE_WH
  TO ROLE DBT_EXECUTOR_ROLE;
  
GRANT CREATE DATABASE ON ACCOUNT TO ROLE DBT_EXECUTOR_ROLE;

GRANT USAGE ON WAREHOUSE COMPUTE_WH TO ROLE DBT_EXECUTOR_ROLE;

SHOW GRANTS TO ROLE DBT_EXECUTOR_ROLE;
--SHOW GRANTS TO ROLE SYSADMIN;

-- ** Create your first database
USE ROLE DBT_EXECUTOR_ROLE;

CREATE DATABASE DATA_ENG_DBT;
SHOW Databases;

-- configuring the default warehouse
USE ROLE SYSADMIN;
ALTER WAREHOUSE "COMPUTE_WH" SET
    WAREHOUSE_SIZE = 'XSMALL'
    AUTO_SUSPEND = 60
    AUTO_RESUME = TRUE
    COMMENT = 'Default Warehouse';

-- ** Create a user for the dbt application
CREATE USER IF NOT EXISTS DBT_EXECUTOR
    COMMENT = 'User running DBT commands'
    PASSWORD = 'pick_a_password'
    DEFAULT_WAREHOUSE = 'COMPUTE_WH'
    DEFAULT_ROLE = 'DBT_EXECUTOR_ROLE';

-- assign the executor role to the dbt user
GRANT ROLE DBT_EXECUTOR_ROLE TO USER DBT_EXECUTOR;

-- Switch back to operational role we should be always working with
USE ROLE DBT_EXECUTOR_ROLE;

Next - query syntax

The full code is here but it’s basically just basic SQL query examples.

Some examples:

Next - some JOIN clauses

Here

Finally - some window functions

SELECT
    O_ORDERKEY,
    O_CUSTKEY,
    O_ORDERDATE,
    O_TOTALPRICE,
    avg(O_TOTALPRICE) over(partition by O_ORDERDATE) as daily_avg,
    sum(O_TOTALPRICE) over(partition by O_ORDERDATE) as daily_total,
    sum(O_TOTALPRICE) over(partition by
        DATE_TRUNC(MONTH, O_ORDERDATE)) as monthly_total,
    O_TOTALPRICE / daily_avg * 100 as avg_pct,
    O_TOTALPRICE / daily_total * 100 as day_pct,
    O_TOTALPRICE / monthly_total * 100 as month_pct
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."ORDERS"
QUALIFY row_number() over(partition by O_ORDERDATE
                          order by O_TOTALPRICE DESC) <= 5
order by O_ORDERDATE, O_TOTALPRICE desc;

Setting up dbt cloud and connecting it to Snowflake

Faced some issues here with providing the exact credentials, but thankfully the dbt docs had my back.

Chapter 3 - Data Modeling for Data Engineering

Data modelling

Attributes of DM:

entity: a concept that we want to analyze, a thing of interest that we want to collect data about, such as a car, an order, or a page visit
attribute: a property of an entity for which we want to store values, such as a car plate, a customer, or the total amount of a transaction
relationship: a connection between two entities that are related to each other and captures how they are related, such as the ownership relation between a car and its owner, or the purchasing relation between a customer and a product

Data modeling is all about conveying information so that those that work with the modeled data can better understand it and each other.

Data modeling is generally used in two main contexts:

to describe what data should be generated or collected by an application or in a project

In this case, we build the data model before having the data.

to describe some data that already exists

In this case, we reverse-engineer the data model from the available data.

A good practice is to explicitly design this new data model before coding the transformations, drawing at least one model that represents the business concepts that are the goal of the integration and will form the refined targets of our transformations. This model can be at a high level of abstraction, describing the business and only the main entities, or at a greater level of detail, already representing the technical design.

Conceptual, logical, and physical data models

conceptual data model: this is the most abstract model, defining what will be in the domain of the project, providing the general scope

logical data model: provides much greater detail, defining what the data will look like

physical data model: the most detailed model, describing exactly how the data will be stored in the database

Entity-Relationship modeling

entity: represented as a box with the name of the entity inside

If attributes are shown, the entity name is at the top and often in bold or bigger and visually separated from the attributes

attribute: represented by the attribute’s name inside the box of the entity it belongs, with one name per line
relationship: represented by a line joining the entities involved in the relationship and optionally the name of the relationship close to the line

Cardinality

1-to-1
1-to-many
many-to-many

Time perspective

The time perspective significantly influences E-R models, though it is often an implicit and underexplained factor. E-R models capture constraints or business rules of a system, and the system’s time perspective affects entity semantics, applicable rules, and the model’s structure.

For operational applications, relationships like Driver-Taxi are often viewed as one-to-one at a specific time (e.g. a taxi needs a driver, and a driver can drive one taxi at a time). However, both entities make sense independently when inactive (a taxi not in use or a driver on a break). Time-based entities (shifts or contracts) are sometimes embedded to track relationships over time.

In data engineering projects, a lifetime perspective is typically adopted, showing relationships over the lifespan of entities. For example, over time, a driver may drive multiple taxis, and a taxi may have multiple drivers, resulting in a many-to-many relationship. Such relationships require an intermediate table (e.g., Work Period) to store periods when changes occur

Modeling use cases and patterns

Header-detail use case

As an example, it is used in invoices and orders, and almost anywhere else where there is one document with multiple lines of detail inside. The common document info goes into the header entity with the detail lines represented by one or more weak entities depending on the header. A weak entity is an entity that does not have any meaning without the header entity, such as an order line without an order header.

Hierarchical relationships

Hierarchical relationships are very common and generally are a special case of one-to-many relationships. They represent data structures that look like a tree.

The common pattern is that one instance depends on another higher/bigger instance of the same entity, unless it is itself one of the instances at the highest/biggest level.

Forecasts and actuals

Budgeting and forecasting involve predicting figures and comparing them with actuals. A one-to-many relationship exists, with one actual linked to multiple forecasts. Reporting models often generalize Figures into subclasses (actuals and forecasts) and use a single table to simplify comparisons. Actuals are copied for each forecast using joins, relying on reliable forecast dates and business logic.

Common problems in data models

Fan trap

The fan trap is a very common problem that can happen every time you have a join in a one-to-many relationship. It is not a problem of the relationship, but of how you might use it.

The fan trap problem causes the calculations done on measures joined from the one side of the one-to-many relationship to be wrong.

This is only a problem if you use a measure that is on the one side of a one-to-many relationship, when grouping/working at the granularity of the entities on the many side. This happens because the join will duplicate the measures to match the cardinality of the many side.

Chasm trap

A chasm trap is an unmanaged and often unintentional many-to-many relationship. Because it is unmanaged, it becomes an unintended partial Cartesian product: the rows on one side will be cross-joined with the occurrences of the same key on the other side, de facto making a per-key Cartesian product.

Chasm trap and a correct many-to-many relationship model

Modeling styles and architectures

Kimball method or dimensional modeling or star schema

Limitations:

measures in dimensions: while Kimball advises against including measures in dimensions to avoid fan traps, practical needs like storing important metrics (e.g., price in a product dimension) often necessitate this, requiring caution to handle fan traps correctly
no fact-to-fact connections: Kimball schemas restrict facts to only connect with dimensions, which limits analyses involving multiple facts (e.g. orders and deliveries). Alternative approaches like USS or custom SQL are needed for such scenarios
limited enterprise view: the Kimball approach focuses on reporting specific business processes rather than providing a comprehensive enterprise data view, which may not meet all reporting needs
business rules dependency: the loading and storage processes are tightly integrated with business rules, as conformed dimensions and facts serve both as transformation goals and primary data storage

Unified start schema

The USS is an extension of the Kimball star schema that addresses some of its limitations, such as the fan trap and the impossibility to query more than one fact at a time without the risk of a chasm trap.

Inmon design style

The Inmon design style, created by Bill Inmon, emphasizes building an enterprise-wide, integrated, and interoperable data warehouse as a single source of truth for an organization. Its core idea is to model business concepts (entities and processes) and normalize the data into Third Normal Form (3NF) to minimize redundancy and capture complex relationships.

Advantages

resilience: adaptable to changes in business processes or source systems.
comprehensive: captures rich concepts and relationships for broad reporting capabilities

Drawbacks

complexity: requires highly skilled modelers and significant management commitment
auditability: original data is hard to trace due to heavy transformations
effort-intensive: maintenance and updates for new reporting needs are resource-heavy

Data vault

DV architecture:

Data mesh

Data mesh is a modern data architecture approach aimed at solving data ownership issues by decentralizing data management. Instead of a centralized data platform, it organizes data into domain-specific platforms, each owned and maintained by business units familiar with the data’s semantics and use cases. Inspired by microservice architecture and DevOps practices, data mesh emphasizes ‘data products’ — well-defined interfaces, applications, or data marts enabling seamless interaction across domains.

While it aligns business and IT goals, improves data quality, and empowers teams with end-to-end data ownership, it introduces complexity and requires advanced skills in managing distributed systems and APIs. Adoption is challenging in less tech-savvy industries, but insights like treating data as products and assigning data ownership to business units can still be applied. Tools like dbt facilitate collaboration and documentation in this framework.

The Pragmatic Data Platform - PDP

PDP combines the best aspects of various data modeling styles and architectures to create a modern, efficient data platform while minimizing complexity and expertise requirements.
PDP adopts key principles from Data Vault (DV) and Inmon, such as:

separating ingestion from business rules
using insert-only ingestion of full history
organizing data around business concepts and clear BK definitions

However, it avoids DV modeling due to its complexity, accepting a trade-off of reduced resilience for simplicity. Data delivery is typically through data marts with star schemas but can also include wide, denormalized tables for specific use cases like ML/AI.
PDP emphasizes practical solutions, balancing advanced techniques (like using hashes) with accessibility to a broader audience. This pragmatic approach leverages the full capabilities of dbt while simplifying implementation.

Chapter 4 - Analytics Engineering as the New Core of Data Engineering

The data life cycle

data creation
data movement and storage
data transformation
access to reports
data write back

Understanding the modern data stack

Traditional

ETL
limitations of OLTP systems
rigid transformation logic
data gaps
siloed workflows
DQ issues

Modern

analytical data warehouses in the cloud, such as BigQuery, Redshift, and Snowflake, are more and more common and becoming an affordable option for most data teams
data movement/integration tools, such as Stitch and Fivetran, make it very easy to move the data from traditional databases and common business platforms to the cloud data warehouse
modern BI tools start to provide improved abilities for self-service reporting that power users can exploit

The roles in the modern data stack and the analytics engineer

data engineer: the ‘traditional’ DE, armed with scripting/coding and infrastructure knowledge, is still very much valuable, especially if not all your sources can be covered by modern data-loading tools. Even in that case, complex scheduling, security, and a myriad of other topics might require the expertise and focus of a data engineer
analytics engineer: a new figure that is firmly centered around knowing and transforming raw data into useful information. They are less concerned with data movement pipeline development and are instead focused on making the data available, usable, and trusted
data analyst: this is an improved version of the traditional data analyst, as now they can understand and even help with the construction of data models and certainly have an unprecedented ability to investigate a wider array of datasets

DataOps – software engineering best practices for data

version control
quality assirance
modularity of the code base
dev vs prod environments
designing for maintainability

Chapter 5 - Transforming Data with dbt

Layers of a modern data platform

The implementation effort and cost over time in a warehousing project

The project

The book will guide me to build a simple stock tracking platform

The initial data model and glossary

This sample project is about being able to track the value of investments spread across multiple portfolios and analyze them in the ways we prefer. We will start simple, just tracking the current value of the portfolio, and then add more functionalities along the next sections and chapters.

The following conceptual data model illustrates the main entities and their relations, as you could gather from the initial explanation of the use case.

Position: a position tracks the amount of a specific security hold in a portfolio and comes with a few measures besides the amount of stock owned, such as the cost of the security hold in the position
Portfolio: a portfolio is a container of positions, usually associated with a trading or custody account at a broker
Security: a security is an investment instrument that can be bought and sold on a specific market. Securities exist in multiple types, including bonds, stock, options, and many others
Quote: a security quote is a set of values, such as a bid and an ask price, that describe the market offering for the stock at a moment in time or a set of values that describe the market offering for the stock during an interval of time:
- Daily quote: a daily quote is a set of values, such as open, close, max, and min prices, that describe the market offering for the stock during a trading day.
Calendar: the calendar represents the time dimension at the desired resolution, often a day, providing useful precalculated attributes for all the items in the calendar

Setup

my dbt project repo
snowflake db

USE ROLE DBT_EXECUTOR_ROLE;
CREATE DATABASE PORTFOLIO_TRACKING
    COMMENT = 'DB for the portfolio tracking project';

set up dbt cloud

Defining data sources and providing reference data

Setup snowflake:

USE ROLE DBT_EXECUTOR_ROLE;

-- ** 1 ** Create the schema for the source data
CREATE SCHEMA PORTFOLIO_TRACKING.SOURCE_DATA;

-- ** 2 ** Create the landing table for ABC_BANK data
CREATE OR REPLACE TABLE PORTFOLIO_TRACKING.SOURCE_DATA.ABC_BANK_POSITION (
     accountID	    TEXT,
     symbol	        TEXT,
     description	TEXT,
     exchange	    TEXT,
     report_date	DATE,
     quantity	    NUMBER(38,0),
     cost_base	    NUMBER(38,5),
     position_value	NUMBER(38,5),
     currency	    TEXT
);

-- ** 3 ** Create the file format to load the data for ABC_BANK
CREATE FILE FORMAT "PORTFOLIO_TRACKING"."SOURCE_DATA".ABC_BANK_CSV_FILE_FORMAT
    TYPE = 'CSV'
        COMPRESSION = 'AUTO'
        FIELD_DELIMITER = ','
        RECORD_DELIMITER = '\n'
        SKIP_HEADER = 1
        FIELD_OPTIONALLY_ENCLOSED_BY = '\042'
        TRIM_SPACE = FALSE
        ERROR_ON_COLUMN_COUNT_MISMATCH = TRUE
        ESCAPE = 'NONE'
        ESCAPE_UNENCLOSED_FIELD = '\134'
        DATE_FORMAT = 'AUTO'
        TIMESTAMP_FORMAT = 'AUTO'
        NULL_IF = ('\\N')
;

Next, I inserted the given csv file manually:

Next, creating a source for the table just inserted:

version: 2

sources:
  - name: abc_bank
    database: PORTFOLIO_TRACKING
    schema: SOURCE_DATA
    tables:
      - name: ABC_BANK_POSITION

And ~

It works ~ the database, schema, table names, connections are good :)

Writing a basic model

SELECT
    *, 
    POSITION_VALUE - COST_BASE as UNREALIZED_PROFIT,
    ROUND(UNREALIZED_PROFIT / COST_BASE, 5)*100 as UNREALIZED_PROFIT_PCT
FROM { source('abc_bank', 'ABC_BANK_POSITION') }

And it appeared in snowflake after dbt run:

Finally the dbt_project.yml is updated to include the different layers of the architecture:

models:
  portfolio_tracking:
    +materialized: view
    staging:    # Data storage layer
      +materialized: view
      +schema: STAGING
    refined:    # Refined data layer
      +materialized: view
      +schema: REFINED
    marts:      # Data delivery layer
      portfolio:     # one specific data mart
        +materialized: table
        +schema: MART_PORTFOLIO
snapshots:
  portfolio_tracking:
    +target_schema: SNAPSHOTS

Generating views or tables for the different layers is mostly a performance choice, as they are functionally equivalent. Tables are slower to create and take up space on storage, but they are much quicker to query.

Therefore, it makes sense to materialize models that are queried multiple times during the ELT as tables so that the calculations are performed only once, as well as for models queried by end users in interactive mode so that the query executes in the shortest time possible.

We configured the refined layer to be generated as tables; that is a sensible choice when we do not have a data mart layer yet and the refined layer is also our delivery layer.

The delivery layer is usually materialized at tables, but a BI tool that stores the data from your data mart layer internally is used, such as Power BI - it is a sensible choice to generate the layer as views, because each view will be queried once to load the data in the tool.

Some tests

version: 2

sources:

  - name: abc_bank

    database: PORTFOLIO_TRACKING
    schema: SOURCE_DATA

    tables:
      - name: ABC_BANK_POSITION
        description: The landing table holding the data imported from the CSV extracted by ABC Bank.
        columns:
          - name: ACCOUNTID
            description: The code for the account at the ABC Bank.
            tests:
              - not_null
          - name: SYMBOL
            description: The symbol of the security for the ABC Bank.
            tests:
              - unique
              - not_null
          - name: DESCRIPTION
            description: The name of the security at the ABC Bank.
            tests:
              - not_null
          - name: EXCHANGE
            description: The short name or code of the exchange where the security is traded.
            tests:
              - not_null
          - name: REPORT_DATE
            description: >
              The date of the portfolio report extracted by ABC Bank.
              We consider this position to be the effective from this date forward, until a change is seen.
            tests:
              - not_null
          - name: QUANTITY
            description: The number of secuities we hold in the portfolio at ABC Bank at the Report Date.
            tests:
              - not_null
          - name: COST_BASE
            description: The cost we paid attributable to the secuities we hold in the portfolio at ABC Bank at the Report Date.
            tests:
              - not_null
          - name: POSITION_VALUE
            description: The value of the secuities we hold in the portfolio at ABC Bank acording to market quotation on the Report Date.
            tests:
              - not_null
          - name: CURRENCY
            description: The currency for the monetary amounts of the position.
            tests:
              - not_null

Added a test for the 2 new columns in the POSITION_ABC_BANK table:

version: 2
models:
  - name: POSITION_ABC_BANK
    description: The positions we have in the ABC Bank portfolio.
    columns:
      - name: UNREALIZED_PROFIT
        description: The unrealized profit on the position.
        tests:
          - not_null
      - name: UNREALIZED_PROFIT_PCT
        description: >
          The unrealized profit percentage on the position.
        tests:
          - not_null

Docs

And of course finally, we can see the amazing docs generated

Chapter 6 - Writing maintainable code

Writing code for humans

The other most important things that we can do to keep complexity in check are

reduce the need to re-read, applying the Least Surprise Principle (LSP)
make the code that you write easy to re-read/understand
keep the models that we write focused and limited in dimension
keep the dependencies ordered and going only in one direction, from one layer to the next in the architecture
when you see an improvement, refactor your models to keep them current

Creating the architectural layers

Creating the Staging layer

Goals and contents of the staging models: to adapt external data to how we want to see and use it in our project. This includes preparing it for change tracking and for storage in the most usable way for our future needs.

pick the desired columns, usually all of them
rename the columns to follow our conventions and use our project-designated names
manage data types, performing conversions to desired data types
apply “hard rules” with non-destructive transformations, such as time-zone conversions
add the default record but only for dimensions; it is not needed for facts/events
add the desired/available metadata, based on the input specifics
make keys and change tracking explicit by adding hash-based keys and diff fields

A general pattern for your staging models

WITH
src_data as ( … ),
default_record as ( … ),
with_default_record as(
    SELECT * FROM src_data
    UNION ALL
    SELECT * FROM default_record
),
hashed as ( … )
SELECT * FROM hashed

This is the final first staging model after following the above goals:

WITH

src_data as (
    SELECT
        ACCOUNTID         as ACCOUNT_CODE     -- TEXT
         , SYMBOL            as SECURITY_CODE    -- TEXT
         , DESCRIPTION       as SECURITY_NAME    -- TEXT
         , EXCHANGE          as EXCHANGE_CODE    -- TEXT
         , REPORT_DATE       as REPORT_DATE      -- DATE
         , QUANTITY          as QUANTITY         -- NUMBER
         , COST_BASE         as COST_BASE        -- NUMBER
         , POSITION_VALUE    as POSITION_VALUE   -- NUMBER
         , CURRENCY          as CURRENCY_CODE    -- TEXT

         , 'SOURCE_DATA.ABC_BANK_POSITION' as RECORD_SOURCE

    FROM { source('abc_bank', 'ABC_BANK_POSITION') }}
),

hashed as (
    SELECT
          { dbt_utils.surrogate_key([ 'ACCOUNT_CODE', 'SECURITY_CODE']) }} as POSITION_HKEY
        , { dbt_utils.surrogate_key([ 'ACCOUNT_CODE', 'SECURITY_CODE',
                'SECURITY_NAME', 'EXCHANGE_CODE', 'REPORT_DATE',
                'QUANTITY', 'COST_BASE', 'POSITION_VALUE', 'CURRENCY_CODE'
                ]) }} as POSITION_HDIFF
        , *
        , '{ run_started_at }}'::timestamp as LOAD_TS_UTC
    FROM src_data
)

SELECT * FROM hashed

Metadata examples include: load_ts, record_source, extraction_time, creation_time, effective_from

HASH_KEY: This is a single field with all the columns that make up the primary key of the entity chained together;

HASH_DIFF: This is a single field with all the fields that we want to test for changes chained together.

The HASH_DIFF field is a purely technical field that can be considered a version identifier, but be aware that it is perfectly normal for an entity to go back to the same set of values it had at some point in the past. Therefore, it might very well be that we have two equal hash diff values separated in the history table by one or more different ones.

Goals and contents of the refined layer

The goal of the refined layer is clear – take the ‘adapted’ data served from the staging layer and by applying business rules, including combining multiple sources, produce ‘refined’ data that can be used in the data marts to provide useful information for the platform users.

The fact is that ‘business rule’ is a very general term and it can mean something as simple as a sum or product, or it can involve the use of multiple models and quite complicated logic.

Pretty much every operation that changes or combines data to produce other data is a business rule

Types of models in this refined layer:

intermediate transformation (TR): a catch-all type for models that you create for intermediate calculations that do not represent a proper business concept or that cannot exist by themselves, in the sense that you will not expect other developers to use such a model without knowing the specific caveats
refined model (REF): the type for models that represent a recognizable business concept, at any level, with the level needing to be clear from the name if it is not for general use. We expect that anyone needing this concept at that level should reuse this model and not build their own
mapping model (MAP): the type for models that support conversion from one reference system to another, by providing the mapping for keys in one system to the equivalent key in the other system
Master Data Dimension (MDD) model: this is the type for models that implement dimensions provided directly from the master data management system
aggregated (AGG) model: this is the type for models that contain aggregated or summarized data
pivoted (PIVOT) model: This is the type for models that contain pivoted data. This is a specialized model type that you can use if you happen to deal with pivoted data. It is listed here to make clear that you can create new model types according to your needs. The goal should always be to provide clear communication of what the model does

Creating the first data mart

The general plan is to have one data mart for each domain that will use the platform’s data so that we can provide the best dataset for each of them, using the right names, the correct depth and granularity of the data, and making available only the concepts that matter for that domain.

Saving history is crucial

When working with dbt, we have two main avenues to pick up and save the changes to the instances of our entities:

using the standard functionality of dbt snapshots: this allows us to easily capture changes as a Slowly Changing Dimension of type 2 (SCD2) without writing any code for that purpose, as the logic is implemented by dbt itself using merge or delete/insert statements. If we get reliable full exports from the source, the snapshot can also capture deletions.

This works very well for normal-size datasets (up to millions of rows), and it’s the simplest way to go, but it comes with its own cons, so we generally prefer to store changes using insert only history

One of the major cons of snapshots is that they are global objects, so you only have one copy and one version active for all environments. Another one is that you cannot preview or test the SELECT query that feeds the data to the snapshot.

The 2nd option is:

using incremental models to save the new or changed data from each run by applying set-based operations to capture changes in insert-only mode, as this is the most effective way to manage and store even huge amounts of data, and it can be tailored to work in all cases, even when you have multiple sources for a single table, such as a near-real-time feed and a periodic control feed.

Saving history using snapshots

A snapshot is a dbt-managed table that will store the data from a SELECT statement in the form of a slowing changing dimension.

A snapshot is created by the special snapshot block that you create in a .sql file in dbt.

Snapshots are not normal models, in the sense that they are not run with the dbt run command, but they must be run with the specific dbt snapshot command or with the catch-all dbt build command, and they are stored in a snapshots folder.

It is important to note that a snapshot table is a global object, shared by all environments that use the same database, including all developer environments. This poses substantial limitations for teamwork because when you change a snapshot, all the environments sharing the snapshot need to be updated at the same time.

The general layout of a snapshot definition:

% snapshot snapshot_name %
{config(…) }
select … from { source / ref }
{ endsnapshot %

Central to the operations of a snapshot are the following concepts:

identity or how to distinguish between two different instances of one entity, versus two versions of the same instance. Identity is expressed by a unicity constraint that is often called a primary key, natural key, or a similar name.
a strategy to identify changes – that is, how you decide whether two rows with the same key are the same row or two different versions of that specific instance of the entity.
- there is the timestamp strategy which is simpler and quicker when we have a field like updated_at or a monothonically increasing ID field - and in such cases we can just compare the current ID/updated_at to decide
- the check strategy is the one that you use when you do not have a good high-watermark indicator, and instead, you must look at the data to check whether each row presented by the SELECT statement is the same or a different instance from the most recent one stored in the snapshot under the specified unique key. With this strategy, you must configure the columns to be checked to decide whether two rows are the same instance or not, using the check_cols configuration. You do not have to list all columns from SELECT but only the ones you want to compare; in any case, all columns from SELECT will be stored in the snapshot.

The intended usage of snapshots is as starting point of your ETL process to capture and store the source data, so that the general ETL would follow the following pattern:

run the snapshots command to capture the source data – dbt snapshot
run the models to read the data out of the snapshots and calculate the desired transformations – dbt run
run the tests to verify the results – dbt test

Running a snapshot will cause the following:

the creation of a snapshot table from the result of the SELECT statement, which happens the first time that you run the snapshot
the snapshot table to be updated, with the changes coming from the SELECT statement relating to the contents of the snapshot table, which happens any other time that you run the snapshot after the first

In data processing, a snapshot captures data at a specific point in time. When working with snapshots, we have the option to either read the source data unchanged and apply transformations from the STG model after the snapshot, or we can choose to include the transformations directly in the snapshot. The first option requires re-running the transformations every time we access the data, which can be inefficient. The second option, which involves incorporating the transformations into the snapshot, ensures that the transformations are applied once, and the resulting data is stored in the correct format, including any additional data calculated (like dates extracted from filenames).

However, putting the transformations from the STG model directly into the snapshot can make it more challenging to build and maintain the ELT process. This is because we lose the ability to preview the results of your SELECT query inside the snapshot block, which can make development more cumbersome. This lack of previewing results can create tension with the goal of maintaining easy-to-manage models, especially when the STG layer handles most of the data transformation logic.

One way to address this issue is by declaring the STG model as an ‘ephemeral model’. An ephemeral model doesn’t create any persistent tables or views but instead defines a named query. This allows us to preview the results of the query before including it in the snapshot, making it easier to maintain the ELT process while still ensuring that the snapshot runs efficiently with the necessary transformations.

Layout of a snapshot tables

The snapshot table will contain the following metadata fields in addition to the fields produced by the SELECT statement

dbt_valid_from
dbt_valid_to
dbt_scd_id: a unique key generated for each row in the snapshot. This is internal to dbt
dbt_updated_at: the timestamp when the row was inserted. This is internal to dbt

Here are some snapshot best practices:

snapshot source data as the first thing in your ETL, declaring a source as appropriate, and make the rest of the ETL depend only on the snapshot
include all the source columns in your snapshot, as you cannot go back in time and backfill information that you do not have, and even if you have it, the backfilling of data is a complicated and time-consuming task
do not add any business logic in the query that feeds data to the snapshot; use only hard rules that do not change the data. If you apply soft business rules and they will change, most probably you will not be able to go back to the original data and apply the new business rule, and even if it’s possible, it would be a complex and slow manual process
avoid joins in your SELECT query, as it is not always easy to establish when the result of a join is a new version, and even more difficult to absorb the changes into two tables. Snapshot the two tables to be joined independently, and then do the join in the downstream models where you apply the business logic

Multiple ways to take into use the snapshots

a snapshot as a first step: Taking a snapshot as the first thing in the flow is what dbt suggests, but it means either preserving the source names by using the trivial SELECT * FROM { source(…) }} query or writing a query that is not simple to preview/debug
a snapshot after the STG model: Taking the snapshot out of an STG model solves the maintainability issue, but it is a bit clumsy if you use a normal model, as you need first to run only the STG models for the snapshots, take the snapshot, and then run all the models except the snapshot ones. The clumsiness is removed if you use ephemereal models for the STG models, as you do not need to run them before

Creating a snapshot in the project

The below is put in the snapshots folder in an sql file

{ snapshot SNSH_ABC_BANK_POSITION 
{
    config(
      unique_key= 'POSITION_HKEY',
      strategy='check',
      check_cols=['POSITION_HDIFF'],
      invalidate_hard_deletes=True,
    )
}}
select * from { ref('STG_ABC_BANK_POSITION') }
{ endsnapshot 

We use the POSITION_HKEY which was created in the STG_ABC_BANK_POSITION. Here is the data lineage

Also, before executing dbt snapshot I dropped the existing STG_ABC_BANK_POSITION view as otherwise, in the future, people on the project might think it’s still connected to the STG model and even use it. And I added {config(materialized='ephemeral') } to the top of the model file to fix the issue of not being able to preview or test that query in isolation - we change the model to ephermal materialization and use it to feed the content of the STG view to the snapshot.

After running dbt snapshot, in snowflake we can see:

It’s so cool seeing this outside of Zach Wilson’s bootcamp where I learned about SCD just 3 weeks ago.

Connecting the REF layer with the snapshot

The REF_POSITION_ABC_BANK model that we wrote before is still reading data from the STG model that we made ephemeral. It still works, but it’s now time to make it read from the snapshot so that our ELT will keep working even if the external source is not available.

This is the current one:

SELECT
    *, 
    POSITION_VALUE - COST_BASE as UNREALIZED_PROFIT,
    ROUND(UNREALIZED_PROFIT / COST_BASE, 5) * 100 as UNREALIZED_PROFIT_PCT
FROM { ref('STG_ABC_BANK_POSITION') }

If we just change the ref to the snapshot, we would read everything out of the history – that is, all the versions that you encountered in the history of the snapshot which is not what we want.

To report on the current portfolio, we want to read only the active positions.

To achieve this, we need to select from the snapshot only the rows that have not been closed – that is, where the DBT_VALID_TO field is null – as the old versions are the ones with a timestamp in this column.

It should be:

WITH
current_from_snapshot as (
    SELECT *
    FROM { ref('SNSH_ABC_BANK_POSITION') }
    WHERE DBT_VALID_TO is null
)
SELECT
    *
    , POSITION_VALUE - COST_BASE as UNREALIZED_PROFIT
    , ROUND(UNREALIZED_PROFIT / COST_BASE, 5)*100
         as UNREALIZED_PROFIT_PCT
FROM current_from_snapshot

All is good:

🥳

Here is the lineage at the end of this chapter:

Chapter 7 - Working with Dimensional Data

They give a definition about dimensional data - descriptive data that provides human-readable information for an entity, such as the name or country of a customer or account

Adding dimensional data

Loading the data of the first dimension

Creating and loading a CSV as a seed

We need to configure them in the project yml:

seeds:
  +schema: seed_data

Then we can run dbt seed to load the data

We can add a column using a post-hook:

seeds:
  +schema: seed_data
  portfolio_tracking:
    ABC_Bank_SECURITY_INFO:
      +post-hook:
        - "UPDATE { this } SET LOAD_TS = '{ run_started_at }' WHERE LOAD_TS is null"
      +column_types:
        LOAD_TS: TIMESTAMP

And also add the column name LOAD_TS and no values in the seed csv

And load_ts becomes the timestamp when the dbt seed is run

(the time zone is 9hrs behind - Europe)

Building the STG model for the first dimension

Creating an STG model for the security dimension

{ config(materialized='ephemeral') }

WITH
src_data as (
    SELECT
        SECURITY_CODE     as SECURITY_CODE    -- TEXT
         , SECURITY_NAME     as SECURITY_NAME    -- TEXT
         , SECTOR            as SECTOR_NAME      -- TEXT
         , INDUSTRY          as INDUSTRY_NAME    -- TEXT
         , COUNTRY           as COUNTRY_CODE     -- TEXT
         , EXCHANGE          as EXCHANGE_CODE    -- TEXT
         , LOAD_TS           as LOAD_TS          -- TIMESTAMP_NTZ

         , 'SEED.ABC_Bank_SECURITY_INFO' as RECORD_SOURCE

    FROM { source('seeds', 'ABC_Bank_SECURITY_INFO') }
 ),

default_record as (
    SELECT
        '-1'      as SECURITY_CODE
        , 'Missing' as SECURITY_NAME
        , 'Missing' as SECTOR_NAME
        , 'Missing' as INDUSTRY_NAME
        , '-1'      as COUNTRY_CODE
        , '-1'      as EXCHANGE_CODE
        , '2020-01-01' as LOAD_TS_UTC
        , 'Missing' as RECORD_SOURCE
),

with_default_record as(
    SELECT * FROM src_data
    UNION ALL
    SELECT * FROM default_record
),

hashed as (
    SELECT
        concat_ws('|', SECURITY_CODE) as SECURITY_HKEY
        , concat_ws('|', SECURITY_CODE, SECURITY_NAME, SECTOR_NAME,
                         INDUSTRY_NAME, COUNTRY_CODE, EXCHANGE_CODE ) as SECURITY_HDIFF

        , * EXCLUDE LOAD_TS
        , LOAD_TS as LOAD_TS_UTC
    FROM with_default_record
)
SELECT * FROM hashed

It is always important to think about what’s included in the HDIFF key.

We have the default records to help in case there is missing info or incomplete data.

Saving history for the dimensional data

Created SNSH_ABC_BANK_SECURITY_INFO.sql in the snapshots folder.

 snapshot SNSH_ABC_BANK_SECURITY_INFO %}

{
    config(
      unique_key= 'SECURITY_HKEY',

      strategy='check',
      check_cols=['SECURITY_HDIFF'],
    )
}

select * from { ref('STG_ABC_BANK_SECURITY_INFO') }

% endsnapshot %

The invalidate_hard_deletes parameter is removed, as generally for dimensions, it is more important to keep old entries than to capture deletions, as these old entries are referenced by old facts. Eventually, we might be interested to know whether an entry is still active or not. In that case, we would enable the parameter to track hard deletions, but then keep both the active and deleted rows in the REF model derived from the snapshot.

Building the REF layer with the dimensional data

Created REF_ABC_BANK_SECURITY_INFO.sql in the models/refined folder

WITH
    current_from_snapshot as (
        SELECT * EXCLUDE (DBT_SCD_ID, DBT_UPDATED_AT, DBT_VALID_FROM, DBT_VALID_TO)
        FROM { ref('SNSH_ABC_BANK_SECURITY_INFO') }
WHERE DBT_VALID_TO is null
    )
SELECT *
FROM current_from_snapshot

Adding the dimensional data to the data mart

Created DIM_SECURITY.sql in the models/marts/portfolio folder

SELECT * FROM { ref('REF_ABC_BANK_SECURITY_INFO') }

Running dbt build after all this:

That is all for today!

See you tomorrow :)