Hello :) Today is Day 351!
A quick summary of today:
- this book is amazing!!! learned a lot about using dbt for SCD
- my repo for the code from today
Some of the jinja might look wrong, but I had to put it like that because of some weird formatting from my blog.
Data Engineering with dbt - Chapter 1/2 - The Basics of SQL to Transform Data/Setting up Snowflake and dbt cloud
The book starts by introducing SQL, its usage and some key words like table, queries, ddl, etc.
Then I had to create a snowflake account so that I can follow the book examples.
To set up users, a warehouse and a db I ran the below queries:
/**
* Chapter 1 - Section 3: Set up initial users, roles and database in Snowflake
*/
-- **Create and grant your first role
USE ROLE USERADMIN;
CREATE ROLE DBT_EXECUTOR_ROLE
COMMENT = 'Role for the users running DBT models';
--DROP ROLE DBT_EXECUTOR_ROLE; -- if you want to drop and redo ;)
GRANT ROLE DBT_EXECUTOR_ROLE TO USER divakaivan;
SHOW GRANTS TO USER divakaivan;
-- ** Grant the privileges to create a database
USE ROLE SYSADMIN;
GRANT CREATE DATABASE ON ACCOUNT
TO ROLE DBT_EXECUTOR_ROLE;
GRANT USAGE ON WAREHOUSE COMPUTE_WH
TO ROLE DBT_EXECUTOR_ROLE;
GRANT CREATE DATABASE ON ACCOUNT TO ROLE DBT_EXECUTOR_ROLE;
GRANT USAGE ON WAREHOUSE COMPUTE_WH TO ROLE DBT_EXECUTOR_ROLE;
SHOW GRANTS TO ROLE DBT_EXECUTOR_ROLE;
--SHOW GRANTS TO ROLE SYSADMIN;
-- ** Create your first database
USE ROLE DBT_EXECUTOR_ROLE;
CREATE DATABASE DATA_ENG_DBT;
SHOW Databases;
-- configuring the default warehouse
USE ROLE SYSADMIN;
ALTER WAREHOUSE "COMPUTE_WH" SET
WAREHOUSE_SIZE = 'XSMALL'
AUTO_SUSPEND = 60
AUTO_RESUME = TRUE
COMMENT = 'Default Warehouse';
-- ** Create a user for the dbt application
CREATE USER IF NOT EXISTS DBT_EXECUTOR
COMMENT = 'User running DBT commands'
PASSWORD = 'pick_a_password'
DEFAULT_WAREHOUSE = 'COMPUTE_WH'
DEFAULT_ROLE = 'DBT_EXECUTOR_ROLE';
-- assign the executor role to the dbt user
GRANT ROLE DBT_EXECUTOR_ROLE TO USER DBT_EXECUTOR;
-- Switch back to operational role we should be always working with
USE ROLE DBT_EXECUTOR_ROLE;
Next - query syntax
The full code is here but it’s basically just basic SQL query examples.
Some examples:
Next - some JOIN clauses
Finally - some window functions
SELECT
O_ORDERKEY,
O_CUSTKEY,
O_ORDERDATE,
O_TOTALPRICE,
avg(O_TOTALPRICE) over(partition by O_ORDERDATE) as daily_avg,
sum(O_TOTALPRICE) over(partition by O_ORDERDATE) as daily_total,
sum(O_TOTALPRICE) over(partition by
DATE_TRUNC(MONTH, O_ORDERDATE)) as monthly_total,
O_TOTALPRICE / daily_avg * 100 as avg_pct,
O_TOTALPRICE / daily_total * 100 as day_pct,
O_TOTALPRICE / monthly_total * 100 as month_pct
FROM "SNOWFLAKE_SAMPLE_DATA"."TPCH_SF1"."ORDERS"
QUALIFY row_number() over(partition by O_ORDERDATE
order by O_TOTALPRICE DESC) <= 5
order by O_ORDERDATE, O_TOTALPRICE desc;
Setting up dbt cloud and connecting it to Snowflake
Faced some issues here with providing the exact credentials, but thankfully the dbt docs had my back.
Chapter 3 - Data Modeling for Data Engineering
Data modelling
Attributes of DM:
- entity: a concept that we want to analyze, a thing of interest that we want to collect data about, such as a car, an order, or a page visit
- attribute: a property of an entity for which we want to store values, such as a car plate, a customer, or the total amount of a transaction
- relationship: a connection between two entities that are related to each other and captures how they are related, such as the ownership relation between a car and its owner, or the purchasing relation between a customer and a product
Data modeling is all about conveying information so that those that work with the modeled data can better understand it and each other.
Data modeling is generally used in two main contexts:
- to describe what data should be generated or collected by an application or in a project
In this case, we build the data model before having the data.
- to describe some data that already exists
In this case, we reverse-engineer the data model from the available data.
A good practice is to explicitly design this new data model before coding the transformations, drawing at least one model that represents the business concepts that are the goal of the integration and will form the refined targets of our transformations. This model can be at a high level of abstraction, describing the business and only the main entities, or at a greater level of detail, already representing the technical design.
Conceptual, logical, and physical data models
- conceptual data model: this is the most abstract model, defining what will be in the domain of the project, providing the general scope
- logical data model: provides much greater detail, defining what the data will look like
- physical data model: the most detailed model, describing exactly how the data will be stored in the database
Entity-Relationship modeling
- entity: represented as a box with the name of the entity inside
If attributes are shown, the entity name is at the top and often in bold or bigger and visually separated from the attributes
- attribute: represented by the attribute’s name inside the box of the entity it belongs, with one name per line
- relationship: represented by a line joining the entities involved in the relationship and optionally the name of the relationship close to the line
Cardinality
- 1-to-1
- 1-to-many
- many-to-many
Time perspective
The time perspective significantly influences E-R models, though it is often an implicit and underexplained factor. E-R models capture constraints or business rules of a system, and the system’s time perspective affects entity semantics, applicable rules, and the model’s structure.
For operational applications, relationships like Driver-Taxi are often viewed as one-to-one at a specific time (e.g. a taxi needs a driver, and a driver can drive one taxi at a time). However, both entities make sense independently when inactive (a taxi not in use or a driver on a break). Time-based entities (shifts or contracts) are sometimes embedded to track relationships over time.
In data engineering projects, a lifetime perspective is typically adopted, showing relationships over the lifespan of entities. For example, over time, a driver may drive multiple taxis, and a taxi may have multiple drivers, resulting in a many-to-many relationship. Such relationships require an intermediate table (e.g., Work Period) to store periods when changes occur
Modeling use cases and patterns
Header-detail use case
As an example, it is used in invoices and orders, and almost anywhere else where there is one document with multiple lines of detail inside. The common document info goes into the header entity with the detail lines represented by one or more weak entities depending on the header. A weak entity is an entity that does not have any meaning without the header entity, such as an order line without an order header.
Hierarchical relationships
Hierarchical relationships are very common and generally are a special case of one-to-many relationships. They represent data structures that look like a tree.
The common pattern is that one instance depends on another higher/bigger instance of the same entity, unless it is itself one of the instances at the highest/biggest level.
Forecasts and actuals
Budgeting and forecasting involve predicting figures and comparing them with actuals. A one-to-many relationship exists, with one actual linked to multiple forecasts. Reporting models often generalize Figures into subclasses (actuals and forecasts) and use a single table to simplify comparisons. Actuals are copied for each forecast using joins, relying on reliable forecast dates and business logic.
Common problems in data models
Fan trap
The fan trap is a very common problem that can happen every time you have a join in a one-to-many relationship. It is not a problem of the relationship, but of how you might use it.
The fan trap problem causes the calculations done on measures joined from the one side of the one-to-many relationship to be wrong.
This is only a problem if you use a measure that is on the one side of a one-to-many relationship, when grouping/working at the granularity of the entities on the many side. This happens because the join will duplicate the measures to match the cardinality of the many side.
Chasm trap
A chasm trap is an unmanaged and often unintentional many-to-many relationship. Because it is unmanaged, it becomes an unintended partial Cartesian product: the rows on one side will be cross-joined with the occurrences of the same key on the other side, de facto making a per-key Cartesian product.
Chasm trap and a correct many-to-many relationship model
Modeling styles and architectures
Kimball method or dimensional modeling or star schema
Limitations:
-
measures in dimensions: while Kimball advises against including measures in dimensions to avoid fan traps, practical needs like storing important metrics (e.g., price in a product dimension) often necessitate this, requiring caution to handle fan traps correctly
-
no fact-to-fact connections: Kimball schemas restrict facts to only connect with dimensions, which limits analyses involving multiple facts (e.g. orders and deliveries). Alternative approaches like USS or custom SQL are needed for such scenarios
-
limited enterprise view: the Kimball approach focuses on reporting specific business processes rather than providing a comprehensive enterprise data view, which may not meet all reporting needs
-
business rules dependency: the loading and storage processes are tightly integrated with business rules, as conformed dimensions and facts serve both as transformation goals and primary data storage
Unified start schema
The USS is an extension of the Kimball star schema that addresses some of its limitations, such as the fan trap and the impossibility to query more than one fact at a time without the risk of a chasm trap.
Inmon design style
The Inmon design style, created by Bill Inmon, emphasizes building an enterprise-wide, integrated, and interoperable data warehouse as a single source of truth for an organization. Its core idea is to model business concepts (entities and processes) and normalize the data into Third Normal Form (3NF) to minimize redundancy and capture complex relationships.
Advantages
- resilience: adaptable to changes in business processes or source systems.
- comprehensive: captures rich concepts and relationships for broad reporting capabilities
Drawbacks
- complexity: requires highly skilled modelers and significant management commitment
- auditability: original data is hard to trace due to heavy transformations
- effort-intensive: maintenance and updates for new reporting needs are resource-heavy
Data vault
DV architecture:
Data mesh
Data mesh is a modern data architecture approach aimed at solving data ownership issues by decentralizing data management. Instead of a centralized data platform, it organizes data into domain-specific platforms, each owned and maintained by business units familiar with the data’s semantics and use cases. Inspired by microservice architecture and DevOps practices, data mesh emphasizes ‘data products’ — well-defined interfaces, applications, or data marts enabling seamless interaction across domains.
While it aligns business and IT goals, improves data quality, and empowers teams with end-to-end data ownership, it introduces complexity and requires advanced skills in managing distributed systems and APIs. Adoption is challenging in less tech-savvy industries, but insights like treating data as products and assigning data ownership to business units can still be applied. Tools like dbt facilitate collaboration and documentation in this framework.
The Pragmatic Data Platform - PDP
PDP combines the best aspects of various data modeling styles and architectures to create a modern, efficient data platform while minimizing complexity and expertise requirements.
PDP adopts key principles from Data Vault (DV) and Inmon, such as:
- separating ingestion from business rules
- using insert-only ingestion of full history
- organizing data around business concepts and clear BK definitions
However, it avoids DV modeling due to its complexity, accepting a trade-off of reduced resilience for simplicity. Data delivery is typically through data marts with star schemas but can also include wide, denormalized tables for specific use cases like ML/AI.
PDP emphasizes practical solutions, balancing advanced techniques (like using hashes) with accessibility to a broader audience. This pragmatic approach leverages the full capabilities of dbt while simplifying implementation.
Chapter 4 - Analytics Engineering as the New Core of Data Engineering
The data life cycle
- data creation
- data movement and storage
- data transformation
- access to reports
- data write back
Understanding the modern data stack
Traditional
- ETL
- limitations of OLTP systems
- rigid transformation logic
- data gaps
- siloed workflows
- DQ issues
Modern
- analytical data warehouses in the cloud, such as BigQuery, Redshift, and Snowflake, are more and more common and becoming an affordable option for most data teams
- data movement/integration tools, such as Stitch and Fivetran, make it very easy to move the data from traditional databases and common business platforms to the cloud data warehouse
- modern BI tools start to provide improved abilities for self-service reporting that power users can exploit
The roles in the modern data stack and the analytics engineer
- data engineer: the ‘traditional’ DE, armed with scripting/coding and infrastructure knowledge, is still very much valuable, especially if not all your sources can be covered by modern data-loading tools. Even in that case, complex scheduling, security, and a myriad of other topics might require the expertise and focus of a data engineer
- analytics engineer: a new figure that is firmly centered around knowing and transforming raw data into useful information. They are less concerned with data movement pipeline development and are instead focused on making the data available, usable, and trusted
- data analyst: this is an improved version of the traditional data analyst, as now they can understand and even help with the construction of data models and certainly have an unprecedented ability to investigate a wider array of datasets
DataOps – software engineering best practices for data
- version control
- quality assirance
- modularity of the code base
- dev vs prod environments
- designing for maintainability
Chapter 5 - Transforming Data with dbt
Layers of a modern data platform
The implementation effort and cost over time in a warehousing project
The project
The book will guide me to build a simple stock tracking platform
The initial data model and glossary
This sample project is about being able to track the value of investments spread across multiple portfolios and analyze them in the ways we prefer. We will start simple, just tracking the current value of the portfolio, and then add more functionalities along the next sections and chapters.
The following conceptual data model illustrates the main entities and their relations, as you could gather from the initial explanation of the use case.
- Position: a position tracks the amount of a specific security hold in a portfolio and comes with a few measures besides the amount of stock owned, such as the cost of the security hold in the position
- Portfolio: a portfolio is a container of positions, usually associated with a trading or custody account at a broker
- Security: a security is an investment instrument that can be bought and sold on a specific market. Securities exist in multiple types, including bonds, stock, options, and many others
- Quote: a security quote is a set of values, such as a bid and an ask price, that describe the market offering for the stock at a moment in time or a set of values that describe the market offering for the stock during an interval of time:
- Daily quote: a daily quote is a set of values, such as open, close, max, and min prices, that describe the market offering for the stock during a trading day.
- Calendar: the calendar represents the time dimension at the desired resolution, often a day, providing useful precalculated attributes for all the items in the calendar
Setup
- my dbt project repo
- snowflake db
USE ROLE DBT_EXECUTOR_ROLE;
CREATE DATABASE PORTFOLIO_TRACKING
COMMENT = 'DB for the portfolio tracking project';
- set up dbt cloud
Defining data sources and providing reference data
Setup snowflake:
USE ROLE DBT_EXECUTOR_ROLE;
-- ** 1 ** Create the schema for the source data
CREATE SCHEMA PORTFOLIO_TRACKING.SOURCE_DATA;
-- ** 2 ** Create the landing table for ABC_BANK data
CREATE OR REPLACE TABLE PORTFOLIO_TRACKING.SOURCE_DATA.ABC_BANK_POSITION (
accountID TEXT,
symbol TEXT,
description TEXT,
exchange TEXT,
report_date DATE,
quantity NUMBER(38,0),
cost_base NUMBER(38,5),
position_value NUMBER(38,5),
currency TEXT
);
-- ** 3 ** Create the file format to load the data for ABC_BANK
CREATE FILE FORMAT "PORTFOLIO_TRACKING"."SOURCE_DATA".ABC_BANK_CSV_FILE_FORMAT
TYPE = 'CSV'
COMPRESSION = 'AUTO'
FIELD_DELIMITER = ','
RECORD_DELIMITER = '\n'
SKIP_HEADER = 1
FIELD_OPTIONALLY_ENCLOSED_BY = '\042'
TRIM_SPACE = FALSE
ERROR_ON_COLUMN_COUNT_MISMATCH = TRUE
ESCAPE = 'NONE'
ESCAPE_UNENCLOSED_FIELD = '\134'
DATE_FORMAT = 'AUTO'
TIMESTAMP_FORMAT = 'AUTO'
NULL_IF = ('\\N')
;
Next, I inserted the given csv file manually:
Next, creating a source for the table just inserted:
version: 2
sources:
- name: abc_bank
database: PORTFOLIO_TRACKING
schema: SOURCE_DATA
tables:
- name: ABC_BANK_POSITION
And ~
It works ~ the database, schema, table names, connections are good :)
Writing a basic model
SELECT
*,
POSITION_VALUE - COST_BASE as UNREALIZED_PROFIT,
ROUND(UNREALIZED_PROFIT / COST_BASE, 5)*100 as UNREALIZED_PROFIT_PCT
FROM { source('abc_bank', 'ABC_BANK_POSITION') }
And it appeared in snowflake after dbt run
:
Finally the dbt_project.yml is updated to include the different layers of the architecture:
models:
portfolio_tracking:
+materialized: view
staging: # Data storage layer
+materialized: view
+schema: STAGING
refined: # Refined data layer
+materialized: view
+schema: REFINED
marts: # Data delivery layer
portfolio: # one specific data mart
+materialized: table
+schema: MART_PORTFOLIO
snapshots:
portfolio_tracking:
+target_schema: SNAPSHOTS
Generating views or tables for the different layers is mostly a performance choice, as they are functionally equivalent. Tables are slower to create and take up space on storage, but they are much quicker to query.
Therefore, it makes sense to materialize models that are queried multiple times during the ELT as tables so that the calculations are performed only once, as well as for models queried by end users in interactive mode so that the query executes in the shortest time possible.
We configured the refined layer to be generated as tables; that is a sensible choice when we do not have a data mart layer yet and the refined layer is also our delivery layer.
The delivery layer is usually materialized at tables, but a BI tool that stores the data from your data mart layer internally is used, such as Power BI - it is a sensible choice to generate the layer as views, because each view will be queried once to load the data in the tool.
Some tests
version: 2
sources:
- name: abc_bank
database: PORTFOLIO_TRACKING
schema: SOURCE_DATA
tables:
- name: ABC_BANK_POSITION
description: The landing table holding the data imported from the CSV extracted by ABC Bank.
columns:
- name: ACCOUNTID
description: The code for the account at the ABC Bank.
tests:
- not_null
- name: SYMBOL
description: The symbol of the security for the ABC Bank.
tests:
- unique
- not_null
- name: DESCRIPTION
description: The name of the security at the ABC Bank.
tests:
- not_null
- name: EXCHANGE
description: The short name or code of the exchange where the security is traded.
tests:
- not_null
- name: REPORT_DATE
description: >
The date of the portfolio report extracted by ABC Bank.
We consider this position to be the effective from this date forward, until a change is seen.
tests:
- not_null
- name: QUANTITY
description: The number of secuities we hold in the portfolio at ABC Bank at the Report Date.
tests:
- not_null
- name: COST_BASE
description: The cost we paid attributable to the secuities we hold in the portfolio at ABC Bank at the Report Date.
tests:
- not_null
- name: POSITION_VALUE
description: The value of the secuities we hold in the portfolio at ABC Bank acording to market quotation on the Report Date.
tests:
- not_null
- name: CURRENCY
description: The currency for the monetary amounts of the position.
tests:
- not_null
Added a test for the 2 new columns in the POSITION_ABC_BANK table:
version: 2
models:
- name: POSITION_ABC_BANK
description: The positions we have in the ABC Bank portfolio.
columns:
- name: UNREALIZED_PROFIT
description: The unrealized profit on the position.
tests:
- not_null
- name: UNREALIZED_PROFIT_PCT
description: >
The unrealized profit percentage on the position.
tests:
- not_null
Docs
And of course finally, we can see the amazing docs generated
Chapter 6 - Writing maintainable code
Writing code for humans
The other most important things that we can do to keep complexity in check are
- reduce the need to re-read, applying the Least Surprise Principle (LSP)
- make the code that you write easy to re-read/understand
- keep the models that we write focused and limited in dimension
- keep the dependencies ordered and going only in one direction, from one layer to the next in the architecture
- when you see an improvement, refactor your models to keep them current
Creating the architectural layers
Creating the Staging layer
Goals and contents of the staging models: to adapt external data to how we want to see and use it in our project. This includes preparing it for change tracking and for storage in the most usable way for our future needs.
- pick the desired columns, usually all of them
- rename the columns to follow our conventions and use our project-designated names
- manage data types, performing conversions to desired data types
- apply “hard rules” with non-destructive transformations, such as time-zone conversions
- add the default record but only for dimensions; it is not needed for facts/events
- add the desired/available metadata, based on the input specifics
- make keys and change tracking explicit by adding hash-based keys and diff fields
A general pattern for your staging models
WITH
src_data as ( … ),
default_record as ( … ),
with_default_record as(
SELECT * FROM src_data
UNION ALL
SELECT * FROM default_record
),
hashed as ( … )
SELECT * FROM hashed
This is the final first staging model after following the above goals:
WITH
src_data as (
SELECT
ACCOUNTID as ACCOUNT_CODE -- TEXT
, SYMBOL as SECURITY_CODE -- TEXT
, DESCRIPTION as SECURITY_NAME -- TEXT
, EXCHANGE as EXCHANGE_CODE -- TEXT
, REPORT_DATE as REPORT_DATE -- DATE
, QUANTITY as QUANTITY -- NUMBER
, COST_BASE as COST_BASE -- NUMBER
, POSITION_VALUE as POSITION_VALUE -- NUMBER
, CURRENCY as CURRENCY_CODE -- TEXT
, 'SOURCE_DATA.ABC_BANK_POSITION' as RECORD_SOURCE
FROM { source('abc_bank', 'ABC_BANK_POSITION') }}
),
hashed as (
SELECT
{ dbt_utils.surrogate_key([ 'ACCOUNT_CODE', 'SECURITY_CODE']) }} as POSITION_HKEY
, { dbt_utils.surrogate_key([ 'ACCOUNT_CODE', 'SECURITY_CODE',
'SECURITY_NAME', 'EXCHANGE_CODE', 'REPORT_DATE',
'QUANTITY', 'COST_BASE', 'POSITION_VALUE', 'CURRENCY_CODE'
]) }} as POSITION_HDIFF
, *
, '{ run_started_at }}'::timestamp as LOAD_TS_UTC
FROM src_data
)
SELECT * FROM hashed
Metadata examples include: load_ts, record_source, extraction_time, creation_time, effective_from
HASH_KEY: This is a single field with all the columns that make up the primary key of the entity chained together;
HASH_DIFF: This is a single field with all the fields that we want to test for changes chained together.
The HASH_DIFF field is a purely technical field that can be considered a version identifier, but be aware that it is perfectly normal for an entity to go back to the same set of values it had at some point in the past. Therefore, it might very well be that we have two equal hash diff values separated in the history table by one or more different ones.
Goals and contents of the refined layer
The goal of the refined layer is clear – take the ‘adapted’ data served from the staging layer and by applying business rules, including combining multiple sources, produce ‘refined’ data that can be used in the data marts to provide useful information for the platform users.
The fact is that ‘business rule’ is a very general term and it can mean something as simple as a sum or product, or it can involve the use of multiple models and quite complicated logic.
Pretty much every operation that changes or combines data to produce other data is a business rule
Types of models in this refined layer:
- intermediate transformation (TR): a catch-all type for models that you create for intermediate calculations that do not represent a proper business concept or that cannot exist by themselves, in the sense that you will not expect other developers to use such a model without knowing the specific caveats
- refined model (REF): the type for models that represent a recognizable business concept, at any level, with the level needing to be clear from the name if it is not for general use. We expect that anyone needing this concept at that level should reuse this model and not build their own
- mapping model (MAP): the type for models that support conversion from one reference system to another, by providing the mapping for keys in one system to the equivalent key in the other system
- Master Data Dimension (MDD) model: this is the type for models that implement dimensions provided directly from the master data management system
- aggregated (AGG) model: this is the type for models that contain aggregated or summarized data
- pivoted (PIVOT) model: This is the type for models that contain pivoted data. This is a specialized model type that you can use if you happen to deal with pivoted data. It is listed here to make clear that you can create new model types according to your needs. The goal should always be to provide clear communication of what the model does
Creating the first data mart
The general plan is to have one data mart for each domain that will use the platform’s data so that we can provide the best dataset for each of them, using the right names, the correct depth and granularity of the data, and making available only the concepts that matter for that domain.
Saving history is crucial
When working with dbt, we have two main avenues to pick up and save the changes to the instances of our entities:
- using the standard functionality of dbt
snapshots
: this allows us to easily capture changes as a Slowly Changing Dimension of type 2 (SCD2) without writing any code for that purpose, as the logic is implemented by dbt itself using merge or delete/insert statements. If we get reliable full exports from the source, the snapshot can also capture deletions.
This works very well for normal-size datasets (up to millions of rows), and it’s the simplest way to go, but it comes with its own cons, so we generally prefer to store changes using insert only history
One of the major cons of snapshots is that they are global objects, so you only have one copy and one version active for all environments. Another one is that you cannot preview or test the SELECT query that feeds the data to the snapshot.
The 2nd option is:
- using incremental models to save the new or changed data from each run by applying set-based operations to capture changes in insert-only mode, as this is the most effective way to manage and store even huge amounts of data, and it can be tailored to work in all cases, even when you have multiple sources for a single table, such as a near-real-time feed and a periodic control feed.
Saving history using snapshots
A snapshot is a dbt-managed table that will store the data from a SELECT statement in the form of a slowing changing dimension.
A snapshot is created by the special snapshot block that you create in a .sql file in dbt.
Snapshots are not normal models, in the sense that they are not run with the dbt run
command, but they must be run with the specific dbt snapshot
command or with the catch-all dbt build
command, and they are stored in a snapshots folder.
It is important to note that a snapshot table is a global object, shared by all environments that use the same database, including all developer environments. This poses substantial limitations for teamwork because when you change a snapshot, all the environments sharing the snapshot need to be updated at the same time.
The general layout of a snapshot definition:
% snapshot snapshot_name %
{config(…) }
select … from { source / ref }
{ endsnapshot %
Central to the operations of a snapshot are the following concepts:
-
identity or how to distinguish between two different instances of one entity, versus two versions of the same instance. Identity is expressed by a unicity constraint that is often called a primary key, natural key, or a similar name.
-
a strategy to identify changes – that is, how you decide whether two rows with the same key are the same row or two different versions of that specific instance of the entity.
- there is the timestamp strategy which is simpler and quicker when we have a field like
updated_at
or a monothonically increasingID
field - and in such cases we can just compare the current ID/updated_at to decide - the check strategy is the one that you use when you do not have a good high-watermark indicator, and instead, you must look at the data to check whether each row presented by the SELECT statement is the same or a different instance from the most recent one stored in the snapshot under the specified unique key. With this strategy, you must configure the columns to be checked to decide whether two rows are the same instance or not, using the check_cols configuration. You do not have to list all columns from SELECT but only the ones you want to compare; in any case, all columns from SELECT will be stored in the snapshot.
- there is the timestamp strategy which is simpler and quicker when we have a field like
The intended usage of snapshots is as starting point of your ETL process to capture and store the source data, so that the general ETL would follow the following pattern:
- run the snapshots command to capture the source data – dbt snapshot
- run the models to read the data out of the snapshots and calculate the desired transformations – dbt run
- run the tests to verify the results – dbt test
Running a snapshot will cause the following:
- the creation of a snapshot table from the result of the SELECT statement, which happens the first time that you run the snapshot
- the snapshot table to be updated, with the changes coming from the SELECT statement relating to the contents of the snapshot table, which happens any other time that you run the snapshot after the first
In data processing, a snapshot captures data at a specific point in time. When working with snapshots, we have the option to either read the source data unchanged and apply transformations from the STG model after the snapshot, or we can choose to include the transformations directly in the snapshot. The first option requires re-running the transformations every time we access the data, which can be inefficient. The second option, which involves incorporating the transformations into the snapshot, ensures that the transformations are applied once, and the resulting data is stored in the correct format, including any additional data calculated (like dates extracted from filenames).
However, putting the transformations from the STG model directly into the snapshot can make it more challenging to build and maintain the ELT process. This is because we lose the ability to preview the results of your SELECT query inside the snapshot block, which can make development more cumbersome. This lack of previewing results can create tension with the goal of maintaining easy-to-manage models, especially when the STG layer handles most of the data transformation logic.
One way to address this issue is by declaring the STG model as an ‘ephemeral model’. An ephemeral model doesn’t create any persistent tables or views but instead defines a named query. This allows us to preview the results of the query before including it in the snapshot, making it easier to maintain the ELT process while still ensuring that the snapshot runs efficiently with the necessary transformations.
Layout of a snapshot tables
The snapshot table will contain the following metadata fields in addition to the fields produced by the SELECT statement
- dbt_valid_from
- dbt_valid_to
- dbt_scd_id: a unique key generated for each row in the snapshot. This is internal to dbt
- dbt_updated_at: the timestamp when the row was inserted. This is internal to dbt
Here are some snapshot best practices:
- snapshot source data as the first thing in your ETL, declaring a source as appropriate, and make the rest of the ETL depend only on the snapshot
- include all the source columns in your snapshot, as you cannot go back in time and backfill information that you do not have, and even if you have it, the backfilling of data is a complicated and time-consuming task
- do not add any business logic in the query that feeds data to the snapshot; use only hard rules that do not change the data. If you apply soft business rules and they will change, most probably you will not be able to go back to the original data and apply the new business rule, and even if it’s possible, it would be a complex and slow manual process
- avoid joins in your SELECT query, as it is not always easy to establish when the result of a join is a new version, and even more difficult to absorb the changes into two tables. Snapshot the two tables to be joined independently, and then do the join in the downstream models where you apply the business logic
Multiple ways to take into use the snapshots
- a snapshot as a first step: Taking a snapshot as the first thing in the flow is what dbt suggests, but it means either preserving the source names by using the trivial
SELECT * FROM { source(…) }}
query or writing a query that is not simple to preview/debug - a snapshot after the STG model: Taking the snapshot out of an STG model solves the maintainability issue, but it is a bit clumsy if you use a normal model, as you need first to run only the STG models for the snapshots, take the snapshot, and then run all the models except the snapshot ones. The clumsiness is removed if you use ephemereal models for the STG models, as you do not need to run them before
Creating a snapshot in the project
The below is put in the snapshots folder in an sql file
{ snapshot SNSH_ABC_BANK_POSITION
{
config(
unique_key= 'POSITION_HKEY',
strategy='check',
check_cols=['POSITION_HDIFF'],
invalidate_hard_deletes=True,
)
}}
select * from { ref('STG_ABC_BANK_POSITION') }
{ endsnapshot
We use the POSITION_HKEY which was created in the STG_ABC_BANK_POSITION. Here is the data lineage
Also, before executing dbt snapshot
I dropped the existing STG_ABC_BANK_POSITION view as otherwise, in the future, people on the project might think it’s still connected to the STG model and even use it. And I added {config(materialized='ephemeral') }
to the top of the model file to fix the issue of not being able to preview or test that query in isolation - we change the model to ephermal materialization and use it to feed the content of the STG view to the snapshot.
After running dbt snapshot
, in snowflake we can see:
It’s so cool seeing this outside of Zach Wilson’s bootcamp where I learned about SCD just 3 weeks ago.
Connecting the REF layer with the snapshot
The REF_POSITION_ABC_BANK model that we wrote before is still reading data from the STG model that we made ephemeral. It still works, but it’s now time to make it read from the snapshot so that our ELT will keep working even if the external source is not available.
This is the current one:
SELECT
*,
POSITION_VALUE - COST_BASE as UNREALIZED_PROFIT,
ROUND(UNREALIZED_PROFIT / COST_BASE, 5) * 100 as UNREALIZED_PROFIT_PCT
FROM { ref('STG_ABC_BANK_POSITION') }
If we just change the ref to the snapshot, we would read everything out of the history – that is, all the versions that you encountered in the history of the snapshot which is not what we want.
To report on the current portfolio, we want to read only the active positions.
To achieve this, we need to select from the snapshot only the rows that have not been closed – that is, where the DBT_VALID_TO field is null – as the old versions are the ones with a timestamp in this column.
It should be:
WITH
current_from_snapshot as (
SELECT *
FROM { ref('SNSH_ABC_BANK_POSITION') }
WHERE DBT_VALID_TO is null
)
SELECT
*
, POSITION_VALUE - COST_BASE as UNREALIZED_PROFIT
, ROUND(UNREALIZED_PROFIT / COST_BASE, 5)*100
as UNREALIZED_PROFIT_PCT
FROM current_from_snapshot
All is good:
🥳
Here is the lineage at the end of this chapter:
Chapter 7 - Working with Dimensional Data
They give a definition about dimensional data - descriptive data that provides human-readable information for an entity, such as the name or country of a customer or account
Adding dimensional data
Loading the data of the first dimension
Creating and loading a CSV as a seed
We need to configure them in the project yml:
seeds:
+schema: seed_data
Then we can run dbt seed
to load the data
We can add a column using a post-hook:
seeds:
+schema: seed_data
portfolio_tracking:
ABC_Bank_SECURITY_INFO:
+post-hook:
- "UPDATE { this } SET LOAD_TS = '{ run_started_at }' WHERE LOAD_TS is null"
+column_types:
LOAD_TS: TIMESTAMP
And also add the column name LOAD_TS and no values in the seed csv
And load_ts becomes the timestamp when the dbt seed is run
(the time zone is 9hrs behind - Europe)
Building the STG model for the first dimension
Creating an STG model for the security dimension
{ config(materialized='ephemeral') }
WITH
src_data as (
SELECT
SECURITY_CODE as SECURITY_CODE -- TEXT
, SECURITY_NAME as SECURITY_NAME -- TEXT
, SECTOR as SECTOR_NAME -- TEXT
, INDUSTRY as INDUSTRY_NAME -- TEXT
, COUNTRY as COUNTRY_CODE -- TEXT
, EXCHANGE as EXCHANGE_CODE -- TEXT
, LOAD_TS as LOAD_TS -- TIMESTAMP_NTZ
, 'SEED.ABC_Bank_SECURITY_INFO' as RECORD_SOURCE
FROM { source('seeds', 'ABC_Bank_SECURITY_INFO') }
),
default_record as (
SELECT
'-1' as SECURITY_CODE
, 'Missing' as SECURITY_NAME
, 'Missing' as SECTOR_NAME
, 'Missing' as INDUSTRY_NAME
, '-1' as COUNTRY_CODE
, '-1' as EXCHANGE_CODE
, '2020-01-01' as LOAD_TS_UTC
, 'Missing' as RECORD_SOURCE
),
with_default_record as(
SELECT * FROM src_data
UNION ALL
SELECT * FROM default_record
),
hashed as (
SELECT
concat_ws('|', SECURITY_CODE) as SECURITY_HKEY
, concat_ws('|', SECURITY_CODE, SECURITY_NAME, SECTOR_NAME,
INDUSTRY_NAME, COUNTRY_CODE, EXCHANGE_CODE ) as SECURITY_HDIFF
, * EXCLUDE LOAD_TS
, LOAD_TS as LOAD_TS_UTC
FROM with_default_record
)
SELECT * FROM hashed
It is always important to think about what’s included in the HDIFF key.
We have the default records to help in case there is missing info or incomplete data.
Saving history for the dimensional data
Created SNSH_ABC_BANK_SECURITY_INFO.sql
in the snapshots folder.
snapshot SNSH_ABC_BANK_SECURITY_INFO %}
{
config(
unique_key= 'SECURITY_HKEY',
strategy='check',
check_cols=['SECURITY_HDIFF'],
)
}
select * from { ref('STG_ABC_BANK_SECURITY_INFO') }
% endsnapshot %
The invalidate_hard_deletes parameter is removed, as generally for dimensions, it is more important to keep old entries than to capture deletions, as these old entries are referenced by old facts. Eventually, we might be interested to know whether an entry is still active or not. In that case, we would enable the parameter to track hard deletions, but then keep both the active and deleted rows in the REF model derived from the snapshot.
Building the REF layer with the dimensional data
Created REF_ABC_BANK_SECURITY_INFO.sql
in the models/refined folder
WITH
current_from_snapshot as (
SELECT * EXCLUDE (DBT_SCD_ID, DBT_UPDATED_AT, DBT_VALID_FROM, DBT_VALID_TO)
FROM { ref('SNSH_ABC_BANK_SECURITY_INFO') }
WHERE DBT_VALID_TO is null
)
SELECT *
FROM current_from_snapshot
Adding the dimensional data to the data mart
Created DIM_SECURITY.sql
in the models/marts/portfolio folder
SELECT * FROM { ref('REF_ABC_BANK_SECURITY_INFO') }
Running dbt build after all this:
That is all for today!
See you tomorrow :)