(Day 352) SCD type 2 exercises + some dbt tests

Ivan Ivanov · December 18, 2024

data-eng theory applying-knowledge

Hello :) Today is Day 352!

A quick summary of today:

did some of the SCD type 2 exercises in the DE with dbt book
learned more about dbt tests

DE with dbt - Chapter 7 exercises

At the end of the chapter there are some optional exercises to add extra dimensions to the model ~ so I did. And it was great because I got to practice SCD type 2 by myself in dbt and had to think about whether I need scd type 2 for a dimension, what values should I track and need scd type 2 ~

This is the data lineage at the end:

DE with dbt - Chapter 8 - Delivering Consistency in Your Data

Keeping consistency by reusing code – macros

Using macros, we can apply the DRY and KISS software engineering principles, reducing duplication of code, and making complex tasks simple, allowing general solutions such as patterns to be widely adopted, as the complexity can be hidden away from most coders.

How to write a macro

In the macro folder, we can write something like

The macro will be run before the query is sent to the database to produce the final text to be sent to the database by running the logic in the macro.

Refactoring the ‘current’ CTE into a macro

Currently it is:

WITH
current_from_snapshot as (
    SELECT *
    FROM { ref('SNSH_ABC_BANK_POSITION') }
    WHERE DBT_VALID_TO is null
)
SELECT
    *
    , POSITION_VALUE - COST_BASE as UNREALIZED_PROFIT
    , ROUND(UNREALIZED_PROFIT / COST_BASE, 5)*100
         as UNREALIZED_PROFIT_PCT
FROM current_from_snapshot

To make this into a macro, in the macro folder, I created a current_from_snapshot.sql file

And in the files where we use the current_from_snapshot CTE format (like in the above sql snippet), we replace it with:

Fixing data loaded from our CSV file

Because of formating, the report date in one of the tables is messed up (ignore the first 3 lines for now):

We can create a macro:

And apply it to the column we want to fix:

` as REPORT_DATE`

And then when we run dbt build on that table and its downstream, we can see it got fixed. Also in the snapshot we see 3 new rows where the REPORT_DATE is correct and also for the previous incorrect 3 REPORT_DATE values now we have a value in the DBT_VALID_TO because they were updated and the REPORT_DATE is one of the cols we keep track of.

Building on the shoulders of giants – dbt packages

dbt_utils package

expression_is_true: this is a model level test that asserts that the SQL predicate (and expression evaluating to true or false) that you provide is true for all the records in the table
not_empty_string: a column-level test that verifies that there are no empty strings (‘’) in a column, after eventually trimming the content of the column

Example:

      - name: column_name
        tests:
          - not_null
          - dbt_utils.not_empty_string

mutually_exclusive_ranges: model-level macro that tests the rows create correct date ranges when partitioned by a column, allowing you to verify that ranges do not overlap, have or don’t have gaps, and that the range size is greater than zero
get_column_values: this macro loads the values from a specific column of a model as a Python list, without having to deal with Agate tables
get_query_results_as_dict: this macro loads the results from a SQL query as a Python dictionary so that you do not have to deal with Agate tables
date_spine: generates the SQL code to generate a continuous sequence of dates in an interval, according to a desired granularity (day, month, etc)
union_relations: allows you to unite two or more relations (tables or views), aligning fields with the same name and adding null columns for the missing ones, making it simple to UNION tables that are similar but do not have the same columns in the same order

Codegen package

the following code will generate the YAML for a source file to import the relations from a schema in the selected database: ``
the following code can be used to generate the YAML to describe and test a model: ``

dbt_expectations package

This package extends the testing abilities of dbt by defining many test macros, inspired by the Great Expectations package that tests Python programs

The available tests go from simple ones such as expect_table_row_count_to_equal_other_table to more complex ones that can test the values of your data, such as expect_column_values_to_be_within_n_moving_stdevs

The dbt_external_tables package

This package allows you to create external tables as sources in a database-independent way, by configuring the tables that you want in YAML

The metrics package

This dbt-lab-created package uses the metrics definition introduced in dbt V1.0.0 to produce data tables, providing data according to the metrics definition.

This package allows you to materialize the metrics that you have defined and also take advantage of secondary calculations, such as period over period, period to date, rolling, and so on

dbt_vault package

This is the first dbt package to help build a data platform according to the Data Vault 2.0 (DV2) principles and is produced by Data Vault

datavault4dbt package

This is a package to build a data platform according to the DV2 principles

dbt Model Usage package

This dbt package provides tests to let you know whether your models are still relevant to your users. These tests scan your database’s query logs to check whether users are still selecting from the tables that dbt produces. Test failures can let you know when it might be time to retire unused models.

dbt_profiler package

This package provides macros to profile the content of your relations, providing many measures for each column, such as the proportion of not null, min, max, and the number of distinct values.

dbt_artifacts package

This dbt package builds a data mart to analyze the executions and usage of dbt components, such as models, tests, and exposures, ultimately describing the dbt project where you install the package

DE with dbt - Chapter 9 - Delivering Reliability in Your Data

Testing to provide reliability

Tests are going to be your sleep-well-at-night superpower.

Types of tests

singular test: used in a single case; generally only a few of these tests are used to cover the rare cases not covered by the available generic tests
generic tests: writing or using a parametrized SQL query and passing the required arguments; you will get most tests from packages, as in general, you will be testing over and over the same kind of assumptions that other people also test in other projects

Defining a generic test

To define a generic test, or to override one of the out-of-the-box ones, you can write a test Jinja block, containing the SQL for your test in a SQL file.

Test blocks are very similar to macros, as generic tests used to be macros with a name starting with test_, and they still work pretty much the same as macros – defining a function that takes arguments and whose result is evaluated as a test result.

The SQL for generic tests must be written to return the rows that fail the test, using the same logic like singular tests

Generic tests are applied using YAML files and must always define at least two arguments, model and column_name, which are automatically used to pass to each test invocation the reference to the resource and, eventually, the name of the column on which the test is applied in the YAML file.

Since dbt v1.0.0, we can place these generic tests in either the macros folder or tests/generic

This is a generic test pattern to follow:

Writing a generic test

In the project, we have hash values as keys and to check when the content of a row is changed. The probability of having hash collision – that is, two different sets of values producing the same hash – is really minuscule, and it is very safe to use hashes for such purposes.

We will write a generic test that allows us to check whether we get unlucky and have a collision, so we can manage it and re-hash the table to avoid missing even a single row of data.

In the macros/tests folder, I created a no_hash_collisions.sql file with:

Then in the snapshots folder, I create a snapshots.yml file for the tests:

version: 2
snapshots:
  - name: SNSH_ABC_BANK_POSITION
    columns:
      - name: POSITION_HKEY
        tests:
          - no_hash_collisions:
              hashed_fields: ""

(in the hashed_fields param, these are the fields we use for creating the hash key in the model itself so we pass them all here)

The model and column_name parameters are managed automatically by dbt, so we do not need to pass them to the test, but we need to provide a value for any other parameter of the generic test that does not have a default value, such as hashed_fields

Finally, we can run the test with dbt test -s SNSH_ABC_BANK_POSITION

Test configuration patter in yml

      - <test_name>:
          <argument_name>: <argument_value>
          config:
            where: <string>
            severity: error | warn
            error_if: <string>
            warn_if: <string>
            fail_calc: <string>
            store_failures: true | false
            limit: <integer>

Testing the right things in the right places

What do we test?

The Primary Key (PK) and the Business Key (BK): Getting the PK/BK right is essential to be able to identify an entity and store their history of changes. They enable us to join facts and dimensions, and the BK is essential to be able to identify the same entity across different systems
Foreign Keys (FKs): FKs are the essential element of relational systems, as they store the relations between tables and entities by pointing to the PK/BK of another table that completes the information. In the case of a many-to-many relationship, the FK of each side is stored in the mapping table that defines the relation, instead of being directly inside a table
Business rules (BRs): BRs come in many flavors and are the spice that gives most of the value to our data platform. It is impossible to make a complete list of BR types, as they can range from calculations to ways of connecting entities to ways of formatting data
Measures: Measures are numeric types of data that can be used for calculations. Some measures are extremely important, as they are needed for calculations, and their absence can make data almost useless, such as a sales order without the amount sold
Categories: Categories are textual data that is not purely descriptive, such as a name or any free text, but they are used to group or somehow organize data, such as a type of product, the industry of a customer, or any value that our users use to slice reports, but is not a full fledged dimension
Descriptive data: Everything else can pretty much fall under the descriptive data category, which contains things such as names, freeform text, and generally, all kinds of textual data that does not fall into the other category types

Where to test what?

The golden rules are :

test your expectations about inputs on sources or as soon as possible
test your expectations about outputs when you deliver them
in the intermediate transformations of the ELT pipelines, test the important things that might create a lot of problems or that you do not totally trust to always work as planned, such as BRs or complex joins

Tests on the sources (input layer):

test PKs and BKs to be not null and unique
test FKs to check that mandatory one-to-many relationships are not null
test relationships to check that they don’t have an orphan FK
test the source freshness

Tests on the storage layer – the STG and HIST models:

test that the PK and BK (that you could not test on sources) are not null and unique
test that HKEY and HDIFF have no hash collision
create all the tests that you would have done in the source layer but that were impractical because they needed some data transformation

Tests on the refined layer:

test the results of the BRs that calculate values
test the PK/FK after risky or complex joins (to implement the BRs)

Tests on the delivery layer:

test that the dimensions are well-formed
test that facts are well-formed
test the values in facts and dimensions
test that well-known expected output is produced correctly
run statistical, quantitative tests to ensure that data is flowing and in a reasonable range

Testing our models to ensure good quality

Testing the uniqueness of composite keys

Exercise for myself

This is not in the book - but I will make an exercise for meself to add tests to all models because at the moment only 2 have tests on them.

DE with dbt - Chapter 10 - Agile Development

Agile development and collaboration

Manifesto for Agile Software Development:

We are uncovering better ways of developing software

by doing it and helping others do it.

Through this work we have come to value:

Individuals and interactions over processes and tools
Working software over comprehensive documentation
Customer collaboration over contract negotiation
Responding to change over following a plan
That is, while there is value in the items on the right,

we value the items on the left more.

Applying agile to data engineering

agile is not improvising. Having a plan is important, as it is the understanding that any plan is not set in stone, and it will keep changing
pick a direction in collaboration with the people who know the domain (aka business) and the ones who know the technology
an important corollary to the previous point is to improve the detail of your goal while you go
a general goal is to keep your plan as flexible as possible, taking decisions (especially the ones that are difficult to change) at the last responsible moment so that the decision can be taken using as much knowledge and experience as possible
verify the biggest questions at the beginning so that the risk of a big change of direction decreases as you move forward
try to cut the work into small work increments, often referred to as a story, that can be completed with little risk of them having to stop
don’t try boiling the ocean or getting lost in too many different things. Try to focus the team on the most important topics of the moment so that they can collaborate on the solution
start completing and stop starting – focus on closing open work before you start more work
agility is not only about flexibility but also about avoiding waste and removing things that are outdated or of little or uncertain value

Building reports in an agile way

S1 – designing a light data model for the data mart

keep the model simple

S2 – designing a light data model for the REF layer

define REF model from sources
dimension enrichment
connecting to a fact a snowflaked dimension
the application of master data
definition of the composite fact

S3.x – developing with dbt models the pipeline for the XYZ table

Once the S1 and S2 stories have been completed or are at least in the advanced stage of completion, you will have a list of tables that you need to build in the refined and delivery layers.

S4 – an acceptance test of the data produced in the data mart

Two approaches:

automated acceptance/integration tests
manual acceptance test

S5 – development and verification of the report in the BI application

This is the final story, where you will use the data mart verified in S4 to build the report using the BI application.

The starting point is the data model for the data mart designed in the S1 story, which was developed with the BI tool needs in mind and will, therefore, provide the correct model for your BI tool, whether it is a star schema or a wide table.

DE with dbt - Chapter 11 - Team Collaboration

Core collaboration practices

common code ownership
clean code and technical excellence
trust, but challenge
an open door policy
pairing
code walkthroughs
PRs and code reviews

Working with Git in dbt Cloud

using the cloud’s git functionality
proper branch naming conventions
adopting frequent releases

DE with dbt - Chapter 12 - Deployment, Execution, and Documentation Automation

Creating our deployment automation

This part was to create a dev (QA) and prod environments in the ‘Deploy’ page of dbt cloud and execute jobs on either ~

I needed to run dbt seed because one of the dimensions comes directly from a seed csv which does not exist and I was getting an error when doing only dbt build.

So I made the job run like this:

Everything in the staging folder is materialised as a view, so I need to load the csv directly so that downstream tasks can run. Nice little error that I faced only because I did not do SCD type 2 on all dimensions. 🥳

Advanced automation – hooks and run-operations

Hooks

The available hooks are as follows:

pre-hook: the SQL will be run before building the dbt node it is attached to
post-hook: the SQL will be run after building the dbt node it is attached to
on-run-start: the SQL will be run at the beginning of the dbt command
on-run-end: the SQL will be run at the end of the dbt command

While all hooks can access the context available in the model, the on-run-end hook has some extra context variables:

the schemas variable contains the list of schemas that contain models that have been built by dbt in the current run
the database_schemas variable is like the schemas variable but contains tuples (database, schema) in place of schema names
the results variable contains a list with a result object for each node that has been built by dbt in the current run. This corresponds to the data that will populate the result’s JSON artifact

Run-operations

Allows to activate a macro from the command line or as part of a job

Some examples:

data ingestion: using macros to configure external tables, or to ingest data from files residing in data lakes
migrations: perform the evolutionary changes to your database objects that are not handled by dbt, such as dropping unused tables or managing individual fields
post-run operations: once transformations are done, there are extra things that we might want to do, such as managing grants, writing audit information, profiling tables, producing extra documentation, and so on

Table migrations

Dbt manages most of the DDL and DML commands needed to create and maintain the schemata, tables, and views, but in some cases we might want to do things that dbt does not do for us

The most common things:

deleting tables and views that are not in use anymore
evolving incremental tables
performing bulk operations on data

That is all for today!

See you tomorrow :)