Hello :) Today is Day 352!
A quick summary of today:
- did some of the SCD type 2 exercises in the DE with dbt book
- learned more about dbt tests
DE with dbt - Chapter 7 exercises
At the end of the chapter there are some optional exercises to add extra dimensions to the model ~ so I did. And it was great because I got to practice SCD type 2 by myself in dbt and had to think about whether I need scd type 2 for a dimension, what values should I track and need scd type 2 ~
This is the data lineage at the end:
DE with dbt - Chapter 8 - Delivering Consistency in Your Data
Keeping consistency by reusing code – macros
Using macros, we can apply the DRY and KISS software engineering principles, reducing duplication of code, and making complex tasks simple, allowing general solutions such as patterns to be widely adopted, as the complexity can be hidden away from most coders.
How to write a macro
In the macro folder, we can write something like
The macro will be run before the query is sent to the database to produce the final text to be sent to the database by running the logic in the macro.
Refactoring the ‘current’ CTE into a macro
Currently it is:
WITH
current_from_snapshot as (
SELECT *
FROM { ref('SNSH_ABC_BANK_POSITION') }
WHERE DBT_VALID_TO is null
)
SELECT
*
, POSITION_VALUE - COST_BASE as UNREALIZED_PROFIT
, ROUND(UNREALIZED_PROFIT / COST_BASE, 5)*100
as UNREALIZED_PROFIT_PCT
FROM current_from_snapshot
To make this into a macro, in the macro folder, I created a current_from_snapshot.sql
file
And in the files where we use the current_from_snapshot CTE format (like in the above sql snippet), we replace it with:
Fixing data loaded from our CSV file
Because of formating, the report date in one of the tables is messed up (ignore the first 3 lines for now):
We can create a macro:
And apply it to the column we want to fix:
` as REPORT_DATE`
And then when we run dbt build on that table and its downstream, we can see it got fixed. Also in the snapshot we see 3 new rows where the REPORT_DATE is correct and also for the previous incorrect 3 REPORT_DATE values now we have a value in the DBT_VALID_TO because they were updated and the REPORT_DATE is one of the cols we keep track of.
Building on the shoulders of giants – dbt packages
dbt_utils package
expression_is_true
: this is a model level test that asserts that the SQL predicate (and expression evaluating to true or false) that you provide is true for all the records in the tablenot_empty_string
: a column-level test that verifies that there are no empty strings (‘’) in a column, after eventually trimming the content of the column
Example:
- name: column_name
tests:
- not_null
- dbt_utils.not_empty_string
mutually_exclusive_ranges
: model-level macro that tests the rows create correct date ranges when partitioned by a column, allowing you to verify that ranges do not overlap, have or don’t have gaps, and that the range size is greater than zeroget_column_values
: this macro loads the values from a specific column of a model as a Python list, without having to deal with Agate tablesget_query_results_as_dict
: this macro loads the results from a SQL query as a Python dictionary so that you do not have to deal with Agate tablesdate_spine
: generates the SQL code to generate a continuous sequence of dates in an interval, according to a desired granularity (day, month, etc)union_relations
: allows you to unite two or more relations (tables or views), aligning fields with the same name and adding null columns for the missing ones, making it simple to UNION tables that are similar but do not have the same columns in the same order
Codegen package
- the following code will generate the YAML for a source file to import the relations from a schema in the selected database: ``
- the following code can be used to generate the YAML to describe and test a model: ``
dbt_expectations package
This package extends the testing abilities of dbt by defining many test macros, inspired by the Great Expectations package that tests Python programs
The available tests go from simple ones such as expect_table_row_count_to_equal_other_table
to more complex ones that can test the values of your data, such as expect_column_values_to_be_within_n_moving_stdevs
The dbt_external_tables package
This package allows you to create external tables as sources in a database-independent way, by configuring the tables that you want in YAML
The metrics package
This dbt-lab-created package uses the metrics definition introduced in dbt V1.0.0 to produce data tables, providing data according to the metrics definition.
This package allows you to materialize the metrics that you have defined and also take advantage of secondary calculations, such as period over period, period to date, rolling, and so on
dbt_vault package
This is the first dbt package to help build a data platform according to the Data Vault 2.0 (DV2) principles and is produced by Data Vault
datavault4dbt package
This is a package to build a data platform according to the DV2 principles
dbt Model Usage package
This dbt package provides tests to let you know whether your models are still relevant to your users. These tests scan your database’s query logs to check whether users are still selecting from the tables that dbt produces. Test failures can let you know when it might be time to retire unused models.
dbt_profiler package
This package provides macros to profile the content of your relations, providing many measures for each column, such as the proportion of not null, min, max, and the number of distinct values.
dbt_artifacts package
This dbt package builds a data mart to analyze the executions and usage of dbt components, such as models, tests, and exposures, ultimately describing the dbt project where you install the package
DE with dbt - Chapter 9 - Delivering Reliability in Your Data
Testing to provide reliability
Tests are going to be your sleep-well-at-night superpower.
Types of tests
- singular test: used in a single case; generally only a few of these tests are used to cover the rare cases not covered by the available generic tests
- generic tests: writing or using a parametrized SQL query and passing the required arguments; you will get most tests from packages, as in general, you will be testing over and over the same kind of assumptions that other people also test in other projects
Defining a generic test
To define a generic test, or to override one of the out-of-the-box ones, you can write a test Jinja block, containing the SQL for your test in a SQL file.
Test blocks are very similar to macros, as generic tests used to be macros with a name starting with test_, and they still work pretty much the same as macros – defining a function that takes arguments and whose result is evaluated as a test result.
The SQL for generic tests must be written to return the rows that fail the test, using the same logic like singular tests
Generic tests are applied using YAML files and must always define at least two arguments, model and column_name, which are automatically used to pass to each test invocation the reference to the resource and, eventually, the name of the column on which the test is applied in the YAML file.
Since dbt v1.0.0, we can place these generic tests in either the macros folder or tests/generic
This is a generic test pattern to follow:
Writing a generic test
In the project, we have hash values as keys and to check when the content of a row is changed. The probability of having hash collision – that is, two different sets of values producing the same hash – is really minuscule, and it is very safe to use hashes for such purposes.
We will write a generic test that allows us to check whether we get unlucky and have a collision, so we can manage it and re-hash the table to avoid missing even a single row of data.
In the macros/tests folder, I created a no_hash_collisions.sql
file with:
Then in the snapshots folder, I create a snapshots.yml file for the tests:
version: 2
snapshots:
- name: SNSH_ABC_BANK_POSITION
columns:
- name: POSITION_HKEY
tests:
- no_hash_collisions:
hashed_fields: ""
(in the hashed_fields
param, these are the fields we use for creating the hash key in the model itself so we pass them all here)
The model and column_name parameters are managed automatically by dbt, so we do not need to pass them to the test, but we need to provide a value for any other parameter of the generic test that does not have a default value, such as hashed_fields
Finally, we can run the test with dbt test -s SNSH_ABC_BANK_POSITION
Test configuration patter in yml
- <test_name>:
<argument_name>: <argument_value>
config:
where: <string>
severity: error | warn
error_if: <string>
warn_if: <string>
fail_calc: <string>
store_failures: true | false
limit: <integer>
Testing the right things in the right places
What do we test?
- The Primary Key (PK) and the Business Key (BK): Getting the PK/BK right is essential to be able to identify an entity and store their history of changes. They enable us to join facts and dimensions, and the BK is essential to be able to identify the same entity across different systems
- Foreign Keys (FKs): FKs are the essential element of relational systems, as they store the relations between tables and entities by pointing to the PK/BK of another table that completes the information. In the case of a many-to-many relationship, the FK of each side is stored in the mapping table that defines the relation, instead of being directly inside a table
- Business rules (BRs): BRs come in many flavors and are the spice that gives most of the value to our data platform. It is impossible to make a complete list of BR types, as they can range from calculations to ways of connecting entities to ways of formatting data
- Measures: Measures are numeric types of data that can be used for calculations. Some measures are extremely important, as they are needed for calculations, and their absence can make data almost useless, such as a sales order without the amount sold
- Categories: Categories are textual data that is not purely descriptive, such as a name or any free text, but they are used to group or somehow organize data, such as a type of product, the industry of a customer, or any value that our users use to slice reports, but is not a full fledged dimension
- Descriptive data: Everything else can pretty much fall under the descriptive data category, which contains things such as names, freeform text, and generally, all kinds of textual data that does not fall into the other category types
Where to test what?
The golden rules are :
- test your expectations about inputs on sources or as soon as possible
- test your expectations about outputs when you deliver them
- in the intermediate transformations of the ELT pipelines, test the important things that might create a lot of problems or that you do not totally trust to always work as planned, such as BRs or complex joins
Tests on the sources (input layer):
- test PKs and BKs to be not null and unique
- test FKs to check that mandatory one-to-many relationships are not null
- test relationships to check that they don’t have an orphan FK
- test the source freshness
Tests on the storage layer – the STG and HIST models:
- test that the PK and BK (that you could not test on sources) are not null and unique
- test that HKEY and HDIFF have no hash collision
- create all the tests that you would have done in the source layer but that were impractical because they needed some data transformation
Tests on the refined layer:
- test the results of the BRs that calculate values
- test the PK/FK after risky or complex joins (to implement the BRs)
Tests on the delivery layer:
- test that the dimensions are well-formed
- test that facts are well-formed
- test the values in facts and dimensions
- test that well-known expected output is produced correctly
- run statistical, quantitative tests to ensure that data is flowing and in a reasonable range
Testing our models to ensure good quality
Testing the uniqueness of composite keys
Exercise for myself
This is not in the book - but I will make an exercise for meself to add tests to all models because at the moment only 2 have tests on them.
DE with dbt - Chapter 10 - Agile Development
Agile development and collaboration
Manifesto for Agile Software Development:
We are uncovering better ways of developing software
by doing it and helping others do it.
Through this work we have come to value:
Individuals and interactions over processes and tools
Working software over comprehensive documentation
Customer collaboration over contract negotiation
Responding to change over following a plan
That is, while there is value in the items on the right,
we value the items on the left more.
Applying agile to data engineering
- agile is not improvising. Having a plan is important, as it is the understanding that any plan is not set in stone, and it will keep changing
- pick a direction in collaboration with the people who know the domain (aka business) and the ones who know the technology
- an important corollary to the previous point is to improve the detail of your goal while you go
- a general goal is to keep your plan as flexible as possible, taking decisions (especially the ones that are difficult to change) at the last responsible moment so that the decision can be taken using as much knowledge and experience as possible
- verify the biggest questions at the beginning so that the risk of a big change of direction decreases as you move forward
- try to cut the work into small work increments, often referred to as a story, that can be completed with little risk of them having to stop
- don’t try boiling the ocean or getting lost in too many different things. Try to focus the team on the most important topics of the moment so that they can collaborate on the solution
- start completing and stop starting – focus on closing open work before you start more work
- agility is not only about flexibility but also about avoiding waste and removing things that are outdated or of little or uncertain value
Building reports in an agile way
S1 – designing a light data model for the data mart
- keep the model simple
S2 – designing a light data model for the REF layer
- define REF model from sources
- dimension enrichment
- connecting to a fact a snowflaked dimension
- the application of master data
- definition of the composite fact
S3.x – developing with dbt models the pipeline for the XYZ table
Once the S1 and S2 stories have been completed or are at least in the advanced stage of completion, you will have a list of tables that you need to build in the refined and delivery layers.
S4 – an acceptance test of the data produced in the data mart
Two approaches:
- automated acceptance/integration tests
- manual acceptance test
S5 – development and verification of the report in the BI application
This is the final story, where you will use the data mart verified in S4 to build the report using the BI application.
The starting point is the data model for the data mart designed in the S1 story, which was developed with the BI tool needs in mind and will, therefore, provide the correct model for your BI tool, whether it is a star schema or a wide table.
DE with dbt - Chapter 11 - Team Collaboration
Core collaboration practices
- common code ownership
- clean code and technical excellence
- trust, but challenge
- an open door policy
- pairing
- code walkthroughs
- PRs and code reviews
Working with Git in dbt Cloud
- using the cloud’s git functionality
- proper branch naming conventions
- adopting frequent releases
DE with dbt - Chapter 12 - Deployment, Execution, and Documentation Automation
Creating our deployment automation
This part was to create a dev (QA) and prod environments in the ‘Deploy’ page of dbt cloud and execute jobs on either ~
I needed to run dbt seed because one of the dimensions comes directly from a seed csv which does not exist and I was getting an error when doing only dbt build.
So I made the job run like this:
Everything in the staging folder is materialised as a view, so I need to load the csv directly so that downstream tasks can run. Nice little error that I faced only because I did not do SCD type 2 on all dimensions. 🥳
Advanced automation – hooks and run-operations
Hooks
The available hooks are as follows:
- pre-hook: the SQL will be run before building the dbt node it is attached to
- post-hook: the SQL will be run after building the dbt node it is attached to
- on-run-start: the SQL will be run at the beginning of the dbt command
- on-run-end: the SQL will be run at the end of the dbt command
While all hooks can access the context available in the model, the on-run-end hook has some extra context variables:
- the schemas variable contains the list of schemas that contain models that have been built by dbt in the current run
- the database_schemas variable is like the schemas variable but contains tuples (database, schema) in place of schema names
- the results variable contains a list with a result object for each node that has been built by dbt in the current run. This corresponds to the data that will populate the result’s JSON artifact
Run-operations
Allows to activate a macro from the command line or as part of a job
Some examples:
- data ingestion: using macros to configure external tables, or to ingest data from files residing in data lakes
- migrations: perform the evolutionary changes to your database objects that are not handled by dbt, such as dropping unused tables or managing individual fields
- post-run operations: once transformations are done, there are extra things that we might want to do, such as managing grants, writing audit information, profiling tables, producing extra documentation, and so on
Table migrations
Dbt manages most of the DDL and DML commands needed to create and maintain the schemata, tables, and views, but in some cases we might want to do things that dbt does not do for us
The most common things:
- deleting tables and views that are not in use anymore
- evolving incremental tables
- performing bulk operations on data
That is all for today!
See you tomorrow :)