(Day 333) Learning about Data Quality

Ivan Ivanov · November 29, 2024

Hello :) Today is Day 333!

A quick summary of today:

watched some pydata talks
started reading the DQ Fundamentals book
some more SQL practice

Mojtaba Farmanbar - Uncertainty quantification: How much can you trust your machine learning model?

Mojtaba Farmanbar discussed uncertainty quantification in ML models and how to make models more trustworthy for human interaction. He talked about the importance of quantifying uncertainty in situations where the model’s decisions are crucial, such as financial crime detection and medical diagnosis. Data uncertainty can arise from issues like noisy data, disagreement between labellers, natural variability in data collection, or overlapping feature spaces. Model uncertainty can be due to limited data, model architecture choices, or assumptions made within the model.

Two methods for measuring uncertainty:

intrinsic methods, like Bayesian approaches and quantile regression, inherently provide uncertainty quantification
post-hoc methods, like calibration and conformal prediction, are applied after training and don’t restrict model choice

He also mentions calibrations and that a well-calibrated model with a score of 0.8 for a positive class implies an 80% chance of the prediction being correct.

He introduces conformal prediction as a framework to transform uncertainty into a rigorous form. Instead of point estimates, conformal prediction provides prediction sets (for classification) or prediction intervals (for regression). This approach offers coverage guarantee, ensuring that the prediction set or interval contains the true outcome with a specified probability. It’s also model-agnostic and distribution-free.

Vincent D. Warmerdam - Run a benchmark they said. It will be fun they said

Inspired by scikit-learn discussions about the “table vectorizer,” Vincent sought to evaluate its universality across task types (classification, regression, time series).

Findings:

default ensemble models (e.g. CatBoost, LightGBM, XGBoost) showed similar accuracy but CatBoost had longer training times
memory usage was comparable across models

Hyperparameter optimisation insights:

avoid overfitting to noise or random seeds by leveraging visual tools like HiPlot
prefer random search over grid search for efficiency
use cross-validation carefully, as fold choices can skew performance perceptions

Lessons learned:

focus on real-world problems, not just benchmarks
prioritise data quality over excessive reliance on metrics
embrace noise and use robust visualisation and resampling techniques

Data Quality Fundamentals - Chapter 1 - Why Data Quality Deserves Attention—Now

What is data quality?

To book defines it as the health of data at any stage in its life cycle. Data quality can be impacted at any stage of the data pipeline, before ingestion, in production, or even during analysis.

DQ might not be so popular with companies compared to ML, analytics, or DS, so most times DQ falls onto the shoulders of data analysts or engineers to handle it.

Data quality issues (or, data downtime) are practically unavoidable given the rate of growth and data consumption of most companies. But by understanding how we define data quality, it becomes much easier to measure and prevent it from causing issues downstream.

Framing the current moment

DQ has become a more popular topic since 2019 and teams have started using DevOps practices in the data world - culmunating in the term DataOps.

Understanding the ‘rise of data downtime’

Data downtime refers to periods of time where data is missing, inaccurate, or otherwise erroneous, and it manifests in stale dashboards, inaccurate reports, and even poor decision making.

Migration to the cloud

In many ways, the cloud makes data easier to manage, more accessible to a wider variety of users, and far faster to process.

Not long after data warehouses moved to the cloud, so too did data lakes (a place to transform and store unstructured data), giving data teams even greater flexibility when it comes to managing their data assets. As companies and their data moved to the cloud, analytics-based decision making (and the need for high-quality data) became a greater priority for businesses.

More data sources

Nowadays, companies use anywhere from dozens to hundreds of internal and external data sources to produce analytics and ML models. Any one of these sources can change in unexpected ways and without notice, compromising the data the company uses to make decisions.

Increasingly complex data pipelines

Data pipelines have become increasingly complex with multiple stages of processing and nontrivial dependencies between various data assets as a result of more advanced (and disparate) tooling, more data sources, and increasing diligence afforded to data by executive leadership

More specialised data teams

As companies increasingly rely on data to drive smart decision making, they are hiring more and more data analysts, data scientists, and data engineers to build and maintain the data pipelines, analytics, and ML models that power their services and products, as well as their business operations. With all of these different users touching the data, miscommunication or insufficient coordination is inevitable and will cause these complex systems to break as changes are made.

Decentralised data teams

For instance, a 200-person company may support a team of 3 data engineers and 10 data analysts, with analysts distributed across functional teams to better support the needs of the business. These analysts will report to operational teams or centralized data teams but own specific data sets and reporting functions. Multiple domains will generate and leverage data, leading to the inevitability that data sets used by multiple teams become duplicated, go missing, or go stale over time.

Other industry trends contributing to the current moment

Data mesh:

The data mesh is a decentralized, domain-oriented data architecture inspired by microservices. It treats data as a product, with each domain managing its own pipelines while adhering to universal governance standards. Organizations like Intuit and JPMorgan Chase have adopted this paradigm to improve data ownership, visibility, and quality. Success depends on reliable data through rigorous testing, monitoring, and observability.

Streaming Data:

Streaming data enables real-time insights by processing continuous data flows, unlike traditional batch data processing. This shift increases the speed of insights but introduces new challenges for ensuring data quality in motion.

Rise of the data lakehouse:

Data lakehouses combine the structured data capabilities of warehouses with the unstructured flexibility of data lakes. This model supports growing business needs and simplifies complex pipelines. Providers like Redshift Spectrum and Databricks Lakehouse blur the lines between lakes and warehouses, enabling a hybrid approach but adding complexity to data quality management.

Data Quality Fundamentals - Chapter 2 - Assembling the Building Blocks of a Reliable Data System

Understanding the difference between Operational and Analytical data

Operational data

Data produced operationally—that is, by the day-to-day ongoing operations at your organization. Inventory snapshots at moments in time, customer impressions, and transaction records are all examples of operational data.

Analytical data

Data used analytically. This is to say, it’s the type of data behind data-driven business decisions. Marketing churn, clickthrough rates, and impressions by global region are all examples of analytical data categories.

What makes them different?

Almost always, operational data appears upstream from analytical data in your data pipelines. This is because analytical data can and often does contain aggregations or augmentations of operational datastores.

Data warehouses vs. Data lakes

Generally, data warehouses store data in a structured (row-column) format. Such data is highly transformed (the result of defined preprocessing procedures) and present only because it has a determinate reason for being there—at least, in theory.

Data lakes by contrast store anything—structured data, semi-structured data, unstructured data. Unlike warehouses, data lakes need not have highly specified procedures through which data enters—you can dump any format you like into a lake and directly access it. The result is a system that is typically higher volume and often more complex in terms of its governance and data.

Data warehouses

Data warehouses require ‘schema on write’ access, meaning we set the structure of the data at the instant it enters the warehouse. Further transformations of this data must make its new structure explicit at every step.

The common ones are: Amazon redshift, Google BigQuery, and Snowflake

Data lakes

Data lakes also rely on high-quality analytical data to deliver optimal results. Unlike data warehouses, data lake architectures permit ‘schema on read’ access. This means we infer the structure of the data when we’re ready to use it.

Very early data lakes were built primarily on Apache Hadoop MapReduce and HDFS, leveraging Apache Hive to query their data with a SQL engine. In the early 2010s, Apache Spark made data lakes much more tenable, providing a generalized framework for distributed computations across large data sets in the data lake.

Some common features of data lakes include:

decoupled storage and compute
customization and interoperability
largely built on open source technologies
ability to handle unstructured or weakly structured data
support for sophisticated non-SQL programming models

Data lakehouses?

Data lakehouses first came onto the scene when cloud warehouse providers began adding features that offer lake-style benefits, such as Redshift Spectrum or Delta Lake. Similarly, data lakes have been adding technologies that offer warehouse-style features, such as SQL functionality and schema.

Syncing between warehouses and lakes

Different data warehouses and lakes are bridged by a data integration layer. Data integration tools, such as AWS Glue, Fivetran, and Matillion, collect data from disparate sources, unify this data, and transform it into an upstream source. A classic use case of data integration would be to collect lake data and load it in a structured format into one’s data warehouse.

Extract-transform-load, or ETL, is one well-known process within data integration. ETL generally describes integration steps where data is first extracted from one or more data stores, transformed into a new structure or format, and finally loaded into a destination data store

Collecting DQ metrics

What are DQ metrics?

In assessing whether your data is down, we might build a list of questions:

is the data up-to-date?
is the data complete?
are fields within expected ranges?
is the null rate higher or lower than it should be?
has the schema changed?

This list is a good place to start

How to pull DQ metrics

scalability challenges: managing large datasets involves batching, query optimization, deduplication, schema normalization, and scalable storage. A dedicated data pipeline is essential for long-term operation
monitoring across the data stack: true data observability extends beyond warehouses to include data lakes, dashboards, and ML models. Integrated solutions help ensure reliability across all assets
plling metrics from Snowflake:
- step 1: inventory mapping: use SQL queries to list and analyze database schemas, tables, and metadata for tracking changes over time. Include views and external tables for completeness
- step 2: monitor data freshness and volume: track row counts, data size, and timestamps of the latest updates to identify anomalies or missing data
- step 3: query history: analyze query logs to trace table usage, data lineage, and optimize costs. Include logs for data load operations to understand data movement and freshness
- step 4: health checks: evaluate key metrics like completeness, distinctness, and distribution of both string and numeric fields. Identify issues such as missing data, duplicates, or format mismatches
practical insights:
- automate monitoring and alert systems for fast issue resolution
- dentralize data quality insights in a user-friendly interface for root cause analysis

Using query logs to understand DQ in the warehouse

Questions we can answer:

who is accessing this data?
where does it come from upstream? Where is it going downstream?
how often, on average, is this particular transformation executed?
how many rows are affected?
when was this table last queried?
was that update part of a regular cadence, or does it break a pattern?
what’s the load on this warehouse as a function of the time of day?
is this query taking progressively longer than it was two months ago?
who (or what bot) has access to this resource who shouldn’t?

Using query logs to understand DQ in the lake

Example questions we can answer:

which job pipeline or user was responsible for this object’s creation?
what schema is the object using or relying on?

Designing a data catalog

Data catalogs, like to library catalogs, inventory metadata to assess data accessibility, health, and location. Tools like Alation, Collibra, and Informatica enhance discoverability, collaboration, and compliance through machine learning and automation. Acting as a single source of truth, data catalogs track data lineage, ensure trust, and manage personally identifiable information across pipelines.

Key questions addressed include data relevance, representation, and usability. While early cataloging relied on spreadsheets, manual methods fail to scale with vast, evolving datasets. Automation is essential for handling unstructured, dynamic data suited for machine learning pipelines. Modern catalogs automate metadata discovery and organization, overcoming the limitations of manual cataloging and catering to diverse data producer and consumer perspectives.

Building a data catalog

Building a data catalog requires collaboration with stakeholders to prioritise important data, assign ownership, and document metadata. While manual methods like spreadsheets were traditionally used, automation through SQL parsers and open-source databases is now essential. Cataloging tools like Amundsen and Apache Atlas can visualize metadata, while advanced catalogs focus on automated discovery and active metadata for self-service use.

Modern data management emphasises data discovery over static catalogs, offering real-time insights into data health, lineage, and usage across domains. This approach aligns with data mesh principles, assigning domain-specific accountability. Key features include:

self-service discovery and automation for easier access and reduced silos
scalability to adapt to growing, unstructured datasets
data lineage for tracking dependencies and troubleshooting pipeline issues

By integrating automated monitoring and observability platforms, teams ensure reliable, dynamic data discovery and quickly address pipeline issues, fostering trust and informed decision-making.

Example:

Table name	Dashboard/report	Last updated	Owner	Notes
LIOR_GOOD_TABLE_3.csv	Exec Forecasting V3 (Looker)	March 3, 2022	Lior Gavish (lior@internet.org)	Lior’s table; used for executive financial forecasts, i.e., ARR
MEETINGS_DOWNTIME_2022.csv	Report 1234 (Tableau)	February 2, 2022	Barr Moses (barr@internet.org)	Outages occurred during in-person meetings
DONT_USE_4_MV.csv	Dashboard Yikes (Chartio)	October 30, 2021	Molly Vorwerck (molly@internet.org)	Who knows?
RYANS_DATA.csv	Marketing Model (Looker)	March 3, 2022	Ryan Kearns (ryan@internet.org)	For demand generation models to inform ad spend across social channels

DQ Fundamentals - Chapter 3 - Collecting, Cleaning, Transforming, and Testing Data

Collecting data

The entrypoint is where raw, unprocessed data enters the pipeline. It is critical because it shapes the pipeline’s downstream processes. Data at this stage is often noisy and heterogeneous, originating from application logs, API responses, or sensors.

Data Sources

application log data
- records events within software, often with timestamps and log levels (info, warn, error)
- structure is variable, typically text-based, and chosen by developers
- used for diagnostics (tracking errors) or auditing (recording actions)
api responses
- data exchanged between programs, often in structured formats like json
- includes response codes (e.g 200 ok, 500 error)
- purpose depends on context, such as monitoring errors or using response content
sensor data
- collected from iot devices or research equipment, often noisy or prone to silent failure
- requires downstream cleaning and anomaly detection
- used in tasks like machine learning or real-time alerts, where high volume and low latency matter

The next step is cleaning and preparing the data for processing.

Cleaning data

Common ways:

outlier removal
assessing features
normalisation
data reconstruction (interpolation, extrapolation, labeling)
time zone conversion
type coercion

Batch vs. Stream Processing

Batch processing collects and processes data in large chunks over a set period, making it cost-effective and ideal for handling large datasets. Stream processing, in contrast, processes data in near real time, enabling timely insights for applications like fraud detection or ride-sharing.

batch processing: traditional method, cost-efficient, used for periodic tasks (e.g. Hadoop, BigQuery)
stream processing: supports real-time insights, suitable for dynamic use cases (e.g. Kafka, Spark, Flink)

Stream processing has gained popularity as real-time needs increase, though batch remains widely used for non-urgent data tasks.

DQ for streaming

Stream processing prioritizes speed over volume, often compromising data quality compared to batch processing. Traditional testing methods (unit, functional, integration) struggle to scale for real-time, unpredictable data, necessitating new approaches to ensure reliable streaming systems.

AWS kinesis:
- serverless, scalable, integrates well with AWS stack (e.g. S3, Redshift)
- cost-effective and supports multiple SDKs but fixed 7-day data retention

Kafka:
- open source, customizable, high throughput, low latency (e.g. ~2ms)
- requires more expertise but allows better configurability

Kinesis suits smaller teams for quick implementation, while Kafka offers flexibility for advanced needs. Stream processing systems need robust data normalization and quality frameworks to handle real-time challenges effectively.

Normalising data

handling heterogeneous data sources
schema checking and type coercion
syntactic vs semantic ambiguity in data

Running analytical data transformations

Analytical data transformations, often part of ETL or ELT workflows, involve preparing data for analytical use.

reasons for transformations:
- schema alignment, data cleaning (deduplication, filtering, etc.)
- unit/currency standardization and encryption for compliance
- governance audits and quality checks

The transform step is critical for ensuring data quality and usability in downstream systems.

Alerting and testing

Most common tests:

null values
data volume/any data at all
check data distribution
uniqueness
known invariants

To run data quality tests, you need to do two simple things:

load the transformed data into a temporary staging table/data set
run tests to ensure that the data in the staging table falls within the thresholds demanded of production (i.e., you need to answer “yes” to the question: is this what reliable data looks like?)

Managing DQ with Airflow

Apache Airflow improves data quality in pipelines by orchestrating workflows and monitoring for issues like deteriorating queries or buggy code. Tools like SLAs, circuit breakers, and SQL check operators help prevent and manage data downtime.

SLAs: set maximum task durations and trigger alerts or callbacks when tasks exceed thresholds

def sla_callback(dag, task_list, blocking_task_list, slas, blocking_tis):
    print(
        "The callback arguments are: ",
        {
            "dag": dag,
            "task_list": task_list,
            "blocking_task_list": blocking_task_list,
            "slas": slas,
            "blocking_tis": blocking_tis,
        },
    )
  
@dag(
    schedule_interval="*/2 * * * *",
    start_date=pendulum.datetime(2021, 1, 1, tz="UTC"),
    catchup=False,
    sla_miss_callback=sla_callback,
    default_args={'email': "email@example.com"},
)
def example_sla_dag():
    @task(sla=datetime.timedelta(seconds=10))
    def sleep_20():
        """Sleep for 20 seconds"""
        time.sleep(20)
  
    @task
    def sleep_30():
        """Sleep for 30 seconds"""
        time.sleep(30)
  
    sleep_20() >> sleep_30()
  
dag = example_sla_dag()

circuit breakers: pause pipelines if data fails quality checks (e.g., freshness, volume, schema) to prevent downstream issues and reduce backfilling costs
sql check operators: validate pipeline data against expectations using SQL queries; can halt pipelines for critical data failures

Good use of these can ensure data reliability while minimizing unnecessary disruptions.

More SQL on interviewquery.com

Did 10 more medium Qs ^^

That is all for today!

See you tomorrow :)