(Day 355) Continued with the DE with dbt book

Ivan Ivanov · December 21, 2024

data-eng theory mlops

Hello :) Today is Day 355!

A quick summary of today:

finished the preview of The AI Engineer’s Guide to Surviving the EU AI Act
learned about keeping history in dbt

The AI Engineer’s Guide to Surviving the EU AI Act - Chapter 5 - MLOps for High-Risk AI Systems

AI Act Engineering

The EU AI Act regulates AI systems in the EU to ensure trustworthiness by mitigating risks and safeguarding fundamental rights. Articles 9–15 provide requirements for high-risk AI systems but lack technical implementation guidance. To address this, a quality model for safety-critical AI systems has been proposed, integrating EU AI Act requirements with attributes like ethical integrity, human oversight, and fairness.

Key insights:

Mapping act requirements to quality attributes:
- facilitates operationalization of regulatory requirements
- integrates compliance into existing practices and aligns with international standards
EU AI act engineering:
- a multidisciplinary field combining AI, engineering, data science, law, and ethics
- focuses on risk assessment, transparency, monitoring, auditing, documentation, privacy, and human oversight
Integration with CRISP-ML(Q):
- aligns AI Act requirements with the six CRISP-ML(Q) phases: business and data understanding, data prep, model engineering, model evaluation, model deployment, and monitoring
- ensures continuous compliance by addressing risks iteratively throughout the ML lifecycle

MLOps and ML Engineering Practices for Achieving Compliance

establishing a risk management system for the entire lifecycle of the high-risk AI system
ensuring robust data governance by validating and testing datasets to guarantee their relevance, representativeness, and accuracy for the intended purpose
creating comprehensive technical documentation to demonstrate compliance and provide authorities with the necessary information for assessment
designing the hight-risk AI system for aytomatic event recording to identify national-level risks and substantial modifications throughout its lifecycle
providing detailed instructions for downstream deployers to ensure compliance
incorporating human oversight capabilities into the design of the high-risk AI system for downstream deployers
ensuring that the high-risk AI system meets appropriate levels of accuracy, robustness, and cybersecurity
establishing a quality management system to guarantee compliance

Risk management system

As per Article 9 of the EU AI Act

Checklist for compliance with the EU AI act, article 9, risk management system:

** Pre-development

Define risk assessment methodologies and tools to identify critical risks based on potential harms (safety, privacy, bias).

Create a risk management plan aligned with EU AI Act requirements.

Set up a centralized risk register.

** Business and Data Understanding

Conduct initial risk identification workshop

Document potential AI risks in the risk register

Perform ethical impact assessment.

Align project goals with organizational values and EU AI Act requirements.

** Data Preparation

Assess data quality, completeness, and representativeness.

Ensure datasets are free from discriminatory bias.

Identify and document data-related risks (e.g., bias, privacy). Assess risks related to data drift and fairness metrics.

Implement data versioning and lineage tracking.

Design and implement data validation tests. Implement data quality processes (e.g., completeness, correctness).

** Modeling

Conduct model vulnerability assessment.

Validate the model architecture for explainability and robustness.

Implement version control for model code and artifacts.

Design and implement unit tests for model components.

Document model architecture and design decisions.

Implement integration tests for the full ML pipeline.

Simulate edge cases and extreme scenarios during model design.

** Evaluation

Execute comprehensive test suites (unit, integration, system).

Perform adversarial testing and document results.

Evaluate performance with business, ethical goals, and the intended purpose of the system (value alignment metrics, e.g., fairness, explainability, robustness, safety, and accountability).

Conduct A/B testing in controlled environments.

Update risk register based on evaluation results.

Engage domain experts to validate evaluation metrics.

** Deployment

Conduct pre-deployment security audit.

Implement a continuous monitoring and alerting system.

Set up automated regression testing.

Establish feedback mechanisms from end-users and stakeholders.

Ensure that risk mitigation measures are operational (e.g., rollback mechanisms).

Automate the deployment pipeline to include quality gate checks.

** Monitoring and Maintenance

Implement automated monitoring for data drift, concept drift, and model performance.

Set up alerts for predefined risk thresholds.

Conduct regular model and system internal audits.

Perform periodic chaos engineering tests.

Regularly update risk register based on operational insights.

Schedule periodic reviews of ethical guidelines and value alignment.

MLOps and AI Engineering Checklist for Compliance with EU AI Act, Article 10. Data and Data Governance

** Business and Data Understanding

Establish a data governance framework aligned with EU AI Act requirements.

(Organizational) Define roles and responsibilities for data management.

Create data quality metrics and standards documentation.

Document all data sources and their credibility.

Verify data licensing and usage rights.

Assess data protection requirements.

Validate data source reliability.

Profile existing datasets.

Document known data quality issues.

Assess data completeness requirements.

Evaluate data representativeness.

Check for potential biases.

** Data Preparation

Specify data validation requirements (e.g., schema validation, data type checks, range checks, functional dependency validation, consistency checks, etc.).

Implement data quality checks (e.g., completeness, accuracy, consistency, etc.).

Set up data validation pipelines.

Implement data cleaning procedures.

Implement data lineage tracking.

Validate processed data quality.

Document transformation rules.

Validate transformed data.

Implement data anonymization where required.

Set up data versioning.

Implement privacy-preserving measures in data.

Set up secure data storage.

Implement access controls.

Document security measures.

Set up audit logging.

** Modeling

Validate training data quality.

Document data splitting methodology.

Implement cross-validation strategy.

Track data versions used for training.

Track model versions.

Track model lineage.

Monitor data drift.

Implement reproducibility controls.

Track experiment results.

Implement model validation procedures.

Test model robustness.

Validate model fairness.

** Evaluation

Evaluate model performance.

Assess model fairness and check for biases.

Validate model robustness.

Conduct risk assessment.

Validate security measures.

** Deployment

Define SLAs (Service Level Agreements) for data quality (completeness, accuracy, timeliness, consistency).

Set up quality gates.

Set up alerts.

Set up incident response.

Implement production monitoring.

Implement rollback procedures.

** Monitoring and Maintenance

Monitor data quality metrics.

Track model performance (e.g. processing or inference time, error rates).

Monitor fairness metrics.

Track system health.

Track business critical metrics (e.g., revenue-impacting metrics, customer-facing metrics, regulatory requirements).

Implement update procedures.

Track changes and updates.

Maintain (internal) audit trail.

MLOps and AI Engineering Checklist for Compliance with EU AI Act, Article 11. Technical Documentation and Article 12. Record-Keeping

** Business & Data Understanding Phase

Document intended purpose and use cases (Document business objectives and constraints. Specify expected performance metrics)

Establish record-keeping infrastructure (Define data collection scope and methods. Setup metadata tracking systems. Define logging requirements)

Document data sources and specifications (Data provenance records, data quality criteria, privacy and security requirements, data governance procedures)

** Model Training and Operationalization Phase

Document model development environment (Hardware specifications, software dependencies, development tools and versions)

Document training methodology (Model architecture specifications, hyperparameter settings, training algorithms and procedures, feature engineering processes)

Implement pipeline logging (Training infrastructure logs, resource utilization tracking, pipeline execution records)

Implement training logs (Training runs metadata, model performance metrics, data versioning records, feature extraction logs)

** Model Deployment and Serving Phase

Document deployment procedures (Testing protocols, validation methods, release criteria)

Document deployment architecture (Infrastructure requirements, scaling procedures, security measures)

Implement deployment logging (Deployment event logs, configuration changes, version control records, access control logs)

Document operational procedures (Maintenance protocols, update procedures, emergency response plans)

Document serving infrastructure (API specifications, performance requirements, resource allocations, scaling policies)

Implement prediction logging (Inference requests and responses, latency metrics, error logs, user feedback records)

Document monitoring strategy (Monitoring metrics, alert thresholds, response procedures, maintenance schedules)

Document monitoring system (Alert thresholds, performance metrics, system health checks)

MLOps and AI Engineering Checklist for Compliance with EU AI Act, Article 13. Transparency and Provision of Information to Deployers and Article 14. Human Oversight

** Business and Data Understanding

User Requirement Gathering: Involve users in defining AI system requirements through interviews, surveys, and workshops.

Expectation Alignment: Ensure user expectations are documented and understood. Stakeholders’ fairness expectations are documented.

Transparency Requirements: Define what information needs to be transparent to users based on their needs.

User Personas and Goals: Create user personas to tailor transparency efforts effectively.

Explainability Goals: Establish interpretability objectives that align with user needs and business goals.

Documentation Standards: Set standards for documenting requirements, data sources, and decisions.

Requirement Documentation: Use templates to document business and data understanding comprehensively.

Scope Definition: Clearly define the AI system’s intended use cases and limitations.

Use-Case Documentation: Develop detailed descriptions of appropriate and inappropriate use cases.

** Data Preparation

Data Relevance Validation: Involve users in validating the relevance and quality of the data collected. Complete the data bias assessment. Log the dataset distributions.

Fairness-enhancing preprocessing applied (if necessary).

Data Catalog Accessibility: Provide users with access to data catalogs and descriptions.

Feature Explainability: Ensure that data features are understandable and meaningful to users.

Feature Selection Documentation: Document the rationale behind feature selection.

Data Source Documentation: Record all data sources, collection methods, and preprocessing steps.

Data Versioning: Implement data version control to track changes over time.

Contextual Data Labeling: Label and tag data to indicate the context in which it is appropriate.

** Modeling

Model Selection Input: Seek user input when selecting models to ensure they meet user needs.

Model Architecture Disclosure: Provide users with understandable information about model structures.

Use of Interpretable Models: Prefer inherently interpretable models where possible.

Explainability Techniques: Implement techniques to explain complex models if used.

Model Documentation: Record model architectures, parameters, and training processes.

Model Cards: Create model cards summarizing key information about each model.

Model Applicability Testing: Validate that models are appropriate for the intended contexts.

Stress Testing: Perform stress tests to assess model performance in various scenarios.

Explainability outputs validated (with SHAP, LIME, or similar tools).

Mechanisms for human override implemented.

** Evaluation

User Testing Sessions: Involve users in testing and evaluating model outputs.

Evaluation Results Sharing: Provide users with accessible reports on evaluation outcomes.

Explain Evaluation Metrics: Use understandable metrics and explain their significance to users.

Evaluation Procedures Documentation: Record all evaluation methods and results thoroughly.

Experiment Tracking: Use tools to track experiments and results systematically.

Contextual Performance Analysis: Evaluate model performance across different contexts and document findings.

** Deployment

User Control Options: Implement features that allow users to control AI interactions (e.g., opt-in/opt-out).

AI Interaction Disclosure: Clearly inform users when they are interacting with an AI system.

Information Accessibility: Provide easy access to information about how the AI system works.

Decision Explanations: Offer explanations for AI decisions in a user-friendly manner.

Deployment Documentation: Record deployment configurations, environments, and versions.

AI System Indicators: Use visual cues or labels to indicate AI-generated content or decisions.

Usage Limitations Display: Clearly display the limitations and appropriate contexts of the AI system.

Alerting systems for fairness and drift configured.

** Monitoring and Maintenance

Feedback Mechanisms: Provide channels for users to submit feedback and report issues.

Update Notifications: Inform users about updates, changes, or issues with the AI system.

Continuous Explainability: Ensure explanations remain accurate and relevant over time and real-time monitoring pipelines implemented.

Maintenance Logs: Keep detailed logs of maintenance activities, updates, and system changes.

Documentation Updates: Regularly update documentation to reflect the system’s current state.

Context Drift Monitoring: Monitor for changes in the operating environment that may affect appropriateness. Conduct regular fairness audits.

Misuse Alerts: Set up alerts for potential misuse or operation outside intended contexts.

Feedback loops are established for continuous learning.

MLOps and AI Engineering Checklist for Compliance with EU AI Act, Article 15: Accuracy, Robustness, and Cybersecurity

** Business and Data Understanding

Identify potential data quality issues through data profiling and initial analysis.

Document known data limitations and risks that could impact faultlessness.

Ensure training data covers diverse scenarios, including potential edge cases.

Identify potential robustness requirements based on system objectives and risks (For example, the system must resist common adversarial attacks such as feature perturbation or injection of adversarial examples. Or maybe it should achieve at least 90% accuracy when tested with synthetic adversarial inputs).

** Data Preparation

Validate datasets for accuracy, duplicates, outliers, and completeness using data validation tools (e.g., Great Expectations).

Maintain a clear lineage of all preprocessing steps for traceability.

Perform data augmentation to enrich the dataset with diverse and extreme scenarios.

Test data against adversarial examples to ensure robustness.

Encrypt data in transit and at rest using AES-256 encryption.

Apply access controls to ensure secure handling of sensitive data.

** Model Development

Define and track performance metrics that cover a wide range of evaluation needs, from functional correctness to robustness and cybersecurity, ensuring a comprehensive assessment of AI systems for compliance with the EU AI Act Article 15 (e.g., precision, recall, F1-score, Mean Absolute Error (MAE), Adversarial Robustness Score, Adversarial Attack Success Rate, System Uptime and Thoughput).

Implement unit and integration tests for model components.

Use exception handling in code to manage unexpected inputs or processing errors.

Conduct code reviews and apply static code analysis tools (e.g., SonarQube).

Train models with adversarial training techniques to improve robustness.

Evaluate models under stress testing to determine performance limits.

Use interpretable models or apply explainability tools (e.g., SHAP, LIME) to explain decisions.

Use metadata management tools (e.g., MLflow, Neptune.ai) to track model lineage.

Sign model artifacts to ensure their integrity using secure hash algorithms (e.g., SHA-256).

Protect model training environments from unauthorized access.

Conduct security assessments and penetration testing during development.

Apply defenses against adversarial attacks (e.g., gradient masking).

** Model Evaluation

Validate model outputs against ground truth data.

Test the model’s accuracy using holdout datasets and cross-validation techniques.

Conduct regression testing to ensure the absence of bugs in updated models.

Evaluate model performance against adversarial inputs and out-of-distribution scenarios.

** Deployment

Use CI/CD pipelines to automate deployment testing and validation.

Monitor real-time performance to ensure consistency in production.

Use redundancy mechanisms to mitigate failures in production.

Deploy redundant models or ensembles for fault tolerance.

Secure deployment pipelines using role-based access controls (RBAC) and encrypted communications.

Monitor for cybersecurity threats (using tools like AWS GuardDuty).

Develop and periodically test incident response plans for security breaches.

** Monitoring and Maintenance

Monitor edge-case behavior and retrain models as new data becomes available.

Periodically reassess robustness to handle evolving conditions.

Perform regular security audits to ensure the integrity of deployed systems.

Encrypt backups and secure access to sensitive data and systems.

Continuously monitor for potential attacks and implement mitigations.

Use intrusion detection systems to detect unauthorized access or anomalies.

Wow ~ this chapter is so detailed about the articles in the Act. Definitely a book to refenrece if I end up doing some work that might be related to the Act/the EU.

DE with dbt - Chapter 13 - Moving Beyond the Basics

Building for modularity

A quick recap of the Pragmatic Data Platform approach:

storage layer: adapt incoming data to how we want to use it without changing its semantics and store all the data: the good, the bad, and the ugly
refined layer: apply master data and implement business rules
delivery layer: organize the data to be easy to use, depending on the user

Modularity in the storage layer

The storage layer is designed for predictability and modularity, leveraging interchangeable components for flexibility and scalability. Key components include:

seed: loads small reference data from csv files, either directly via ref() or indirectly using source(). Naming conventions like “SEED_xxx” clarify its function
source definition: references external data, grouping inputs for easier management and auditability while maintaining table name consistency
snapshot: tracks changes in input entities with periodic snapshots. Limitations include shared resources across environments and dependency on full exports for deletion tracking
STG model: integrates components via CTEs. Subcomponents include:
- adaptation: prepares incoming data by fixing data types, renaming, and applying corrections
- default record: adds default values for orphan foreign keys
- hashed: calculates keys (HKEY, HDIFF) and metadata columns (e.g, LOAD_TS)
HIST model: an insert-only strategy for tracking entity changes, creating a slowly changing dimension type 2 table. This high-performance default stores both current and historical states effectively. Delete-aware variants handle deletions seamlessly
PII model: separates Personally Identifiable Information (PII) for compliance and controlled access

Modularity in the refined layer

The refined layer does not have the same level of predictability as the storage layer, but the general playbook is pretty clear and goes trough these stages:

refined individual system: the initial process is to compose and transform the data from each individual source system, including the use of locally specific business rules, up to an abstraction level comparable with the other systems
master data mapping: applying master data transformations to convert the codes from one system to make them interoperable with a wider range of systems
master data refined layer: the final process is to compose and transform the results of the master data mapping to produce a set of business concepts, expressed with master data codes, that allow us to calculate company-wide business metrics and apply company business rules

Progressive vs direct MD conversion example:

There are many different transformations that can be applied, but here is a general overview:

REF model: the REF model’s role is to represent a refined entity in a certain context, something that you expect to be generally reusable in that context because it represents a concept that is useful and fairly complete in the context
AGG model: the AGG model’s role is to represent an aggregated refined model
MDD model: the MDD model’s role is to represent a dimension coming directly from master data, therefore a kind of ‘authoritative’ list of entities
MAP model: the MAP model’s role is to represent a mapping table that allows code conversions between one system/context and another
TR model: the TR model’s role is to represent an intermediate transformation
PIVOT model: the PIVOT model’s role is to obtain models by pivoting other models

Modularity in the delivery layer

When we move to the delivery layer, we have one or more data marts. The role of each data mart is to deliver a dataset related to a specific subject, such as marketing, finance, or operations, at the level of a domain that often matches one organization boundary, such as a plant, country, region, or corporation-wide.

Common models found here:

FACT model: the FACT model’s role is to represent a Kimball-style fact
DIM model: the DIM model’s role is to represent a Kimball-style dimension or hierarchy
REPORT model: the REPORT model’s role is to represent a table that can power a report
FILTER model: this is a specialized model whose role is to make it extremely clear how one or more facts are filtered based on requirements that exist only in the specific data mart where the filter model lives

Managing identity

identifying a key/code to be used to identify one instance of an entity

Types of keys:

surrogate keys
natural key or business key
primary key
foreign key

Main uses of keys

tracking and applying changes
extracting the current and historical version
joining data

Master Data management

Master Data refers to the core descriptive data of an organization, ensuring consistent identification of the same concept across different units. It often overlaps with organizational dimensions like customer, product, or employee.

Data for Master Data management

Master Data Dimensions (MDDs): this is the dimension that contains the ‘gold records’ of the business concept. Often, there is not a full dimension but only a list of master data codes, maybe with a name or description for human use
Mapping tables (MAPs): this is a table that allows us to convert the codes from one domain into another – for example, converting Italian product codes into EMEA codes

Saving history at scale

Before, we used dbt’s snapshot feature to track changes to the data. But now the HIST method will be used - they are simple, efficient, flexible, and resilient and they are the best solution, whether you wish to store the changes, if you want to be able to audit the data that you ingest, or even if you just want to have the current data.

Main advantages of this storage pattern:

HIST models save version changes using insert-only operations and offer the most simple and scalable way to save your data while keeping all the versions that your data has gone through (insert is the most efficient operation on modern cloud databases, while updates or deletes used by Snapshot have poor performances)

The columnar format is quite good at compressing data, especially for columns with low variability, so our versions will be stored effectively

HIST tables implement idempotent loads which allow to run and re-run our dbt project
HIST tables store the changes in a common and flexible way as slowly changing dimensions, allowing to start by using only the current versions, and move to use all the changes in case we need them. There’s no need to reload the data or change how we load/store data if we want to use the stored changes
even the simplest implementation of the save_history macro provides full auditability of the changes if we receive one change per key per load

Understanding the save_history macro

The normal use of the macro is to create HIST models that read data from our STG models and save it.

Example usage for one of the staging tables:

{ save_history(
    input_rel = ref('STG_ABC_BANK_POSITION'),
    key_column = 'POSITION_HKEY',
    diff_column = 'POSITION_HDIFF',
) }

Here is the outline of the macro:

By not providing a unique_key parameter to the config, we instruct dbt to append all the rows returned by our SQL code to the incremental model.

The macro’s design is based on the dbt is_incremental() macro to run the right code depending if we already have the underlying table or if we are creating it

A crucial step of this load pattern is to be sure that, in both the input and the current_from_history, we only have one row for each key. If we have multiple rows, one HDIFF can never match two different ones, so the load will be incorrect. The current_from_history macro satisfies this requirement by design, so we need to ensure that the input is unique by HKEY.

The core of this pattern is to determine what rows from the input need to be inserted into the HIST table. This job is performed by the two versions of the load_from_input CTE – one for the incremental runs and one for the non-incremental ones.

I think this way of adding history uses a different method compared to what I learned in Zach Wilson’s bootcamp about implementing SCD type 2. I think that code then was very manual and not designed to use over different pipelines. However, with this one it can be used over different pipelines and create a history table.

The current_from_history macro

The heavy lifting is done by the QUALIFY clause, where we partition the data by the desired key and order it by the desired column, in descending order, so that the first row that we keep is the most recent.

Maybe the only interesting param is the qualify_function - it is an optional string that designates the window function used to filter the HIST table

The default value is row_number, which means that the most recent row is returned, based on the key used to partition and the column used to order.

When we use a key with a coarser grain than the data stored in the HIST table, we can use other functions to control what data is retrieved.

As an example, by using the rank() function together with a MONTH(LOAD_TS_UTC) order column, we can retrieve all the changes that were made in the month with the most recent data by each key.

(wow this macro is amazing)

That is all for today!

See you tomorrow :)