(Day 354) Reading more about the EU AI act

Ivan Ivanov · December 20, 2024

data-eng theory mlops

Hello :) Today is Day 354!

A quick summary of today:

more of The AI Engineer’s Guide to Surviving the EU AI Act book
PyFlink homework

The AI Engineer’s Guide to Surviving the EU AI Act - Chapter 3 - Data and AI Governance and MLOps

Why Data and AI Governance are crucial in the EU AI Act era

Proper data governance for AI is crucial given some of the disasters that already happened (Amazon’s AI recruiter discriminating against women, IBM Watson suggesting incorrect cancer treatments, etc). Comprehensive bias testing across diverse populations is essential, especially in healthcare and education applications. Transparency and explainability in AI systems are not just technical challenges but legal and ethical requirements. Ongoing monitoring and auditing of AI systems in production are necessary to catch and address issues early.

To comply with the AI Act’s requirements, organizations have a few things to think about:

need robust data and AI governance frameworks in place to meet strict requirements related to data quality, documentation, transparency, human oversight, and risk management
effective data and AI governance are essential for identifying, assessing, and mitigating the risks associated with AI systems
strong data and AI governance promote responsible and ethical AI development by ensuring data quality, preventing bias, and maintaining human oversight, in line with the AI Act’s aim to foster the development of trustworthy AI that respects EU values and principles
well-designed governance enables organizations to innovate in AI while safeguarding privacy and other fundamental rights

Overview of Data Governance

It includes a data management system informing about:

sourcing
storage
labeling
access control
quality

Data Governance Definition

Data governance is a data management concept focused on ensuring high data quality throughout its lifecycle, implementing data controls supporting business objectives, and enabling organization-wide use. Key focus areas include data availability, usability, consistency, integrity, security, and standards compliance.

Data Engineering Lifecycle and Data Governance

Data Products and Data Contracts

Data product canvas:

Data products consist of 8 building blocks:

Domain Name
Data Product Name
Consumer and Use Case
Data Contract (more on this in a moment)
Documentation of data sources
Data Product Architecture, which includes the transformation pipelines for data lineage.
Ubiquitous Language to get shared understanding of all definitions.
Classification of the data product (source-aligned, aggregate, or consumer-aligned)

Key components of data contracts:

data schema
quality standards
delivery schedule
SLAs
security, privacy, and governance
change management

Integrating Data Governance into MLOps

Integrating data governance into the MLOps Stack Canvas ensures machine learning systems are compliant, trustworthy, and sustainable. By embedding governance across each component of the MLOps lifecycle, organizations can manage data responsibly and adhere to regulations, enhancing the overall value of ML projects.

Value Proposition
Incorporate compliance and data governance goals, using metrics like policy compliance rate, regulatory adherence, and cost of poor data quality

Data Sources & Versioning

Catalog data sources by sensitivity and regulatory needs
Implement lineage tracking and sustainable documentation practices
Use tools like DVC, Apache Atlas, or Delta Lake for versioning and lineage

Data Analysis & Experimentation

Enforce access controls and maintain audit trails
Ensure compliance with tools like MLflow or Weights & Biases

Feature Store & Engineering

Secure and govern features with metadata and access policies
Track feature lineage and ensure compliance with tools like Feast or AWS SageMaker feature store

DevOps Foundations

Embed governance checks in CI/CD pipelines
Automate data quality validations and anomaly detection

Continuous Integration & Deployment

Monitor data drift and automate bias detection using tools like Evidently AI or Fiddler AI

Model Registry & Versioning

Maintain governance metadata in model registries and establish versioning strategies aligned with data policies.

Model Deployment

Deploy models with staged practices to ensure compliance, and monitor input/output data for privacy adherence.

Prediction Serving

Protect sensitive data with encryption and implement data access controls
Monitor prediction quality, fairness, and compliance

System Monitoring
- Track metrics like data quality and model drift.
- Use tools like Prometheus and Grafana for real-time alerts.

Metadata Management
- Centralize metadata to ensure lineage, provenance, and compliance tracking - Leverage platforms like Alation or Apache Atlas for comprehensive management

Overview of AI Governance

AI Governance Defined

The objective of AI governance is to establish and maintain a framework that ensures AI systems are developed, deployed, and used responsibly, ethically, and in alignment with organizational goals and values. AI governance addresses AI algorithms, decision-making based on AI predictions, security, and data privacy. AI governance is an ecosystem in an enterprise. This ecosystem consists of people, processes, and policies.

Core Principles of AI Governance

The core principles for AI governance:

human-centred design and oversight
privacy and data security
safety, security, and dependability
ethical and responsible practices
transparency and explainability
accountability, redress, and remedy
fairness and inclusiveness
reproducability
robustness
innovation and competition

AI System Lifecycle and AI Governance

business and data understanding
data preparation
model engineering
model evaluation
model deployment
model monitoring and maintenance

The integration of AI Governance and MLOps

Incorporating AI governance into the MLOps Stack Canvas ensures ethical, legal, and responsible AI practices throughout the machine learning lifecycle. By addressing risks such as bias, fairness, privacy, and security, this approach enhances compliance, promotes cross-functional collaboration, and ensures higher-quality, trustworthy models.

Value Proposition:

align AI projects with ethical principles and regulations (i.e. EU AI Act, GDPR)
perform risk classifications and ethical impact assessments
identify affected stakeholders, including vulnerable groups

Data Sources & Versioning:

ensure data sourcing follows ethical guidelines
assess data quality and bias
conduct data ethics reviews

Data Analysis & Experiment Management:

use tools for transparency and accountability (e.g., fairness metrics, XAI techniques)
document experiments for reproducibility and audit trails
implement ethical reviews for design and results

Feature Store & Engineering:

detect and mitigate feature bias
use privacy-preserving techniques
ensure feature documentation for governance

CI/CT/CD Pipelines:

add ethical checkpoints and compliance automation
test for robustness and bias using tools like AI Fairness 360
incorporate human oversight and gradual rollouts

Model Registry & Deployment:

document ethical metadata and approvals
monitor models for ethical concerns post-deployment
enable user feedback and explainability

Monitoring & Metadata Management:

continuously track fairness, bias, and data drift
maintain comprehensive governance documentation with traceability
add regulatory compliance metadata and decision traceability

AI Compliance vs. AI Governance vs. AI Risk Management

AI Compliance focuses on adhering to legal, regulatory, and policy standards, ensuring ethical behavior, and promoting transparency through rigorous processes, metrics, and stakeholder engagement.

AI Governance establishes a framework for ethical and responsible AI development, emphasizing transparency, stakeholder engagement, and balancing innovation with ethical considerations.

AI Risk Management identifies and mitigates risks associated with AI systems, ensuring robust risk assessment, risk mitigation strategies, and promoting a culture of risk awareness and proactive risk management.

Well-known examples of AI risks

algorithmic bias
privacy valuations and social manipulations
deepfakes
autonomous weapons
AI-enabled cyberattacks
lack of transparency
environmental impact
safety risks in critical systems
intellectual property issues
autonomous vehicle accidents

The NIST AI Risk Management Framework (AI RMF) proposes four core functions that form the foundation for managing risks associated with artificial intelligence systems.

Emerging Trends in Data and AI Governance

focus on AI ethics, trust, and enhanced data privacy
adoption of AI in data management and proactive compliance automation
focus on data lineage and provenance in AI
shift left data and AI governance
decentralized data governance

Chapter 4 - Tailoring MLOps for Different Risk Levels

Creating AI System Landscape

To navigate the EU AI Act, first, one should get an overview of existing AI systems or, broadly speaking, AI use cases and assess whether these AI systems are underlying the legislation or if there are no action points. It is also helpful to distinguish whether you plan to put these AI use cases into production or are just experimentation or proof of concepts. Whenever an AI use case is developed as an internal research project, it can be safely excluded from the compliance obligation.

Inventory Your AI Systems

Start by cataloging all AI systems used or developed within your organization by following these steps:

identify AI applications across all departments
include both custom-developed and third-party AI solutions
document key details like department, purpose, data used, deployment status, and risk category.

Include:

basic information (system name, department, purpose and functionality)
technical details (data sources, deployment status, vendor information)
risk assessment (risk level and justification)
additional notes (integration with other systems, responsible team/individual, date of last review)

Example:

Application of the EU AI Act

The process of determining the EU AI Act’s relevance to AI Systems and Risk Classification

Prohibited AI Practices

The AI Act aims to protect fundamental rights, democracy, and the rule of law by banning AI applications that are incompatible with EU values and rights. In general, the AI Act prohibits AI systems from:

using subliminal, manipulative, or deceptive techniques to misinterpret behavior and damage informed decision-making, causing significant harm
exploiting vulnerabilities related to age, disability, or socio-economic circumstances to distort behavior, causing significant harm
utilizing biometric categorization systems to infer sensitive attributes such as race, political opinions, religious beliefs, and sexual orientation
conducting social scoring, i.e., evaluating individuals based on social behavior or personal traits, which can lead to discriminative treatment
creating or expanding facial recognition databases through untargeted scraping of facial images from the internet or CCTV footage
inferring emotions in workplaces or educational institutions, except for medical or safety reasons

Determining AI Act Obligation

To determine the EU AI Act obligations, there are two fundamental steps in AI systems assessment:

Determine if the AI Act applies to any AI Systems from the inventory above by answering the following questions: Do they meet the definition of “AI System”? Are they in the scope of the AI Act? Conduct a risk classification for each AI System. Are they High-Risk or not? Do Transparency obligations apply?
From the inventory above, determine your organization’s role in relation to AI systems. Is your organization an AI system provider or deployer?

Framework for Classification of AI Systems by Risk Levels

High-Risk AI Systems. Key areas include:

biometric identification
critical infrastructure
education and employment
law enforcement and justice
migration and border control

Limited-Risk AI Systems: these systems require transparency measures but are not as regulated as high-risk systems (Article 50). Examples include chatbots, emotion recognition systems, and deepfake generation tools. A questionnaire helps determine compliance

Low-Risk AI Systems: pose minimal societal risks. Examples include spam filters, recommendation engines, and basic chatbots. No specific obligations are imposed under the Act unless their context or usage escalates risks.

Planning the Data Governance, AI Governance, and MLOps for the Compliance with the EU AI Act

Engineering and MLOps tasks for EU AI Act compliance grow more detailed and intensive as AI projects progress from ideation to maturity. Robust infrastructure, monitoring, and governance are essential to support compliance and system reliability. Key points include:

MLOps and compliance integration: MLOps processes like data pipelines and model monitoring directly support EU AI Act requirements
data governance: increasingly critical in later stages for ensuring data quality and regulatory compliance
human oversight: evolving sophistication to align with EU AI Act mandates for high-risk AI systems
risk management: progresses from high-level to continuous, automated assessments
automation and scalability: emphasis grows as systems mature
MVP stage: focus shifts to audit readiness and conformity assessments for high-risk AI systems

Emerging Roles in Organizations for EU AI Act Compliance

the AI ethics specialist
the AI internal auditor
the legal AI compliance officer or AI regulatory affairs specialist
data and AI governance specialist
AI risk management specialist
AI security specialist

Flink homework for Zach Wilson’s YT camp

I decided to do the homework today ~

Task:

Create a Flink job that sessionizes the input data by IP address and host; Use a 5 minute gap

import os
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.table import EnvironmentSettings, StreamTableEnvironment
from pyflink.table.expressions import lit, col
from pyflink.table.window import Session


def create_aggregated_events_ip_host_sink_postgres(t_env):
    table_name = 'processed_events_aggregated_hw'
    sink_ddl = f"""
        CREATE TABLE {table_name} (
            event_hour TIMESTAMP(3),
            ip VARCHAR,
            host VARCHAR,
            num_hits BIGINT
        ) WITH (
            'connector' = 'jdbc',
            'url' = '{os.environ.get("POSTGRES_URL")}',
            'table-name' = '{table_name}',
            'username' = '{os.environ.get("POSTGRES_USER", "postgres")}',
            'password' = '{os.environ.get("POSTGRES_PASSWORD", "postgres")}',
            'driver' = 'org.postgresql.Driver'
        );
    """
    t_env.execute_sql(sink_ddl)
    return table_name

def create_processed_events_source_kafka(t_env):
    kafka_key = os.environ.get("KAFKA_WEB_TRAFFIC_KEY", "")
    kafka_secret = os.environ.get("KAFKA_WEB_TRAFFIC_SECRET", "")
    table_name = "process_events_kafka_hw"
    pattern = "yyyy-MM-dd''T''HH:mm:ss.SSS''Z''"
    sink_ddl = f"""
        CREATE TABLE {table_name} (
            url VARCHAR,
            referrer VARCHAR,
            user_agent VARCHAR,
            host VARCHAR,
            ip VARCHAR,
            headers VARCHAR,
            event_time VARCHAR,
            window_timestamp AS TO_TIMESTAMP(event_time, '{pattern}'),
            WATERMARK FOR window_timestamp AS window_timestamp - INTERVAL '15' SECOND
        ) WITH (
            'connector' = 'kafka',
            'properties.bootstrap.servers' = '{os.environ.get('KAFKA_URL')}',
            'topic' = '{os.environ.get('KAFKA_TOPIC')}',
            'properties.group.id' = '{os.environ.get('KAFKA_GROUP')}',
            'properties.security.protocol' = 'SASL_SSL',
            'properties.sasl.mechanism' = 'PLAIN',
            'properties.sasl.jaas.config' = 'org.apache.flink.kafka.shaded.org.apache.kafka.common.security.plain.PlainLoginModule required username=\"{kafka_key}\" password=\"{kafka_secret}\";',
            'scan.startup.mode' = 'latest-offset',
            'properties.auto.offset.reset' = 'latest',
            'format' = 'json'
        );
    """
    t_env.execute_sql(sink_ddl)
    return table_name


def log_aggregation():
    # Set up the execution environment
    env = StreamExecutionEnvironment.get_execution_environment()
    env.enable_checkpointing(10)
    env.set_parallelism(3)

    # Set up the table environment
    settings = EnvironmentSettings.new_instance().in_streaming_mode().build()
    t_env = StreamTableEnvironment.create(env, environment_settings=settings)

    try:
        # Create Kafka table
        source_table = create_processed_events_source_kafka(t_env)
        print('running hw job')
        aggregated_table = create_aggregated_events_ip_host_sink_postgres(t_env)
        t_env.from_path(source_table) \
            .window(Session.with_gap(lit(5).minutes).on(col("window_timestamp")).alias("w")) \
            .group_by(
                col("w"),
                col("host"),
                col("ip")
            ) \
            .select(
                    col("w").start.alias("event_hour"),
                    col("ip"),
                    col("host"),
                    col("host").count.alias("num_hits")
            ) \
            .execute_insert(aggregated_table).wait()

    except Exception as e:
        print("Writing records from Kafka to JDBC failed:", str(e))


if __name__ == '__main__':
    log_aggregation()

The thing is that the data every student gets will be different because we run the job during different times, but I think my code is the same as some other students from what I have seen talked about on the community discord server ~ We’ll see when the submission opens 😆

That is all for today!

See you tomorrow :)