(Day 354) Reading more about the EU AI act

Ivan Ivanov · December 20, 2024

Hello :) Today is Day 354!

A quick summary of today:

  • more of The AI Engineer’s Guide to Surviving the EU AI Act book
  • PyFlink homework

The AI Engineer’s Guide to Surviving the EU AI Act - Chapter 3 - Data and AI Governance and MLOps

Why Data and AI Governance are crucial in the EU AI Act era

Proper data governance for AI is crucial given some of the disasters that already happened (Amazon’s AI recruiter discriminating against women, IBM Watson suggesting incorrect cancer treatments, etc). Comprehensive bias testing across diverse populations is essential, especially in healthcare and education applications. Transparency and explainability in AI systems are not just technical challenges but legal and ethical requirements. Ongoing monitoring and auditing of AI systems in production are necessary to catch and address issues early.

To comply with the AI Act’s requirements, organizations have a few things to think about:

  • need robust data and AI governance frameworks in place to meet strict requirements related to data quality, documentation, transparency, human oversight, and risk management
  • effective data and AI governance are essential for identifying, assessing, and mitigating the risks associated with AI systems
  • strong data and AI governance promote responsible and ethical AI development by ensuring data quality, preventing bias, and maintaining human oversight, in line with the AI Act’s aim to foster the development of trustworthy AI that respects EU values and principles
  • well-designed governance enables organizations to innovate in AI while safeguarding privacy and other fundamental rights

Overview of Data Governance

It includes a data management system informing about:

  • sourcing
  • storage
  • labeling
  • access control
  • quality

Data Governance Definition

Data governance is a data management concept focused on ensuring high data quality throughout its lifecycle, implementing data controls supporting business objectives, and enabling organization-wide use. Key focus areas include data availability, usability, consistency, integrity, security, and standards compliance.

Data Engineering Lifecycle and Data Governance

image

Data Products and Data Contracts

Data product canvas:

image

Data products consist of 8 building blocks:

  1. Domain Name
  2. Data Product Name
  3. Consumer and Use Case
  4. Data Contract (more on this in a moment)
  5. Documentation of data sources
  6. Data Product Architecture, which includes the transformation pipelines for data lineage.
  7. Ubiquitous Language to get shared understanding of all definitions.
  8. Classification of the data product (source-aligned, aggregate, or consumer-aligned)

Key components of data contracts:

  • data schema
  • quality standards
  • delivery schedule
  • SLAs
  • security, privacy, and governance
  • change management

Integrating Data Governance into MLOps

Integrating data governance into the MLOps Stack Canvas ensures machine learning systems are compliant, trustworthy, and sustainable. By embedding governance across each component of the MLOps lifecycle, organizations can manage data responsibly and adhere to regulations, enhancing the overall value of ML projects.

Value Proposition
Incorporate compliance and data governance goals, using metrics like policy compliance rate, regulatory adherence, and cost of poor data quality

Data Sources & Versioning

  • Catalog data sources by sensitivity and regulatory needs
  • Implement lineage tracking and sustainable documentation practices
  • Use tools like DVC, Apache Atlas, or Delta Lake for versioning and lineage

Data Analysis & Experimentation

  • Enforce access controls and maintain audit trails
  • Ensure compliance with tools like MLflow or Weights & Biases

Feature Store & Engineering

  • Secure and govern features with metadata and access policies
  • Track feature lineage and ensure compliance with tools like Feast or AWS SageMaker feature store

DevOps Foundations

  • Embed governance checks in CI/CD pipelines
  • Automate data quality validations and anomaly detection

Continuous Integration & Deployment

  • Monitor data drift and automate bias detection using tools like Evidently AI or Fiddler AI

Model Registry & Versioning

  • Maintain governance metadata in model registries and establish versioning strategies aligned with data policies.

Model Deployment

  • Deploy models with staged practices to ensure compliance, and monitor input/output data for privacy adherence.

Prediction Serving

  • Protect sensitive data with encryption and implement data access controls
  • Monitor prediction quality, fairness, and compliance

System Monitoring
- Track metrics like data quality and model drift.
- Use tools like Prometheus and Grafana for real-time alerts.

Metadata Management
- Centralize metadata to ensure lineage, provenance, and compliance tracking - Leverage platforms like Alation or Apache Atlas for comprehensive management

Overview of AI Governance

AI Governance Defined

The objective of AI governance is to establish and maintain a framework that ensures AI systems are developed, deployed, and used responsibly, ethically, and in alignment with organizational goals and values. AI governance addresses AI algorithms, decision-making based on AI predictions, security, and data privacy. AI governance is an ecosystem in an enterprise. This ecosystem consists of people, processes, and policies.

Core Principles of AI Governance

The core principles for AI governance:

  • human-centred design and oversight
  • privacy and data security
  • safety, security, and dependability
  • ethical and responsible practices
  • transparency and explainability
  • accountability, redress, and remedy
  • fairness and inclusiveness
  • reproducability
  • robustness
  • innovation and competition

AI System Lifecycle and AI Governance

  • business and data understanding
  • data preparation
  • model engineering
  • model evaluation
  • model deployment
  • model monitoring and maintenance

The integration of AI Governance and MLOps

Incorporating AI governance into the MLOps Stack Canvas ensures ethical, legal, and responsible AI practices throughout the machine learning lifecycle. By addressing risks such as bias, fairness, privacy, and security, this approach enhances compliance, promotes cross-functional collaboration, and ensures higher-quality, trustworthy models.

Value Proposition:

  • align AI projects with ethical principles and regulations (i.e. EU AI Act, GDPR)
  • perform risk classifications and ethical impact assessments
  • identify affected stakeholders, including vulnerable groups

Data Sources & Versioning:

  • ensure data sourcing follows ethical guidelines
  • assess data quality and bias
  • conduct data ethics reviews

Data Analysis & Experiment Management:

  • use tools for transparency and accountability (e.g., fairness metrics, XAI techniques)
  • document experiments for reproducibility and audit trails
  • implement ethical reviews for design and results

Feature Store & Engineering:

  • detect and mitigate feature bias
  • use privacy-preserving techniques
  • ensure feature documentation for governance

CI/CT/CD Pipelines:

  • add ethical checkpoints and compliance automation
  • test for robustness and bias using tools like AI Fairness 360
  • incorporate human oversight and gradual rollouts

Model Registry & Deployment:

  • document ethical metadata and approvals
  • monitor models for ethical concerns post-deployment
  • enable user feedback and explainability

Monitoring & Metadata Management:

  • continuously track fairness, bias, and data drift
  • maintain comprehensive governance documentation with traceability
  • add regulatory compliance metadata and decision traceability

AI Compliance vs. AI Governance vs. AI Risk Management

AI Compliance focuses on adhering to legal, regulatory, and policy standards, ensuring ethical behavior, and promoting transparency through rigorous processes, metrics, and stakeholder engagement.

AI Governance establishes a framework for ethical and responsible AI development, emphasizing transparency, stakeholder engagement, and balancing innovation with ethical considerations.

AI Risk Management identifies and mitigates risks associated with AI systems, ensuring robust risk assessment, risk mitigation strategies, and promoting a culture of risk awareness and proactive risk management.

Well-known examples of AI risks

  • algorithmic bias
  • privacy valuations and social manipulations
  • deepfakes
  • autonomous weapons
  • AI-enabled cyberattacks
  • lack of transparency
  • environmental impact
  • safety risks in critical systems
  • intellectual property issues
  • autonomous vehicle accidents

The NIST AI Risk Management Framework (AI RMF) proposes four core functions that form the foundation for managing risks associated with artificial intelligence systems.

image

  • focus on AI ethics, trust, and enhanced data privacy
  • adoption of AI in data management and proactive compliance automation
  • focus on data lineage and provenance in AI
  • shift left data and AI governance
  • decentralized data governance

Chapter 4 - Tailoring MLOps for Different Risk Levels

Creating AI System Landscape

To navigate the EU AI Act, first, one should get an overview of existing AI systems or, broadly speaking, AI use cases and assess whether these AI systems are underlying the legislation or if there are no action points. It is also helpful to distinguish whether you plan to put these AI use cases into production or are just experimentation or proof of concepts. Whenever an AI use case is developed as an internal research project, it can be safely excluded from the compliance obligation.

Inventory Your AI Systems

Start by cataloging all AI systems used or developed within your organization by following these steps:

  • identify AI applications across all departments
  • include both custom-developed and third-party AI solutions
  • document key details like department, purpose, data used, deployment status, and risk category.

Include:

  • basic information (system name, department, purpose and functionality)
  • technical details (data sources, deployment status, vendor information)
  • risk assessment (risk level and justification)
  • additional notes (integration with other systems, responsible team/individual, date of last review)

Example:

image

Application of the EU AI Act

The process of determining the EU AI Act’s relevance to AI Systems and Risk Classification

image

Prohibited AI Practices

The AI Act aims to protect fundamental rights, democracy, and the rule of law by banning AI applications that are incompatible with EU values and rights. In general, the AI Act prohibits AI systems from:

  • using subliminal, manipulative, or deceptive techniques to misinterpret behavior and damage informed decision-making, causing significant harm
  • exploiting vulnerabilities related to age, disability, or socio-economic circumstances to distort behavior, causing significant harm
  • utilizing biometric categorization systems to infer sensitive attributes such as race, political opinions, religious beliefs, and sexual orientation
  • conducting social scoring, i.e., evaluating individuals based on social behavior or personal traits, which can lead to discriminative treatment
  • creating or expanding facial recognition databases through untargeted scraping of facial images from the internet or CCTV footage
  • inferring emotions in workplaces or educational institutions, except for medical or safety reasons

Determining AI Act Obligation

To determine the EU AI Act obligations, there are two fundamental steps in AI systems assessment:

  1. Determine if the AI Act applies to any AI Systems from the inventory above by answering the following questions: Do they meet the definition of “AI System”? Are they in the scope of the AI Act? Conduct a risk classification for each AI System. Are they High-Risk or not? Do Transparency obligations apply?
  2. From the inventory above, determine your organization’s role in relation to AI systems. Is your organization an AI system provider or deployer?

Framework for Classification of AI Systems by Risk Levels

High-Risk AI Systems. Key areas include:

  • biometric identification
  • critical infrastructure
  • education and employment
  • law enforcement and justice
  • migration and border control

Limited-Risk AI Systems: these systems require transparency measures but are not as regulated as high-risk systems (Article 50). Examples include chatbots, emotion recognition systems, and deepfake generation tools. A questionnaire helps determine compliance

Low-Risk AI Systems: pose minimal societal risks. Examples include spam filters, recommendation engines, and basic chatbots. No specific obligations are imposed under the Act unless their context or usage escalates risks.

Planning the Data Governance, AI Governance, and MLOps for the Compliance with the EU AI Act

image

Engineering and MLOps tasks for EU AI Act compliance grow more detailed and intensive as AI projects progress from ideation to maturity. Robust infrastructure, monitoring, and governance are essential to support compliance and system reliability. Key points include:

  • MLOps and compliance integration: MLOps processes like data pipelines and model monitoring directly support EU AI Act requirements
  • data governance: increasingly critical in later stages for ensuring data quality and regulatory compliance
  • human oversight: evolving sophistication to align with EU AI Act mandates for high-risk AI systems
  • risk management: progresses from high-level to continuous, automated assessments
  • automation and scalability: emphasis grows as systems mature
  • MVP stage: focus shifts to audit readiness and conformity assessments for high-risk AI systems

Emerging Roles in Organizations for EU AI Act Compliance

  • the AI ethics specialist
  • the AI internal auditor
  • the legal AI compliance officer or AI regulatory affairs specialist
  • data and AI governance specialist
  • AI risk management specialist
  • AI security specialist

I decided to do the homework today ~

Task:

Create a Flink job that sessionizes the input data by IP address and host; Use a 5 minute gap

import os
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.table import EnvironmentSettings, StreamTableEnvironment
from pyflink.table.expressions import lit, col
from pyflink.table.window import Session


def create_aggregated_events_ip_host_sink_postgres(t_env):
    table_name = 'processed_events_aggregated_hw'
    sink_ddl = f"""
        CREATE TABLE {table_name} (
            event_hour TIMESTAMP(3),
            ip VARCHAR,
            host VARCHAR,
            num_hits BIGINT
        ) WITH (
            'connector' = 'jdbc',
            'url' = '{os.environ.get("POSTGRES_URL")}',
            'table-name' = '{table_name}',
            'username' = '{os.environ.get("POSTGRES_USER", "postgres")}',
            'password' = '{os.environ.get("POSTGRES_PASSWORD", "postgres")}',
            'driver' = 'org.postgresql.Driver'
        );
    """
    t_env.execute_sql(sink_ddl)
    return table_name

def create_processed_events_source_kafka(t_env):
    kafka_key = os.environ.get("KAFKA_WEB_TRAFFIC_KEY", "")
    kafka_secret = os.environ.get("KAFKA_WEB_TRAFFIC_SECRET", "")
    table_name = "process_events_kafka_hw"
    pattern = "yyyy-MM-dd''T''HH:mm:ss.SSS''Z''"
    sink_ddl = f"""
        CREATE TABLE {table_name} (
            url VARCHAR,
            referrer VARCHAR,
            user_agent VARCHAR,
            host VARCHAR,
            ip VARCHAR,
            headers VARCHAR,
            event_time VARCHAR,
            window_timestamp AS TO_TIMESTAMP(event_time, '{pattern}'),
            WATERMARK FOR window_timestamp AS window_timestamp - INTERVAL '15' SECOND
        ) WITH (
            'connector' = 'kafka',
            'properties.bootstrap.servers' = '{os.environ.get('KAFKA_URL')}',
            'topic' = '{os.environ.get('KAFKA_TOPIC')}',
            'properties.group.id' = '{os.environ.get('KAFKA_GROUP')}',
            'properties.security.protocol' = 'SASL_SSL',
            'properties.sasl.mechanism' = 'PLAIN',
            'properties.sasl.jaas.config' = 'org.apache.flink.kafka.shaded.org.apache.kafka.common.security.plain.PlainLoginModule required username=\"{kafka_key}\" password=\"{kafka_secret}\";',
            'scan.startup.mode' = 'latest-offset',
            'properties.auto.offset.reset' = 'latest',
            'format' = 'json'
        );
    """
    t_env.execute_sql(sink_ddl)
    return table_name


def log_aggregation():
    # Set up the execution environment
    env = StreamExecutionEnvironment.get_execution_environment()
    env.enable_checkpointing(10)
    env.set_parallelism(3)

    # Set up the table environment
    settings = EnvironmentSettings.new_instance().in_streaming_mode().build()
    t_env = StreamTableEnvironment.create(env, environment_settings=settings)

    try:
        # Create Kafka table
        source_table = create_processed_events_source_kafka(t_env)
        print('running hw job')
        aggregated_table = create_aggregated_events_ip_host_sink_postgres(t_env)
        t_env.from_path(source_table) \
            .window(Session.with_gap(lit(5).minutes).on(col("window_timestamp")).alias("w")) \
            .group_by(
                col("w"),
                col("host"),
                col("ip")
            ) \
            .select(
                    col("w").start.alias("event_hour"),
                    col("ip"),
                    col("host"),
                    col("host").count.alias("num_hits")
            ) \
            .execute_insert(aggregated_table).wait()

    except Exception as e:
        print("Writing records from Kafka to JDBC failed:", str(e))


if __name__ == '__main__':
    log_aggregation()

The thing is that the data every student gets will be different because we run the job during different times, but I think my code is the same as some other students from what I have seen talked about on the community discord server ~ We’ll see when the submission opens 😆


That is all for today!

See you tomorrow :)