(Day 289) Reading more of the Fin. DE book + stream

Ivan Ivanov · October 16, 2024

Hello :) Today is Day 289!

A quick summary of today:

  • read more of the financial data engineering book
  • streamed and learned about permutation importance in sklearn

Financial Data Eng Ch. 6 - Overview of the FDE lifecycle

image

Below are the criteria for building a financial data engineering stack

Criterion 1: open source vs. commercial software

The main advantages of commercial software are:

  • vendor accountability
  • product integrations
  • enterprise-grade features
  • user-friendly experience

The disadvantages of commercial software are:

  • cost
  • bulky products
  • vendor lock-in
  • lack of customisation

The main advantages of open source are:

  • cost
  • customisation
  • community
  • transparency

The main disadvantages of open source are:

  • support
  • documentation
  • complexity
  • compatibility
  • security
  • confidentiality

Criterion 2: ease of use vs. performance

The passage discusses the tradeoff between ease of use and performance in software and hardware technology. Complex technologies, like low-level programming languages (like C) or specialized hardware (i.e. GPUs, FPGAs), provide significant performance advantages, but are harder to learn, implement, and manage. On the other hand, simpler, user-friendly technologies (i.e. Python, CPUs) are easier to use but may offer less optimal performance.

In financial markets, especially in high-frequency trading, low latency is critical, as even millisecond delays can result in large financial losses. Trading firms invest in optimized systems, such as Direct Market Access (DMA) and colocation with exchanges, to reduce latency. FPGAs are often used to gain speed advantages in processing orders, but these setups can raise fairness concerns due to differential access to market data.

Criterion 3: cloud vs. on premises

Criterion 4: public vs. private vs. hybrid cloud

Criterion 5: single vs. multi-cloud

Criterion 6: monothonic vs modular codebase

Chapter 7 - Data ingestion layer

Data transmission and arrival processes

Data ingestion is a process wherein data gets transmitted from a given source, travels over a network, and arrives at its destination. Understanding the details of such a process is critical for designing a reliable financial data infrastructure that meets different business and technical requirements.

Data transmission protocols

They enable communication between systems over a network. These protocols form essential layers in data infrastructure, particularly for industries like finance that rely heavily on data transfers.

Protocols are organized in layered models, with the most popular being the seven-layer OSI model and the simpler four-layer TCP/IP model. The TCP/IP model includes:

  • Application Layer: contains protocols like HTTP, SMTP, and DNS for web, email, and file transfers
  • Transport Layer: responsible for reliable data transfer using TCP (connection-oriented) and UDP (faster but less reliable). It also includes TLS for secure transmission
  • Network Layer: handles data routing using IP addresses, with IPsec providing encryption
  • Network Access Layer: manages data transmission over physical hardware (ie. Ethernet, WiFi)

Financial communication standards, like EBICS, leverage these protocols for secure and reliable transactions, utilizing TCP/IP and TLS encryption. Understanding these protocols is crucial for designing secure and efficient financial data infrastructures.

image

Data arrival processes (DAP)

Data may arrive at the ingestion layer with different temporal and structural patterns.

  • scheduled DAP
  • event-driven DAP
  • homogeneous DAP (the data has pre-determined consistent properties)
  • heterogeneous DAP (data may possess variable attributes such as extension, format, type, content, schema)
  • single-item DAP (data is ingested either on a record-at-a-time or file-at-a-time basis)
  • bulk DAP

Data ingestion formats

  • general-purpose formats (csv, tsv, txt, json, xml, etc)
  • big data formats (parquet, avro, orc)
  • in-memory formats (apache arrow, RDD by Spark)
  • standardised financial formats: the above can be used, but if different ones are used by market participants issues are inevitable. To that end, a standard format has been created (or at least an attempt) - FpML, FIX, IFX, MDDL, FEDI, OFX, XBRL, ISO 8583, ISO 15022, ISO 20022, SWIFT messages

Data ingestion technologies

Below are 6 of the most common ingestion technologies used in financial markets

  • financial APIs
  • financial data feeds
  • secure file transfer
  • cloud access
  • web access
  • specialised financial software

Data ingestion best practices

  • meet business requirements
  • design for change
  • enforce data governance (data validation, logging and reporting, lineage and visibility, security)
  • perform benchmarking and stress testing

Chapter 8 - Data storage layer

Principles of data storage system (DSS) design

  1. meet business requirements
  2. data modelling (conceptual, logical, and physical plan)
  3. transactional guarantee (ensure data consistency and reliability - ACID/BASE)
  4. consistency tradeoffs (CAP theoreom - consistency, availability, partition tolerance - involve balancing between ensuring all nodes have the latest data (consistency) and keeping the system operational during network issues (availability))
  5. scalability
  6. security

Data storage modelling

  • SQL vs. noSQL
  • primary (secury and permanent repository for data storage) vs. secondary (may not maintain the same level of data consistency, but is generally intended for use cases where this discrepancy does not cause significant issues)
  • operational (OLTP) vs. analytical (OLAP)
  • native vs. non-native (i.e. storing graph data as a graph in neo4j vs in postgres)
  • multi-model vs. polyglot persistence (using neo4j for graphs, mongodb for content and posts, redis for caching frequently accessed data vs using postgres and its variations to store different types of data)

Data storage models

The data lake model

A data lake is a centralized repository that stores massive amounts of raw data, allowing flexible storage of both structured and unstructured data. It supports simple data ingestion, data variety, and integration while enabling cost-effective archiving and analysis. However, data lakes require proper governance, including security, privacy, and data cataloging to prevent turning into disorganized “data swamps.” They are widely used in financial institutions for cloud migration, data aggregation, compliance, and handling unstructured data.

The relational model

The popular and highly trusted DSSs are built on the relational DSM, which organizes data in rows and columns, forming a table or relation

image

By design, relational databases offer the strongest ACID transactional guarantee among all DSSs.

  • analytical querying: one of the strongest arguments in favor of using SQL databases is their advanced querying capabilities
  • schema enforcement: relational dbs require and enforce a data schema
  • data modelling is common with relational dbs
  • normalisation is the most popular SQL data modelling technique

The document model

json and xml are popular formats

Document dbs offer:

  • schema flexibility
  • document modelling
  • horizontal scalability
  • ACID transactions
  • performance

The time series model

Time series data, consisting of sequential measurements over time, is widely used in financial markets for analyzing trends, risks, and predicting prices. To efficiently manage this growing data, specialized time series databases (TSDBs) have emerged, offering features like fast querying, time-based aggregations, and data immutability. Native TSDBs like kdb+ and InfluxDB are optimized for high performance, especially in time-critical applications like high-frequency trading, while general-purpose databases like PostgreSQL and MongoDB offer time series extensions to handle such data.

The message broker model

image

Message brokers are crucial for event-driven, distributed, and streaming architectures, facilitating the exchange of messages between applications. They decouple data production and consumption, enabling independent scaling of producer and consumer applications based on load. With fault tolerance, message brokers ensure that failures in producers or consumers don’t impact operations. Key elements include topics, which organize messages similarly to tables in SQL, and message schemas, which define message structure to prevent complexity. Proper governance, including topic management and schema validation, ensures smooth communication and scalability in large systems.

Technical implementations should:

  • cover business needs
  • ensure performance
  • have an appropriate delivery mode (at most once, at least once, and exactly once)
  • have message persistence
  • be scalable
  • may offer message prioritisation
  • offer message ordering

The graph model

Graphs can offer:

  • centrality analysis
  • search and pathfinding
  • community detection
  • contagion analysis (how shock impacting one or a set of nodes propages throughout the rest of the network)
  • link prediction

Graph data modelling:

image

The warehouse model

A data warehouse is a subject-oriented, integrated, nonvolatile, and time-variant collection of data in support of management’s decisions.

Bill Inmon

A data warehouse is a copy of transaction data specifically structured for query and analysis.

Ralph Kimball

image

Why data warehouses?

  • consistent structure for data
  • advanced analytics
  • scalability
  • subject-oriented
  • integrated
  • nonvolatile
  • time-variant

Data modelling with data warehouses

  • the subject modeling approach of Bill Inmon, vs.
  • the dimensional modeling approach of Ralph Kimball

In the Inmon approach, data is sourced from various operational systems across the organization and integrated into a centralized data warehouse. Subsequently, based on the specific needs of individual departments or business units, dedicated data marts are derived from the data warehouse. In this model, the data warehouse and the data marts are physically separate entities, each with its own storage and characteristics. Data within the data marts is considered consistent and reliable, as it originates from the centralized source of truth—the data warehouse itself. Users from different departments typically employ specialized tools to access and analyze the data within their respective data marts

image

The Kimball approach takes a different perspective. Instead of isolating the data warehouse from the data marts, Kimball suggested a user requirement-driven approach that defines the data marts in advance and integrates them within the data warehouse itself

image

Dimension modelling

Two main types

image

Examples are Google BigQuery, Amazon Redshift, Snowflake

The blockchain model

Blockchain is a data structure used to store data as a chain of linked blocks, offering immutability and tamper resistance. Its primary application is in creating digital ledgers, such as those for cryptocurrencies like Bitcoin and Ethereum. Blockchain ensures data integrity through cryptographic hashes and distributed ledger technology (DLT), where decentralized nodes validate and maintain the ledger. However, blockchain has limitations, such as slow data querying and performance issues with increasing nodes. Blockchain databases, like BigchainDB and Amazon QLDP, address these by combining traditional database performance with blockchain’s immutability and security.

image

Chapter 9 - Data transformation and delivery layer

Data querying

  • time series queries
SELECT time_column, attribute_1, attribute_2 
FROM financial_entity_table 
WHERE entity_name = 'A' 
AND date BETWEEN '2022-01-01' AND '2022-02-01'
  • cross-section queries
SELECT entity_name, attribute_1, attribute_2 
FROM financial_entity_table 
WHERE entity_name in ('A', 'B', 'C', 'D') 
AND date = '2022-02-01'
  • panel queries
SELECT time_coluumn, entity_name, attribute_1, attribute_2 
FROM financial_entity_table 
WHERE entity_name in ('A', 'B', 'C', 'D') 
AND date BETWEEN '2022-01-01' AND '2022-02-01'
  • analytical queries
SELECT stock_symbol, date, MAX(price)
FROM price_table
GROUP BY stock_symbol, date

Recently, cloud data warehouse solutions such BigQuery and Snowflake have introduced features enabling users to create and interact with machine learning models using advanced analytical SQL queries

-- Create a linear regression model
CREATE MODEL `mydataset.my_model`
OPTIONS(MODEL_TYPE='LINEAR_REG') AS
SELECT
  feature_column1,
  feature_column2,
  label
FROM
  `mydataset.my_training_table_data`;

Calling the model would look like:

SELECT *
FROM
  ML.PREDICT(MODEL `mydataset.my_model`, (
    SELECT
      feature_column1,
      feature_column2
    FROM
      `mydataset.my_test_table_data`
  ));

Query optimisation

  • database-side query optimisations (indexing, clustering, partitioning)
  • user-side query optimisations (use indexes; for composite indexes, filter by the leftmost columns; avoid SELECT *; use anchored patterns in LIKE; break large queries into smaller, incremental batches to handle failures efficiently; filter data before joins to improve performance)

image

Data transformation

Transformation operations

  • format conversion
  • data cleaning
  • data adjustments
  • data standardisation
  • data filtering
  • feature engineering
  • advanced analytical computations

Transformation patterns

  • batch vs. streaming transformations

image

  • memory-based vs. disk-based transformations

image

  • full vs. incremental (CDC) data transformations

Data delivery

Financial data engineers must determine who the ultimate data consumers are and understand their specific needs, and then create the appropriate mechanisms to deliver the data.

Data consumers include any user, team, or system that uses data from a company’s infrastructure. They may range from compliance officers and marketing teams, who specify data needs and rely on data engineers, to analysts and machine learning specialists more involved in data processes. Roles within data governance include data owners (e.g., finance directors managing financial data), data stewards (ensuring data quality), and data custodians (protecting data). Data contracts formalize expectations between producers and consumers, detailing requirements like data type, format, and SLAs.

Data delivery mechanisms vary, including direct database/file access via interfaces, programmatic access through APIs or libraries, reports, dashboards, and email. Effective delivery design requires helping users search and find data, with practices like specifying data locations in contracts and using data catalogs as central search engines.

Chapter 10 - The monitoring layer

Data monitoring is a continuous task requiring close collaboration between financial data engineers and business teams. It enables financial institutions to operate securely, compliantly, and efficiently in a rapidly changing environment. Monitoring is needed for:

  • operational continuity and efficiency
  • compliance
  • risk management

When designing this layer, the 1st question to ask is what components of the financial data infrastructure do we need to monitor

Metrics, Events, Logs, and Traces

  • metrics: quantitative measurements that provide information about a particular aspect of a system, process, variable, or activity; usually metrics are observed over a specific time frame; and often enriched with metadata
  • events: structured data emitted by an application during runtime
  • logs: semi-structured data emitted by an application during runtime, providing greater flexibility in terms of content, structure, and triggering mechanisms compared to events (log types: debug, info, warning, error, critical)
  • traces: represent a more advanced form of system behavior records, capturing comprehensive details about each step taken by a process as it progresses through one or more system (types: business operation traces, application traces, security incident traces)

Data quality monitoring

Data quality monitoring in financial institutions focuses on ensuring data consistently meets high standards, covering aspects like errors, duplicates, biases, completeness, and timeliness. The process begins with defining Data Quality Metrics (DQMs), tailored to client expectations, business needs, and technical standards. Common metrics include error, duplicate, and missing value ratios. Data profiling, which analyzes a dataset’s structure and relationships, is a key monitoring method. Tools like Ataccama ONE, Great Expectations, and Talend Data Fabric help automate issue detection. Monitoring is a continuous process, adapting to evolving data types and business requirements.

Example of a data profile report:

image

Performance monitoring

Here are the key metrics for performance monitoring in financial data infrastructure:

  • General Metrics:
    • RAM usage
    • CPU usage
    • Storage usage
    • Execution time
    • Requests per second (RPS)
    • Ingress/egress bytes (bytes/sec)
    • Uptime/downtime
    • API response time
    • Concurrent users
  • Database-Specific Metrics:
    • Read/write operations
    • Active connections
    • Connection pool usage
    • Query count
    • Transaction runtime
    • Replication lag
    • Records read
    • Blocks read
    • Bytes scanned
    • Number of output rows
    • Scan time, sort time, aggregation time
    • Peak memory usage
    • Query plans
  • Financial Market-Specific Metrics:
    • Trade execution latency
    • Trade execution error rate
    • Trade settlement time
    • Market data latency
    • Algorithmic trading performance
    • Risk exposure calculation time
    • Customer transaction processing time
  • Incident Response Metrics:
    • Mean Time to Detect (MTTD)
    • Mean Time Before Failure (MTBF)
    • Mean Time to Recovery (MTTR)
    • Mean Time to Failure (MTTF)
    • Mean Time to Acknowledge (MTTA)

Cost monitoring

Cost monitoring involves tracking financial data infrastructure expenses to ensure they remain within acceptable limits and to detect patterns of excessive usage. With cloud services’ widespread use, managing costs is critical due to potential misinterpretations of pricing, particularly in serverless computing (like AWS Lambda). Misunderstandings about execution patterns, memory allocation, and data transfer fees can lead to unforeseen costs.

Key practices include setting budgets with alerts, using tools like FinOps for financial accountability, and applying a FinOps Maturity Model (FMM) to improve cost management through collaboration between engineering, finance, and business teams.

Business and Analytical monitoring

  • advanced monitoring: observes not only raw data but also contextual and analytical insights for strategic decision-making
  • risk monitoring in financial institutions:
    • market risk: losses from market price movements
    • credit risk: losses from borrower defaults
    • operational risk: losses from internal process failures or external events
    • data engineering challenges arise in collecting, aggregating, and normalizing risk data for regulatory compliance (i.e. Basel III)
  • ML monitoring:
    • concept drift: model no longer accurately reflects the original problem
    • data drift: changes in input data distribution over time
    • model risk: financial models may produce inconsistent results due to changing market conditions
  • fraud and illegal activity monitoring:
    • money laundering, rerrorism financing, fraud, corruption: monitoring for illegal transactions
    • market manipulation: activities such as pump and dump, spoofing, insider trading, and wash trading
  • investment monitoring:
    • monitoring key performance indicators like Sharpe, Sortino, and Treynor ratios to evaluate investment portfolio performance

Data observability

Observability elevates monitoring to a new level, allowing software and data engineering teams to handle issues proactively and gain deep insights into the internal state and behavior of a given software application or infrastructure

A system is observable if through it we can:

  • understand the inner workings of our application
  • understand any system state the application may have gotten itself into, even new ones we have never seen before and couldn’t have predicted
  • understand the inner workings and system state solely by observing and interrogating with external tools
  • understand the internal state without shipping any new custom code to handle it (because that implies you needed prior knowledge to explain it)

A data observability system can bring:

  • higher data quality
  • operational efficiency (reduced TTD and TTR)
  • facilitated communication and trust between different team members
  • gains in client trust by being able to detect and understand the issues they might face
  • enabling complex data ingestion and transformation while maintaining reliability and visibility into the system
  • regulatory compliance

Stream

Read two interesting articles(?) or examples (?) from the sklearn docs:

Super interesting!

And also found more for tomorrow. Again from the ‘inspection’ part of the sklearn docs


That is all for today!

See you tomorrow :)