(Day 289) Reading more of the Fin. DE book + stream

Ivan Ivanov · October 16, 2024

traditional-machine-learning theory applying-knowledge

Hello :) Today is Day 289!

A quick summary of today:

read more of the financial data engineering book
streamed and learned about permutation importance in sklearn

Financial Data Eng Ch. 6 - Overview of the FDE lifecycle

Below are the criteria for building a financial data engineering stack

Criterion 1: open source vs. commercial software

The main advantages of commercial software are:

vendor accountability
product integrations
enterprise-grade features
user-friendly experience

The disadvantages of commercial software are:

cost
bulky products
vendor lock-in
lack of customisation

The main advantages of open source are:

cost
customisation
community
transparency

The main disadvantages of open source are:

support
documentation
complexity
compatibility
security
confidentiality

Criterion 2: ease of use vs. performance

The passage discusses the tradeoff between ease of use and performance in software and hardware technology. Complex technologies, like low-level programming languages (like C) or specialized hardware (i.e. GPUs, FPGAs), provide significant performance advantages, but are harder to learn, implement, and manage. On the other hand, simpler, user-friendly technologies (i.e. Python, CPUs) are easier to use but may offer less optimal performance.

In financial markets, especially in high-frequency trading, low latency is critical, as even millisecond delays can result in large financial losses. Trading firms invest in optimized systems, such as Direct Market Access (DMA) and colocation with exchanges, to reduce latency. FPGAs are often used to gain speed advantages in processing orders, but these setups can raise fairness concerns due to differential access to market data.

Criterion 3: cloud vs. on premises

Criterion 4: public vs. private vs. hybrid cloud

Criterion 5: single vs. multi-cloud

Criterion 6: monothonic vs modular codebase

Chapter 7 - Data ingestion layer

Data transmission and arrival processes

Data ingestion is a process wherein data gets transmitted from a given source, travels over a network, and arrives at its destination. Understanding the details of such a process is critical for designing a reliable financial data infrastructure that meets different business and technical requirements.

Data transmission protocols

They enable communication between systems over a network. These protocols form essential layers in data infrastructure, particularly for industries like finance that rely heavily on data transfers.

Protocols are organized in layered models, with the most popular being the seven-layer OSI model and the simpler four-layer TCP/IP model. The TCP/IP model includes:

Application Layer: contains protocols like HTTP, SMTP, and DNS for web, email, and file transfers
Transport Layer: responsible for reliable data transfer using TCP (connection-oriented) and UDP (faster but less reliable). It also includes TLS for secure transmission
Network Layer: handles data routing using IP addresses, with IPsec providing encryption
Network Access Layer: manages data transmission over physical hardware (ie. Ethernet, WiFi)

Financial communication standards, like EBICS, leverage these protocols for secure and reliable transactions, utilizing TCP/IP and TLS encryption. Understanding these protocols is crucial for designing secure and efficient financial data infrastructures.

Data arrival processes (DAP)

Data may arrive at the ingestion layer with different temporal and structural patterns.

scheduled DAP
event-driven DAP
homogeneous DAP (the data has pre-determined consistent properties)
heterogeneous DAP (data may possess variable attributes such as extension, format, type, content, schema)
single-item DAP (data is ingested either on a record-at-a-time or file-at-a-time basis)
bulk DAP

Data ingestion formats

general-purpose formats (csv, tsv, txt, json, xml, etc)
big data formats (parquet, avro, orc)
in-memory formats (apache arrow, RDD by Spark)
standardised financial formats: the above can be used, but if different ones are used by market participants issues are inevitable. To that end, a standard format has been created (or at least an attempt) - FpML, FIX, IFX, MDDL, FEDI, OFX, XBRL, ISO 8583, ISO 15022, ISO 20022, SWIFT messages

Data ingestion technologies

Below are 6 of the most common ingestion technologies used in financial markets

financial APIs
financial data feeds
secure file transfer
cloud access
web access
specialised financial software

Data ingestion best practices

meet business requirements
design for change
enforce data governance (data validation, logging and reporting, lineage and visibility, security)
perform benchmarking and stress testing

Chapter 8 - Data storage layer

Principles of data storage system (DSS) design

meet business requirements
data modelling (conceptual, logical, and physical plan)
transactional guarantee (ensure data consistency and reliability - ACID/BASE)
consistency tradeoffs (CAP theoreom - consistency, availability, partition tolerance - involve balancing between ensuring all nodes have the latest data (consistency) and keeping the system operational during network issues (availability))
scalability
security

Data storage modelling

SQL vs. noSQL
primary (secury and permanent repository for data storage) vs. secondary (may not maintain the same level of data consistency, but is generally intended for use cases where this discrepancy does not cause significant issues)
operational (OLTP) vs. analytical (OLAP)
native vs. non-native (i.e. storing graph data as a graph in neo4j vs in postgres)
multi-model vs. polyglot persistence (using neo4j for graphs, mongodb for content and posts, redis for caching frequently accessed data vs using postgres and its variations to store different types of data)

Data storage models

The data lake model

A data lake is a centralized repository that stores massive amounts of raw data, allowing flexible storage of both structured and unstructured data. It supports simple data ingestion, data variety, and integration while enabling cost-effective archiving and analysis. However, data lakes require proper governance, including security, privacy, and data cataloging to prevent turning into disorganized “data swamps.” They are widely used in financial institutions for cloud migration, data aggregation, compliance, and handling unstructured data.

The relational model

The popular and highly trusted DSSs are built on the relational DSM, which organizes data in rows and columns, forming a table or relation

By design, relational databases offer the strongest ACID transactional guarantee among all DSSs.

analytical querying: one of the strongest arguments in favor of using SQL databases is their advanced querying capabilities
schema enforcement: relational dbs require and enforce a data schema
data modelling is common with relational dbs
normalisation is the most popular SQL data modelling technique

The document model

json and xml are popular formats

Document dbs offer:

schema flexibility
document modelling
horizontal scalability
ACID transactions
performance

The time series model

Time series data, consisting of sequential measurements over time, is widely used in financial markets for analyzing trends, risks, and predicting prices. To efficiently manage this growing data, specialized time series databases (TSDBs) have emerged, offering features like fast querying, time-based aggregations, and data immutability. Native TSDBs like kdb+ and InfluxDB are optimized for high performance, especially in time-critical applications like high-frequency trading, while general-purpose databases like PostgreSQL and MongoDB offer time series extensions to handle such data.

The message broker model

Message brokers are crucial for event-driven, distributed, and streaming architectures, facilitating the exchange of messages between applications. They decouple data production and consumption, enabling independent scaling of producer and consumer applications based on load. With fault tolerance, message brokers ensure that failures in producers or consumers don’t impact operations. Key elements include topics, which organize messages similarly to tables in SQL, and message schemas, which define message structure to prevent complexity. Proper governance, including topic management and schema validation, ensures smooth communication and scalability in large systems.

Technical implementations should:

cover business needs
ensure performance
have an appropriate delivery mode (at most once, at least once, and exactly once)
have message persistence
be scalable
may offer message prioritisation
offer message ordering

The graph model

Graphs can offer:

centrality analysis
search and pathfinding
community detection
contagion analysis (how shock impacting one or a set of nodes propages throughout the rest of the network)
link prediction

Graph data modelling:

The warehouse model

A data warehouse is a subject-oriented, integrated, nonvolatile, and time-variant collection of data in support of management’s decisions.

Bill Inmon

A data warehouse is a copy of transaction data specifically structured for query and analysis.

Ralph Kimball

Why data warehouses?

consistent structure for data
advanced analytics
scalability
subject-oriented
integrated
nonvolatile
time-variant

Data modelling with data warehouses

the subject modeling approach of Bill Inmon, vs.
the dimensional modeling approach of Ralph Kimball

In the Inmon approach, data is sourced from various operational systems across the organization and integrated into a centralized data warehouse. Subsequently, based on the specific needs of individual departments or business units, dedicated data marts are derived from the data warehouse. In this model, the data warehouse and the data marts are physically separate entities, each with its own storage and characteristics. Data within the data marts is considered consistent and reliable, as it originates from the centralized source of truth—the data warehouse itself. Users from different departments typically employ specialized tools to access and analyze the data within their respective data marts

The Kimball approach takes a different perspective. Instead of isolating the data warehouse from the data marts, Kimball suggested a user requirement-driven approach that defines the data marts in advance and integrates them within the data warehouse itself

Dimension modelling

Two main types

Examples are Google BigQuery, Amazon Redshift, Snowflake

The blockchain model

Blockchain is a data structure used to store data as a chain of linked blocks, offering immutability and tamper resistance. Its primary application is in creating digital ledgers, such as those for cryptocurrencies like Bitcoin and Ethereum. Blockchain ensures data integrity through cryptographic hashes and distributed ledger technology (DLT), where decentralized nodes validate and maintain the ledger. However, blockchain has limitations, such as slow data querying and performance issues with increasing nodes. Blockchain databases, like BigchainDB and Amazon QLDP, address these by combining traditional database performance with blockchain’s immutability and security.

Chapter 9 - Data transformation and delivery layer

Data querying

time series queries

SELECT time_column, attribute_1, attribute_2 
FROM financial_entity_table 
WHERE entity_name = 'A' 
AND date BETWEEN '2022-01-01' AND '2022-02-01'

cross-section queries

SELECT entity_name, attribute_1, attribute_2 
FROM financial_entity_table 
WHERE entity_name in ('A', 'B', 'C', 'D') 
AND date = '2022-02-01'

panel queries

SELECT time_coluumn, entity_name, attribute_1, attribute_2 
FROM financial_entity_table 
WHERE entity_name in ('A', 'B', 'C', 'D') 
AND date BETWEEN '2022-01-01' AND '2022-02-01'

analytical queries

SELECT stock_symbol, date, MAX(price)
FROM price_table
GROUP BY stock_symbol, date

Recently, cloud data warehouse solutions such BigQuery and Snowflake have introduced features enabling users to create and interact with machine learning models using advanced analytical SQL queries

-- Create a linear regression model
CREATE MODEL `mydataset.my_model`
OPTIONS(MODEL_TYPE='LINEAR_REG') AS
SELECT
  feature_column1,
  feature_column2,
  label
FROM
  `mydataset.my_training_table_data`;

Calling the model would look like:

SELECT *
FROM
  ML.PREDICT(MODEL `mydataset.my_model`, (
    SELECT
      feature_column1,
      feature_column2
    FROM
      `mydataset.my_test_table_data`
  ));

Query optimisation

database-side query optimisations (indexing, clustering, partitioning)
user-side query optimisations (use indexes; for composite indexes, filter by the leftmost columns; avoid SELECT *; use anchored patterns in LIKE; break large queries into smaller, incremental batches to handle failures efficiently; filter data before joins to improve performance)

Data transformation

Transformation operations

format conversion
data cleaning
data adjustments
data standardisation
data filtering
feature engineering
advanced analytical computations

Transformation patterns

batch vs. streaming transformations

memory-based vs. disk-based transformations

full vs. incremental (CDC) data transformations

Data delivery

Financial data engineers must determine who the ultimate data consumers are and understand their specific needs, and then create the appropriate mechanisms to deliver the data.

Data consumers include any user, team, or system that uses data from a company’s infrastructure. They may range from compliance officers and marketing teams, who specify data needs and rely on data engineers, to analysts and machine learning specialists more involved in data processes. Roles within data governance include data owners (e.g., finance directors managing financial data), data stewards (ensuring data quality), and data custodians (protecting data). Data contracts formalize expectations between producers and consumers, detailing requirements like data type, format, and SLAs.

Data delivery mechanisms vary, including direct database/file access via interfaces, programmatic access through APIs or libraries, reports, dashboards, and email. Effective delivery design requires helping users search and find data, with practices like specifying data locations in contracts and using data catalogs as central search engines.

Chapter 10 - The monitoring layer

Data monitoring is a continuous task requiring close collaboration between financial data engineers and business teams. It enables financial institutions to operate securely, compliantly, and efficiently in a rapidly changing environment. Monitoring is needed for:

operational continuity and efficiency
compliance
risk management

When designing this layer, the 1st question to ask is what components of the financial data infrastructure do we need to monitor

Metrics, Events, Logs, and Traces

metrics: quantitative measurements that provide information about a particular aspect of a system, process, variable, or activity; usually metrics are observed over a specific time frame; and often enriched with metadata
events: structured data emitted by an application during runtime
logs: semi-structured data emitted by an application during runtime, providing greater flexibility in terms of content, structure, and triggering mechanisms compared to events (log types: debug, info, warning, error, critical)
traces: represent a more advanced form of system behavior records, capturing comprehensive details about each step taken by a process as it progresses through one or more system (types: business operation traces, application traces, security incident traces)

Data quality monitoring

Data quality monitoring in financial institutions focuses on ensuring data consistently meets high standards, covering aspects like errors, duplicates, biases, completeness, and timeliness. The process begins with defining Data Quality Metrics (DQMs), tailored to client expectations, business needs, and technical standards. Common metrics include error, duplicate, and missing value ratios. Data profiling, which analyzes a dataset’s structure and relationships, is a key monitoring method. Tools like Ataccama ONE, Great Expectations, and Talend Data Fabric help automate issue detection. Monitoring is a continuous process, adapting to evolving data types and business requirements.

Example of a data profile report:

Performance monitoring

Here are the key metrics for performance monitoring in financial data infrastructure:

General Metrics:
- RAM usage
- CPU usage
- Storage usage
- Execution time
- Requests per second (RPS)
- Ingress/egress bytes (bytes/sec)
- Uptime/downtime
- API response time
- Concurrent users
Database-Specific Metrics:
- Read/write operations
- Active connections
- Connection pool usage
- Query count
- Transaction runtime
- Replication lag
- Records read
- Blocks read
- Bytes scanned
- Number of output rows
- Scan time, sort time, aggregation time
- Peak memory usage
- Query plans
Financial Market-Specific Metrics:
- Trade execution latency
- Trade execution error rate
- Trade settlement time
- Market data latency
- Algorithmic trading performance
- Risk exposure calculation time
- Customer transaction processing time
Incident Response Metrics:
- Mean Time to Detect (MTTD)
- Mean Time Before Failure (MTBF)
- Mean Time to Recovery (MTTR)
- Mean Time to Failure (MTTF)
- Mean Time to Acknowledge (MTTA)

Cost monitoring

Cost monitoring involves tracking financial data infrastructure expenses to ensure they remain within acceptable limits and to detect patterns of excessive usage. With cloud services’ widespread use, managing costs is critical due to potential misinterpretations of pricing, particularly in serverless computing (like AWS Lambda). Misunderstandings about execution patterns, memory allocation, and data transfer fees can lead to unforeseen costs.

Key practices include setting budgets with alerts, using tools like FinOps for financial accountability, and applying a FinOps Maturity Model (FMM) to improve cost management through collaboration between engineering, finance, and business teams.

Business and Analytical monitoring

advanced monitoring: observes not only raw data but also contextual and analytical insights for strategic decision-making
risk monitoring in financial institutions:
- market risk: losses from market price movements
- credit risk: losses from borrower defaults
- operational risk: losses from internal process failures or external events
- data engineering challenges arise in collecting, aggregating, and normalizing risk data for regulatory compliance (i.e. Basel III)
ML monitoring:
- concept drift: model no longer accurately reflects the original problem
- data drift: changes in input data distribution over time
- model risk: financial models may produce inconsistent results due to changing market conditions
fraud and illegal activity monitoring:
- money laundering, rerrorism financing, fraud, corruption: monitoring for illegal transactions
- market manipulation: activities such as pump and dump, spoofing, insider trading, and wash trading
investment monitoring:
- monitoring key performance indicators like Sharpe, Sortino, and Treynor ratios to evaluate investment portfolio performance

Data observability

Observability elevates monitoring to a new level, allowing software and data engineering teams to handle issues proactively and gain deep insights into the internal state and behavior of a given software application or infrastructure

A system is observable if through it we can:

understand the inner workings of our application
understand any system state the application may have gotten itself into, even new ones we have never seen before and couldn’t have predicted
understand the inner workings and system state solely by observing and interrogating with external tools
understand the internal state without shipping any new custom code to handle it (because that implies you needed prior knowledge to explain it)

A data observability system can bring:

higher data quality
operational efficiency (reduced TTD and TTR)
facilitated communication and trust between different team members
gains in client trust by being able to detect and understand the issues they might face
enabling complex data ingestion and transformation while maintaining reliability and visibility into the system
regulatory compliance

Stream

Read two interesting articles(?) or examples (?) from the sklearn docs:

Super interesting!

And also found more for tomorrow. Again from the ‘inspection’ part of the sklearn docs

That is all for today!

See you tomorrow :)