(Day 293) Finishing the book + reading sklearn docs for fun

Ivan Ivanov · October 20, 2024

traditional-machine-learning theory

Hello :) Today is Day 293!

A quick summary of today:

finished ML for Financial Risk management
read some more sklearn docs for fun

ML for fin. risk management Chapter 7 - Liquidity Modeling

Liquidity risk has long been neglected in financial modeling, despite its crucial role in systemic financial risk, as demonstrated during the 2007–2008 mortgage crisis. Liquidity risk arises from incomplete markets and asymmetric information, leading to issues like moral hazard and adverse selection. Including liquidity in financial models enhances accuracy, as it impacts both asset returns and uncertainty levels. Liquidity is a multifaceted concept, with four key dimensions:

tightness (transaction cost)
immediacy (speed of trading)
depth (presence of buyers and sellers)
resiliency (ability to recover from price imbalances)

Modeling liquidity is challenging due to its multidimensionality, and different liquidity measures are used to capture these aspects.

Liquidity Measures

Volume-Based Liquidity Measures

Large orders are fulfilled when the market has depth, meaning a deep market can accommodate substantial orders. This provides insight into the market’s health, as a lack of depth leads to order imbalances and discontinuities. Volume-based liquidity measures can help differentiate between liquid and illiquid assets based on market depth. Additionally, these measures are closely related to the bid-ask spread: a wide spread indicates low volume, while a narrow spread suggests higher volume.

To properly represent the depth dimension of liquidity, the following volume-based measures are used:

Liquidity ratio

Measures the extent to which volume is required to induce a price change of 1%. Again, it measures how much trading volume is needed to cause a 1% change in the price of an asset. A higher ratio indicates greater liquidity, meaning that a large volume of trades can occur with only a small change in price. On the other hand, if a small volume of trades causes a price change, the asset is considered illiquid. This ratio primarily focuses on the relationship between trading volume and price changes, rather than the time it takes to execute trades or the costs involved in doing so.

Hui-Heubel ratio

The Hui-Heubel ratio is a measure of market depth, specifically designed to assess the liquidity of individual stocks rather than entire portfolios. It compares the percentage change in a stock’s price (based on its maximum and minimum prices over a set period) to the volume of trades relative to the stock’s market capitalization. This ratio helps evaluate how much trading activity is needed to move the stock’s price. One key feature of the Hui-Heubel ratio is its focus on individual stocks, making it versatile for liquidity analysis. Although it can use the bid-ask spread instead of price changes, low volatility in the bid-ask spread may lead to an underestimation of liquidity.

Turnover ratio

The turnover ratio is a commonly used proxy for liquidity, reflecting how frequently a stock is traded. It measures the ratio of trading volume (the number of shares traded) to the number of shares outstanding over a specific period. A high turnover ratio suggests a high level of liquidity, as it indicates frequent trading of the stock. The inclusion of shares outstanding in the calculation provides a more refined measure of liquidity, as it accounts for the stock’s availability in the market. Essentially, the turnover ratio captures both the trading activity and the stock’s market presence.

Transaction Cost–Based Liquidity Measures

These measures focus on the costs involved in executing trades, which include both explicit costs (like taxes and brokerage fees) and implicit costs (such as the bid-ask spread and execution timing). These costs are crucial in determining market tightness (the ability to trade with low costs) and immediacy (the speed of executing trades). High transaction costs discourage trading, leading to a fragmented and shallow market, while low costs promote a more active, centralized market. The bid-ask spread is a commonly used proxy for transaction costs and liquidity, indicating how easily an asset can be converted to cash. Methods like quoted, effective, and realized spreads are used to calculate this spread, though actual trade conditions may sometimes differ from observed spreads.

Percentage quoted and effective bid-ask spreads

Percentage quoted and effective bid-ask spreads are key measures of transaction costs in trading.

Percentage quoted spread: this measures the cost of completing a trade by calculating the difference between the ask and bid prices of a stock, scaled as a percentage. It shows how much more a buyer must pay (ask price) than what a seller would receive (bid price), indicating the cost of executing a trade.
Percentage effective spread: this spread measures the difference between the actual trading price and the midpoint of the bid-ask prices (considered the true value of the stock). It accounts for trades that occur inside or outside the quoted prices, offering a more accurate reflection of trading costs. The percentage effective half spread is often used to calculate this spread on a percentage basis.

Both spreads provide insight into trading costs, with the effective spread giving a more realistic measure when trades deviate from quoted prices.

Roll’s spread estimate

This estimate is a method to infer the bid-ask spread using only price data. It assumes that price changes caused by the spread create negative autocorrelation, as prices move between the bid and ask. By calculating the covariance of consecutive price changes, Roll’s model estimates the spread. The larger the negative covariance, the wider the spread. While useful, it doesn’t account for information-driven price changes and works best in markets where transaction costs dominate price movements.

The Corwin-Schultz spread

This is a method for estimating the bid-ask spread using only the daily high and low prices of a stock. The core idea is that the highest price on a given day is typically driven by buyers, while the lowest price is driven by sellers, so the price difference reflects both the stock’s volatility and the bid-ask spread.

The approach works by comparing price ranges over two consecutive days. The sum of the daily price ranges for two days reflects both the spread and the volatility over that period, while the price range for the two-day period captures only the volatility and one bid-ask spread. This method offers an easy and intuitive way to estimate the spread without needing more detailed trade data.

Price Impact–Based Liquidity Measures

Such measures assess how sensitive asset prices are to changes in trading volume and turnover ratios. Resiliency, in this context, refers to the market’s ability to respond to new orders and correct imbalances. If a market effectively adjusts its prices in response to changes in volume or turnover, it is considered resilient. Therefore, a high price adjustment in relation to changes in volume or turnover indicates greater market resiliency.

Amihud illiquidity

It measures the sensitivity of the return to trading volume. More concretely, it gives us a sense about a change in absolute return as trading volume changes by $1. This measure is well known by researchers and practitioners.

The Amihud measure has two advantages over many other liquidity measures. First, the Amihud measure has a simple construction that uses the absolute value of the daily return-to-volume ratio to capture price impact. Second, the measure has a strong positive relation with expected stock return

The price impact ratio

The price impact ratio, known as Return-to-Turnover (RtoTR), was developed by Florackis, Andros, and Alexandros to address shortcomings in the Amihud illiquidity ratio. The authors identified two main drawbacks of Amihud’s measure: it is not comparable across stocks with varying market capitalizations, and it does not consider the investor’s holding horizon.

To overcome these limitations, the RtoTR replaces the volume ratio in Amihud’s model with the turnover ratio, which captures trading frequency. This adjustment allows the new measure to better reflect the liquidity of stocks by considering how actively they are traded. The RtoTR retains the other components of Amihud’s illiquidity measure, providing a more comprehensive view of liquidity dynamics.

Coefficient of elasticity of trading

CET is a liquidity measure proposed to remedy the shortcomings of time-related liquidity measures such as number of trades and orders per unit of time. These measures are adopted to assess the extent to which market immediacy affects the liquidity level.

Market immediacy and CET go hand in hand as it measures the price elasticity of trading volume and if price is responsive (i.e. elastic) to the trading volume, that amounts to a greater level of market immediacy.

Market Impact-Based Liquidity Measures

Price movements from the market as a whole don’t provide the same information as those from individual stocks, making it necessary to capture price changes accurately.

To achieve this, the Capital Asset Pricing Model (CAPM) is used to separate systematic risk from unsystematic risk. Systematic risk is represented by the slope coefficient in CAPM, while unsystematic risk is linked to individual stocks when market risk is eliminated.

The daily return of a stock is modeled, and the idiosyncratic (unsystematic) risk is calculated as a residual. These residuals are then regressed against the stock’s volatility, with the result indicating the stock’s liquidity level or idiosyncratic risk. The formula involves the squared residuals, daily percentage change in trading volume (volatility), and a residual term.

Gaussian Mixture Model

Liquidity in a high-volatility period cannot be modeled in the same way as a low-volatility period. In a similar vein, given the depth of the market, we need to focus on these different aspects of liquidity. GMM provides us with a tool to address this problem.

Long story short, if a market is experiencing a boom period, which coincides with high volatility, volume and transaction cost–based measures would be good choices, and if a market ends up with price discontinuity, price-based measures would be the optimal choice.

After applying GMM, we can use PCA for dimensionality reduction. In the context of GMM-PCA, loading analysis serves as a critical tool for interpreting the underlying structure of the data. GMM-PCA combines the benefits of both GMM, which captures the probabilistic distribution of the data, and PCA, which reduces dimensionality while retaining variance. Through loading analysis, you can identify which liquidity measures, such as turnover ratio, percent quoted bid ask, and others, are most influential in shaping the principal components derived from the GMM-PCA. This understanding is essential for distinguishing between different liquidity states, allowing for more informed decision-making and deeper insights into how various liquidity ratios interact within different market conditions. Ultimately, loading analysis enhances the interpretability of complex financial data, facilitating targeted strategies in liquidity management and risk assessment.

Gaussian Mixture Copula Model

Given the complexity and sophistication of financial markets, it is not possible to suggest one-size-fits-all risk models. Thus, financial institutions develop their own models for credit, liquidity, market, and operational risks so that they can manage the risks they face more efficiently and realistically. However, one of the biggest challenges that these financial institutions come across is the correlation, also known as joint distribution, of the risk.

The task of modeling ‘extreme market phases’ leads us to the concept of joint distribution by which we are allowed to model multiple risks with a single distribution. A model disregarding the interaction of risks is destined for failure. In this respect, an intuitive yet simple approach is proposed: copulas.

Copula is a function that maps marginal distribution of individual risks to multivariate distribution, resulting in a joint distribution of many standard uniform random variables. If we are working with a known distribution, such as normal distribution, it is easy to model joint distribution of variables, known as bivariate normal. However, the challenge here is to define the correlation structure between these two variables, and this is the point at which copulas come in.

Chapter 8 - Modelling Operational Risk

Op. risk is ambiguous. This arises from varied sources and can lead to significant financial losses for institutions. Operational risk includes both direct and indirect losses caused by failures in internal processes, people, systems, or external events.

Direct losses can result from legal liabilities, theft, compliance issues, or business interruptions, while indirect losses are linked to opportunity costs due to poor decision-making. Financial institutions often allocate funds to cover such losses, but determining the appropriate amount is challenging—overfunding leads to idle resources, while underfunding creates liquidity risks.

This chapter will focus on fraud risk, one of the most pervasive forms of operational risk. Fraud is intentional deception for gain, and it can be either internal (committed by insiders) or external (by third parties). Several factors increase the likelihood of fraud, including globalization (which complicates operations), poor risk management (which fosters misleading reporting and unauthorized trading), and economic pressures (which tempt individuals and institutions into illegal activities).

Fraud can cause financial losses and damage to an institution’s reputation, as demonstrated by high-profile cases like Barings Bank and Enron. The chapter concludes by introducing a machine learning (ML) model to detect fraud, employing both supervised and unsupervised learning techniques to improve detection and prevention efforts.

(The dataset used here is the dataset I used in the KB fraud transaction project)

Supervised Learning Modeling for Fraud Examination

It shows example models with XGB, RFC, LogReg, and Decision tree. Here is the full code

Cost-Based Fraud Examination

In fraud, a false negative is significantly more costly than a false positive. So we can come up with a cost matrix like:

	Positive Prediction	Negative Prediction
Actual Positive	2	Transaction Amount
Actual Negative	2	0

Using this we can calculate the cost of each model (and this would be called a cost-sensitive classifier since we take cost into account)

Below is a table made from the confusion matrices of the 4 created models.

Model	Total Cost	F1 Score
Logistic Regression	5995	0.79
Decision Tree	5351	0.96
Random Forest	5413	0.87
XGBoost	5371	0.97

Saving Score

There are different metrics that can be used in cost improvement, and saving score is one of them.

The cost formula is:

The saving formula is:

Model	Saving Score	F1 Score
Logistic Regression	-0.5602	0.0000
Decision Tree	0.6557	0.7383
Random Forest	0.4789	0.7068

In this case we can see that the Decision Tree model is the best

Cost-Sensitive Modelling

This type of modelling focuses on how varying costs of misclassification impact fraud detection, particularly through saving scores. Traditional fraud models often overlook the variability in these costs, assuming uniform costs for all classifications, which is not accurate in real-world scenarios

Model	Saving Score	F1 Score
Logistic Regression	-0.5906	0.0000
Decision Tree	0.8414	0.3281
Random Forest	0.8913	0.4012

The best and the worst saving scores are obtained in random forest and logistic regression, respectively. This confirms two important facts: first, it implies that random forest has a low number of inaccurate observations, and second, that those inaccurate observations are less costly. To be precise, modeling fraud with random forest generates a very low number of false negatives, which is the denominator of the saving score formula

Bayesian Minimum Risk

Bayesian decision can also be used to model fraud taking into account the cost sensitivity. The Bayesian minimum risk method rests on a decision process using different costs (or loss) and probabilities. Mathematically, if the transaction is predicted to be fraud, the overall risk is defined as follows:

On the other hand, if a transaction is predicted to be legitimate, then the overall risk turns out to be:

Alternatively, the Bayesian minimum risk formula can be interpreted as:

with admin is administrative cost and amt is the transaction amount. With that being said, the transaction is labeled as fraud if:

This is the results table when the Bayesian Minimum Risk model is applied:

Model	Saving Score	F1 Score
Logistic Regression	0.8064	0.1709
Decision Tree	0.7343	0.6381
Random Forest	0.9624	0.4367

Here is a plot comparing the base model, cost-sensitive method, and Bayesian min. risk method:

Unsupervised Learning Modeling for Fraud Examination

The most prominent advantage of this method over the supervised model is that there is no need to apply a sampling procedure to fix the imbalanced-data problem. Unsupervised models, by their nature, do not require any prior knowledge about the data

Self-Organizing Map (SOM)

A Self-Organizing Map (SOM) is an unsupervised learning model that excels at clustering and visualizing high-dimensional data. Unlike supervised models, SOMs do not require labeled data, making them particularly useful for fraud examination where labeled examples may be scarce. SOMs project data onto a lower-dimensional grid while preserving the topological relationships, allowing for easier visualization of patterns. This ability to cluster similar data points helps identify anomalies that may indicate fraud. Additionally, SOMs can adapt to new data without retraining, making them effective for continuous monitoring of transactions. SOMs are valuable for detecting hidden patterns in transaction data without the challenges of imbalanced datasets.

Autoencoders

Autoencoders are unsupervised deep learning models designed to transform inputs into outputs through a hidden layer, comprising two main components: an encoder and a decoder. The encoder extracts features from the input data, while the decoder reconstructs the original input. The primary goal of an autoencoder is not merely to copy the input but to learn a lower-dimensional representation that captures the essential characteristics of the data.

Types of Autoencoders:

undercomplete: this basic type has a hidden layer with fewer neurons than the input data, aiming to capture latent attributes by minimizing the reconstruction loss. They face a bias-variance trade-off, balancing accurate reconstruction with a low-dimensional representation
sparse: these address the bias-variance trade-off by imposing sparsity on the reconstruction error. Regularization techniques, such as L_1 regularization or Kullback-Leibler (KL) divergence, are used to encourage the model to represent the data in a sparse manner, effectively focusing on the most relevant features
denoising: instead of minimizing the standard reconstruction error, denoising autoencoders add noise to the input data and learn to reconstruct the original data from the corrupted input. This approach helps the model generalize better by focusing on the underlying structure of the data, making it more robust to noise

Chapter 9 - A Corporate Governance Risk Measure: Stock Price Crash

The link between corporate governance and risk measure has been established via stock price crash risk, which is referred to as the risk of a large negative individual stock return

Detecting the determinants of stock price crashes is crucial for identifying the root causes of corporate governance issues, enabling companies to address problematic managerial areas and enhance performance and reputation. Improved governance reduces the risk of stock price crashes and can increase total revenue. Stock price crashes signal the strengths and weaknesses of corporate governance, defined by fairness, transparency, and accountability.

Fairness ensures equal treatment of shareholders, transparency involves timely communication about company events, and accountability relates to establishing a code of conduct that presents an accurate company position to shareholders. Agency costs arise from conflicts of interest between shareholders and management, where managers may prioritize their own power and wealth over shareholder value.

When managers withhold bad news, stock price crashes can occur once accumulated negative information is disclosed, leading to sudden declines in stock prices, known as crash risk. Opaque financial reporting exacerbates this issue by preventing timely corrective actions. Overall, a strong link exists between corporate governance and stock price crashes, highlighting the importance of effective governance practices. The chapter will first explore stock price measures and then discuss how these measures can be applied to detect crashes.

Stock Price Crash Measures

Key measures discussed include Down-to-Up Volatility (DUVOL), Negative Coefficient of Skewness (NCSKEW), and the CRASH measure

DUVOL: this calculates crash risk based on the standard deviation of firm-specific weekly returns, distinguishing between “down” weeks (returns below the annual mean) and “up” weeks (returns above the annual mean). Higher DUVOL values indicate a greater risk of a crash
NCSKEW: this is derived from the negative third moment of daily returns, normalized by the standard deviation raised to the third power. Similar to DUVOL, higher NCSKEW values suggest an increased crash risk
CRASH: This measure is based on the distance of firm-specific weekly returns from the mean, assigning a value of 1 if returns fall more than 3.09 (or 3.2) standard deviations below the mean, indicating a potential crash

These measures provide different perspectives on crash risk, paving the way for the introduction of machine learning-based approaches to enhance stock price crash detection

Minimum Covariance Determinant

MCD is a method proposed to detect anomalies in a distribution with elliptically symmetric and unimodal datasets. Anomalies in stock returns are detected using the MCD estimator, and this becomes the dependent variable in the logistic panel regression by which we explore the root causes of crash risk.

The MCD estimator provides a robust and consistent method in detecting outliers. This is important because outliers may have a huge effect on the multivariate analysis.

The strength of the MCD comes in the form of its explainability, adjustability, low computational time requirement, and robustness:

explainability is the extent to which the algorithm behind a model can be explained. MCD assumes the data to be elliptically distributed and the outliers are computed by the Mahalanobis distance metric
adjustability stresses the importance of having a data-dependent model that is allowed to calibrate itself on a consistent basis so that structural change can be captured
low computational time refers to fast calculation of covariance matrix and avoids using an entire sample. Instead, MCD uses a half sample in which no outliers are included so that outlying observations do not skew MCD location or shape
robustness: using a half sample in MCD also ensures robustness, because it implies that the model is consistent under contamination

Variables used in analysis

Independent variables used in the analysis of stock price crash

Variable	Explanation
Size (log_size)	Logarithm of total assets owned by the company.
Receivables (rect)	Accounts receivable/debtors.
Property, plant, and equipment (ppegt)	Total property, plant, and equipment.
Average turnover (dturn)	The average monthly turnover ratio in year t minus the average monthly turnover ratio in year t - 1. The turnover ratio is the monthly trading volume divided by the total number of shares outstanding.
NCSKEW (ncskew)	The negative coefficient of skewness of firm-specific weekly returns in a year, calculated as the negative of the third moment of firm-specific weekly returns divided by the cubed standard deviation.
Firm-specific return (residuals)	The average of firm-specific weekly returns in a year.
Return on asset (RoA)	The returns on assets in a year, defined as the ratio of net income to total assets.
Standard deviation (annual_std)	The standard deviation of firm-specific weekly returns in a year.

Variables used for firm-specific sentiment

Variable	Explanation
Price-to-earnings ratio (p/e)	Market value per share divided by earnings per share.
Turnover rate (turnover_rate)	Total number of shares traded divided by the average number of shares outstanding.
Equity share (equity_share)	Represents common stock.
Closed-end fund discount (cefd)	Refers to assets that raise a fixed amount of capital through an initial public offering.
Leverage (leverage)	The ratio of the sum of long-term debt and current liabilities to total assets.
Buying and selling volume (bsi)	Buying volume is the number of shares associated with buying trades, and selling volume is for selling trades.

Reading some more scikit-learn docs on my own (no stream)

There are so many little gems in the sklearn docs.

I have some other docs that I will continue reading on tomorrow’s stream.

That is all for today!

See you tomorrow :)