(Day 292) Continuing with the ML for financial risk management book

Ivan Ivanov · October 19, 2024

traditional-machine-learning theory

Hello :) Today is Day 292!

A quick summary of today:

read a few more chapters of the Financial Risk Management book
read a longer post on linear models and coefficient estimation from sklearn’s docs

ML for Financial Risk Management Chapter 4 - Machine Learning-Based Volatility Prediction

Volatility is the backbone of finance in the sense that it not only provides an information signal to investors, but it also is an input to various financial models. What makes volatility so important? The answer stresses the importance of uncertainty, which is the main characteristic of the financial model.

Volatility used as a proxy of risk is among the most important variables in many fields, including asset pricing and risk management.

Modeling volatility amounts to modeling uncertainty so that we better understand and approach uncertainty, enabling us to have good enough approximations of the real world. To gauge the extent to which proposed models account for the real-world situation, we need to calculate the return volatility, which is also known as realized volatility. Realized volatility is the square root of realized variance, which is the sum of squared return. Realized volatility is used to calculate the performance of the volatility prediction method. Here is the formula for return volatility:

where r and mu are return and mean of return, and n is number of observations

ARCH model

One of the early attempts to model volatility was proposed by Eagle (1982) and is known as the ARCH model. The ARCH model is a univariate model and based on historical asset returns. The ARCH(p) model has the following form:

where the mean model is:

where Є_t is assumed to be normally distributed. In this parametric model, we need to satisfy some assumptions to have strictly positive variance. In this respect, the following conditions should hold:

w > 0, and
alpha_k >= 0

All of these equations tell us that ARCH is a univariate and nonlinear model in which volatility is estimated with the square of past returns. One of the most distinctive features of ARCH is that it has the property of time-varying conditional variance1 so that ARCH is able to model the phenomenon known as volatility clustering—that is, large changes tend to be followed by large changes of either sign, and small changes tend to be followed by small changes. Hence, once an important announcement is made to the market, it might result in huge volatility.

Here is S&P500’s volatility clustering:

Despite its appealing features, such as simplicity, nonlinearity, easiness, and adjustment for forecast, the ARCH model has certain drawbacks:

equal response to positive and negative shocks
strong assumptions such as restrictions on parameters
possible misprediction due to slow adjustments to large movements

To run ARCH, we can use the arch library

arch = arch_model(ret, mean='zero', vol='ARCH', p=1).fit(disp='off') # ret is the S&P500 return
print(arch.summary())

                        Zero Mean - ARCH Model Results                        
==============================================================================
Dep. Variable:              Adj Close   R-squared:                       0.000
Mean Model:                 Zero Mean   Adj. R-squared:                  0.000
Vol Model:                       ARCH   Log-Likelihood:               -4061.27
Distribution:                  Normal   AIC:                           8126.54
Method:            Maximum Likelihood   BIC:                           8138.50
                                        No. Observations:                 2913
Date:                Sat, Oct 19 2024   Df Residuals:                     2913
Time:                        05:28:00   Df Model:                            0
                            Volatility Model                            
========================================================================
                 coef    std err          t      P>|t|  95.0% Conf. Int.
------------------------------------------------------------------------
omega          0.7014  5.002e-02     14.023  1.129e-44 [  0.603,  0.799]
alpha[1]       0.3909  7.011e-02      5.575  2.474e-08 [  0.253,  0.528]
========================================================================

Covariance estimator: robust

To find a better model we can compare a few with different lags. For that we will use Bayesian Informaiont Criteria (BIC) as the model selection method and to select lag. The reason BIC is used is that as long as we have large enough samples, BIC is a reliable tool for model selection

bic_arch = []

for p in range(1, 5):
        arch = arch_model(ret, mean='zero', vol='ARCH', p=p)\
                .fit(disp='off')
        bic_arch.append(arch.bic)
        if arch.bic == np.min(bic_arch):
            best_param = p
arch = arch_model(ret, mean='zero', vol='ARCH', p=best_param)\
        .fit(disp='off')
print(arch.summary())
forecast = arch.forecast(start=split_date[0])
forecast_arch = forecast

Result:

                        Zero Mean - ARCH Model Results                        
==============================================================================
Dep. Variable:              Adj Close   R-squared:                       0.000
Mean Model:                 Zero Mean   Adj. R-squared:                  0.000
Vol Model:                       ARCH   Log-Likelihood:               -3709.51
Distribution:                  Normal   AIC:                           7429.03
Method:            Maximum Likelihood   BIC:                           7458.91
                                        No. Observations:                 2913
Date:                Sat, Oct 19 2024   Df Residuals:                     2913
Time:                        05:28:08   Df Model:                            0
                             Volatility Model                             
==========================================================================
                 coef    std err          t      P>|t|    95.0% Conf. Int.
--------------------------------------------------------------------------
omega          0.2794  2.580e-02     10.829  2.498e-27   [  0.229,  0.330]
alpha[1]       0.1519  3.458e-02      4.394  1.111e-05 [8.417e-02,  0.220]
alpha[2]       0.2329  3.614e-02      6.444  1.164e-10   [  0.162,  0.304]
alpha[3]       0.1917  3.702e-02      5.178  2.241e-07   [  0.119,  0.264]
alpha[4]       0.1923  4.149e-02      4.634  3.586e-06   [  0.111,  0.274]
==========================================================================

Covariance estimator: robust

Get the RMSE:

rmse_arch = np.sqrt(mse(realized_vol[-n:] / 100, 
                        np.sqrt(forecast_arch\
                        .variance.iloc[-len(split_date):] 
                        / 100)))
print('The RMSE value of ARCH model is {:.4f}'.format(rmse_arch))
# The RMSE value of ARCH model is 0.0896

Then plot along with the predictions

plt.figure(figsize=(10, 6))
plt.plot(realized_vol / 100, label='Realized Volatility')
plt.plot(forecast_arch.variance.iloc[-len(split_date):] / 100, 
         label='Volatility Prediction-ARCH')
plt.title('Volatility Prediction with ARCH', fontsize=12)
plt.legend()
plt.show()

GARCH Model

To tackle the drawbacks of the ARCH model, researchers developed GARCH.

The GARCH model is an extension of the ARCH model incorporating lagged conditional variance. So ARCH is improved by adding p number of delated conditional variance, which makes the GARCH model multivariate in the sense that it is an autoregressive moving average model for conditional variance with p number of lagged squared returns and q number of lagged conditional variance. GARCH(p, q) can be formulated as:

where w, beta, and alpha are parameters to be estimated and p and q are maximum lag in the model. To have consistent GARCH, the following conditions should hold:

w > 0
beta >= 0
alpha >= 0
beta + alpha < 1

The ARCH model is unable to capture the influence of historical innovations. However, as a more parsimonious model, the GARCH model can account for the change in historical innovations because GARCH models can be expressed as an infinite-order ARCH

To run GARCH we can again use the arch library:

garch = arch_model(ret, mean='zero', vol='GARCH', p=1, o=0, q=1)\
        .fit(disp='off')
print(garch.summary())

                       Zero Mean - GARCH Model Results                        
==============================================================================
Dep. Variable:              Adj Close   R-squared:                       0.000
Mean Model:                 Zero Mean   Adj. R-squared:                  0.000
Vol Model:                      GARCH   Log-Likelihood:               -3654.98
Distribution:                  Normal   AIC:                           7315.97
Method:            Maximum Likelihood   BIC:                           7333.90
                                        No. Observations:                 2913
Date:                Sat, Oct 19 2024   Df Residuals:                     2913
Time:                        05:39:36   Df Model:                            0
                              Volatility Model                              
============================================================================
                 coef    std err          t      P>|t|      95.0% Conf. Int.
----------------------------------------------------------------------------
omega          0.0393  8.429e-03      4.664  3.100e-06 [2.279e-02,5.583e-02]
alpha[1]       0.1745  2.278e-02      7.661  1.845e-14     [  0.130,  0.219]
beta[1]        0.7892  2.273e-02     34.729 2.853e-264     [  0.745,  0.834]
============================================================================

Covariance estimator: robust

We can again find more optimum params:

bic_garch = []

for p in range(1, 5):
    for q in range(1, 5):
        garch = arch_model(ret, mean='zero',vol='GARCH', p=p, o=0, q=q)\
                .fit(disp='off')
        bic_garch.append(garch.bic)
        if garch.bic == np.min(bic_garch):
            best_param = p, q
garch = arch_model(ret, mean='zero', vol='GARCH',
                   p=best_param[0], o=0, q=best_param[1])\
        .fit(disp='off')
print(garch.summary())
forecast = garch.forecast(start=split_date[0])
forecast_garch = forecast

                       Zero Mean - GARCH Model Results                        
==============================================================================
Dep. Variable:              Adj Close   R-squared:                       0.000
Mean Model:                 Zero Mean   Adj. R-squared:                  0.000
Vol Model:                      GARCH   Log-Likelihood:               -3654.98
Distribution:                  Normal   AIC:                           7315.97
Method:            Maximum Likelihood   BIC:                           7333.90
                                        No. Observations:                 2913
Date:                Sat, Oct 19 2024   Df Residuals:                     2913
Time:                        05:41:32   Df Model:                            0
                              Volatility Model                              
============================================================================
                 coef    std err          t      P>|t|      95.0% Conf. Int.
----------------------------------------------------------------------------
omega          0.0393  8.429e-03      4.664  3.100e-06 [2.279e-02,5.583e-02]
alpha[1]       0.1745  2.278e-02      7.661  1.845e-14     [  0.130,  0.219]
beta[1]        0.7892  2.273e-02     34.729 2.853e-264     [  0.745,  0.834]
============================================================================

Covariance estimator: robust

rmse_garch = np.sqrt(mse(realized_vol[-n:] / 100,
                         np.sqrt(forecast_garch\
                         .variance.iloc[-len(split_date):] 
                         / 100)))
print('The RMSE value of GARCH model is {:.4f}'.format(rmse_garch))
# 0.0878

plt.figure(figsize=(10,6))
plt.plot(realized_vol / 100, label='Realized Volatility')
plt.plot(forecast_garch.variance.iloc[-len(split_date):] / 100, 
         label='Volatility Prediction-GARCH')
plt.title('Volatility Prediction with GARCH', fontsize=12)
plt.legend()
plt.show()

The volatility of returns is well-fitted by the GARCH model partly because of its volatility clustering and partly because GARCH does not assume that the returns are independent, which allows it to account for the leptokurtic property of returns. However, despite these useful properties and its intuitiveness, GARCH is not able to model the asymmetric response of the shocks.

To remedy this issue, GJR-GARCH was proposed by Glosten, Jagannathan, and Runkle (1993)

GJR-GARCH

The GJR-GARCH model performs well in modeling the asymmetric effects of announcements in the way that bad news has a larger impact than good news. In other words, in the presence of asymmetry, the distribution of losses has a fatter tail than the distribution of gains

The equation of this model includes gamma, which controls for the asymmetry of the announcements and if:

gamma = 0 -> the response to the past shock is the same
gamma > 0 -> the response to the past negative shock is stronger than a positive one
gamma < 0 -> the response to the past positive shock is stronger than a negative one

Here is how to run it:

bic_gjr_garch = []

for p in range(1, 5):
    for q in range(1, 5):
        gjrgarch = arch_model(ret, mean='zero', p=p, o=1, q=q)\
                   .fit(disp='off')
        bic_gjr_garch.append(gjrgarch.bic)
        if gjrgarch.bic == np.min(bic_gjr_garch):
            best_param = p, q
gjrgarch = arch_model(ret,mean='zero', p=best_param[0], o=1,
                      q=best_param[1]).fit(disp='off')
print(gjrgarch.summary())
forecast = gjrgarch.forecast(start=split_date[0])
forecast_gjrgarch = forecast

Model summary:

                     Zero Mean - GJR-GARCH Model Results                      
==============================================================================
Dep. Variable:              Adj Close   R-squared:                       0.000
Mean Model:                 Zero Mean   Adj. R-squared:                  0.000
Vol Model:                  GJR-GARCH   Log-Likelihood:               -3591.15
Distribution:                  Normal   AIC:                           7190.30
Method:            Maximum Likelihood   BIC:                           7214.21
                                        No. Observations:                 2913
Date:                Sat, Oct 19 2024   Df Residuals:                     2913
Time:                        05:47:22   Df Model:                            0
                               Volatility Model                              
=============================================================================
                 coef    std err          t      P>|t|       95.0% Conf. Int.
-----------------------------------------------------------------------------
omega          0.0430  7.754e-03      5.547  2.908e-08  [2.781e-02,5.821e-02]
alpha[1]       0.0390  3.069e-02      1.270      0.204 [-2.116e-02,9.913e-02]
gamma[1]       0.2798  4.799e-02      5.831  5.501e-09      [  0.186,  0.374]
beta[1]        0.7907  2.700e-02     29.286 1.572e-188      [  0.738,  0.844]
=============================================================================

Covariance estimator: robust

The RMSE value of GJR-GARCH models is 0.0882

Volatility prediction plot:

EGARCH

Together with the GJR-GARCH model, the EGARCH model, proposed by Nelson (1991), is another tool for controlling for the effect of asymmetric announcements.

The main difference in the EGARCH equation is that logarithm is taken of the variance on the left-hand side of the equation. This indicates the leverage effect, meaning that there exists a negative correlation between past asset returns and volatility.

Here is a comparison table for the above 4 models:

Model	RMSE
ARCH	0.0896
GARCH	0.0878
GJR-GARCH	0.0882
EGARCH	0.0904

Above are all classical volatility models, but next are ML models.

Support Vector Regression: GARCH

Support Vector Machine (SVM) is a supervised learning algorithm used for both classification and regression tasks. The primary goal of SVM is to identify the optimal hyperplane that separates two classes by maximizing the margin between the closest points (support vectors) of each class.

In the context of regression, the aim is to find the hyperplane that minimizes error while maximizing margin, leading to a method known as Support Vector Regression (SVR). When applied to the GARCH model, this combination results in the SVR-GARCH model, leveraging the strengths of both approaches for improved predictions.

There are 3 types of kernel functions used with SVMs - polynomial, gaussian, and exponential.

The following code shows us the preparations before running the SVR-GARCH in Python. The most crucial step here is to obtain independent variables, which are realized volatility and square of historical returns:

from sklearn.svm import SVR
from scipy.stats import uniform as sp_rand
from sklearn.model_selection import RandomizedSearchCV

realized_vol = ret.rolling(5).std()
realized_vol = pd.DataFrame(realized_vol)
realized_vol.reset_index(drop=True, inplace=True)

returns_svm = ret ** 2
returns_svm = returns_svm.reset_index()
del returns_svm['Date']

X = pd.concat([realized_vol, returns_svm], axis=1, ignore_index=True)
X = X[4:].copy()
X = X.reset_index()
X.drop('index', axis=1, inplace=True)

realized_vol = realized_vol.dropna().reset_index()
realized_vol.drop('index', axis=1, inplace=True)

# create for each kind of kernel
svr_poly = SVR(kernel='poly', degree=2)
svr_lin = SVR(kernel='linear')
svr_rbf = SVR(kernel='rbf')

SVR-GARCH-Linear

para_grid = {'gamma': sp_rand(),
             'C': sp_rand(),
             'epsilon': sp_rand()}
clf = RandomizedSearchCV(svr_lin, para_grid)
clf.fit(X.iloc[:-n].values, 
        realized_vol.iloc[1:-(n-1)].values.reshape(-1,))
predict_svr_lin = clf.predict(X.iloc[-n:])

predict_svr_lin = pd.DataFrame(predict_svr_lin)
predict_svr_lin.index = ret.iloc[-n:].index

rmse_svr = np.sqrt(mse(realized_vol.iloc[-n:] / 100,
                       predict_svr_lin / 100))
print('The RMSE value of SVR with Linear Kernel is {:.6f}'.format(rmse_svr))

# The RMSE value of SVR with Linear Kernel is 0.000505

realized_vol.index = ret.iloc[4:].index

plt.figure(figsize=(10, 6))
plt.plot(realized_vol / 100, label='Realized Volatility')
plt.plot(predict_svr_lin / 100, label='Volatility Prediction-SVR-GARCH')
plt.title('Volatility Prediction with SVR-GARCH (Linear)', fontsize=12)
plt.legend()
plt.show()

The linear kernel works quite well (if the features are linearly separable)

Next are the radial basis function (RBF) and polynomial kernels. The former uses elliptical curves around the observations, and the latter, unlike the first two, focuses on the combinations of samples.

SVR-GARCH RBF

The RMSE value of SVR with RBF Kernel is 0.000455

SVR-GARCH Polynomial

The RMSE value of SVR with Polynomial Kernel is 0.002386

Neural Networks

We can use sklearn’s MLPRegressor

from sklearn.neural_network import MLPRegressor
NN_vol = MLPRegressor(learning_rate_init=0.001, random_state=1) 
para_grid_NN = {'hidden_layer_sizes': [(100, 50), (50, 50), (10, 100)],
               'max_iter': [500, 1000],
               'alpha': [0.00005, 0.0005 ]}
clf = RandomizedSearchCV(NN_vol, para_grid_NN)
clf.fit(X.iloc[:-n].values, 
        realized_vol.iloc[1:-(n-1)].values.reshape(-1, ))
NN_predictions = clf.predict(X.iloc[-n:])

NN_predictions = pd.DataFrame(NN_predictions)
NN_predictions.index = ret.iloc[-n:].index

rmse_NN = np.sqrt(mse(realized_vol.iloc[-n:] / 100,
                      NN_predictions / 100))
print('The RMSE value of NN is {:.6f}'.format(rmse_NN))

# The RMSE value of NN is 0.000647

plt.figure(figsize=(10, 6))
plt.plot(realized_vol / 100, label='Realized Volatility')
plt.plot(NN_predictions / 100, label='Volatility Prediction-NN')
plt.title('Volatility Prediction with Neural Network', fontsize=12)
plt.legend()
plt.show()

Result:

Despite its reasonable performance, we can play with the number of hidden neurons to generate a deep learning model.

We can use TF:

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

model = keras.Sequential(
    [layers.Dense(256, activation="relu"),
     layers.Dense(128, activation="relu"),
     layers.Dense(1, activation="linear"),])

model.compile(loss='mse', optimizer='rmsprop')

epochs_trial = np.arange(100, 400, 4)
batch_trial = np.arange(100, 400, 4)
DL_pred = []
DL_RMSE = []
for i, j, k in zip(range(4), epochs_trial, batch_trial):
    model.fit(X.iloc[:-n].values,
              realized_vol.iloc[1:-(n-1)].values.reshape(-1,),
              batch_size=k, epochs=j, verbose=False)
    DL_predict = model.predict(np.asarray(X.iloc[-n:]))
    DL_RMSE.append(np.sqrt(mse(realized_vol.iloc[-n:] / 100,
                            DL_predict.flatten() / 100)))
    DL_pred.append(DL_predict)
    print('DL_RMSE_{}:{:.6f}'.format(i+1, DL_RMSE[i]))

# DL_RMSE_1:0.000701
# DL_RMSE_2:0.000611
# DL_RMSE_3:0.000807
# DL_RMSE_4:0.000655

DL_predict = pd.DataFrame(DL_pred[DL_RMSE.index(min(DL_RMSE))])
DL_predict.index = ret.iloc[-n:].index

plt.figure(figsize=(10, 6))
plt.plot(realized_vol / 100,label='Realized Volatility')
plt.plot(DL_predict / 100,label='Volatility Prediction-DL')
plt.title('Volatility Prediction with Deep Learning',  fontsize=12)
plt.legend()
plt.show()

All the code from this chapter is here

Chapter 5. Modeling Market Risk

A measure of risk driven by historical data assumes the future will follow the pattern of the past. You need to understand the limitations of that assumption. More importantly, you need to model scenarios in which that pattern breaks down.

Miles Kennedy

Risk in finance is complex and varies by source, making it essential to differentiate risks and apply appropriate tools for each. The main types of risk are market, credit, liquidity, and operational. Market risk, for example, is tied to changes in financial indicators like inflation and interest rates, and tools like Value at Risk (VaR) and Expected Shortfall (ES) are used to manage it.

Traditional models, limited by assumptions and complexity, struggle with modern financial systems. Machine learning (ML) models, on the other hand, identify patterns and relationships in data without predefined theories, offering a fresh approach to both prediction and theory discovery. Below, first traditional statistical models are explored, and then more modern ML models.

Value at Risk (VaR)

The VaR model emerged from a request made by a J.P. Morgan executive who wanted to have a summary report showing possible losses as well as risks that J.P. Morgan was exposed to on a given day. This report would inform executives about the risks assumed by the institution in an aggregated manner. The method by which market risk is computed is known as VaR. This report was the starting point of VaR, and now it has become so widespread that not only institutions prefer using VaR, but its adoption has become required by regulators.

The adoption of VaR dates back to the 1990s, and despite numerous extensions to it and new proposed models, it is still in use

What makes it so appealing? The answer comes from Kevin Dowd:

The VaR figure has two important characteristics. The first is that it provides a common consistent measure of risk across different positions and risk factors. It enables us to measure the risk associated with a fixed-income position, say, in a way that is comparable to and consistent with a measure of the risk associated with equity positions. VaR provides us with a common risk yardstick, and this yardstick makes it possible for institutions to manage their risks in new ways that were not possible before. The other characteristic of VaR is that it takes account of the correlations between different risk factors. If two risks offset each other, the VaR allows for this offset and tells us that the overall risk is fairly low.

In fact, VaR addresses one of the most common questions an investor has: what is the maximum expected loss of my investment

VaR provides a very intuitive and practical answer to this question. In this regard, it is used to measure the worst expected loss for a company over a given period and a pre-defined confidence interval. Suppose that a daily VaR of an investment is $1 million with 95% confidence interval. This would read as there being a 5% chance that an investor might incur a loss greater than $1 million in a day.

In summary, there are some important points in VaR analysis that need to be highlighted:

VaR needs an estimation of the probability of loss
VaR concentrates on the potential losses. We are not talking about actual or realized losses; rather, VaR is a kind of loss projection

VaR has three key ingredients:

standard deviation that defines the level of loss
fixed time horizon over which risk is assessed
Confidence interval

VaR can be measured via three different approaches:

variance-covariance VaR
historical simulation VaR
Monte Carlo VaR

Variance-Covariance Method

The variance-covariance method is also known as the parametric method, because observations are assumed to be normally distributed. The variance-covariance method is commonplace in that returns are deemed to follow normal distribution. The parametric form assumption makes the application of variance-covariance method easy.

As in all VaR approaches, we can either work with a single asset or a portfolio. However, working with a portfolio requires careful treatment in the sense that correlation structure and portfolio variance need to be estimated. At this point, correlation comes into the picture, and historical data is used to calculate correlation, mean, and standard deviation. When augmenting this with an ML-based approach, correlation structure will be the main focus.

Suppose that we have a portfolio consisting of a single asset

It is shown that the return of this asset is zero and standard deviation is 1, and if the holding period is 1, the corresponding VaR value can be computed from the value of the asset by the corresponding Z-value and standard deviation. Hence, the normality assumption makes things easier, but it is a strong assumption, as there is no guarantee that asset returns are normally distributed; rather, most asset returns do not follow a normal distribution. Moreover, due to the normality assumption, potential risk in tail might not be captured. Therefore the normality assumption comes with a cost.

Here is how to apply the variance-covariance VaR in python.

stocks_returns = (np.log(stocks) - np.log(stocks.shift(1))).dropna()

stocks_returns_mean = stocks_returns.mean()
weights  = np.random.random(len(stocks_returns.columns))
weights /= np.sum(weights)
cov_var = stocks_returns.cov()
port_std = np.sqrt(weights.T.dot(cov_var).dot(weights))

initial_investment = 1e6
conf_level = 0.95

def VaR_parametric(initial_investment, conf_level):
    alpha = norm.ppf(1 - conf_level, stocks_returns_mean, port_std)
    for i, j in zip(stocks.columns, range(len(stocks.columns))):
        VaR_param = (initial_investment - initial_investment * 
                     (1 + alpha))[j]
        print("Parametric VaR result for {} is {} "
              .format(i, VaR_param))
    VaR_param = (initial_investment - initial_investment * (1 + alpha))
    print('--' * 25)
    return VaR_param

VaR_param = VaR_parametric(initial_investment, conf_level)

# Parametric VaR result for IBM is 42606.16125893139
# Parametric VaR result for MSFT is 41024.50194348814
# Parametric VaR result for INTC is 43109.25240851776

var_horizon = []
time_horizon = 30
for j in range(len(stocks_returns.columns)):
    for i in range(1, time_horizon + 1):
        var_horizon.append(VaR_param[j] * np.sqrt(i))
plt.plot(var_horizon[:time_horizon], "o",
         c='blue', marker='*', label='IBM')
plt.plot(var_horizon[time_horizon:time_horizon + 30], "o",
         c='green', marker='o', label='MSFT')
plt.plot(var_horizon[time_horizon + 30:time_horizon + 60], "o",
         c='red', marker='v', label='INTC')
plt.xlabel("Days")
plt.ylabel("USD")
plt.title("VaR over 30-day period")
plt.legend()
plt.show()

VaR changes depending on the time horizon in the sense that holding assets for a longer period makes an investor more susceptible to risk

The pros and cons of the variance-covariance method are as follows:

Pros

easy to calculate
does not require a large number of samples

Cons

observations are normally distributed
does not work well with nonlinear structures
requires the computation of the covariance matrix

So, even though assuming normality sounds appealing, it may not be the best way to estimate VaR, especially in the case where the asset returns do not have a normal distribution. Luckily, there is another method that does not have a normality assumption, namely the historical simulation VaR model.

The Historical Simulation Method

This is an empirical method: instead of using a parametric approach, we find the percentile, which is the Z-table equivalent of variance-covariance method. Suppose that the confidence interval is 95%; 5% will be used in lieu of the Z-table values, and all we need to do is to multiply this percentile by the initial investment.

The following are the steps taken in the historical simulation VaR:

Obtain the asset returns of the portfolio (or individual asset)
Find the corresponding return percentile based on confidence interval
Multiply this percentile by initial investment

In code:

def VaR_historical(initial_investment, conf_level):
    Hist_percentile95 = []
    for i, j in zip(stocks_returns.columns,
                    range(len(stocks_returns.columns))):
        Hist_percentile95.append(np.percentile(stocks_returns.loc[:, i],
                                               5))
        print("Based on historical values 95% of {}'s return is {:.4f}"
              .format(i, Hist_percentile95[j]))
        VaR_historical = (initial_investment - initial_investment *
                          (1 + Hist_percentile95[j]))
        print("Historical VaR result for {} is {:.2f} "
              .format(i, VaR_historical))
        print('--' * 35)

VaR_historical(initial_investment,conf_level)

Based on historical values 95% of IBM's return is -0.0371
Historical VaR result for IBM is 37081.53
----------------------------------------------------------------------
Based on historical values 95% of MSFT's return is -0.0426
Historical VaR result for MSFT is 42583.68
----------------------------------------------------------------------
Based on historical values 95% of INTC's return is -0.0425
Historical VaR result for INTC is 42485.39
----------------------------------------------------------------------

The historical simulation VaR method implicitly assumes that historical price changes have a similar pattern, i.e., that there is no structural break. The pros and cons of this method are as follows:

Pros

no distributional assumption
works well with nonlinear structures
easy to calculate

Cons

requires a large sample
needs high computing power

The Monte Carlo Simulation VaR

Monte Carlo is a computerized mathematical method used to make an estimation in cases where there is no closed-form solution, so it is a highly efficient tool for numerical approximation. Monte Carlo relies on repeated random samples from a given distribution.

From the application standpoint, Monte Carlo is very similar to the historical simulation VaR, but it does not use historical observations. Rather, it generates random samples from a given distribution. Monte Carlo helps decision makers by providing links between possible outcomes and probabilities, which makes it an efficient and applicable tool in finance.

So in a nutshell, a Monte Carlo simulation is doing nothing but generating random samples and calculating their mean. Computationally, it follows these steps:

Define the domain
Generate random numbers
Iterate and aggregate the result

Denoising

The financial market is filled with both noise (random information) and signal (valuable information). Noise traders act on random behavior, while informed traders use insider information or signals to make rational decisions. The key challenge is distinguishing between noise and signal, as failing to do so can lead to poor profit gains and risk assessment. Investors must be cautious, as only the signal provides useful insights for making informed decisions. The problem lies in how to differentiate and use this information effectively.

The Marchenko–Pastur theorem helps have homogenous covariance matrices. The Marchenko–Pastur theorem allows us to extract signal from noise using eigenvalues of covariance matrices.

Eigenvalue and eigenvector have special meanings in a financial context. Eigenvectors represent the variance in covariance matrix, while an eigenvalue shows the magnitude of an eigenvector. Specifically, the largest eigenvector corresponds to largest variance, and the magnitude of this is equal to the corresponding eigenvalue. Due to noise in the data, some eigenvalues can be thought of as random, and it makes sense to detect and filter out these eigenvalues to retain only signals.

Below is a PDF of a Marchenko–Pastur distribution which is fit to the data

The distribution fits the data well and we are able to differentiate the noise and signal. we can now refer to data for which the noise has filtered as denoised.

We can plug it into the VaR model, which is called the denoised VaR estimation. Denoising the covariance matrix is nothing but taking unnecessary information (noise) out of the data. So we can then make use of the signal from the market, focusing our attention on the important events only.

Denoising the covariance matrix includes the following stages:

Calculate the eigenvalues and eigenvectors based on correlation matrix
Use kernel density estimation, find the eigenvector for a specific eigenvalue
Fit the Marchenko–Pastur distribution to the kernel density estimation
Find the maximum theoretical eigenvalue using the Marchenko–Pastur distribution
Calculate the average of eigenvalues greater than the theoretical value
Use these new eigenvalues and eigenvectors to calculate the denoised correlation matrix
Calculate the denoised covariance matrix by the new correlation matrix

The difference between the traditionally applied VaR and the denoised VaR is even more pronounced in a crisis period. During a crisis period, correlation among assets becomes higher, which is sometimes referred to as correlation breakdown

Despite its appeal and ease of use, VaR is not a coherent risk measure, which requires satisfying certain conditions or axioms

Here are the four axioms of a coherent risk measure:

Monotonicity:
If one portfolio always produces a higher return than another, its risk should be lower. If Portfolio A consistently performs better than Portfolio B, the risk of Portfolio A should be lower than or equal to that of Portfolio B.
Sub-additivity:
The risk of a combined portfolio should not exceed the sum of the risks of the individual portfolios. Diversification should reduce or at least not increase overall risk.
Positive Homogeneity:
Scaling a portfolio by a positive constant scales the risk proportionally. For example, doubling a portfolio’s size will double its risk.
Translation Invariance:
Adding a risk-free asset to a portfolio reduces the risk by the amount of that asset. If you add a certain amount of risk-free cash, the portfolio’s risk decreases by that same amount.

VaR is not, but Expected Shortfall is a coherent risk measure.

Expected Shortfall

Unlike VaR, ES focuses on the tail of the distribution. More specifically, ES enables us to take into account unexpected risks in the market. However, this doesn’t mean that ES and VaR are two entirely different concepts. Rather, they are related—that is, it is possible to express ES using VaR.

ES can also be computed based on the historical observations. Like the historical simulation VaR method, parametric assumption can be relaxed. To do that, the first return (or loss) corresponding to the 95% is found, and then the mean of the observations greater than the 95% gives us the result.

Liquidity-Augmented Expected Shortfall

ES provides us with a coherent risk measure to gauge market risk. However, though we differentiate financial risks as market, credit, liquidity, and operational risks, that does not necessarily mean that these risks are entirely unrelated to one another. Rather, they are, to some extent, correlated. That is, once a financial crisis hit the market, market risk surges along with the drawdown on lines of credit, which in turn increases liquidity risk.

Ignoring the liqudity dimension of risk may result in underestimating the market risk. Therefore, augmenting ES with liquidity risk may make a more accurate and reliable estimation. Sounds good, but what is a good proxy for liquidity risk?

Bid-ask spread measures are commonly used for modeling liquidity. Shortly, bid-ask spread is the difference between the highest available price (bid price) that a buyer is willing to pay and the lowest price (ask price) that a seller is willing to get. So bid-ask spread gives a tool to measure the transaction cost.

To the extent that bid-ask spread is a good indicator of transaction cost, it is also a good proxy of liquidity in the sense that transaction cost is one of the components of liquidity. Spreads can be defined various ways depending on their focus. Here are the bid-ask spreads usedto incorporate liquidity risk into the ES model:

Effective Cost

A buyer-initiated trade occurs when a trade is executed at a price above the quoted mid price. Similarly, a seller-initiated trade occurs when a trade is executed at a price below the quoted mid price. We can then describe the effective cost as follows:

Now we need to find a way to incorporate these bid-ask spreads into the ES model so that we are able to account for the liquidity risk as well as market risk. We will employ two different methods to accomplish this task:

take the cross-sectional mean of the bid-ask spread (a row-wise averaging of the bid-ask spread); using this method, we are able to generate a measure for market-wide liquidity
use PCA

In PCA, the number of components matter and we need to determine on which side of the n_components and information lost do we stand. Here are some possible cutoffs our case:

greater than 80% explained variance
more than one eigenvalue
the point at which the scree plot gets flatter

All detailed code from this chapter is here

Chapter 6 - Credit Risk Estimation

Although market risk is much better researched, the larger part of banks’ economic capital is generally used for credit risk. The sophistication of traditional standard methods of measurement, analysis, and management of credit risk might, therefore, not be in line with its significance.

Uwe Wehrspohn (2002)

Credit risk and its goal can be defined in a more formal way (BCBS and BIS 2000):

Credit risk is most simply defined as the potential that a bank borrower or counterparty will fail to meet its obligations in accordance with agreed terms. The goal of credit risk management is to maximise a bank’s risk-adjusted rate of return by maintaining credit risk exposure within acceptable parameters.

Estimating the Credit Risk

Aside from the probability of default, credit risk has three defining characteristics:

exposure: refers to a party that may possibly default or suffer an adverse change in its ability to perform
likelihood: the likelihood that this party will default on its obligations
recovery rate: how much can be retrieved if a default takes place

The BCBS put forth the global financial credit management standards, which are known as the Basel Accord. There are currently three Basel Accords. The most distinctive rule set by Basel I in 1988 was the requirement to hold capital equating to at least 8% of risk-weighted assets.

Basel I includes the very first capital measurement system, which was created following the onset of the Latin American debt crisis. In Basel I, assets are classified as follows:

0% for risk-free assets
20% for loans to other banks
50% for residential mortgages
100% for corporate debt

In 1999, Basel II issued a revision to Basel I based on three main pillars:

minimum capital requirements, which sought to develop and expand the standardized rules set out in the 1988 Accord
supervisory review of an institution’s capital adequacy and internal assessment process
effective use of disclosure as a lever to strengthen market discipline and encourage sound banking practices

The last accord, Basel III in 2010, was inevitable. as the 2007–2008 mortgage crisis heightened. It introduced a new set of measures to further strengthened liquidity and poor governance practices. For instance, equity requirements were introduced to prevent a serial failure in the financial system, known as domino effect, during times of financial turbulence and crises. Accordingly, Basel III requires the financial ratios for banks to be:

Basel II suggests banks implement either a standardized approach or an internal ratings–based (IRB) approach to estimate the credit risk.

The key params of the IRB approach are:

Expected Loss = EAD x LGD x PD

The most important and challenging part of estimating credit risk is to model the probability of default which is adressed in this chapter.

Risk Bucketing

Risk bucketing is nothing but grouping borrowers with similar creditworthiness. The behind-the-scenes story of risk bucketing is to obtain homogenous groups or clusters so that we can better estimate the credit risk. Treating different risky borrowers equally may result in poor predictions because the model cannot capture entirely different characteristics of the data at once. Thus, by dividing the borrowers into different groups based on riskiness, risk bucketing enables us to make accurate predictions.

A common way to do this is through K-means clustering. And there are multiple methods to determine the right k:

the elbow method
the Silhouette score
Calinski-Harabasz (CH)

Probability of Default Estimation with Logistic Regression

For this, a German Credit Dataset is used from Kaggle.

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import numpy as np

scaler = StandardScaler()
scaled_credit = scaler.fit_transform(numerical_credit)

kmeans = KMeans(n_clusters=2) # 2 found after using the above mentioned methods
clusters = kmeans.fit_predict(scaled_credit)

We can see the 2 clusters plotted for different features

Next, the dataset is prepared by adding a cluster target (0 or 1) to the dataset:

clusters, counts = np.unique(kmeans.labels_, return_counts=True)

cluster_dict = {}
for i in range(len(clusters)):
    cluster_dict[i] = scaled_credit[np.where(kmeans.labels_==i)]

credit['clusters'] = pd.DataFrame(kmeans.labels_)

df_scaled = pd.DataFrame(scaled_credit)
df_scaled['clusters'] = credit['clusters']

df_scaled['Risk'] = credit['Risk']
df_scaled.columns = ['Age', 'Job', 'Credit amount',
                     'Duration', 'Clusters', 'Risk']

df_scaled[df_scaled.Clusters == 0]['Risk'].value_counts()

# Risk	count
# good	139
# bad	  110

df_scaled[df_scaled.Clusters == 1]['Risk'].value_counts()

# Risk	count
# good	561
# bad	  190

there is an imbalance distribution across risk level in the first cluster, whereas the frequency of good and bad risk levels are more balanced

Before dealing with the imbalance, split the data:

from sklearn.model_selection import train_test_split

df_scaled['Risk'] = df_scaled['Risk'].replace({'good': 1, 'bad': 0})

X = df_scaled.drop('Risk', axis=1)
y = df_scaled.loc[:, ['Risk', 'Clusters']]

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.2,
                                                    random_state=42)

first_cluster_train = X_train[X_train.Clusters == 0].iloc[:, :-1]
second_cluster_train = X_train[X_train.Clusters == 1].iloc[:, :-1]

To deal with the imbalance a combined method between SMOTE and ENN is used (SMOTEENN)

import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, roc_curve
from imblearn.combine import SMOTEENN
import warnings
warnings.filterwarnings('ignore')

X_train1 = first_cluster_train
y_train1 = y_train[y_train.Clusters == 0]['Risk']
smote = SMOTEENN(random_state = 2)
X_train1, y_train1 = smote.fit_resample(X_train1, y_train1.ravel())

First, a model for the 1st cluster:

logit = sm.Logit(y_train1, X_train1)
logit_fit1 = logit.fit()
print(logit_fit1.summary())

Result from logit_fit1

Optimization terminated successfully.
         Current function value: 0.646649
         Iterations 5
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                      y   No. Observations:                   61
Model:                          Logit   Df Residuals:                       57
Method:                           MLE   Df Model:                            3
Date:                Sat, 19 Oct 2024   Pseudo R-squ.:                 0.06545
Time:                        09:54:50   Log-Likelihood:                -39.446
converged:                       True   LL-Null:                       -42.208
Covariance Type:            nonrobust   LLR p-value:                    0.1371
=================================================================================
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
Age               0.5274      0.338      1.561      0.118      -0.135       1.190
Job              -0.1123      0.376     -0.299      0.765      -0.849       0.624
Credit amount    -0.2877      0.247     -1.165      0.244      -0.772       0.196
Duration          0.5352      0.306      1.751      0.080      -0.064       1.134
=================================================================================

The AUC-ROC for the 1st cluster model is:

The 2nd model summary from logit_fit2:

Optimization terminated successfully.
         Current function value: 0.591220
         Iterations 5
                           Logit Regression Results                           
==============================================================================
Dep. Variable:                   Risk   No. Observations:                  590
Model:                          Logit   Df Residuals:                      586
Method:                           MLE   Df Model:                            3
Date:                Sat, 19 Oct 2024   Pseudo R-squ.:                -0.04282
Time:                        09:56:08   Log-Likelihood:                -348.82
converged:                       True   LL-Null:                       -334.50
Covariance Type:            nonrobust   LLR p-value:                     1.000
=================================================================================
                    coef    std err          z      P>|z|      [0.025      0.975]
---------------------------------------------------------------------------------
Age               0.4019      0.101      3.992      0.000       0.205       0.599
Job              -0.0639      0.098     -0.652      0.515      -0.256       0.128
Credit amount    -0.6215      0.187     -3.322      0.001      -0.988      -0.255
Duration         -0.9486      0.165     -5.760      0.000      -1.271      -0.626
=================================================================================

And its AUC-ROC:

These particular models for both clusters are not that good as they go alone the diagonal of their respective plots.

Below are different methods to compare with the above one.

Probability of Default Estimation with Support Vector Machines

from sklearn.svm import SVC
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingRandomSearchCV
import time

param_svc = {'gamma': [1e-6, 1e-2],
             'C':[0.001,.09,1,5,10],
             'kernel':('linear','rbf')}

svc = SVC(class_weight='balanced')
halve_SVC = HalvingRandomSearchCV(svc, param_svc, 
                                  scoring = 'roc_auc', n_jobs=-1)
halve_SVC.fit(X_train1, y_train1)
print('Best hyperparameters for first cluster in SVC {} with {}'.
      format(halve_SVC.best_score_, halve_SVC.best_params_))

# Best hyperparameters for first cluster in SVC 0.6233333333333333 with {'kernel': 'linear', 'gamma': 1e-06, 'C': 10}

y_pred_SVC1 = halve_SVC.predict(X_test1)
print('The ROC AUC score of SVC for first cluster is {:.4f}'.
      format(roc_auc_score(y_test1, y_pred_SVC1)))

# The ROC AUC score of SVC for first cluster is 0.4079

halve_SVC.fit(X_train2, y_train2)
print('Best hyperparameters for second cluster in SVC {} with {}'.
      format(halve_SVC.best_score_, halve_SVC.best_params_))

# Best hyperparameters for second cluster in SVC 0.6388292598206392 with {'kernel': 'rbf', 'gamma': 0.01, 'C': 0.001}

y_pred_SVC2 = halve_SVC.predict(X_test2)
print('The ROC AUC score of SVC for first cluster is {:.4f}'.
      format(roc_auc_score(y_test2, y_pred_SVC2)))

# The ROC AUC score of SVC for first cluster is 0.5000

The AUC performance criteria indicates that the predictive performance of SVC is slightly below that of logistic regression

Probability of Default Estimation with Random Forest

from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(random_state=42)

param_rfc = {'n_estimators': [100, 300],
    'criterion' :['gini', 'entropy'],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth' : [3, 4, 5, 6],
    'min_samples_split':[5, 10]}

halve_RF = HalvingRandomSearchCV(rfc, param_rfc,
                                 scoring = 'roc_auc', n_jobs=-1)
halve_RF.fit(X_train1, y_train1)
print('Best hyperparameters for first cluster in RF {} with {}'.
      format(halve_RF.best_score_, halve_RF.best_params_))

# Best hyperparameters for first cluster in RF 0.9133333333333333 with {'n_estimators': 100, 'min_samples_split': 5, 'max_features': 'sqrt', 'max_depth': 3, 'criterion': 'entropy'}

y_pred_RF1 = halve_RF.predict(X_test1)
print('The ROC AUC score of RF for first cluster is {:.4f}'.
      format(roc_auc_score(y_test1, y_pred_RF1)))

# The ROC AUC score of RF for first cluster is 0.5829

halve_RF.fit(X_train2, y_train2)
print('Best hyperparameters for second cluster in RF {} with {}'.
      format(halve_RF.best_score_, halve_RF.best_params_))

# Best hyperparameters for second cluster in RF 0.6535338960049029 with {'n_estimators': 100, 'min_samples_split': 10, 'max_features': 'sqrt', 'max_depth': 4, 'criterion': 'entropy'}

y_pred_RF2 = halve_RF.predict(X_test2)
print('The ROC AUC score of RF for first cluster is {:.4f}'.
      format(roc_auc_score(y_test2, y_pred_RF2)))

# The ROC AUC score of RF for first cluster is 0.5125

It does not go into getting a model with good performance. Just an overview of the above + a neural net and a Keras model. Here is the full code

Common pitfalls in the interpretation of coefficients of linear models and Failure of Machine Learning to infer causal effects

I was saving this to read on stream, but I wanted to read it casually on my own today. Definitely a resource that is awesome for beginners and something I wish I read earlier in my learning journey haha

That is all for today!

See you tomorrow :)