Hello :) Today is Day 297!
A quick summary of today:
- company review LLM
- finished the credit risk modelling course
- streamed
- fixed my pyspark setup for some cool new tasks
Company review LLM
Previously, when I talked about the movie reviewer project and data - well it was related to this. The data is here, and today I started working on it.
Company review data is here, after I preprocessed it. The data came is an arbitrary format from the professor so I had to do some basic preprocessing and formating to get it to the format I need.
We will employ different PEFT strategies, but today I started with a simple LoRA fine-tuning of llama-3.2-3b-instruct. The results… bad. I did not even need to run the results through the evaluation metrics because just by looking at them - the text does not match the desired output, and is completely unralated to the desired output.
Next, I tried to run the llama-3.1-8b models in colab’s free T4, but the training said it would take 6-7 hours and towards the end I got a message I rand out of free hours… And the more confusing and frustrating part is that Kaggle is giving me some weird torch errors that I believe are coming from unsloth and kaggle. I might need to ask for resources from the professor - either credits for colab GPUs or using his 2nd GPU PC.
I also fine-tuned a gpt4o-mini and the results for that seem by eye-balling it, however I need to run the results through the eval metrics code to get proper results for my report.
Finishing the Credit Risk Modelling in Python course on Udemy
LGD and EAD models
For these models we need data for accounts that have been written-off and a sufficient amount of time has passed so there can be some recoveries.
Recovery rate
LGD - if a borrower defaults, it is the proportion of the total exposure that can be recovered by the lender once a default has occured. An established approach in practice is to model the proportion of the total exposure that can be recovered by the lender once a default has occured. This proportion is called recovery rate (recoveries / funded amount). From there, the proportion that can’t be recovered (LGD) can be calculated easily because it equals 1 - recovery_rate for each exposure.
Sometimes due to calculation and database specifications, we might end up with values < 0 and/or > 1. In such cases, replace the ones below 0 with 0, and the ones above 1 with 1.
Credit conversion factor (CCF)
EAD is the total value that a lender is exposed to when a borrower defaults. Therefore, it is the max that a bank may lose when a borrower defaults on a loan. On many occasions, the bank has decided to grant an amount of money but the lender has not disbursed the whole amount. Also, the borrower might have repaid a signigicant amount of the loan already at the time of default. Moreover, for revolving facilities such as credit cards, but also for some consumer laons, the borrower may be able to repay and spend what they had already repaid again, up to a certain limit called credit limit. So the borrower may only have defaulted on a proportion of the original funded amount. That proportion is going to be the dependent variable for the EAD model. Often, this proportion is called credit conversion factor. And EAD = total funded amount X credit conversion factor. Higher CCF -> higher risk, and vice versa.
Both of the above are in the range [0, 1] and both of them are described with a Beta distribution so a beta regression is used to model them. But because no major python package offers this type of regression natively, we will can use some other approach.
For recovery rate and estimating LGD, we can use a 2-stage approach:
- is recovery rate equal to 0 or greater than 0? (binary question - logistic regression)
- if recovery rate is greater than 0, how much exactly is it? (use linear regression - make sure the values are between 0 and 1; we can use correlation coef to evaluate the model; and also look that the distribution of the residuals follows a normal distribution)
For CCF, the course suggests just to use a multi-linear regression model (here as well, we can use correlation coef as an evaluation metric; and also that the residuals follow a normal distribution)
After we have all three, we can calculate Expected Loss = PG x LGD x EAD for our customer base or on a portfolio level.
We can then calculate the ratio of EL divided by total funded amount and that number can add to our decision-making in whether to employ a more aggressive or conservative lending strategy.
Overall
Wow, such a great course and intro to the topic of credit risk modelling. I am not the exact audience for this course as I am already familiar with majority of the concepts, but it was a nice refresher and reminded me of my 1st year at Lloyds Banking Group (which I cherish deeply)
Stream
Completed this Noe4j course on stream - Build a Neo4j-backed Chatbot using Python
It was not bad. After the stream ended I was looking around and found this - GraphRAG Python Package: Accelerating GenAI With Knowledge Graphs
This is neo4j’s native GraphRAG python package. Which reflects my belief that what langchain offers is fast-changing and learning one thing today, might be something similar, but different, in 1 month because the API has changed, or the specific function is provided natively by another package. Anyway ~ this blog post is something to read, maybe on next stream (Monday) or by meself - I will see.
I want to try to convert the pandas code and models from the Udemy course to PySpark - kind of what part of my role was back during my 1st year at Lloyds Banking Group. I decided to start doing it, but of course I faced pyspark issues. I spend ~1hr fixing them but finally I got it done. I will follow along the notebooks that do it in pandas and translate that to pyspark ~ but that is for tomorrow/next few days.
That is all for today!
See you tomorrow :)
P.S. had the post ready to post, but was playing around with translating the pandas code to pyspark and got stuck at doing OHE in pyspark.
So far, I have
from pyspark.ml.feature import StringIndexer, OneHotEncoder
from pyspark.ml import Pipeline
indexer = StringIndexer(inputCol='grade', outputCol='grade_index')
encoder = OneHotEncoder(inputCol='grade_index', outputCol='grade_ohe', dropLast=True)
pipeline = Pipeline(stages=[indexer, encoder])
pipeline_model = pipeline.fit(df)
df = pipeline_model.transform(df)
df.select('grade', 'grade_index', 'grade_ohe').show()
Which returns:
+-----+-----------+-------------+
|grade|grade_index| grade_ohe|
+-----+-----------+-------------+
| B| 0.0|(6,[0],[1.0])|
| C| 1.0|(6,[1],[1.0])|
| C| 1.0|(6,[1],[1.0])|
| C| 1.0|(6,[1],[1.0])|
| B| 0.0|(6,[0],[1.0])|
| A| 3.0|(6,[3],[1.0])|
| C| 1.0|(6,[1],[1.0])|
| E| 4.0|(6,[4],[1.0])|
| F| 5.0|(6,[5],[1.0])|
And apparently for spark ML models the features need to be put in a vector, so that’s why this is returned like that. I need to think if this is ok for me and will continue tomorrow.