(Day 298) From Pandas to PySpark

Ivan Ivanov · October 25, 2024

Hello :) Today is Day 298!

A quick summary of today:

  • starting to translate pandas data transformations to pyspark

In my 1st year in Lloyds Banking Group I was part of a project that built an innovative webapp for the risk reporting world - and part of my role in the team was tranlsating code from sas to python (pandas). More than SAS to python, though, we had to switch from pandas to pyspark and I had to learn how to translate pandas functions into pyspark, and also I helped other colleagues translate their code directly from sas to pyspark.

Doing the Credit Modelling with Python course on Udemy this past week, reminded me of this great time in my professional life and I decided to try to convert the pandas code from the course to pyspark. The reason we switched from pandas to pyspark back then (2019-2020) was because our data was millions and millions and millions of rows so pyspark was a much better choice. Now… the dataset used in the course is not that big so I most likely won’t see any benefit of switching from pandas to pyspark, I think it is still a good exercise to translate the models.

Disclaimer

I cannot share the pandas code from the course because the course author does not have it public anywhere so I do not want to share it either. However, I will share the pyspark notebooks/scripts as I write them.

Here is my project structure

image

The pandas folder contains the code from the course, and the pyspark folder is my pyspark code.

For now, rather than uploading it to github, I will keep it local and just share my code through colab. Here is what I have so far

What I finished at is figuring out how to do WoE and IV calculations with PySpark. However, I think I might do just data transformation and model translation, rather than the WoE and IV data analysis as well. For the sake of continuing and getting to the modelling part, I might assume that the output of the WoE and IV analysis is a given and I will just translate those transformations.

I could not do much more today because of my Friday class. But I invite you to check the colab notebook for the pyspark code ^^ It reminded me of my great time in my risk analyst job.


That is all for today!

See you tomorrow :)