(Day 294) Reading more of sklearn's docs

Ivan Ivanov · October 21, 2024

Hello :) Today is Day 294!

A quick summary of today:

  • read the XGBoost and LightGMB papers
  • discovered more cool little modules from sklearn’s docs

Read two papers

I decided to read the above papers as they were referenced here (Ensembles: Gradient boosting, random forests, bagging, voting, stacking) in the scikit-learn docs post which introduces various ensemble methods. The XGBoost one I read just because of XGBoost’s popularity, and LightGBM was used as motivation for creating sklearn’s HistGradBoostClassifier(Regressor) model.

Computing with sklearn

I will give a short summary of this as the ideas of more efficient computation in sklearn are something I have not explored before.

Several factors affecting prediction latency are explored:

  • number of features: more features lead to increased memory consumption and processing time, resulting in slower prediction times
  • input data representation and sparsity: sparse data representation can significantly speed up predictions for sparse datasets by reducing the number of operations. However, dense representation can benefit from optimized BLAS operations, making it faster for datasets with low sparsity. The document recommends using sparse formats if the sparsity ratio is greater than 90%
  • model complexity: increasing complexity often leads to increased latency. For linear models, the impact on latency is minimal, but for non-linear models like SVM and ensemble methods, complexity significantly affects latency
  • feature extraction: in real-world applications, feature extraction can be much more time-consuming than the prediction itself

Here are some tips and tricks for performance optimization

  • using optimized linear algebra libraries (BLAS/LAPACK): optimized libraries like Atlas, OpenBLAS, MKL, and Apple Accelerate can lead to significant speedups for models relying heavily on linear algebra operations
  • limiting working memory: by limiting temporary memory usage, you can avoid memory exhaustion during calculations
  • model compression: controlling model sparsity can reduce memory usage and latency, especially when combined with sparse input data
  • model reshaping: selecting only the relevant features used by the model can reduce memory overhead and processing time

Isotonic Regression

I feel like I should provide some info on the read paper sometimes because I feel like just pasting the link is kind of lazy. Especially if most of what I have studied during the day is something I am already familiar with, have written about it multiple times in my blog and the original post explains it very well in a concise way.

The isotonic regression algorithm finds a non-decreasing approximation of a function while minimizing the mean squared error on the training data. The benefit of such a non-parametric model is that it does not assume any shape for the target function besides monotonicity.

image

Stream

Read through the below guides. Particularly more in-depth on the plethora of feature selection techniques offered by scikit-learn.


That is all for today!

See you tomorrow :)