Hello :) Today is Day 229!
A quick summary of today:
- watched a few amazing videos from this probabl channel
I have been subscribed to this channel for a while. It is created by some of the creators and devs of scikit-learn, and today I decided to watch their content. The video titles always looked interesting and I finally gave it a shot. Tldr: the videos are awesome! Everything is explain so well and in a simple way!
GridSearch vs RandomSearch
Param search can get slow. For instance, if we search over two params: and for each we use 5 values - that would result in a total of 5x5 = 25 fits. We usually do cross-validation as well, so that makes the number (using cv=5) -> 25x5 = 125 fits.
- One idea to speed things up is to add
refit=False
. This is useful if we are just interested in finding params, as once the best params are found, the pipeline will not fit on them once more - so we are saving a fit here. - Another idea is adding
njobs
for multiprocessing and more CPU utilisation so that multiple fits can happen at the same time as they do not depend on each other - A powerful option is to add caching. This is done by adding
memory="cache_name"
when we are initialising our scikit-learn pipeline. This may work better for GridSearch as in that case we have fixed values that we go over and a param’s values can be cached and then reused. This will not work so well in RandomSearch as there… well it is random - the param values for each param of interest are sampled from a distribution and on every fit, the value is different so caching will not really give us any benefits.
To better utilise the cache if using RandomSearch we can have a compromise:
param_distribution = {
"C": np.logspace(0.001, 2, 100), # random sampling
"n_components": [10,20,50,100] # we will hit the cache a few times with this
}
Another thing when comparing the two: in a grid we are picking pre-determined values for a param, say 1, 3, and 6. But what if the best value is 1.5? The GridSearch will not be able to find that by itself unless more values are provided (but we have to be careful as we might run into grid explosion), however a RandomSearch might.
Building pipelines using scikit-learn’s API [1, 2]
I have used their API for pipelines but did not know it was so cool and powerful. Example:
Creaded using:
from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline, make_union
from sklearn.preprocessing import OneHotEncoder, Binarizer
from sklearn.impute import SimpleImputer
from skrub import SelectCols
from sklearn.ensemble import HistGradientBoostingClassifier
feat_pipe = make_union(
make_pipeline(
SelectCols(["pclass"]),
OneHotEncoder(sparse_output=False)
),
make_pipeline(
SelectCols(["sex"]),
OneHotEncoder(sparse_output=False)
),
make_pipeline(
SelectCols(["age"]),
SimpleImputer(fill_value=19, strategy="constant"),
make_union(
Binarizer(threshold=18),
Binarizer(threshold=12),
)
),
SelectCols(["fare", "age"])
)
pipe = make_pipeline(
feat_pipe,
HistGradientBoostingClassifier()
)
Scaling datasets
To show the different scalers a simple classification set was used
Decision Boundary without scaling:
Decision Boundary using the StandardScaler:
MinMax Scaler:
Quantile Scaler:
Boosting [1,2,3]
Gradient Boosting is an ensemble technique where each model (typically a decision tree) corrects the errors of the previous models. Each model uses the previous’ errors to learn. Example:
In gradient boosting, the algorithm repeatedly adds trees to correct errors made by the previous trees.
The process of finding the best split point in decision trees can be computationally expensive, especially with large datasets.
HistGradientBoostingRegressor simplifies this process by grouping continuous features into discrete bins (histograms). This reduces the number of possible split points, making the split-finding process faster and more efficient.
Better boosting using the monotonic param
The chart shows the probability of being alive after 10 years and comparing smoker vs non-smokers. The picture is also the output of a boosted tree. We see something weird - these jumps. For example, after 70 there is a weird spike in probability, but that does not make sense. We expect the probability of being alive at 80 (regardless of smoking or not) to be down compared to 60, and 70. This is due to how the data is.
We can overcome this issue using the monotonic_cst
param
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.pipeline import make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, FunctionTransformer
pipe = make_pipeline(
ColumnTransformer([
('smoke', OneHotEncoder(), ['smoker']),
('age', FunctionTransformer(), ['age'])
]),
HistGradientBoostingClassifier(monotonic_cst=[0, 0, 1])
)
The 3 numbers represent the 3 columns used for the model. Smoker results in 2 columns as they are OHE, and the 3rd column is age. The param makes sure that the 3rd column is monotonic. So when this setting is set the graph is:
And now the graph represent the logical thing - probability of being alive after 10 years goes down with age, regardless of being smoker or not.
Benchmarking boosted trees against overfitting
This was about how different params can influence a boosting model’s performance. On the Y axis, the difference between train and test performance is show (lower is better), and iterations are shown on the X axis.
We start with the below chart with 4000 data samples (from simulated data)
As we increase the samples to 10000, we see another model performing better:
And increasing to 50000:
So when we are creating a model, we need to be aware that parameters might work for a certain amount of data. Note! In this case the data is simulated so it comes from the same distribution, and in real life we might have something different and also face data drift. Nevertheless it cautions to be aware of the difference the amount of data can make on a model’s performance.
Features in Histogram Gradient Boosting Trees
The video instructor keeps saying that the examples in the scikit-learn website are very good, so I decided to check one related to the HGBT model. The instructor keeps using it in their examples so maybe it is a very good model. Apparently it is indeed.
- HGBT is efficient, particularly on large datasets, due to its histogram-based approach to gradient boosting
- it requires minimal hyperparameter tuning, making it accessible for users without deep expertise in machine learning
- handling of Missing Data: The model can handle missing data natively, which simplifies preprocessing
Taking about missing data ~
This video talks about scikit-learn’s native ability to handle missing values. Most of the library’s algorithms can handle nan values natively and the instructor encourages us to leave them in and maybe our model can learn something extra from them
PCA used for outlier detection ?
PCA’s goal is to reduce the dimensionality of our data while keeping as much information as possible. It uses a transformation matrix to convert our high-dimensional data into a low-dimensional representation. But what if we use the inverse of that transformation matrix to go from the low-dimensional representation back to a high-dimensional representation again? Because we are trying to reduce dimensions, we might lose some information, so the original high-dimensional data and the reconstructed high-dimensional data might not be the same.
But what we can take from that is: if a data point ‘survived’ the transformations, that means it is likely a normal point. If a point did ‘not survive,’ it can be thought of as an outlier.
What does ‘surviving’ mean? If the difference between the original data point and the reconstructed data point is above some threshold, then the point could be an outlier. This is because PCA tries to preserve the most common information when doing its transformation, and if that data point’s information was not able to be preserved (i.e., it had some different kind of information), then it might be an outlier.
Using a KernelDensityEstimator
We can use a histogram to try to represent our data distribution:
But this representation may vary depending on the amount of bins we choose. Also it over- and undershoots in many places. This issue may be fixed with more data but in that case there is the curse of dimensionality - if we have many dims we won’t have that much data per dimension to work with. Instead of using bins and a histogram, we can use a kernel.
We have a few different options:
Where is this useful?
- outlier detection
- fraud detection
If we have a pipeline where a new data point comes in -> this point goes through a trained kernelEstimator to determine if it is an outlier or not, and depending on our design - if it is an outlier, it won’t go through the model at all.
Using sample weights for better models
The above is a dataset where we have fit a Ridge regression model. As we can see there is more density in the lower parts compared to the top curvy part and the fitted line reflects that. But what if for our use case we are particularly interested in predicting the top high values on the Y axis? Well we can add sample weights to our Ridge model.
Each weight will be associated with a value and will affect how much the loss on that data point contributes to the overall loss. So a way we can add sample weights to the Ridge model is by using y values themselves as the weights (this is one of the options). This way the model will pay more attention to the high values and we can see this in the result:
And while the top values are fit better, we pay the price for some of the lower values. So it is not a free lunch ~ 😆
These are most of the videos on the channel. However there is a ‘Live’ section with ~1h livestream, so I need to go over these as well ^^ Thank you, Vincent (the instructor from probabl)
That is all for today!
See you tomorrow :)