(Day 5) Kaggle's 'Learn' courses

Ivan Ivanov · January 6, 2024

traditional-machine-learning

Hello :) Today is Day 5!

A quick summary of today:

found extremely helpful courses for ML in Kaggle

The content of the above courses is as below

Intro to Machine Learning

model creating, training and predictions
model validation
model selection - under and overfitting
beginner RandomForest model creation

Intermediate Machine Learning

dealing with missing data
- deleting the column, or imputation
handling categorical data
- deleting the column, ordinal encoding, one-hot encoding
pipelines
- define preprocessor
- define model
- create pipeline
cross-validation
XGBoost
Data leakage (target leakage and train-test contamination)

Data visualization

Using seaborn

line plot, bar graphs, heat maps, scatterplot

Feature engineering

Mutual information - it is a lot like correlation in that it measures a relationship between two quantities. The advantage of mutual information is that it can detect any kind of relationship, while correlation only detects linear relationships. Features should be integers
Clistering with K-means
PCA - its primary goal is to reduce the number of features (or dimensions) in a dataset while preserving as much of the original variability as possible (need to look more into this)
Target encoding with MEstimateEncoder

Above, instead of mean encoding, smoothing would be better. The idea is to blend the in-category average with the overall average. Rare categories get less weight on their category average, while missing categories just get the overall average.

When choosing a value for m, consider how noisy you expect the categories to be. Use Cases for Target Encoding:

High-cardinality features: A feature with a large number of categories can be troublesome to encode: a one-hot encoding would generate too many features and alternatives, like a label encoding, might not be appropriate for that feature. A target encoding derives numbers for the categories using the feature’s most important property: its relationship with the target.
Domain-motivated features: From prior experience, you might suspect that a categorical feature should be important even if it scored poorly with a feature metric. A target encoding can help reveal a feature’s true informativeness.

Today I learned about PCA, pipelines and target encoding, but I have a feeling I need to look into them a bit more.

That is all for today!

See you tomorrow :)

Original post in Korean