(Day 5) Kaggle's 'Learn' courses

Ivan Ivanov · January 6, 2024

Hello :) Today is Day 5!

A quick summary of today:

  • found extremely helpful courses for ML in Kaggle

image

The content of the above courses is as below

Intro to Machine Learning

  • model creating, training and predictions
  • model validation
  • model selection - under and overfitting
  • beginner RandomForest model creation

Intermediate Machine Learning

  • dealing with missing data
    • deleting the column, or imputation
  • handling categorical data
    • deleting the column, ordinal encoding, one-hot encoding
  • pipelines
    • define preprocessor image

    • define model image

    • create pipeline image

  • cross-validation
  • XGBoost
  • Data leakage (target leakage and train-test contamination)

Data visualization

Using seaborn

  • line plot, bar graphs, heat maps, scatterplot

Feature engineering

  • Mutual information - it is a lot like correlation in that it measures a relationship between two quantities. The advantage of mutual information is that it can detect any kind of relationship, while correlation only detects linear relationships. Features should be integers

    image image

  • Clistering with K-means
  • PCA - its primary goal is to reduce the number of features (or dimensions) in a dataset while preserving as much of the original variability as possible (need to look more into this) image image image image

  • Target encoding with MEstimateEncoder image

Above, instead of mean encoding, smoothing would be better. The idea is to blend the in-category average with the overall average. Rare categories get less weight on their category average, while missing categories just get the overall average.

image image

When choosing a value for m, consider how noisy you expect the categories to be. Use Cases for Target Encoding:

  • High-cardinality features: A feature with a large number of categories can be troublesome to encode: a one-hot encoding would generate too many features and alternatives, like a label encoding, might not be appropriate for that feature. A target encoding derives numbers for the categories using the feature’s most important property: its relationship with the target.
  • Domain-motivated features: From prior experience, you might suspect that a categorical feature should be important even if it scored poorly with a feature metric. A target encoding can help reveal a feature’s true informativeness.

Today I learned about PCA, pipelines and target encoding, but I have a feeling I need to look into them a bit more.

That is all for today!

See you tomorrow :)

Original post in Korean