Hello :) Today is Day 5!
A quick summary of today:
- found extremely helpful courses for ML in Kaggle
The content of the above courses is as below
Intro to Machine Learning
- model creating, training and predictions
- model validation
- model selection - under and overfitting
- beginner RandomForest model creation
Intermediate Machine Learning
- dealing with missing data
- deleting the column, or imputation
- handling categorical data
- deleting the column, ordinal encoding, one-hot encoding
- pipelines
-
define preprocessor
-
define model
-
create pipeline
-
- cross-validation
- XGBoost
- Data leakage (target leakage and train-test contamination)
Data visualization
Using seaborn
- line plot, bar graphs, heat maps, scatterplot
Feature engineering
-
Mutual information - it is a lot like correlation in that it measures a relationship between two quantities. The advantage of mutual information is that it can detect any kind of relationship, while correlation only detects linear relationships. Features should be integers
- Clistering with K-means
-
PCA - its primary goal is to reduce the number of features (or dimensions) in a dataset while preserving as much of the original variability as possible (need to look more into this)
- Target encoding with MEstimateEncoder
Above, instead of mean encoding, smoothing would be better. The idea is to blend the in-category average with the overall average. Rare categories get less weight on their category average, while missing categories just get the overall average.
When choosing a value for m, consider how noisy you expect the categories to be. Use Cases for Target Encoding:
- High-cardinality features: A feature with a large number of categories can be troublesome to encode: a one-hot encoding would generate too many features and alternatives, like a label encoding, might not be appropriate for that feature. A target encoding derives numbers for the categories using the feature’s most important property: its relationship with the target.
- Domain-motivated features: From prior experience, you might suspect that a categorical feature should be important even if it scored poorly with a feature metric. A target encoding can help reveal a feature’s true informativeness.
Today I learned about PCA, pipelines and target encoding, but I have a feeling I need to look into them a bit more.
That is all for today!
See you tomorrow :)