Trees have the awesome .feature_importance_ attribute.
But
Tree-based models have a strong tendency to overestimate the importance of continuous numerical or high cardinality categorical features.
Binary classification on whether employees will leave the company (attrition)Most important:
MonthlyIncomeAgeWorkingYearsWe'll add a single random continuous variable in the range [100, 200].

Let's also add a discrete random variable
Much less important than the continuous variable.
high cardinality features, cause it can split more.overfitting..score or custom metric)What happens with correlated columns?

Techniques typically used:
These methods increase complexity with often limited results.
But there is another intuitive way!
Synthetic Minority Oversampling Technique

We'll look at a dataset of employees interested in switching jobs
Target is imbalanced:

We train a classifier on SMOTE, undersampling, oversampling and class weights and compare the results.

# includes metrics from future test set
df = StandardScaler().fit_transform(df)
x_train, x_test, y_train, y_test = train_test_split(
df.drop('target', axis=1),
df.target,
)
What we should do:
x_train, x_test, y_train, y_test = train_test_split(
df.drop('target', axis=1),
df.target,
)
scaler = StandardScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.transform(x_test)
Goes for
Thank you for listening.