Random Forests

Decision Trees

Pros

  • Interpretable
  • Simple
  • Classification or Regression

Cons

  • Poor accuracy
  • High variance

Bias vs Variance

Bias-Variance Trade-off

$Error = Variance + Bias^2 + noise$

Reduce bias: Build more complex models

Reduce variance: Use a lot of data or simple model

How to reduce the error caused by noise?

Bagging

Averaging a set of observations reduces variance.

But we only have one training set ... or do we?

Bootstrap

Simulate new datasets:

Take samples (with replacement) from original training set

Repeat $n$ times

Bootstrap visualization

Sample has $n$ elements.

Probability of getting picked: $\frac{1}{n}$

Probability of not getting picked: $1-\frac{1}{n}$

If you sample $n$ elements with replacement, the probability for each element of not getting picked in the sample is: $(1-\frac{1}{n})^n$

As ${n\to\infty}$, this probability approaches $\frac{1}{e}\approx.368$

Thus, $0.632$ of the data points in your original sample show up in the Bootstrap sample (the other $0.368$ won't be present in it)

Bagging

Train a tree on each bootstrap sample, and average their predictions (Bootstrap Aggregating)

Random Forests

Like bagging, but removes correlation among trees.

At each split, considers only a subset of predictors.

Notes:
Random forests typically consider $\sqrt{\textrm{number of features}}$ or $1/3$ at each split (if 10 features, then it considers ~3 features), and picks the best one.

Slide credit G. Calmattes

Random Forests

Slide credit G. Calmattes

Random Forest Lab

Other Ensemble Models

Gradient Boosted Decision Trees

Sequential method

Fit trees to residuals (on first iteration, residuals are the true predictions)

Idea: build tree more slowly

These trees are NOT independent of each other

GBDT

Y Prediction Residual
$40$ $35$ $5$
$60$ $67$ $-7$
$30$ $28$ $2$
$33$ $32$ $1$
$25$ $27$ $-2$
Y Prediction Residual
$5$ $3$ $2$
$-7$ $-4$ $-3$
$2$ $3$ $-1$
$1$ $0$ $1$
$-2$ $-2$ $0$

Y Prediction
$40$ $38$
$60$ $63$
$30$ $31$
$33$ $32$
$25$ $25$

Kaggle

Data Science and Machine Learning Competition site

XGBoost one of most winning solutions

https://www.kaggle.com/