# Random Forests

## Decision Trees

Pros

• Interpretable
• Simple
• Classification or Regression

Cons

• Poor accuracy
• High variance

## Bias vs Variance $Error = Variance + Bias^2 + noise$ Reduce bias: Build more complex models

Reduce variance: Use a lot of data or simple model

How to reduce the error caused by noise?

## Bagging

Averaging a set of observations reduces variance.

But we only have one training set ... or do we? ## Bootstrap

Simulate new datasets:

Take samples (with replacement) from original training set

Repeat $n$ times ## Bootstrap visualization

Sample has $n$ elements.

Probability of getting picked: $\frac{1}{n}$

Probability of not getting picked: $1-\frac{1}{n}$

If you sample $n$ elements with replacement, the probability for each element of not getting picked in the sample is: $(1-\frac{1}{n})^n$

As ${n\to\infty}$, this probability approaches $\frac{1}{e}\approx.368$

Thus, $0.632$ of the data points in your original sample show up in the Bootstrap sample (the other $0.368$ won't be present in it)

## Bagging

Train a tree on each bootstrap sample, and average their predictions (Bootstrap Aggregating)

## Random Forests

Like bagging, but removes correlation among trees.

At each split, considers only a subset of predictors. Notes:
Random forests typically consider $\sqrt{\textrm{number of features}}$ or $1/3$ at each split (if 10 features, then it considers ~3 features), and picks the best one.

Slide credit G. Calmattes

## Random Forests Slide credit G. Calmattes

## Other Ensemble Models

Sequential method

Fit trees to residuals (on first iteration, residuals are the true predictions)

Idea: build tree more slowly

These trees are NOT independent of each other

## GBDT

 Y Prediction Residual $40$ $35$ $5$ $60$ $67$ $-7$ $30$ $28$ $2$ $33$ $32$ $1$ $25$ $27$ $-2$
 Y Prediction Residual $5$ $3$ $2$ $-7$ $-4$ $-3$ $2$ $3$ $-1$ $1$ $0$ $1$ $-2$ $-2$ $0$

 Y Prediction $40$ $38$ $60$ $63$ $30$ $31$ $33$ $32$ $25$ $25$

## Kaggle

Data Science and Machine Learning Competition site

XGBoost one of most winning solutions

https://www.kaggle.com/