Ensemble learning

(nickname Forecast Combination)

Eran Raviv

Talk overview

Introduction

Combination schemes

Examples

Discussion and takeaways

Manual can be found here

Question: when facing different models, it is advisable to choose the best performing model.

Option 1 = Agree

Option 2 = Disagree

Some targets are easy to forecast

Sollar Eclipse

Some.. not so much

Lorenz Curve

Some.. evidence is mixed

SNP 500

In an uncertain or dynamic environments we should use all the help we can get

Combining $P$ different models

Simple enough:

\begin{equation} f^{combined} = \frac{\sum_{i = 1}^P f_i }{P} \end{equation}

But should we?

Combining $P$ different models

Yes we should

It works

Rationale

Biases
Model risk

Different models perform differently

in different circumstances and/or in different points in time:

Forecasting day-ahead electricity prices: Utilizing hourly prices

Source: Forecasting day-ahead electricity prices

The thing is, going forward we don't know which forecasting model will outperform

So, as we don't bet on the one horse in investments, we don't bet on the one horse here neither

It works in forecasting in the same manner it works when investing

That is the idea, but how to combine? (Down arrow)

Regression based (OLS)

$$ y_t = {\alpha} + \sum_{i = 1}^P {\beta_i} f_{i,t} +\varepsilon_t, $$

The combined forecast is then given by:

$$f^{comb} = \widehat{\alpha} + \sum_{i = 1}^P \widehat{\beta}_i f_i,$$

Regression based (LAD)

Train the individual forecasts using:

$$y_t = {\alpha} + \sum_{i = 1}^P {\beta_i} f_{i,t} +\varepsilon_t,$$ (as before)

But minimise the absolute loss function:

$$\sum_t |\varepsilon_t|$$ instead of the squared loss function $$\sum_t {\varepsilon_t}^2$$

Regression based (CLS)

Train the individual forecasts using:

$$y_t = {\alpha} + \sum_{i = 1}^P {\beta_i} f_{i,t} +\varepsilon_t,$$

Minimise the squared loss function: $$\sum_i {\varepsilon_t}^2,$$ but under additional constraints: $\beta_i \geq 0, \; \forall i, \; \text{or}$ $\sum_{i = 1}^P \beta_i = 1, \; \text{or both} $

Accuracy-based (Inverse MSE)

Use some accuracy measure, for example mean squared error (MSE):

$$ \operatorname {MSE_i} ={\frac {1}{T}}\sum _{t=1}^{T}({{f_{i,t}}} - y_{t})^{2} , $$

and combine the forecasts based on how well each individual is doing:

$$ f^c = \frac{\left(\frac{MSE_{i} }{\sum_{i = 1}^P MSE_{i}}\right)^{-1}}{\sum_{i = 1}^P \left(\frac{MSE_{i} }{\sum_{i = 1}^P MSE_{i}}\right)^{-1} } f_i = \frac{\frac{1}{MSE_{i}}}{\sum_{i=1}^P\frac{1}{MSE_{i}}} f_i $$

Best individual (BI)

Basically (ex-post) model selection

$$ f^c = w_i f_i, \quad \mbox{where} \qquad $$

$w_i = 1 \quad \mbox{if} \quad MSE_{i} < MSE_{-i} \quad \forall i \in \{1, \dots, P\} $

$ w_i = 0 \quad \mbox{otherwise} $

Housing price forecasting (Example)

There are 14 attributes in each case of the dataset. They are:

CRIM - per capita crime rate by town
ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS - proportion of non-retail business acres per town.
CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
NOX - nitric oxides concentration (parts per 10 million)
RM - average number of rooms per dwelling
AGE - proportion of owner-occupied units built prior to 1940
DIS - weighted distances to five Boston employment centres
RAD - index of accessibility to radial highways
TAX - full-value property-tax rate per $10,000
PTRATIO - pupil-teacher ratio by town
B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
LSTAT - % lower status of the population
MEDV - Median value of owner-occupied homes in $1000's

Description of the Boston dataset Source: U.S Census Service

Housing price forecasting

Individual forecasts (RMSE)

Linear: 4.73

Principal component regression: 7.62

Boosting: 3.85

Random forests: 3.26

Support vector machine: 3.06

Neural network: 3.97

——————————————————————————————–

Forecast Combinations (RMSE)

Simple: 3.64

OLS : 2.77

LAD: 2.77

Variance based : 3.2

CLS : 2.95

BI: 3.06

Weights

GDP measurements

“The current system emphasizes data on spending, but the bureau also collects data on income. In theory the two should match perfectly - a penny spent is a penny earned by someone else. But estimates of the two measures can diverge widely” [Aruoba et al., 2015]

Some discussion

Many familiar techniques can be cast in terms of averaging:

$$ D_t = (1-\lambda) \sum_{t=1}^ \infty \lambda^{t-1} (\varepsilon_{t-1}\varepsilon^ \prime_{t-1}) = (1-\lambda)(\varepsilon_{t}\varepsilon^ \prime_{t})+\lambda D_{t-1} $$

Other ideas

Different regimes

Dynamic model averaging

Why not use it?

Interpretation is lost
Does not always add value (garbage in $\Rightarrow$ garbage out)

Why use it?

Good "hedge" against wrong modelling choices
No consensus on the best approach
Simple average is very robust
Useful in changing environment where structural breaks are likely

Ensemble learning

(nickname Forecast Combination)

Eran Raviv

Talk overview

Question: when facing different models, it is advisable to choose the best performing model.

Some targets are easy to forecast

Some.. not so much

Some.. evidence is mixed

In an uncertain or dynamic environments we should use all the help we can get

Combining $P$ different models

Combining $P$ different models

Rationale

Different models perform differently

The thing is, going forward we don't know which forecasting model will outperform

Regression based (OLS)

Regression based (LAD)

Regression based (CLS)

Accuracy-based (Inverse MSE)

Best individual (BI)

Housing price forecasting (Example)

Housing price forecasting

Weights

GDP measurements

Some discussion

Many familiar techniques can be cast in terms of averaging:

Other ideas

Why not use it?

Why use it?

Thank you

Questions?