In a previous post on this subject, we related the loadings of the principal components (PC’s) from the singular value decomposition (SVD) to regression coefficients of the PC’s onto the X matrix. This is normal given the fact that the factors are supposed to condense the information in X, and what better way to do that than to minimize the sum of squares between a linear combination of X (the factors) to the X matrix itself. A reader was asking where does principal component regression (PCR) enter. Here we relate the PCR to the usual OLS.
Beauty.. really? well, beauty is in the eye of the beholder.
The other day, Harvey Campbell from Duke University gave a talk where I work. The talk- bearing the exciting name “Backtesting” was based on a paper by the same name.
The authors tackle the important problem of data-snooping; we need to account for the fact that we conducted many trials until we found a strategy (or a variable) that ‘works’. Accessible explanations can be found here and here. In this day and age, the ‘story’ behind what you are doing is more important than ever, given the things you can do using your desktop/laptop.
In the previous post we reviewed a way to handle the problem of inference after model selection. I recently read another related paper which goes about this complicated issues from a different angle. The paper titled ‘A significance test for the lasso’ is a real step forward in this area. The authors develop the asymptotic distribution for the coefficients, accounting for the selection step. A description of the tough problem they successfully tackle can be found here.
The usual way to test if variable (say variable j) adds value to your regression is using the F-test. We once compute the regression excluding variable j, and once including variable j. Then we compare the sum of squared errors and we know what is the distribution of the statistic, it is F, or , depends on your initial assumptions, so F-test or -test. These are by far the most common tests to check if a variable should or should not be included. Problem arises if you search for variable j beforehand.
We are all standing on the shoulders of giants. Bradley Efron is one such giant. With the invention of the bootstrap in 1979 and later with his very influential 2004 paper about the Least Angle Regression (and the accompanied software written in R).
If you have 10 possible independent regressors, and none of which matter, you have a good chance to find at least one is important.
In the past, I wrote about robust regression. This is an important tool which handles outliers in the data. Roger Koenker is a substantial contributor in this area. His website is full of useful information and code so visit when you have time for it. The paper which drew my attention is “Quantile Autoregression” found under his research tab, it is a significant extension to the time series domain. Here you will find short demonstration for stuff you can do with quantile autoregression in R.
Five months ago I generated forecasts for the Eurozone Misery index. I used the built-in “FitAR” package in R. Using different models differing in their memory length (how many lags were considered for each model) 24 months ahead forecasts were generated. Might be interesting to see how accurate are the forecasts. The previous post is updated and few bugs corrected in the code. The updated data is public and can be found here. It is the sum of inflation rate and unemployment rate in the Euro-zone area.
Some knowledge about the bootstrapping procedure is assumed.
In time series analysis, Information Criteria can be found under every green tree. These are function to help you determine when to stop adding explanatory variables to your model.
Bootstrapping in its general form (“ordinary” bootstrap) relies on IID observations which staples the theory backing it. However, time series are a different animal and bootstrapping time series requires somewhat different procedure to preserve dependency structure.