This post is shares:

Over Christmas I read the following two papers long-resting in my “to-read” list:

I was pleasantly surprised to observe the synergy between the two papers, given they are published almost a decade apart. Here are couple of excerpts from the two papers.

Paper 1 compares the performance of 17 algorithm families – each family with numerous variants totaling 179 classifiers, across 121 data sets. They find that

“The classifiers most likely to be the bests are the random forest versions”

and

“the best of which are implemented in R and accessed via caret.”

Below is Figure 2 (right) from the paper which I find relevant to show here:

The x-axis has the different data on it (so 121 data sets). The blue line is the accuracy obtained by the best classifier. The red is the accuracy obtained by the random forest algorithm (parallel version). For almost all data the accuracy achieved by the random forest algorithm is fairly close to the best, or it is itself the best. Sure, on some data (such as data 41 and 70, see the downward spikes) it performs much worse when compared to the best classifier for that data. Still the takeaway stands: random forest is **likely** to do very well. Of course neural networks are no slouches and do appear in the top 10 performers in this exercise.

Paper 2 (by the way it’s open access so you can download the published version) is basically some shared thoughts from experienced practitioners with regards to the M5 competition – forecasting 42,840 time series. From the paper

“Our central observation is that tree-based methods can

effectively and robustlybe used as blackbox learners” [emphasis mine]

The winning algorithm in the M5 competition was a random forest variant. The authors make the point, and I agree, that one important difference between tree-based methods and neural network is the available at-the-ready software; tree-based methods enjoy more mature publicly available software. Mature in the sense that proper default values for hyperparameters are established, and also that they are easy to configure.

Another second point is something you may already know: “The difficulty of tuning these [neural networks] models makes published results difficult to reproduce and extend, and makes even the original investigation of such methods more of an art than a science.” (quoted from Algorithms for Hyper-Parameter Optimization). There is no way to sugar-coat it, neural networks are way much more sensitive (less robust) to hyperparameters than tree-based methods. Neural networks are also quite sensitive to feature-scaling, there is a large amount of hyperparameters to tweak, and the tweaking of many hyperparamers substantially alters expected forecasting performance. This lack of robustness makes it tedious to promise the same out-of-sample performance as the one observed for the validation set. That said, neural networks are fashionable nevertheless. Especially because of their capacity to flex themselves in order to capture complex underlying data generating process. A prominent example is the M4 competition where the winning algorithm heavily relies on neural networks.

So when to use what?

Neural networks are more flexible than tree-based methods. They can approximate any reasonable function to perfection. But as you read above, tree-based methods are likely to perform well in many cases, and with less effort. Therefore I conclude here that if the problem is not super complex, tree-based algorithms should be your go-to; they have accomplished a “good-enough” status in my mind. If the problem is complex, highly non-linear, you have enough data, you have time to fiddle with hyperparameters and sufficient computing power to do so (without waiting two weeks for the results I mean), neural networks are decidedly worth the trouble.

]]>As per usual this point in time, I check my blog’s traffic-analytics to see which were the most popular pieces last year. Without further ado..

First:

Correlation and Correlation Structure (6) – Distance Correlation (08:33 minutes average time on page)

Second:

Similarity and Dissimilarity Metrics – Kernel Distance (11:28 minutes average time on page)

Third:

What is the Kernel Trick? (11:51 minutes average time on page)

My ‘favorite post’ spot this year is occupied by two:

Hyper-Parameter Optimization using Random Search and Understanding Convolutional Neural Networks.

On the left (scroll down) you can find the most popular posts from all previous years.

To my readership. **Thank you** for reading, sharing, for your emails, and for your corrections, comments and questions. Happy, **healthy**, and productive 2023!

pic credit: Morvanic Lee

]]>

While the meanings of spurious correlation and spurious regression are common knowledge nowadays, much less is understood about spurious factors. This post draws your attention to recent, top-shelf, research flagging the risks around spurious factor analysis. While formal solutions are still pending there are couple of heuristics we can use to detect possible problems.

Since you know what spurious correlation is, it’s easy to board the train of thought at this station. When two variables, think prices of two stocks, drift upwards or downwards simultaneity, this “drifting”-fact alone is enough to spike the correlation, regardless of the actual statistical relation between the two variables. Now, factors are often extracted from cross-sectional data using principal component analysis (PCA). The numerical procedure starts with the computation of the correlation\covariance matrix. Therefore, with spurious entries in the correlation matrix the extracted factors would over-represent the common variation in the data – sometimes absurdly so.

The problem is serious on at least couple of levels. First, the deception effect is substantial: the first factor extracted from random walk data without any common factors, would falsely claim to explain circa 61% of the variation in the data. Below you can find code which expresses how serious a problem this actually is. Second, presently there is no way of solving for this. But, there are couple of things we *can* do.

The paper Spurious Factor Analysis (see references for a working version), suggests a couple of heuristics to cope with spurious factors. The first is to always compare factors estimated from level data, with factors estimated from first-differenced data. A mismatch between the two would call for more investigation. Second, but less formal strategy is to eyeball a times series plot of the extracted factors, and compare it with another plot of completely spurious factors. If the two plots resemble each other than you are a go to sound the alarm.

The following code is for replicating the monte carlo simulation presented in the aforementioned paper (Table A.II). It generates N i.i.d. Gaussian random walks of length T. You can change the (N, T) numbers in code (`(P,TT)`

below) for the data-dimensions you wish to simulate. The Matlab code is taken directly from the supplementary material of the paper, and the R code is my own translation (so any bugs are my doing).

% This code is directly from the Econometrica paper %FRED-MD p=128; T=710; N=p; CNT=min([sqrt(p);sqrt(T)]); U=toeplitz([1;zeros(T-1,1)],ones(1,T)); rmax=15; for i=1:10000 epsil=randn(p,T); DATA=epsil*U; DATA=DATA-(mean(DATA'))'*ones(1,T); if T<=p [UU,D]=eig(DATA'*DATA/(N*T)); else [UU,D]=eig(DATA*DATA'/(N*T)); end d=sort(real(diag(D))); V=flipdim(cumsum(d),1); pen1=V(2:(rmax+1),1)*(0:rmax)*(T/(4*log(log(T))))*((N+T)/(N*T))*log((N*T/(N+T))); pen2=V(2:(rmax+1),1)*(0:rmax)*(T/(4*log(log(T))))*((N+T)/(N*T))*log(CNT^2); pen3=V(2:(rmax+1),1)*((0:rmax).*(N+T-(0:rmax))/(N*T))*(T/(4*log(log(T))))*log(N*T); pen1=pen1'; pen2=pen2'; pen3=pen3'; IPC1=V(1:rmax+1,1)*ones(1,rmax)+pen1; IPC2=V(1:rmax+1,1)*ones(1,rmax)+pen2; IPC3=V(1:rmax+1,1)*ones(1,rmax)+pen3; for j=1:rmax [min1,khat1]=min(IPC1(1:j+1,j)); [min2,khat2]=min(IPC2(1:j+1,j)); [min3,khat3]=min(IPC3(1:j+1,j)); khat1=khat1-1; khat2=khat2-1; khat3=khat3-1; Kh1(i,j)=khat1; Kh2(i,j)=khat2; Kh3(i,j)=khat3; end end for i=1:rmax for t=0:rmax Tablek1(rmax+1-t,i)=sum(Kh1(:,i)==t)/100; Tablek2(rmax+1-t,i)=sum(Kh2(:,i)==t)/100; Tablek3(rmax+1-t,i)=sum(Kh3(:,i)==t)/100; end end

# The following functions is for determining # the number of factors according to IPC criteria ICP <- function(X, rmax) { X <- as.matrix(dat) TT = dim(X)[1] P = dim(X)[2] d <- eigen( (t(X)%*%X) /(TT*P) )$values term1 <- TT/(4*log(log(TT))) term2 <- (P+TT)/(P*TT) term3 <- log((P*TT/(P+TT))) pen1 = d[2:(rmax+1)] * t(replicate(rmax, c(0:rmax))) * term1 * term2 * term3 pen2 = d[2:(rmax+1)] * t(replicate(rmax, c(0:rmax))) * term1 * term2 * log( (min(sqrt(TT), sqrt(P)))^2 ) pen3 = d[2:(rmax+1)] * t(replicate(rmax, c(0:rmax))) * (P+TT - c(0:rmax))/(TT*P) * term1 * log(P*TT) ipc1 <- replicate(rmax, d[1:(rmax+1)]) + t(pen1) ipc2 <- replicate(rmax, d[1:(rmax+1)]) + t(pen2) ipc3 <- replicate(rmax, d[1:(rmax+1)]) + t(pen3) khat1 <- khat2 <- khat3 <- NULL for (j in 1:rmax){ khat1[j] <- which.min(ipc1[1:(j+1), j]) - 1 khat2[j] <- which.min(ipc2[1:(j+1), j]) - 1 khat3[j] <- which.min(ipc3[1:(j+1), j]) - 1 } list(khat1[rmax], khat2[rmax], khat3[rmax]) } khat <- NULL ss <- 20 TT <- 710 P <- 128 sdd <- 1 rmax <- 6 for (i in 1:ss){ tmp <- rep(TT, P) %>% lapply(rnorm, 0, sd= sdd) tmp2 <- do.call(cbind, tmp) x <- rep(1, TT) x <- toeplitz(x) x[lower.tri(x)] <- 0 dat <- (t(tmp2) %*% x) %>% t dat <- dat - t(replicate(TT, colMeans(dat))) khat[i] <- ICP(dat, rmax= rmax)[[1]] } khat %>% table

Googling “CNN” you typically find explanations about the convolution operator which is a defining characteristic of CNN, for example the following animation:

This is quite different from images that you see for general neural networks which are NOT convolutional neural networks:

It is instructive to appreciate the relation between general deep learning models, not domain-specific, and CNN which is particularly tailored for computer vision.

Why? because you will better understand frequently-mentioned concepts that are rarely explained well:

- Sparsity of connections
- Parameter sharing
- Hierarchical feature engineering

Vectorization will help us here. Vectorize the input matrix, think of each pixel as an individual input. The filter convolves over the matrix, creating what is called a feature map. Vectorize that feature map also. Consider each entry in the feature map as a hidden unit. The example below is as simple as could be, for clarity.

The image matrix is a [3 by 3] and the filter is [2 by 2]. Since the output will be also [2 by 2], after vectorization we have a vector of 4 hidden units denoted {f1, f2, f3, f4}. I enumerated the pixels and colored the weights for clarity. Press play.

While a fully connected layer would have 9 active weights connecting the 9 inputs to each of the hidden unit, here we only have 4 connecting weights. Sparse means thinly scattered or distributed; not thick or dense. So since we only allow for 4, rather than the possible 9, we describe it as sparsity of connections.

The filter has 4 parameters. Those are the same parameters for each of the hidden units – in that sense they share the same value. Hence the somewhat confusing term parameter sharing (sounds like they are sharing pizza). The reason it’s a good idea to share parameters is that if a shape to be learned, it should be learned irrespective of its exact location on the image. We don’t want to train one set of parameters to recognize a cat in the left side of the image, and train another set of parameters to recognize a cat in the right side of the image.

With parameter sharing we can do away with estimating a fully connected layer (36 parameters). Instead, we only estimate 4 parameters. But this is quite restrictive, which is the reason for adding many (say 64) feature maps to retrieve the much-needed flexibility for a good model. This is a nice bridge to hierarchical feature engineering.

The following is taken from the paper Imagenet classification with deep convolutional neural networks.

Each small square is a matrix, representing the weights which have been learned by the model. Read it from top to bottom. The top three rows are learned weights\filters early in the network. The bottom three rows are learned weights\filters thereafter, taking their inputs from the top 3 among others.

Parameter sharing is so restrictive that it forces the model to make choices. These choices are directed by the need to minimize the loss function. You can see that weights\filters associated with early layers focus on “primitive” feature engineering. Each (convolutional) layer has very little “weights-budget” due to parameter sharing that the model first picks up orientation and edges, almost completely ignoring colors. Subsequent layers in the network (the bottom three rows showing the weights\filters) once “primitive” features have been learned by earlier layers, are “allowed” to focus on more sophisticated features involving colors.

Why this pattern? because more images are recognized by the “primitive” set, so the network starts there. Once the model is happy with one set of weights, it can move on to minimize the loss further by trying to recognize additional, more complex patterns.Similar observation is made in Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations.

Footnotes

* We have seen parameter sharing in the past in the context of volatility forecasting. particularly equation (4) in the paper A Simple Approximate Long-Memory Model of Realized Volatility assign the same AR coefficient for different volatility lags.

]]>

I use the `quantile`

function for the example. There are many ways to compute the estimate of a quantile, and all those various ways are coded into the one `quantile`

function. The function has the default argument `type = 7`

which indicates the particular way we wish to estimate our quantiles. Given that *R* is an open-source language you can easily find the code for any function, then you can “fish out” only the lines that you actually need. While the code for the `quantile`

function is around 90 lines (given below), the real labor is carried out mainly by lines 49 to 58 – the main workhorse (for the type=7 default).

Now, let’s write our own version of the `quantile`

function; call it `lean_quantile`

. Then we make sure our `lean_quantile`

does what its meant to do, and compare the execution time.

lean_quantile <- function(x, probs = seq(0, 1, 0.25)) { n <- length(x) np <- length(probs) index <- 1 + (n - 1) * probs lo <- floor(index) hi <- ceiling(index) x <- sort(x, partial = unique(c(lo, hi))) qs <- x[lo] i <- which(index > lo) h <- (index - lo)[i] qs[i] <- (1 - h) * qs[i] + h * x[hi[i]] qs }

Check that our `lean_quantile`

does what its meant to do:

tmpp <- rnorm(10) all( quantile(tmpp) == lean_quantile(tmpp) ) [1] TRUE

Now we can compare the execution time (more on timing and profiling code):

library(microbenchmark) # citation("microbenchmark") bench <- microbenchmark(quantile(rnorm(10)), lean_quantile(rnorm(10)), times=10^4) bench # Unit: microseconds # expr min lq mean median uq max neval # quantile(rnorm(10)) 79.1 84.3 96.3 86.1 89.0 3907 10000 # lean_quantile(rnorm(10)) 27.9 31.6 36.2 33.4 34.8 4741 10000

Execution time is reduced by over 60%. Also, we did not have to work very hard for it. We can do more, diving further and improve the `sort`

function which our `lean_quantile`

uses, but you get the idea.

Is it a free lunch? Of course not.

It takes long to master efficient programming, and the functions you find in the public domain are probably well scrutinized – before and after they go up there. When you mingle with the internals you risk making a mistake, erasing an important line or creating unintended consequences and messing up the original behavior. So meticulous checks are good to do.

While some functions are written so efficiently that you will find very little value in pulling out just the workhorse, with most functions written for the general public you will certainly be able to squeeze out some time-profit. As you can see this “get the gist” tip has excellent potential to save a lot of waiting time.

# Part of the R package, https://www.R-project.org # # Copyright (C) 1995-2014 The R Core Team # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # A copy of the GNU General Public License is available at # https://www.R-project.org/Licenses/ quantile <- function(x, ...) UseMethod("quantile") quantile.POSIXt <- function(x, ...) .POSIXct(quantile(unclass(as.POSIXct(x)), ...), attr(x, "tzone")) quantile.default <- function(x, probs = seq(0, 1, 0.25), na.rm = FALSE, names = TRUE, type = 7, ...) { if(is.factor(x)) { if(!is.ordered(x) || ! type %in% c(1L, 3L)) stop("factors are not allowed") lx <- levels(x) } else lx <- NULL if (na.rm) x <- x[!is.na(x)] else if (anyNA(x)) stop("missing values and NaN's not allowed if 'na.rm' is FALSE") eps <- 100*.Machine$double.eps if (any((p.ok <- !is.na(probs)) & (probs < -eps | probs > 1+eps))) stop("'probs' outside [0,1]") n <- length(x) if(na.p <- any(!p.ok)) { # set aside NA & NaN o.pr <- probs probs <- probs[p.ok] probs <- pmax(0, pmin(1, probs)) # allow for slight overshoot } np <- length(probs) if (n > 0 && np > 0) { if(type == 7) { # be completely back-compatible index <- 1 + (n - 1) * probs lo <- floor(index) hi <- ceiling(index) x <- sort(x, partial = unique(c(lo, hi))) qs <- x[lo] i <- which(index > lo) h <- (index - lo)[i] # > 0 by construction ## qs[i] <- qs[i] + .minus(x[hi[i]], x[lo[i]]) * (index[i] - lo[i]) ## qs[i] <- ifelse(h == 0, qs[i], (1 - h) * qs[i] + h * x[hi[i]]) qs[i] <- (1 - h) * qs[i] + h * x[hi[i]] } else { if (type <= 3) { ## Types 1, 2 and 3 are discontinuous sample qs. nppm <- if (type == 3) n * probs - .5 # n * probs + m; m = -0.5 else n * probs # m = 0 j <- floor(nppm) h <- switch(type, (nppm > j), # type 1 ((nppm > j) + 1)/2, # type 2 (nppm != j) | ((j %% 2L) == 1L)) # type 3 } else { ## Types 4 through 9 are continuous sample qs. switch(type - 3, {a <- 0; b <- 1}, # type 4 a <- b <- 0.5, # type 5 a <- b <- 0, # type 6 a <- b <- 1, # type 7 (unused here) a <- b <- 1 / 3, # type 8 a <- b <- 3 / 8) # type 9 ## need to watch for rounding errors here fuzz <- 4 * .Machine$double.eps nppm <- a + probs * (n + 1 - a - b) # n*probs + m j <- floor(nppm + fuzz) # m = a + probs*(1 - a - b) h <- nppm - j if(any(sml <- abs(h) < fuzz)) h[sml] <- 0 } x <- sort(x, partial = unique(c(1, j[j>0L & j<=n], (j+1)[j>0L & j0L) { names(qs) <- format_perc(probs) } if(na.p) { # do this more elegantly (?!) o.pr[p.ok] <- qs names(o.pr) <- rep("", length(o.pr)) # suppress names names(o.pr)[p.ok] <- names(qs) o.pr } else qs }

As a side note, would be nice to do that in Python also, but the source code for the numpy quantile function is heavily “decorated”. Comment if you know how to create the Python counterpart.

]]>