Googling “CNN” you typically find explanations about the convolution operator which is a defining characteristic of CNN, for example the following animation:

This is quite different from images that you see for general neural networks which are NOT convolutional neural networks:

It is instructive to appreciate the relation between general deep learning models, not domain-specific, and CNN which is particularly tailored for computer vision.

Why? because you will better understand frequently-mentioned concepts that are rarely explained well:

- Sparsity of connections
- Parameter sharing
- Hierarchical feature engineering

Vectorization will help us here. Vectorize the input matrix, think of each pixel as an individual input. The filter convolves over the matrix, creating what is called a feature map. Vectorize that feature map also. Consider each entry in the feature map as a hidden unit. The example below is as simple as could be, for clarity.

The image matrix is a [3 by 3] and the filter is [2 by 2]. Since the output will be also [2 by 2], after vectorization we have a vector of 4 hidden units denoted {f1, f2, f3, f4}. I enumerated the pixels and colored the weights for clarity. Press play.

While a fully connected layer would have 9 active weights connecting the 9 inputs to each of the hidden unit, here we only have 4 connecting weights. Sparse means thinly scattered or distributed; not thick or dense. So since we only allow for 4, rather than the possible 9, we describe it as sparsity of connections.

The filter has 4 parameters. Those are the same parameters for each of the hidden units – in that sense they share the same value. Hence the somewhat confusing term parameter sharing (sounds like they are sharing pizza). The reason it’s a good idea to share parameters is that if a shape to be learned, it should be learned irrespective of its exact location on the image. We don’t want to train one set of parameters to recognize a cat in the left side of the image, and train another set of parameters to recognize a cat in the right side of the image.

With parameter sharing we can do away with estimating a fully connected layer (36 parameters). Instead, we only estimate 4 parameters. But this is quite restrictive, which is the reason for adding many (say 64) feature maps to retrieve the much-needed flexibility for a good model. This is a nice bridge to hierarchical feature engineering.

The following is taken from the paper Imagenet classification with deep convolutional neural networks.

Each small square is a matrix, representing the weights which have been learned by the model. Read it from top to bottom. The top three rows are learned weights\filters early in the network. The bottom three rows are learned weights\filters thereafter, taking their inputs from the top 3 among others.

Parameter sharing is so restrictive that it forces the model to make choices. These choices are directed by the need to minimize the loss function. You can see that weights\filters associated with early layers focus on “primitive” feature engineering. Each (convolutional) layer has very little “weights-budget” due to parameter sharing that the model first picks up orientation and edges, almost completely ignoring colors. Subsequent layers in the network (the bottom three rows showing the weights\filters) once “primitive” features have been learned by earlier layers, are “allowed” to focus on more sophisticated features involving colors.

Why this pattern? because more images are recognized by the “primitive” set, so the network starts there. Once the model is happy with one set of weights, it can move on to minimize the loss further by trying to recognize additional, more complex patterns.Similar observation is made in Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations.

Footnotes

* We have seen parameter sharing in the past in the context of volatility forecasting. particularly equation (4) in the paper A Simple Approximate Long-Memory Model of Realized Volatility assign the same AR coefficient for different volatility lags.

]]>

I use the `quantile`

function for the example. There are many ways to compute the estimate of a quantile, and all those various ways are coded into the one `quantile`

function. The function has the default argument `type = 7`

which indicates the particular way we wish to estimate our quantiles. Given that *R* is an open-source language you can easily find the code for any function, then you can “fish out” only the lines that you actually need. While the code for the `quantile`

function is around 90 lines (given below), the real labor is carried out mainly by lines 49 to 58 – the main workhorse (for the type=7 default).

Now, let’s write our own version of the `quantile`

function; call it `lean_quantile`

. Then we make sure our `lean_quantile`

does what its meant to do, and compare the execution time.

lean_quantile <- function(x, probs = seq(0, 1, 0.25)) { n <- length(x) np <- length(probs) index <- 1 + (n - 1) * probs lo <- floor(index) hi <- ceiling(index) x <- sort(x, partial = unique(c(lo, hi))) qs <- x[lo] i <- which(index > lo) h <- (index - lo)[i] qs[i] <- (1 - h) * qs[i] + h * x[hi[i]] qs }

Check that our `lean_quantile`

does what its meant to do:

tmpp <- rnorm(10) all( quantile(tmpp) == lean_quantile(tmpp) ) [1] TRUE

Now we can compare the execution time (more on timing and profiling code):

library(microbenchmark) # citation("microbenchmark") bench <- microbenchmark(quantile(rnorm(10)), lean_quantile(rnorm(10)), times=10^4) bench # Unit: microseconds # expr min lq mean median uq max neval # quantile(rnorm(10)) 79.1 84.3 96.3 86.1 89.0 3907 10000 # lean_quantile(rnorm(10)) 27.9 31.6 36.2 33.4 34.8 4741 10000

Execution time is reduced by over 60%. Also, we did not have to work very hard for it. We can do more, diving further and improve the `sort`

function which our `lean_quantile`

uses, but you get the idea.

Is it a free lunch? Of course not.

It takes long to master efficient programming, and the functions you find in the public domain are probably well scrutinized – before and after they go up there. When you mingle with the internals you risk making a mistake, erasing an important line or creating unintended consequences and messing up the original behavior. So meticulous checks are good to do.

While some functions are written so efficiently that you will find very little value in pulling out just the workhorse, with most functions written for the general public you will certainly be able to squeeze out some time-profit. As you can see this “get the gist” tip has excellent potential to save a lot of waiting time.

# Part of the R package, https://www.R-project.org # # Copyright (C) 1995-2014 The R Core Team # # This program is free software; you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation; either version 2 of the License, or # (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # A copy of the GNU General Public License is available at # https://www.R-project.org/Licenses/ quantile <- function(x, ...) UseMethod("quantile") quantile.POSIXt <- function(x, ...) .POSIXct(quantile(unclass(as.POSIXct(x)), ...), attr(x, "tzone")) quantile.default <- function(x, probs = seq(0, 1, 0.25), na.rm = FALSE, names = TRUE, type = 7, ...) { if(is.factor(x)) { if(!is.ordered(x) || ! type %in% c(1L, 3L)) stop("factors are not allowed") lx <- levels(x) } else lx <- NULL if (na.rm) x <- x[!is.na(x)] else if (anyNA(x)) stop("missing values and NaN's not allowed if 'na.rm' is FALSE") eps <- 100*.Machine$double.eps if (any((p.ok <- !is.na(probs)) & (probs < -eps | probs > 1+eps))) stop("'probs' outside [0,1]") n <- length(x) if(na.p <- any(!p.ok)) { # set aside NA & NaN o.pr <- probs probs <- probs[p.ok] probs <- pmax(0, pmin(1, probs)) # allow for slight overshoot } np <- length(probs) if (n > 0 && np > 0) { if(type == 7) { # be completely back-compatible index <- 1 + (n - 1) * probs lo <- floor(index) hi <- ceiling(index) x <- sort(x, partial = unique(c(lo, hi))) qs <- x[lo] i <- which(index > lo) h <- (index - lo)[i] # > 0 by construction ## qs[i] <- qs[i] + .minus(x[hi[i]], x[lo[i]]) * (index[i] - lo[i]) ## qs[i] <- ifelse(h == 0, qs[i], (1 - h) * qs[i] + h * x[hi[i]]) qs[i] <- (1 - h) * qs[i] + h * x[hi[i]] } else { if (type <= 3) { ## Types 1, 2 and 3 are discontinuous sample qs. nppm <- if (type == 3) n * probs - .5 # n * probs + m; m = -0.5 else n * probs # m = 0 j <- floor(nppm) h <- switch(type, (nppm > j), # type 1 ((nppm > j) + 1)/2, # type 2 (nppm != j) | ((j %% 2L) == 1L)) # type 3 } else { ## Types 4 through 9 are continuous sample qs. switch(type - 3, {a <- 0; b <- 1}, # type 4 a <- b <- 0.5, # type 5 a <- b <- 0, # type 6 a <- b <- 1, # type 7 (unused here) a <- b <- 1 / 3, # type 8 a <- b <- 3 / 8) # type 9 ## need to watch for rounding errors here fuzz <- 4 * .Machine$double.eps nppm <- a + probs * (n + 1 - a - b) # n*probs + m j <- floor(nppm + fuzz) # m = a + probs*(1 - a - b) h <- nppm - j if(any(sml <- abs(h) < fuzz)) h[sml] <- 0 } x <- sort(x, partial = unique(c(1, j[j>0L & j<=n], (j+1)[j>0L & j0L) { names(qs) <- format_perc(probs) } if(na.p) { # do this more elegantly (?!) o.pr[p.ok] <- qs names(o.pr) <- rep("", length(o.pr)) # suppress names names(o.pr)[p.ok] <- names(qs) o.pr } else qs }

As a side note, would be nice to do that in Python also, but the source code for the numpy quantile function is heavily “decorated”. Comment if you know how to create the Python counterpart.

]]>In this post number 6 on correlation and correlation structure I share another dependency measure called *“distance correlation”*. It has been around for a while now (2009, see references). I provide just the intuition, since the math has little to do with the way distance correlation is computed, but rather with the theoretical justification for its practical legitimacy.

Denote as the distance correlation. The following is taken directly from the paper Brownian distance covariance (open access) :

Our proposed distance correlation represents an entirely new approach. For all distributions with finite first moments, distance correlation R generalizes the idea of correlation in at least two fundamental ways:

- is defined for X and Y in arbitrary dimension.
- characterizes independence of X and Y.

The first point is super useful and far from trivial. You can theoretically calculate distance correlation between two vectors of different lengths (e.g. 12 monthly rates and 250 daily prices)^{*}. The second is a must-have for any aspiring dependence measure. While linear correlation be computed to be very small number **even if the vectors are dependent or even strongly dependent **(quite easily mind you), distance correlation is general enough so that when it’s close to zero, then the vectors must be totally independent (linearly and non-linearly).

Although they don’t present it like this in the paper, the idea is a simple functional extension of a usual probabilistic fact: if two random variables are independent then

Now instead of thinking about variables, think about X and Y as functions (in the paper they use characteristic functions), X and Y are independent if and only if

Now you can quantify how far the joint function is from the product of the two individual functions . If they are identical, the distance will be zero. Use the complement [1 – 0] to get a measure that returns 1 for full dependency and 0 for complete independence. It’s a bit like saying the following: if , and while if they are independent I expect (), then is my measure for how dependent are those random variables. Informally speaking we can say we compute some sort of “excess dependency over the fully independent case”. We have seen this idea before talking about asymmetric correlations of equity portfolios.

I replicated figure (1) from the previous post on this topic and I add the distance correlation measure (denoted here as ) for comparison. Afterwards we can say a few words about advantages or disadvantages.

denotes distance correlation, denotes the “new coefficient of correlation” (as they dub it in the original paper), and denotes the usual Pearson correlation. Distance correlation measure relies on the characteristic functions of the realized vectors (think simply their probability distribution). Therefore it cares about the profile of dependence, rather than the strength of dependence; which is the main takeaway from the figure. By way of contrast you see that “new coefficient of correlation” dramatically decreases (from top to bottom) as noise being added to the data. Distance correlation is less sensitive to that. The following figure offers some clarity, I hope:

Regardless of the added noise – from top to bottom in the figure, the relation between the two variables x and y, depicted using the purple smooth line, is very similar. The dependency value for the measure is decidedly decreasing as noise being added, because it also considers the noise in the data, while distance correlation focuses on the underlying dependency structure, only.

So, you decided to use a dependency measure that captures both linear *and* non-linear dependence.

Should you then use , or ?

It’s a matter of preference.

If the empirical realization is what matters to you – meaning you would like to account for the noise in the data, the measure is the way to go. If what you care about is the underlying “skeleton profile” then you should opt for using distance covariance\correlation.

Correlation between stocks and bonds is an interesting case in point. These two are undoubtedly correlated, but with a complicated dependence structure which has to do with the economy and anticipated actions by central banks. Let’s see what the three dependency measure report for two relevant tickers: TLT (long term US bong ETF) and SPY (S&P 500 ETF). I also plot an estimated smoother for the two time series (in green).

Interesting stuff. We observe that:

You may wonder about the sign, but remember that both and distance correlation are tailored to capture also non-linear dependencies, which makes the sign irrelevant (they both range between 0 and 1).

Footnotes

* That said, I only found implementations that allow for equal vector length. But you can code it if you need.

]]>

When I was a PhD candidate my promotor changed the notation I used in my research from to . I was working with hourly time series data so the subscript made more sense. Two things happened at that moment. The first is that the paper became much, much more readable. The second is that I realized the math should be treated as part of the text, rather than “here is the text” and “here is the associated equations”. To all you promotors\supervisors out there, in case you wonder about the impact you make on your protégés; such change-of-notation comments can make a massive difference in their career.

Here are few “dos and don’ts” which are worth generalizing in my opinion, in no particular order:

- If 95% of the time the abbreviation RMSE refers to
*Root Mean Squared Error*then don’t use it as(as is done here by otherwise superb writers). Something like**Relative**Mean Squared Error*RelMSE*makes more sense. - If you need a running subscript, the lower case is more similar to the upper case than is to so the latter is better when you need to make a distinction between lower and upper case.
- Prefer words. You can write which reads ‘a function of the input, output and parameters’ but better to write simply
*residuals*(or squared residuals) if that is what you aim for. Another recent example is taken from here where the authors use as a shorthand notation for the maximum of two numbers. I wonder about the*efficiency vs readability*tradeoff of replacing with . - The notation is mainly used to denote the absolute value of , but it’s also used to denote the cardinality of a set (or determinant, or a norm). Same notation for more than one meaning in the same paper is energy-taxing for the reader. Solve it by using something like to denote cardinality of x.
- There are enough Greek letters to go around; lose the extravagance by choosing familiar/friendly letters when you can. All else equal, is a better choice than and is a better choice than . Research that uses pompous notations for no reason is self-degrading readability, which is a shame.
- If you use the intuitive notation as a financial returns time series, you can use the letter as the column-collected data matrix. I often see stands for the correlation matrix, which the letter would make better sense. The point is that while is a common notation for the data, we can afford a change based on the context.
- In the same vein, since there is no global math-writing standard just yet, I try to follow the paper Notation in econometrics: a proposal for a standard which offers internally consistent framework for notation and abbreviations (and an associated .sty file for latex).

I hope this post contributes simply to increase awareness with regards to the way we incorporate math into text.

For one, there are too many packages out there. There are imperfect duplicates. You can easily end up downloading inferior code/package/module compared to existing other. Second, there is a matter of security. I myself try to refrain from downloading relatively new code, not yet tried-and-true. How do we know if a package is solid?

I recently came across this useful web application* which provides a friendly assist to check the number of downloads of an R package. I was curious about my own packages:

At first glance it doesn’t look too shabby. But compare it with, for example, Frank Harrell’s Hmisc (for Harrell Miscellaneous) package downloads:

20M downloads. Now that is a proper way to **give back** to the community.

You can use this web application to check if a package is trending, if it has matured, or use it in conjunction with the *CRAN Task Views* to compare packages from the same category to help you prioritize your trials.

* There are other such web applications, but less useful, in my opinion.

]]>