The way to automate your workflow startup process is via the command

`shell.exec`

. Here is how you can use it to open whatever it is you need:library(magrittr) a_pdf <- "path to pdf" shell.exec(a_pdf) a_tex_file <- "path to tex" shell.exec(a_tex_file) shell.exec("Path to your note taking program.exe file") shell.exec("Path to you-get-the-idea.exe file")

I imagine you don’t often move around files once they are saved where they should be saved, so those paths are fairly fixed. You can use the a tip given in a previous post to quickly reverse the backslashes before pasting the path into your code editor.

You can open multiple files for the same application (e.g. multiple pdfs). You can also rework the code for a bit more elegance:

voila <- list(a_pdf, a_tex_file, application1.exe, application2.exe) voila %>% lapply(shell.exec)

Open your default browser with the pages you use most. Those few lines should help you feel comfortable clearing your web cache and data saved by aggressive browsers, your starting point is here:

url1 <- "https://something something 1" url2 <- "https://something something 2" url3 <- "https://take me back to my gmail please" url_list <- list(url1, url2, url3) url_list %>% lapply(browseURL)

For Python users, you can use the `subprocess`

module to do the same as.

import subprocess subprocess.call(['C:\Program Files\Mozilla Firefox\\firefox.exe'])]]>

A nice illustration for why one would prefer the Bayesian approach over the Frequentist approach is given in the wonderful book Computer age statistical inference, section 3.3.

To make the point quickly: imagine you want estimate the mean of the weight distribution for 40 years old males, and you have 10 observations (individuals). Observations are measured such that each guy stands on the scales and reports his weight. The range of results turns out to be between 75 and 95, and the average is 85. If weight is normally distributed (it is) then 85, the average, is the best estimate for the mean. However, if on the next day it turns out that the scales are faulty, specifically that anything over a weight of 100 is reported as 100, then the distribution of the observations is not normal any more (but is capped at 100). If the distribution of the sample is not normal then 85 is **not** the best estimate for the mean (the actual estimate should be lower than 85). So.. even though we have only the one data, which has not changed mind you, the mere fact that we found out the scales are faulty, and even though we had not a single observation over 100, our Frequentist statistician now believes she has a biased estimate for the mean. Before we discovered the problem with the scales, that estimate was optimal – perfect. Now, with no new data, only new information, the estimate is sub-optimal and must be adjusted.

In that sense Frequentist is rightly accused for being internally inconsistent with regards to the data. This is a by-product of the belief that data originates from some underlying real, yet unobserved distribution. In sharp contrast, the Bayesian believes that the one realization is it(!), nothing else could have been seen. If there is no other possible data but the one we observe, then what’s to talk about? The average of the observed data is the mean, period. Ah ha, while the Frequentist believes there is the one parameter to be estimated, the Bayesian does not. Rather, there are many possible values which would fit the data we see. The only question is which values make sense for our realized, “single truth” data. The average is indeed the most likely estimate for the mean, but we can consider other possible values via some prior distribution around the mean.

It is this business of adding another distributional layer between the problem solver and its target, be it prediction or estimation, that creates the preference for other methods. Bayesian machinery piles additional hyperparameters, often over already existing hyperparameters. I don’t see many, if at all, applications of Bayesian methodology to **real** data, still no.

As Prof. Efron writes in his 1986 paper “Why Isn’t Everyone a Bayesian?”:

“Bayesian theory requires a great deal of thought about the given situation to apply sensibly… All of this thinking is admirable in principle, but not necessarily in day-to-day practice.”

There are many special cases where there is a one-to-one mapping between the Bayesian and Frequenist statistician. I came up a Frequenist and so I am perhaps biased towards that paradigm, but I am not the only one who finds Frequenism to be more practical simply. For example, you may not have known that both ridge regression and lasso regression have their Bayesian counterparts. Ridge regression can be cast as bayesian regression with a Gaussian prior for the parameters, and lasso regression can be cast as a bayesian regression with Dirichlet prior for the parameters*. These facts are barely mentioned, probably because most if not all practitioners choose the Frequenist path for estimation, as also mentioned in the Computer age statistical inference book (section 7.3): “Despite the Bayesian provenance, most regularization research is carried out frequentistically” (not a typo..).

Finally, I query the google trends API for the search-term *Bayesian Statistics*, so as to gauge the level of interest over time (indeed, assuming number of googles is a good proxy for “interest”). Here is the result:

The earliest data is from 2004. Data is normalized by google, and smoothed by yours truly. A telling image.

As I wrote before: don’t be a Bayesian, nor be a Frequenist, be opportunist. I don’t have a dog in this fight, I am only making the point that intellectual curiosity alone does not justify the rivers of ink that are being spilled over Bayesian methods.

* See for example Dirichlet–Laplace Priors for Optimal Shrinkage

This post is inspired by the most recent, beautiful and stimulating (as usual) paper by Prof. Bradley Efron: *Prediction, Estimation, and Attribution* (linked below).

There is more than one way to calculate variable importance score. Here is just one common way which is called mean decrease in node impurity. Since we will be busy with a classification problem in the empirical example, node impurity will be measured by the Gini criterion (see appendix for formal definition).

Once the RF algorithm took its course, for each (bootstrapped) tree we have a list of splits. Each such split has a one-to-one mapping with a particular variable which was chosen for it. We can then sum up all the decreases in the Gini criterion over all splits, per variable, and average that number over all (bootstrapped) trees which were constructed. Normalize those numbers and you have the *importance score* per variable. The rationale behind this procedural calculation is that if the algorithm chooses some variables often, and splits are taken based on those variables display large “progress” towards a good solution-i.e large drop in node impurity, it stands to reason that those variables are “important”. Super intuitive I admit. I myself have strayed based on this intuition. If you run the code given in the help files for the function `varImpPlot`

you can see a typical variable importance plot I am referring to:

set.seed(4543) data(mtcars) mtcars.rf <- randomForest(mpg ~ ., data=mtcars, ntree=100, keep.forest=FALSE, importance=TRUE) varImpPlot(mtcars.rf, type= 1, pch=19, main="")

What you see is a ranking of the individual variables according to their importance (measured as explained above). The variable `disp`

is the most important variable according to this plot.

Prediction is much easier than attribution. The default modus operandi of those pure-prediction algorithms is nonparametric. Those algorithms are permissive in the sense that **you don’t need to powerful explanatory variables**. This is key. Pure-prediction algorithm can do very well with what is sometimes referred to as “weak learners” or “inefficient” explanatory variables; as long as you have enough of those. Prediction from pure-prediction algorithm is accurate, not so much because of a specific/particular variable, but **because of the interim non-linear transformation which includes many variables**. Plots which show **relative** “importance measure”- a the result of numerical computation done on random subsets on the algorithm’s path towards prediction- falesly award a feel of absolute importance. That is misleading. Let’s hammer this further.

Data for this exercise is taken from the `sda`

package in R. It contains measurement from genetic expression levels of 6033 genes for 102 people (so matrix dimension is ). 52 people are cancer patients and 50 people are normal controls. The goal is to apply RF to predict whether a patients are healthy or sick based on their microarray measurements.

I split the data into training set (70%) and test set (30%). Using the default values of the `randomForest`

function (from the package by the same name) I estimate a RF model. The first plot shown here is actually the second plot I generated (the first plot comes after the second in a few seconds).

The accuracy achieved is excellent: all test cases are predicted correctly. The plot above depicts the genes which helped the most in achieving this good accuracy. The plot was generated using the `varImpPlot`

from the same package. Gene 281 was the most helpful followed by the duo 890 and 574 etc. So, are those genes important? Not at all!

This is the first plot I generated:

Here you can see that genes numbers 5568, 1720, 77 etc. are helpful. The accuracy of this model is, again, 100%.

Importantly, the first plot you saw was generated after I removed those 30 genes plotted above (numbers 5568, 1720, 77 etc.). We have obtained a similar plot with interesting genes without any loss in accuracy. You see, it’s relative importance, while in the second model, after we removed the top 30 most helpful variables from the first model, you would find gene number 281 worth looking into, while the same Gene did not even appear in the top 30 interesting genes when we used the full microarray set. While I think 30 is enough to make the point, I started experimenting with values of 10, 20, and stopped at 30. You can even remove an additional subsequent 30 variable from the RF model and I think you will still not suffer any accuracy loss.

The complicated nonlinear transformation of the variables is in command over the excellent accuracy, not individual variables. While you can certainly rank the variables based on the numerical procedure we outlined above and zoom on those variables which rank high, we should altogether carve out the term “important” when looking at the original variables in that way. In my opinion that was an unfortunate choice of words for those plots.

We have discussed RF in this post. However, in the field of explainable AI there is a research trend which tries to revert back to the original variables. While I fully understand the temptation, whether there are or there aren’t strong explanatory variables, that is not the way to go about it. It is not the way to go about it because of the way those black-box algorithms inner-work. For explicability’s sake, at my work I proposed to introduce the concept of “grey box”, which says something about the inner-workings of the algorithms, the numerical procedures followed; rather than attempting to backtrack towards the original explanatory variables/features. We need to stop that.

library(sda) library(randomForest) data(singh2002) dat0 <-singh2002 TT <- length(dat0$y) set.seed(654654) in_samp <- sample(c(1:TT),0.7*TT) out_samp <- c(1:TT)[-in_samp] train_dat <- data.frame(y= dat0$y[in_samp], x= dat0$x[in_samp,] ) test_dat <- data.frame(y= dat0$y[out_samp], x= dat0$x[out_samp,] ) num_import <- 30 model_rf <- randomForest(train_dat$y ~., data= train_dat) imp_var <- importance(model_rf, type= 2) tmp1 <- tail(imp_var %>% order, num_import) tmp2 <- tail(imp_var %>% sort, num_import) pdfpar(leftt=4, rowss=1) # barplot( sort(tmp2), horiz= T, names.arg= tmp1 , space=0.05,col= e_col[3]) varImpPlot(model_rf, main="", pch=19, n.var= num_import) rf_p <- predict(model_rf, newdata= test_dat[,-1], type= "response") 1 - sum( as.numeric(test_dat[,1]) - as.numeric(rf_p) )/length(out_samp) # remove "important" variables train_dat <- data.frame(y = dat0$y[in_samp], x = dat0$x[in_samp, -tmp1]) test_dat <- data.frame(y = dat0$y[out_samp], x = dat0$x[out_samp, -tmp1]) model_rf <- randomForest(train_dat$y ~ ., data = train_dat) imp_var <- importance(model_rf, type = 2) tmp1 <- tail(imp_var %>% order, num_import) tmp2 <- tail(imp_var %>% sort, num_import) varImpPlot(model_rf, main="", pch=19, n.var= num_import) rf_p <- predict(model_rf, newdata = test_dat[, -1], type = "response") 1 - sum(as.numeric(test_dat[, 1]) - as.numeric(rf_p)) / length(out_samp)

Designed for classification.

where is the th entry in the vector of class-estimated probabilities.

]]>Say you have a file to read into R. The file path is `C:\Users\folder1\folder2\folder3\mydata.csv`

. So what do you do? you copy the path, paste it to the editor, and start reversing the backslash into a forward slash so that R can read your file.

With the help of the `rstudioapi`

package, the `readClipboard`

function and the following function:

get_path <- function(){ x <- readClipboard(raw= F) rstudioapi::insertText( paste("#",x, "\n") ) x }

You can

1. Simply copy the path `C:\Users\folder1\folder2\folder3\mydata.csv`

2. execute `pathh <- get_path()`

3. use `pathh`

which is now R-ready.

No more reversing or escaping backslash.

]]>A new paper titled *“Beta in the tails”* is a showcase application for why we should focus on correlation structure rather than on average correlation. They discuss the question: *Do hedge funds hedge?* The reply: No, they don’t!

The paper *“Beta in the tails”* was published in the *Journal of Econometrics* but you can find a link to a working paper version below. We start with a figure replicated from the paper, go through the meaning and interpretation of it, and explain the methods used thereafter.

Hedge funds don’t hedge.

`rq`

function from the `quantreg`

package. On the X-axis: the actual quantiles (e.g. 50 means roughly median monthly returns).The data used is taken from lab.credit-suisse.com (registration needed). The figure shows that when market returns are low (lower quantiles on the X-axis), the hedge fund’s returns move more in tandem (so also low), then when market returns higher end (higher quantiles on the X-axis). If hedge fund are actually serving us as a hedge, we should have seen exactly the reverse. When markets do poorly for the individual, the hedge fund returns would “kick in” to compensate for market losses.

More interesting points from the paper:

- The figure above is typical, and holds true for almost all hedge fund styles. The wording “in the tails” is because the slope is very steep when returns are at the very low quantiles (left tail of the return distribution). Meaning the hedge, if that is what you tell yourself you are doing, fails exactly when you need it the most, so add-on tail risk.
- Two styles are actually not a bad hedge:
*managed futures*and e*quity market neutral.*It’s nice to see an analysis I have made back in 2012 about most profitable hedge fund style actually holds up, as 2 out of the 3 winning styles are indeed market neutral.

I followed a paper written by Dirk Baur: The structure and degree of dependence: A quantile regression approach, and used a slightly revised code from the post Correlation and correlation structure (1); quantile regression. Basically you create a quantile regression loop through the different quantiles to estimate the beta. While we can discuss the more general notion of correlation, clearly hedge funds follow the market rather than the other way around, so we can simply look at the beta from that regression. The actual function I used is given below.

library(quantreg) corquantile <- function(seriesa,seriesb,k=10){ if(length(seriesa)!=length(seriesb)){stop("length(seriesa)!=length(seriesb)")} TT <- length(seriesa) cofa <- cofb <- NULL for (i in k:(100-k)){ # The workhorse: lm0 <- summary(rq(seriesa~seriesb,tau=(i/100))) lm1 <- summary(rq(seriesb~seriesa,tau=(i/100))) cofa[i-k+1] <- lm0$coef[2,1] cofb[i-k+1] <- lm1$coef[2,1] } return(list(cofa=cofa,cofb=cofb)) }

* The first sentence is a quote by Michael Lewis.

]]>