Standard kernel density estimation is by far the most popular way for density estimation. However, it is biased around the edges of the support. In this post I show what does this bias imply, and while not the only way, a simple way to correct for this bias. Practically, you could present density curves which makes sense, rather than apologizing (as I often did) for your estimate making less sense around the edges of the chart; that is, when you use a standard software implementation.
For a start, see for yourself. Assume you would like to estimate the density of the daily market volatility; say estimated by the squared daily return of the SPY, code at the end of the post.
I removed few outliers from the chart for better readability. This is how it looks like if you simply use the default density
function in R. The green line is without changing the default bandwidth, while the light blue line is with a more narrow bandwidth.
Both are quite ugly estimates in that circled region. You can see that the kernel estimate uses a sort of “moving window”, which falls off the left edge. You would expect the highest density to be just above zero. That is where most of the observations are concentrated (observations are dense there). The “moving window” should not “continue” below zero, as the variance never goes negative and yet it does. This is an example for the bias I was referring to. Jones (1996) offers a simple solution which is conveniently coded for us in the evmix
package. Applying it provides the following, much more reasonable estimates:
Both those estimate are boundary-bias-corrected, but the window-size (the bandwidth) is a bit off. You can use the bw.nrd
function to help choose the bandwidth. It is meant for Gaussian kernels but in my opinion the kernel type is inconsequential here. Using that function, you then get the following figure:
See below for good references and the code.
All of Nonparametric Statistics. Currently amazon reviews are averaged 3.5. But given that no one gave it 3 stars, you can imagine people love it or hate it. I myself like it. Professor Larry Wasserman is a world-class statistician. He used to have a very influential blog which he Seinfeldly retired (at it’s pick) at the end of 2013. Unfortunately for all of us lovers of statistics, he has wonderful insights and impressive clarity of exposition.
Jones, M. C., and P. J. Foster. “A simple nonnegative boundary correction method for kernel density estimation.” Statistica Sinica (1996): 1005-1013. Not an easy read without some prior knowledge.
Data and standard kernel density estimation:
# The following must be installed # Before this code could be executed smoothly library(quantmod) library(magrittr) library(evmix) citationevmix) k <- 10 # how many years back? end <- format(Sys.Date(),"%Y-%m-%d") start <-format(Sys.Date() - (k*365),"%Y-%m-%d") symetf = c('SPY') dat0 = getSymbols( symetf, src = "yahoo", from= start, to= end, auto.assign = F, warnings = FALSE, symbol.lookup = F ) w1 <- (100*dailyReturn(dat0)) %>% as.numeric %>% "^"(2) w1 %>% head # plot(w1) hist(w1, breaks=1000, xlim= c(0,2.5), freq= F, ylab="", main="", col= "lightgrey") TT <- NROW(w1) tmpp <- w1 %>% density() lines(tmpp, lwd=2) tmpp <- w1 %>% density(adjust = 0.5) # half the BW lines(tmpp, lwd=2)
Boundary corrected kernel density estimate:
# ? dbckden # for help on the function seqq <- seq(0, 2.5, length.out=TT) bc_dens <- dbckden(x= seqq, kerncentres=w1, bw=1) lines(bc_dens~seqq, col=e_col[4], lwd=2) bc_dens <- dbckden(x= seqq, kerncentres=w1, bw=0.3) lines(bc_dens~seqq, col=e_col[4], lwd=2)]]>
Nowadays even marginally tedious computation is being sent to faster, minimum-overhead languages like C++. So it’s mainly syntax administration we insist to insist on. What does it matter if we have this:
xsquare <- function(x){ x^2 }
Or that
def xsquare(x): return x**2
Besides, I freely admit, I am most proficient in stackoverflow tab-opening (still working on tab-closing..) where I often find the syntax I need.
I am being a bit blatant for the sake of stimuli; there are differences between R and Python of course. I just don’t believe they warrant any passionate reactions as I sometime encounter. For a good framework for comparison, have a look at Norman Matloff‘s R vs. Python for Data Science page. Not so much for real decision making, but more because it’s nice to think about the things one should consider when it comes to scripting tools- the framework for comparison.
You are not married to R nor you are married to Python. Be opportunist and use them both for their advantages. Ecosystem goes a long way. Both languages are backed by strong communities which complement each other, in my opinion. For example, I find voice analytics and computer vision to be a bit easier in Python, while advanced statistical algorithms to be better supported\documented in R. Probably a matter of (ancestral) users’ culture- R more academic while Python has more operational flavor.
I LOVE Rstudio. The company, employees and leadership, and naturally their products and services. The Rstudio IDE is my go-to for all things Rython. Below you can find some help on how to set up the integration.
I assume you have both Rstudio and Python installed on your machine. You need to direct the reticulate
package to your python executable file. But where is it? First you need the path. Here is how one way to find it:
Like so:
Here are some additional help commands:
# install.packages("reticulate") # If not already installed library(reticulate) use_python("full path to your python.exe file") reticulate::py_discover_config() reticulate::py_config()
You can open a notebook and have two different chunks: an R chunk or a Python chunk:
The IDE will care for what’s what:
2+2 > [1] 4 2+3 > 5
Here you can find more or less everything else you may need, including concise but sufficient examples. Feel free to add additional relevant links in the comments.
Do Rython, and have fun!
]]>Here you can find a list of colors in R. This pdf file is one of the first links you get when searching online. Drawbacks, in my opinion:
Perhaps a more dynamic approach..
I think 129 colors are enough for most. If you need more, this link is useful.
The actual color-names widget is here.
Enjoy.
]]>
Methods are functions which are specifically written for particular class. In the post Show yourself (look “under the hood” of a function in R) we saw how to get the methods
that go with a particular class. Now there are more modern, less clunky ways for this.
Have a look at the sloop package, maintained by Hadley Wickham (that alone is a reason). Use the function s3_methods_generic
to get a nice table with some relevant information:
# install.packages("sloop") library(sloop) citation("sloop") s3_methods_generic("mean") # s3_methods_generic("as.data.frame") # A tibble: 10 x 4 generic class visible source1 mean Date TRUE base 2 mean default TRUE base 3 mean difftime TRUE base 4 mean POSIXct TRUE base 5 mean POSIXlt TRUE base 6 mean quosure FALSE registered S3method 7 mean vctrs_vctr FALSE registered S3method 8 mean yearmon FALSE registered S3method 9 mean yearqtr FALSE registered S3method 10 mean zoo FALSE registered S3method
You can use the above to check if there exists a method for the class you are working with. If there is, you can help R
by specifying that method directly. Do that and you gain, sometimes meaningfully so, a speed advantage. Let’s see how it works in couple of toy cases. One with a Date
class and one with a numeric
class.
library(magrittr) # for the piping operator # install.packages("scales") # we talk about this shortly library(scales) # install.packages("microbenchmark") library(microbenchmark) citation("microbenchmark") # Create a sequence of dates some_dates <- seq(as.Date("2000/1/1"), by = "month", length.out = 60) bench <- microbenchmark(mean(some_dates), mean.Date(some_dates), times = 10^3) %>% summary print(bench) expr min lq mean median uq max neval 1 mean(some_dates) 6.038 6.642 7.011879 6.642 6.944 14.189 1000 2 mean.Date(some_dates) 4.528 4.831 5.417070 5.133 5.435 51.923 1000 cat("Save", (1 - bench$mean[2] / bench$mean[1]) %>% percent(digits = 2)) Save 23% # Now something more standard x <- runif(1000) # simulate 1000 from random uniform bench <- microbenchmark( mean(x), mean.default(x) ) print(bench) expr min lq mean median uq max neval 1 mean(x) 4.529 5.133 7.113611 7.548 8.453 44.376 1000 2 mean.default(x) 2.113 2.416 3.148788 3.321 3.623 9.963 1000 > cat("Save", (1 - bench$mean[2] / bench$mean[1]) %>% percent(digits = 2)) Save 56%
Specifying the exact method (if it is there) also reduces the variance around computational time, which is important for simulation exercises:
In the code snippet above I used the scales
package’s percent
function, which spares the formatting annoyance.
When I load a data, I often want to know how big is it. There is the basic object.size
function but it’s ummm, ugly. Use the aptly named object_size
function from the pryr package.
library(pryr) citation("pryr") > x <- runif(10^3) > object_size(x) 8.05 kB > object.size(x) 8048 bytes > x <- runif(10^6) > object_size(x) 8 MB > object.size(x) 8000048 bytes > x <- runif(10^8) > object_size(x) 800 MB > object.size(x) 800000048 bytes # is this Mega or Giga? > x <- runif(10^9) > object_size(x) 8 GB > object.size(x) 8000000048 bytes # is this Mega or Giga?
Use the gc
function; gc stands for garbage collection. It frees up memory by, well, collecting garbage objects from your workspace and trashing them. I at least, need to do this often.
heta
functionI use the head
and tail
functions a lot, often as the first thing I do. Just eyeballing few lines helps to get a feel for the data. Default printing parameter for those function is 6 (lines) which is too much in my opinion. Also, especially with time series data you have a bunch of missing values at the start, or at the end of the time frame. So that I don’t need to run each time two function separately, I combined them into one:
heta <- function(x, k= 3){ cat("Head -- ","\n", "~~~~~", "\n") print(head(x, k)) cat("Tail -- ","\n", "~~~~~", "\n") print(tail(x, k)) }
If you stretch your model enough, you will have to wait until computation is done with. It is nice to get a sound notification for when you can continue working. A terrific way to do that is using the beepr package.
# install.packages("beepr")
library(beepr)
citation("beepr")
for (i in 0:forever){
do many tiny calculations and don't ever converge
}
beep(4)
Click play to play:
Enjoy!
]]>If you can’t explain it simply you don’t understand it well enough.
(Albert Einstein)
What is so deep about deep learning? Nothing. There is nothing deep about it. If you read through the excellent Deep Learning book you can see (p. 167 in my copy) that a deep learning model with say three layers, omitting dependency on parameters, could be written as
In words, the whole shabang boils down to a highly non-linear transformation of the original variables. The word “deep” is not bad in that it provides a feel for the kind of numerical procedures needed for those models. We don’t have a better word, but it’s just a convention used to describe the number of these “chain structures”. It does not carry any real meaning otherwise. So what if deep learning models are highly non-linear, and so what if we apply fanciful optimization methods on the way. Put differently, deep learning models are just and simply a sub-class of the “usual” non-parametric statistics. This statement does not up- or downgrade these class of models. Just to drive the point that machine learning is simply statistics.
In the same Deep Learning book, after you are done reading the pleasantly thorough Machine Learning Basics chapter, which example do you think is first in line? Linear regression! not different from the 1805 Legendre’s method of least squares.
Do you think you don’t understand what convolution is? Have you ever applied a moving average to a time series? that is a one-dimensional convolution. You never explained it by telling that you have convolved your time serie with a box-shaped function, did you? No, you said you used a moving average. Again, I don’t mind the language we use, but whichever way you look, you can always spot just mainstream straightforward statistics, camouflaged in different terms pulled probably from computer science or alike.
Victor Chernozhukov is one of those world-class econometrician I try to follow. In a recent talk he gave, min 1:40 in this youtube link he mentions the “new generation of non-parametric statistical methods, branded as ‘machine learning’ “. In around minute 10 of the same video, he makes a joke about the Frisch–Waugh–Lovell, stating those were machine learning researchers who worked back in 1930.
So that was nice for me to see I am not alone with this somewhat less popular view.
]]>