Matrix Multiplication as a Linear Transformation

AI algorithms are in the air. The success of those algorithms is largely attributed to dimension expansions, which makes it important for us to consider that aspect.

Matrix multiplication can be beneficially perceived as a way to expand the dimension. We begin with a brief discussion on PCA. Since PCA is predominantly used for reducing dimensions, and since you are familiar with PCA already, it serves as a good springboard by way of a contrasting example for dimension expansion. Afterwards we show some basic algebra via code, and conclude with a citation that provides the intuition for the reason dimension expansion is so essential.

Continue reading

Randomized Matrix Multiplication

Matrix multiplication is a fundamental computation in modern statistics. It’s at the heart of all concurrent serious AI applications. The size of the matrices nowadays is gigantic. On a good system it takes around 30 seconds to estimate the covariance of a data matrix with dimensions $X_{10000 \times 2500}$, a small data today’s standards mind you. Need to do it 10000 times? wait for roughly 80 hours. Have larger data? running time grows exponentially. Want a more complex operation than covariance estimate? forget it, or get ready to dig deep into your pockets.

We, mere minions who are unable to splurge thousands of dollars for high-end G/TPUs, are left unable to work with large matrices due to the massive computational requirements needed; because who wants to wait few weeks to discover their bug.

This post offers a solution by way of approximation, using randomization. I start with the idea, followed by a short proof, and conclude with some code and few run-time results.

Continue reading

Hyper-Parameter Optimization using Random Search

Hyper-parameters are parameters which are not estimated as an integral part of the model. We decide on those parameters but we don’t estimate them within, but rather beforehand. Therefore they are called hyper-parameters, as in “above” sense.

Almost all machine learning algorithms have some hyper-parameters. Data-driven choice of hyper-parameters means typically, that you re-estimate the model and check performance for different hyper-parameters’ configurations. This adds considerable computational burden. One popular approach to set hyper-parameters is based on a grid-search over possible values using the validation set. Faster and simpler ways to intelligently choose hyper-parameters’ values would go a long way in keeping the stretched computational cost at a level you can tolerate.

Enter the paper “Random Search for Hyper-Parameter Optimization” by James Bergstra and Yoshua Bengio, suggesting with a straight face not to use grid-search but instead, look for good values completely at random. This is very counterintuitive, for how can a random guesses within some region compete with systematically covering the same region? What’s the story there?

Below I share the message of that paper, along with what I personally believe is actually going on (and the two are very different).

Continue reading

Local Linear Forests

Random forests is one of the most powerful pure-prediction algorithms; immensely popular with modern statisticians. Despite the potent performance, improvements to the basic random forests algorithm are still possible. One such improvement is put forward in a recent paper called Local Linear Forests which I review in this post. To enjoy the read you need to be already familiar with the basic version of random forests.

Continue reading

Publication in Significance – code

Couple of months ago I published a paper in Significance – couple of pages describing the essence of deep learning algorithms, and why they are so popular. I got a few requests for the code which generated the figures in that paper. This weekend I reviewed my code and was content to see that I used a pseudorandom numbers, with a seed (as oppose to completely random numbers; without a seed). So now the figures are exactly reproducible. The actual code to produce the figures, and the figures themselves (e.g. for teaching purposes) are provided below.

Continue reading

A New Parameterization of Correlation Matrices

In volatility modelling, a typical challenge is to keep the covariance matrix estimate valid, meaning (1) symmetric and (2) positive semi definite*. A new paper published in Econometrica (citing from the paper) “introduces a novel parametrization of the correlation matrix. The reparametrization facilitates modeling of correlation and covariance matrices by an unrestricted vector, where positive definiteness is an innate property” (emphasis mine). Econometrica is known to publish ground-breaking research, and you may wonder: what is the big deal in being able to reparametrise the correlation matrix?

Continue reading

What’s the big idea? Deep learning algorithms

Deep learning algorithms are increasingly featuring in popular news outlets, large-scale media events and academic conferences. But what makes them so popular? Why now?

I recently published what I hope is an easy read for all of you modern-statistics geeks lovers; explaining the thrust behind this machine-learning class of models.

You can download the two-pager from Significance, specifically here (subscription required).

Continue reading

Random forest importance measures are NOT important

Random Forests (RF from here onwards) is a widely used pure-prediction algorithm. This post assumes good familiarity with RF. If you are not familiar with this algorithm, stop here and see the first reference below for an easy tutorial. If you used RF before and you are familiar with it, then you probably encountered those “importance of the variables” plots. We start with a brief explanation of those plots, and the concept of importance scores calculation. Main takeaway from the post: don’t use those importance scores plots, because they are simply misleading. Those importance plots are simply a wrong turn taken by our human tendency to look for reason, whether it’s there or it’s not there.

Continue reading

R tips and tricks – Timing and profiling code

Modern statistical methods use simulations; generating different scenarios and repeating those thousands of times over. Therefore, even trivial operations burden computational speed.

In the words of my favorite statistician Bradley Efron:

“There is some sort of law working here, whereby statistical methodology always expands to strain the current limits of computation.”

In addition to the need for faster computation, the richness of open-source ecosystem means that you often encounter different functions doing the same thing, sometimes even under the same name. This post explains how to measure the computational efficacy of a function so you know which one to use, with a couple of actual examples for reducing computational time.

Continue reading

Understanding Spectral Clustering

Some problems are linear, but some problems are non-linear. I presume that you started your education discussing and solving linear problems which is a natural starting point. For non-linear problems solutions often involve an initial processing step. The aim of that initial step is to transform the problem such that it has, again, linear flavor.

A textbook example is the logistic regression, a tried-and-true recipe for getting the best linear boundary between two classes. In a standard neural network model, you will find logistic regression (or multinomial regression for multi-class output) applied on transformed data. Few preceding layers are “devoted” to transform a non-separable input-space into something which linear methods could handle, allowing the logistic regression to solve the problem with relative ease.

The same rationale holds for spectral clustering. Rather than working with the original inputs, work first with a transformed data which would make it easier to solve, and then link back to your original inputs.

Spectral clustering is an important and up-and-coming variant of some fairly standard clustering algorithms. It is a powerful tool to have in your modern statistics tool cabinet. Spectral clustering includes a processing step to help solve non-linear problems, such that they could be solved with those linear algorithms we are so fond of. For example, the undeniably popular K-means.

Continue reading

Understanding K-Means Clustering

Introduction

Google “K-means clustering”, and you usually you find ugly explanations and math-heavy sensational formulas*. It is my opinion that you can only understand those explanations if you don’t need them; meaning you are already familiar with the topic. Therefore, this is a more gentle introduction to K-means clustering. Here you will find out what K-Means Clustering, an algorithm, actually does. You will get only the basics, but in this particular topic, the extensions are not wildely different.

Continue reading