Spike-and-slab priors

In variable selection problems where the number of predictors is large relative to the number of data points, the optimization of penalized objective functions represents a convenient approach to regularization. The \(L_2\)-penalty (ridge regression), and \(L_1\)-penalty (Lasso regression) are most common. There is even a penalized regression called elasticnet regression, that combines the \(L_1\) and \(L_2\) penalties as a convex sum. Looked at from a Bayesian lens, the \(L_2\)-penalized regression solution is the mode of the posterior distribution if a Gaussian prior is placed on each of the linear regression coefficients, while the Lasso solution is the mode of the posterior if a Laplacian distribution is used as the prior instead. The Laplacian distribution, being non-differentiable at 0, is peakier than the Gaussian distribution, and as a result it results in a sparse solution.

The Lasso or \(L_1\)-penalty is really a relaxation to the original problem of variable selection, which requires a \(L_0\)-penalty (the \(L_0\) norm of the coefficient vector is the number of non-zero coefficients). However, appending the \(L_0\) norm to the standard linear regression objective function destroys the convexity of the optimization problem. The \(L_1\)-norm is the smallest norm that preserves convexity in the penalized objective function, while serving as a convenient relaxation to the \(L_0\) norm.

A spike-and-slab distribution is a mixture of two Gaussians, one that is very peaky (i.e. low variance - this is the spike) and another that is very broad (i.e. high variance - this is the slab).


With a spike of zero variance (a Dirac Delta function), the spike and slab prior perfectly expresses the original variable selection criterion of either accepting or rejecting a variable. However, with this prior, there is no closed form penalty function that can simply be appended to the original objective function and the result minimized. The formulation requires the computation of a full posterior distribution and in the absence of a convenient conjugacy relationship, the only way out is to do MCMC sampling (another approach might be variational inference, but this requires crafting a variational inference procedure tailored to this problem, and I haven't seen this published yet). The idea of a Spike-and-Slab prior, originally proposed in the 80s and refined in the 90s, is older than Lasso. It is seeing renewed interest from practitioners - for e.g. this work out of Google, sets up a variable selection problem in terms of a spike-and-slab prior with MCMC for inference. 

Do genetics influence Olympic performance?

During the recent 2016 summer olympics in Brazil, I found myself grumbling along with some friends about the historically poor performance of India at the Olympics. During the discussion, it was suggested that one of the reasons for India's poor performance at the Olympics might be that Indians are genetically disadvantaged at competitive sport. This was an intriguing hypothesis, but one that was difficult to settle.

A few weeks later, I came across a Priceonomics article about the average heights of people in each country over time. This seemed to provide an opening to test the genetics hypothesis. The Pricenomics article suggested that the mean height of a country was reflective of access to nutrition enjoyed by a country's population during developmental years (and hence reflective of income & resource equality) and contained information that is orthogonal to a country's GDP. I hypothesized that the medal count of a country might have a relationship with it's mean height even after controlling for the country's GDP (it was highly plausible that there is a link between GDP and medal count). If it were so, then it would mean that performance at the Olympics was linked to the portion of the variance in mean height that cannot be explained by GDP, and which must be a consequence of either (1) differences in inequality and/or (2) genetic predisposition. If we could eliminate the dependence of height on inequality, that would ostensibly leave only genetic predisposition. 

This suggested therefore, that if a regression is performed with olympic medal count as the response, and with two predictors: (1) GDP and (2) a version of mean height per country from which economic effects had been residualized, ostensibly leaving behind only genetic factors, then the debate could be settled. If this residualized height variable were to be found statistically significant after controlling for GDP, it would suggest that genetic factors are statistically significant in determining Olympic performance. If we this residualized height were found to not be predictive of medals however, we would only have shown that genetics does not influence Olympic performance via height -- it might still influence performance independently of height. 

Fortunately, I was able to find an Olympic medal count dataset courtesy of The Guardian, a clean dataset on mean height by country broken down by gender and year linked from this study linked from the Priceonomics article, and datasets on GDP and GINI coefficient from the World bank. What follows is an exploration of the relationships between these variables after performing the necessary joins, en route to the hypothesis test crafted above.  The final joined dataset contained data for 156 countries with the following variables: GDP, GDP per capita, GINI coefficient, mean_height for men and women, and counts of medals ever won, by year for each country by medal type (Bronze, Silver, Gold). We take the average of the mean height of men and women to be a single mean height for the country. Further, we compute a single medals score = 3 * # of Golds + 2 * # of Silvers + # of bronzes as a representative scalar of a country's performance at the Olympics. Here is the file with all the data for 156 coutries. We begin with descriptive statistics of the variables in the joined dataset, followed by the simplest univariate relationships.

In each of the regression studies below, we pay special attention to a few countries of interest: USA, China, India and the Netherlands using the colors blue, red, green and cyan respectively. The Netherlands is in the list because it is the country with the largest mean height, while USA and China are in the list because of their Olympic performance and interesting economic variables. 

Univariate regression studies

Medal count and height: The total medal count of a country is indeed related to its mean height. USA appears as a large outlier, suggesting that it's olympic medal count is not driven by its mean height but something else.

Medal count and GDP: shows a strong relationship, much stronger than the relationship between height and medals. This is not too surprising given that richer countries can afford to spend more resources in training their athletes, and that a higher per capita GDP is likely to encourage greater participation in sport. We also tested the GDP per capita against medals and the log10 versions for both GDP and GDP per capita, but found the raw GDP below to have the higher r-quared with medals. USA is no longer an outlier in this regression fit, suggesting that its olympic performance might be driven less by its mean height and more by its GDP. It is worth acknowledging that we cannot make claims about causality from these results. USA is a country of immigrants and it attracts talent in many fields, including sports. This also probably plays a role in its outsized Olympic medal count, even though this is probably related to its high GDP too.

Medal count and GINI coefficient: The hypothesis here is that countries with more equitable distribution of wealth should have a higher medal count because they make resources & nutrition more broadly to their population. We do see a weakly negative relationship but it falls just short of statistical significance of |tstat| > 2.

Height and GDP: Not surprisingly, there is a positive relationship between height and GDP.

However, the relationship is much stronger if we consider GDP per capita instead of raw GDP

The scatterplot above is strongly suggestive that there is a log-like relationship between height and GDP per capita and indeed we see this after transforming GDP per capita to the log domain. The plot below is worth dwelling over. It shows a strikingly good fit between mean height and log 10 of per capita GDP, with an r-squared of 52%! Netherlands does better in mean height and India does worse, than the linear fit from log10 per capita GDP would suggest. 

Height and GINI coefficient: Interestingly, there is a fairly strong negative relationship between height and GINI coefficient, which supports the hypothesis that an equitable distribution leads to a greater mean height. 

Multiple linear regressions

Height vs. log10 of GDP_pcap and GINI: We discovered above that height has a strong positive relationship with log10 of per capita GDP which represents the amount of resources available per unit of population without taking into account the distribution of resources within a country. It is therefore worth examining the effect of adding on GINI as a predictor, which is meant to capture just this, and which we have seen to have a negative relationship with height. We find that adding GINI does indeed improve the r-squared and GINI is found to be a statistically significant predictor. This is evidence that between countries with the same per capita GDP, greater equality leads to a larger mean height. 

Medals vs GDP and height: Does height improve the predictability of medals beyond what is already captured by GDP? We find that this is indeed the case. That is, height contains information relevant to the determining the performance at olympics that is not already captured by a country's GDP. 

Medals vs GDP and GINI: As per our hypothesis, countries with the same GDP should earn more medals if they have a more equitable distribution of wealth because of better access to resources and nutrition. We do find this to be the case. Adding GINI as a predictor to the regression between medals and GDP improves the r-squared and shows GINI to be statistically significant in the presence of GDP.  

Do genetic factors influence Olympic medal counts?

Finally, we arrive at the question we sought to answer: do genetic factors influence olympic medal count? And is genetic disadvantage the primary excuse for India's poor performance at the Olympics? We take a shot at studying this by first removing the effects of GDP per capita (mean resources per person) and GINI coefficient (inequality) from the mean height of a country. This is done by simply regressing mean height against those two predictors and retaining the residual and intercept. 

\[\textrm{Mean height} =  \beta_1 * (\log_{10}{(\textrm{GDP per capita})}) + \beta_2 * (\textrm{GINI}) + \textrm{residual}\]

The residualized height can be scaled and shifted such that it has the same mean and variance as the original mean height variable. Then the residualized heigh can be thought of as what the mean heights of countries would have been if there were no economic factors at play, only genetic factors (and other factors we may not have thought of). Plotting the residualized height against the original mean height shows that USA, China and Netherlands have a lower residualized height than their mean heights, while India has a higher residualized height than its mean height. In other words, removing economic factors influencing height causes the height of India to increase and the heights of China, USA and Netherlands to decrease. 

Having removed the linear projection onto these two economic variables, we take the residual to be determined largely by genetic factors. This is then used as a predictor against medal count. 

The residualized height shows no statistically significant predictive power against medal count in a univariate regression, where we have not controlled for the effect of GDP on medals. Therefore the univariate relationship between mean height and medals that we saw above disappears if we remove the economic factors that influence height, which says that height influences medals directly only because of the economic factors that influence height. Now, let us control for the effect of GDP on medals and examine the effect of residualized height on medals after GDP has been controlled for. 

Thus, we find that among countries with similar GDP, residualized height does appear to be a statistically significant predictor of medals, indicating that genetics have an impact on Olympic performance if we control for the effect of GDP on medals. 

Darwin and Mixture Distributions

Although Charles Darwin understood that each individual of a species resembled its parents, he did not know about the genetic basis for life, i.e. each trait of an individual comes from a gene, which comes from exactly one of either the mother or the father. Instead, he imagined that each trait comes from a blending of the traits of the parents. In Richard Dawkins' book The Greatest Show on Earth, The Evidence for Evolution, he writes about Darwin:

He was aware of course that characteristics tend to run in families, that offspring tend to resemble their parents and siblings. Heredity was a central plank of his theory of natural selection. But a gene pool is something else. A gene is an all or nothing entity. When you were conceived, what you received form your father was not a substance, to be mixed with what you received from your mother, as if mixing blue paint with red paint to make purple. 

This is an interesting distinction because it amounts to the distinction between the average of two random variables

\[X = \frac{1}{2}(X_1 + X_2)\]

whose distribution is given by a convolution of the two probability density functions, and a mixture distribution,

\[Z \sim U\{+1,-1\}\]

\[X \sim p_z\]

where a latent Bernoulli random variable \(Z\) decides which random variable to sample from (father/mother), and then depending upon the value of \(Z\), one of two distributions is sampled. This two step process is then repeated for each gene. The averaging processes, over successive generations, would lead to a Gaussian distribution by the central limit theorem, implying that after several generations, all members of a species would look very similar with smooth variation across members. On the other hand, the mixture distribution would lead to the formation clusters. 


David MacKay

I was deeply saddened to learn last week that Sir David MacKay has passed away. Over the last nine months, he struggled with cancer, and documented this journey in great detail on a personal blog. He is survived by his wife and two kids, aged one and four. He was just a few days short of his 49th birthday. 

I first met David MacKay at Princeton in 2006 when he gave a talk about a piece of HCI software called Dasher that he had designed. I immediately loved his presentation style, and I loved how with Dasher, he had blended theory with practical impact. He had a way to infuse a serious technical talk with wit and humor. Dasher was born out his realization that natural language can be compressed and that this compression can be presented visually as a form of text entry. From what I later read, Dasher went on to prove tremendously useful to the disabled. Soon after I got hold of a copy of David's book Information Theory, Inference and Learning Algorithms, considering it good enough to own in print, despite the fact that had made it available online for free. The amount of detail in the book in the way of examples is truly astonishing. It is a great book to learn ideas in the information sciences from first principles. His book was influential in giving me the confidence to study machine learning seriously towards the end of my PhD, because it served as a bridge from information and coding theory, which I was comfortable with. Around 2013 I learnt that dasher was available as an iOS app and I immediately installed it. A few years later I came across David's work on Gaussian processes and Bayesian learning. I remember enjoying watching this video on videolectures.net in which he explains Gaussian processes in his characteristic style -- very clear on technicals, witty and humorous, and loaded with demos and code snippets in Matlab. A while later I learnt of his book on climate change and remember browsing through the free copy he had put online.

Although I did not get too many opportunities to interact with him, David's work was, and continues to be, hugely inspiring. The world is a lesser place without him.

Deep learning and Bayesian Nonparametrics

Deep learning thrives on problems where automatic feature learning is crucial and lots of data are available. Large amounts of data are needed to optimize the large number of free parameters in a deep network. In many domains, that would benefit from automatic feature learning, large amounts of data are not available, or not available initially. This means that models with low capacity must be used with hand crafted features, and if large amounts of data become available at a later time, it may be worthwhile to learn a deep network on the problem. If deep learning really reflects the way the human brain works, then I imagine that the network being trained by a baby must not be very deep, and must have few parameters than networks that are trained later in life, as more data becomes available. This is reminiscent of Bayesian Nonparametric models, wherein the model capacity is allowed to grow with increasing amounts of data. 

Aesthetic pleasure requires partial familiarity

The toy plot below is inspired by Jurgen Schmiduber’s thesis on what we as humans seek when we seek aesthetic pleasure from information objects (visual art, music, movies, stories, etc.).

Partial familiarity is necessary in order to enjoy aesthetic pleasure from an object. At some level, this is pretty obvious. For e.g, a Mathematician may experience aesthetic elegance when reading a new theorem that a layman cannot, because the layman just isn’t familiar enough to be able to interpret it. At the other extreme, residents of even the most beautiful places on the planet cease to be overcome by the beauty of their surroundings because they grow accustomed to it. I think aesthetic pleasure therefore, is simply a state where the mind understands part of what it is shown, just enough, not too much, not too little.

  1. This explains why designers (fashion designers, graphic designers, industrial designers like those who decide what the next model of a car should look like, etc.) need to constantly innovate. People eventually get bored with designs that have been around.
  2. This is also the reason that retro designs sometimes make a comeback. Bell bottoms eventually fade from memory, and aren’t even familiar to those born in the 90s. A design is called ‘retro’ if has faded just enough from public memory to be partially familiar. Something that has completely faded from public memory is unlikely to make a comeback.
  3. One somewhat depressing consequence that the thesis above suggests is that we are constantly driven by a desire to find and consume new things that we are partially familiar with. That is, we want to move from the right edge of the plot to the middle. We cannot keep watching the same kind of movie, or TV shows, or wearing the same kind of clothes, etc. This suggests that the pursuit of aesthetic pleasure seems somewhat meaningless by itself. I told you it was depressing.
  4. There is some interesting work on why we find jokes funny, or, what makes a joke funny. Several jokes take two seemingly unrelated things and develop a new connection between them, for e.g. a pun. Here, the listener is familiar with part of the context in the joke, but there is something new. Once the joke has been heard, it looses its humor value, because it slowly moves towards the right edge of the plot above. Edward de Bono suggests about jokes:

The mind is a pattern-matching machine, and that it works by recognising stories and behaviour and putting them into familiar patterns. When a familiar connection is disrupted and an alternative unexpected new link is made in the brain via a different route than expected, then laughter occurs as the new connection is made.

Ratio of unbiased mean estimates

Suppose \(X\) and \(Y\) are two quantities and we are interested in estimating the ratio \(X/Y\) but we need to estimate \(X\) and \(Y\) using samples (e.g. survey samples). For example, \(X\) may be the the amount of money spent by the population on leisure in year \(T\) and \(Y\) may be the amount of money spent in year \(T+1\). It is tempting to estimate \(X\) and \(Y\) separately and divide the estimates to get an estimate of \(X/Y\). If \(\hat{\mu}_x\) and \(\hat{\mu}_y\) are empirical means of \(X\) and \(Y\) as computed from the samples, the ratio \(r = \hat{\mu}_x/\hat{\mu}_y\) is not an unbiased estimate of \(X/Y\). This can be easily seen as follows: $$E\left[\frac{X}{Y}\right] = E\left[X\right] E\left[\frac{1}{Y}\right] \geq \frac{E\left[X\right]}{E\left[Y\right]},$$where the equality follows the independence of \(X\) and \(Y\), and the inequality follows from Jensen's inequality, noting that \(1/Y\) is a convex function of \(Y\).

If samples of \(X\) and \(Y\) are available, an unbiased estimate of \(X/Y\) can be obtained by generating samples of \(X/Y\) via statistical bootstrap. 

Incidentally, if \(X\) and \(Y\) are independent zero mean Gaussian random variables, then their ratio has a Cauchy distribution. However, if \(X\) and \(Y\) are not zero mean, then their ratio had a distribution that isn't well known and has a pretty complex distribution. 

10th Annual NYAS Machine Learning Symposium

On March 4, I attended the 10th annual Machine Learning Symposium organized by the New York Academy of Sciences. This was the third time I attended. It is a nice day-long event with a 3-4 keynote talks and several posters and 5-minute talks by selected poster presenters. It was at this venue that I first heard David Blei talk about topic modeling and LDA in 2010. Below are the talks I enjoyed most this time. 


Alex Graves, Google DeepMind
This was the best talk of the day in my opinion. Alex presented a set of results on the role of specifying which parts of the input to pay attention to in deep neural networks. Deep networks naturally learn where to focus their attention for a particular task (e.g. which parts of an image are responsible for detecting the desired output), but being able to provide some side information to the network pertaining to where to focus can improve learning tremendously. The mathematical development of this idea makes use of smooth, differentiable operators for specifying where to focus attention, so that these operators can be included in gradient descent learning. Alex is famous for his work on handwriting recognition using Recurrent Neural Networks, and has a cool widget on his website where you can translate typed text into cursive handwriting in a chosen style. 

Sasha Rakhlin, UPenn
This keynote was more theoretical than Alex’s talk. The setting for Sasha’s talk was predicting attributes of a node in a graph (e.g. a social network), when the input predictors include the graph structure as well as non-graph features of the node, and data about nodes arrives in an online fashion. Training a machine in the offline setting (e.g. a classifier for whether a user in a social network will click on a given ad) using conventional methods runs into a combinatorial explosion. Sasha’s work shows that  the online setting allows for a relaxation that allows for polynomial time learning. 

Notable Posters: 

Multitask Matrix Completion for Learning Protein Interactions across Diseases
Meghana Kshirsagar, IBM Research
The matrix completion problem, exemplified by the Netflix Challenge is that a large matrix with a low rank has very few entries available and the task is to estimate the missing entries. The problem also arises in the context of protein-protein interactions between the proteins of disease causing germs and proteins of a host organism. While there are several known methods for matrix completion (e.g. nuclear norm minimization, Candes et al. ), this setting can also benefit from multi-task learning because often several diseases have similar responsible proteins. Multi-task learning, which is a general idea (not specific to matrix completion) that basically proposes that when multiple problems have portion of the signal in common, then using a combination of a shared representation and a specific representation can lead to better generalization. 

Another Look at DWD: Thrifty Algorithm and Bayes Risk Consistency in RKHS
Boxiang Wang, University of Minnesota
Distance Weighted Discrimination was proposed as a superior alternative to the SVM, because unlike the SVM, the DWD cares about the distances of *all* data points from the separating hyperplane, rather than just the points closest to it. However, the DWD hasn’t yet enjoyed the popularity of the SVM because it require solving a SOCP rather than a quadratic program (SVM). This paper extends theoretical results on DWD (established Bayes risk consistency) and proposes an algorithm faster than second order cone programming. 


Comparing Cybersecurity and Trading in Financial Markets

  1. Both fields benefit from rigorous quantitative modeling for tasks such as detecting patterns and anomalies.

  2. Both financial trading and security operate in an adversarial environment. In finance, the adversary is the ‘market’, i.e. the rest of the participants. In security, the adversary is the set of attackers.

  3. Because of the adversarial environments, simple models do not work for very long, because the adversary evolves in response to actions taken by the other party, and it is hard for a model to capture the direction in which the adversary will evolve. Therefore both fields require that models be updated. Models are built using training data . The challenge therefore is to use the right mix of recent data and longer term historical data. How to do this is not obvious in both fields..

  4. Game theory is a useful tool for modeling problems in both fields because of the adversarial nature of the field.

  5. Modern financial instruments, apart from serving as investment vehicles, are designed for mitigating risk. The business of selling information security products or services is also one of mitigating risk — risk of information loss. There is no such thing as perfect security. Instead, organizations allocate an amount of money they consider appropriate to meet their risk appetite.

  6. One difference is that trading in financial markets often has the ‘self fulfilling prophesy’ property, i.e. it can act as a positive feedback system (if a certain fraction of people with opinion that the price of a security is likely to fall, sell their holdings, the rest start thinking its probably a good idea to sell, and the end result is that the price does fall), whereas there doesn’t seem to be an equivalent in cybersecurity.

A probabilistic Escher waterfall

If a fair coin is repeatedly tossed till either the sequence HHHHH or HTHHT occurs, which of the two sequences is likely to occur first? If you thought the two sequences are equally likely, you’d be wrong. The correct answer is HTHHT. A quick way to see this is that HHHHH requires five H’s to occur in succession and the occurrence of a single T destroys any progress made towards obtaining the sequence. On the other hand, for HTHHT, if after obtaining HTH, the 4th toss results in a T instead of an H, then the HT formed by the 3rd and 4th tosses can serve as the starting ‘HT’ of the same sequence.  

[Taken from:  Konold, Clifford. 1995. Confessions of a coin flipper and would-be instructor, American Statistician 49: 203–209]

One can analyze such problems graphically using state diagrams. For example, suppose we repeatedly toss a fair coin till either HHH or THH appear. Which is likely to appear first? The state transition diagram below represents the ‘race’ between the two sequences

It can be seen why THH is more likely to occur first: once the sequence gets to the state ‘T’, there is no path that leads to the state HHH.

Walter Penney, in his article in the Journal of Recreational Mathematics (October 1969, p. 241) showed a curious property of sequences of three (or more) tosses of fair coins: for any sequence of length three, one can find another sequence of length three, that has a higher probability of occurring first in repeated tosses of a fiar coin. This is fascinating, and not at all intuitive, because it leads to circular loops like the following: THH is less likely to occur before TTH (2 to 1), which is less likely than HTT (3 to 1), which is less likely than HHT (2 to 1), which is less likely than THH (3 to 1)!

Random walks in Manhattan

"Random Walks in Manhattan", pixels in Python and NumPy.

The black curve is a pure random walk in two dimensions with independent zero mean Gaussian increments of equal variance in each dimension. The red curve is an Ornstein Uhlenbeck process in two dimensions with mean at (0,0). Unlike a pure random walk, an Ornstein Uhlenbeck process experiences a pull back towards the origin whenever it wanders away from it in any direction, and therefore the OU process does not travel as far and wide as the random walk. Each image shows a trajectory that is a subset of the previous, showing \(2 \times 10^n\) steps, with \(n = 1,\ldots,6\). Notice how, when the number of steps is small, the RW and the OU processes look very similar, but with a larger number of steps, the mean reverting property of the OU process makes its trajectory look very different. 


"Gaussian script"

The piece above represents a new random alphabet. It was drawn by generating a sequence of 20 sample paths of a Gaussian Process in two dimensions with covariance kernel given by a squared exponential \(e^{-d^2/\lambda}\) where d is the distance in input space and \(\lambda\) is a relaxation constant. Note that the Ornstein Uhlenbeck process is also a Gaussian Process, with a covariance kernel given by the related absolute value exponential \(e^{-|d|/\lambda}\).


An afternoon with Geostatistics

Problem: Given a spatial field in 2 dimensions and time in one dimension, measurements are made at a few points in the three dimensions, and values at certain other points in the spatio-temporal field are required. 

Spatial Kriging. The 2-dimensional spatial version of the problem arises in several contexts. An oil company drills a few holes in the Earth’s crust for oil exploration and acquires a random variable that indicates the drilling potential at that spot. Each hole drilled costs a large sum of money. How best can the company estimate the drilling potential at other nearby spots without any further drilling? One way is to employ Gaussian Process Regression, where one places as Gaussian Process with a selected covariance kernel as a prior over the 2D space, adds the information gained by drilling at the few spots, and then computes a posterior distribution over the entire 2D field. This idea is termed kriging in geostatistics, after Daniel G. Krige, a South African mining engineer, who used similar techniques (weighted averaging, not GP regression) for measuring Gold grades in mines in South Africa, and whose work was taken up and formalized later by the French Mathematician Georges Matheron. 

The squared exponential Kernel is a popular choice for modeling covariance that depends only upon distance between two points and decays quickly. $$k(x_1, x_2) = \exp{\frac{(x_1-x_2)^2}{h^2}}$$

Spatio-temporal Kriging. The original problem above applies to a time varying spatial field. The idea of spatial kriging is naturally extensible to include the time dimension, if care is taken to ensure that the covariance kernel of the Gaussian Process degrades at an appropriate rate in the time dimension, which is chosen carefully, and separately from the the spatial dimension. One convenient alteration is to simply alter the bandwidth parameter of the temporal dimension. 

$$\exp{\frac{(x_1 - x_2)^2 + \alpha(t_1 - t_2)^2}{h^2}}$$ which of course turns out to be equivalent to simply using a product of two independent covariance kernels, one for space and one for time: $$\exp{\frac{(x_1 - x_2)^2}{h_x^2}} * \exp{\frac{(t_1 - t_2)^2}{h_t^2}}$$There is the question of whether such a kernel that is a product of two kernels, is guaranteed to product a positive-semidefinite matrix as the covariance metrix when arbitray values of \(x_1, x_2, t_1, t_2\) are plugged in. It is easy to show that the product of two valid kernel functions \(k = k_xk_t\) is also a valid positive-semidefinite kernel. This follows from observing that the covariance matrix for the kernel \(k\) is the Hadamard product (element-by-element product) of the covariance matrices resulting from the two component kernels. If \(K_x\) is the covariance matrix of a random vector \((x_1, x_2, \ldots, x_n)\) and \(K_t\) is the covariance matrix of a random vector \((t_1, t_2, \ldots, t_n)\), then the Hadamard product of \(K_x\) and \(K_t\) must be the covariance matrix of the random vector \((x_1t_1, \ldots, x_nt_n)\), and must therefore be positive semidefinite and hence \(k = k_x k_t\) must be a valid kernel.

In fact the product form of the kernel need not be limited to both kernels being of the squared exponential type. The temporal kernel can for instance be taken to be one where longer term correlations at the annual, monthly or weekly level can be exploited by the GP regression. This can be achieved via a sinusoidal temporal kernel or a product of a sinusoid with a slowly decaying exponential. Here is a cookbook of kernel functions from David Duvenaud, and there are several others in chapter 4 of the GPML book of Rasmussen & Williams. 

Notes on non-parametric regression & smoothing

Ordinary least squares regression analysis makes a few strong assumptions about the data:

• A linear relationship of y to the x’s, so that $$y | x = f(x) + \epsilon$$ • That the conditional distribution of y is, except for its mean, everywhere the same, and that this distribution is a normal distribution $$y \sim \mathcal{N}(f(x), \sigma)$$ • That observations are sampled independently, so the \(y_i\) and \(y_j\) are independent for \(i \neq j\).

There are a number of ways in this things can go wrong, for example:
• The errors may not be independent (common in time series data)
• The conditional variance of the residual error may not be constant in \(x\)
• The conditional distribution of \(y\) may be very non-Gaussian.

There are a few ways out, centering around non-parametric methods.  

Kernel Estimation: This is really a non-parametric smoothing method. The y value estimated for a given input x depends upon the weight of each labelled data point which in turn depends upon the distance between the input x and the labelled data points. 

$$\hat{y} = \frac{\sum_i w_i y_i}{\sum_i w_i},$$ where \(w_i = K(\frac{x - x_i}{h})\). \(K(\cdot)\) is a Kernel function, commonly taken to be the Gaussian Kernel, and \(h\) is a bandwidth parameter that dictates the extent to which labelled data points exert their influence in the weighted average. The weighted estimator is also called a Nadaraya-Watson Estimator. Unlike LOESS however, there is no fitting of parameters involved. The y value is simply the weighted average of the y values of all other labelled data points.

LOWESS / LOWESS: A non-parametric method (i.e. the mount of data one needs to keep around for a model specification grows linearly with the amount of training data) where, for every new input x for which the value y is desired, one computes a weighted version of the quadratic loss used in OLS linear regression. A weight is attached to each training data point, and the weight of a training point depends upon its the distance from the new input x.

SVM regression: Transform data to a higher dimension using the kernel trick and then perform linear regression using a specific type of loss function (epsilon - insensitive) rather than the usual quadratic (squared loss) function. The epsilon-insensitive loss function encourages the support vectors to be sparse. Imagine doing polynomial regression in the original space by including polynomial terms of the data. The SVM regression achieves a similar effect via the non-linear kernel transform. 

Gaussian Process regression: Also a non-parametric method, somewhat similar to LOESS in spirit in that all the data points need to be kept around (as in any non-parametric method), but this is a Bayesian method in that it allows not just a point estimate of y for a new input x, but a complete posterior distribution. This is achieved by placing a prior distribution on the values of a Guassian Process (A GP is simply a collection of random variables, any subset of which have a joint Gaussian distribution) via a covariance kernel function that only specifies the covariance between two Guassian Random variables in the GP given their x values (typically, only the distance between their x values matters). The posterior distribution of Y values for a given set of input X values also turns out to be a Gaussian distribution with an easily computable mean and covariance matrix. 

Unemployment Rate: The Devil is in the Details

The recent recession seems to have had a revealing effect on the economy. While the unemployment rate as of Dec 2014 is back to what it was in June 2008 (5.6%), the employed to population ratio as of Dec 2014 (59.2) is not what is was in June 2008 (62.2). Also, wile the former has dropped by about 5% since the official end of the recession, the latter ha barely moved up by 1%.

What explains the discrepancy? I think it reflects the manner in which the commonly reported unemployment rate is measured -- i.e. based on the people seeking work. If the number of people seeking work decreases, then the unemployment rate decreases, even if those people chose to remain unemployed, but the employed-to-population ratio still falls. 

Detecting Anomalies

In several fields, it is often required to detect the occurrence of an anomalous event while monitoring a stream of data. I first came across this problem in the context of network security during my time at AT&T Labs. In network security, the goal is to quickly identify suspected malicious activity, but it is difficult to know in advance what a problematic pattern will look like, especially when dealing with high dimensional data. On the other hand, it is typically easy to characterize non-abnormal data. 

I've found people sometimes mistakenly treat this problem as a 2-class classification problem. The 2-class approach is unsuitable because data corresponding to anomalous events may not arise from the same distribution each time an anomalous event occurs, and/or one is unlikely to have sufficient labelled samples of anomalous events to estimate the distribution of anomalous data properly. Therefore if one tries to apply a traditional classifier to this problem, one is likely to experience an unacceptably high missed detection rate (think of a SVM margin-maximizing hyperplane that positions itself between the nearest points from the non-anomalous class and a few points from the non-anomalous class. Since points from the anomalous class are few and/or they are drawn from different distributions, the hyperplane resulting from the SVM optimization is positioned closer to the anomalous points than it should be, and hence, an anomalous point may be wrongly classified as an anomalous point).

One approach to dealing with this problem is to focus on the distribution of the non-anomalous data. The problem then boils down to determining whether a given new test data point is drawn from the distribution of non-anomalous data or not. Except for the simplest of problems, the distribution of non-anomalous data may well be multi-modal, and we may not know the number of modes. The one-class SVM comes in handy here. The 1-class SVM basically attempts to estimate the density of the non-anomalous data non-parametrically by discovering a region in feature space that captures most (a user-supplied fraction) of the probability mass. It does this by finding a hyperplane (which is curved in the original space but linear in the original space -- thanks to the standard SVM kernel trick) that forces the specified fraction of data points to lie inside closed regions formed by the hyperplane in the original space. Happily, the 1-class SVM does this non-parametrically and can discover multi modal densities. The anomaly detection problem then boils down to determining whether a given new test data point lies inside or outside the regions of high-density of the non-anomalous data points.

The folks at scikit-learn have a good implementation of the 1-class SVM.  

Constructing a good anomaly detector is still a bit of an art because it requires proper selection of features to monitor. In a very high dimensional space, the curse of dimensionality is likely to make it difficult to build an anomaly detector with a small enough false positive rate. 

Religion, GDP and Happiness

The percentage of a country's population that feels religion plays an important role in their lives (x-axis) is negatively correlated with the GDP per capita (Y-axis, in log-scale):

Here are the heatmaps, first for religion:

and per capita GDP (PPP):

From the heat maps above, several countries in Africa and Asia stand out as having low per capita GDP and high importance of religion, while all of Scandinavia and Australia stand out as places with high per capita GDP and low importance of religion. The US is somewhat of an outlier with with highest per capita GDP in the world of ~$35k and about 65% of the population reporting that religion plays an important role in their lives.

For completeness sake, here is the heat map for life satisfaction, or long-term happiness:

Combining data on religion and GDP with data with data on long term happiness, we get roughly the following picture:

  1. Long term happiness correlates positively with per capita GDP.
  2. Long term happiness correlates negatively with importance of religion.
  3. Importance of religion correlates negatively with per capita GDP.

My guess is that the causal structure between the 3 variables: Religion (R), Happiness (H) and per capita GDP is the following:

When per capita  GDP is high, the basic needs of life are met, people are relatively comfortable, and therefore long term happiness is high, but for the same reason, very few people feel compelled to ask fundamental questions of the type that people turn to religion for. This has the effect of creating a negative correlation between Religion and Happiness. In the causal structure above, religion and happiness are related, but given per capita GDP, they are independent.

Creativity is Just Connecting Things

"Creativity is just connecting things. When you ask creative people how they did something, they feel a little guilty because they didn’t really do it, they just saw something. It seemed obvious to them after a while. That’s because they were able to connect experiences they’ve had and synthesize new things. And the reason they were able to do that was that they’ve had more experiences or they have thought more about their experiences than other people. (...)

Unfortunately, that’s too rare a commodity. A lot of people in our industry haven’t had very diverse experiences. So they don’t have enough dots to connect, and they end up with very linear solutions without a broad perspective on the problem. The broader one’s understanding of the human experience, the better design we will have."

- Steve Jobs

Lack of price discovery in the Healthcare Market

Price discovery is

... the process of determining the price of an asset in the marketplace through the interactions of buyers and sellers.

Buyers and sellers of a good or service set the price by expressing how much they are willing to buy or sell for. When a very large number of buyers and sellers interact in this way, the final price of the good or service settles to one that incorporates all this information. However, in the healthcare market, this mechanism is absent.

When you are sick and you visit a doctor, you do not get to ask up front what you will be charged for seeing the doctor! Instead, you present your health insurance card to the receptionist, thereby disclosing information about your ability to pay, and creating information assymetry. The other party (i.e. the healthcare provider) can now select a price , taking into account the information you have just disclosed, and that too, after you have made use of their services, rather than before. This is akin to getting to know the price of a haircut in a barber shop, after you have had the haircut. There is no price discovery mechanism in the healthcare industry in US. I recommend:

  1. Providers should not be allowed to see the insurance coverage a patient has. Each patient should simply be treated. Claims and payments should be handled by a third party, such as a government agency, that acts as a mediator between the hospital and the patient.
  2. Providers should be mandated to disclose a menu of charges, like a restaurant or a barber.
  3. Providers should charge the same fee, irrespective of whether a patient has insurance or not, or how much insurance coverage she has.

Ostensibly, providers check a patient's insurance card in order to make sure that the patient can pay for the services she is about to purchase. Instead, this function should be done by a 'lookup' over the internet on a server run by a government agency: Input = 'name of the medical procedure', Output = 'Covered or not'. This will prevent the provider from learning about the extent of coverages available to the patient. Not learning about this will force providers to more honestly assess fees, and compete with each other more realistically, bringing down costs borne by the patients. I think this will also prevent providers from passing on large unnecessary fees to insurance companies.

A formal mechanism design formulation of the problem would be interesting to tackle. 4 players: patients, providers, insurance companies, government.