Gradient descent

Gradient descent is used in a bewildering range of supervised machine learning areas as an optimization method to 'fit' a model to training data. Its use is obvious when the objective function being optimized (which is often some type of average error over the training set) is convex in the variables of optimization (which are usually the free parameters of the model one is trying to fit). However, gradient descent is (somewhat surprisingly) also used when the objective function surface is not convex! For example, in training a neural network using the back-propagation algorithm, the free parameters of the network are updated using a stochastic gradient descent over the error function surface, which is non-convex because of the non-linearities introduced by non-linear sigmoid/other neurons. Surprisingly, gradient descent seems to work fine. The reason for this is that when employing stochastic gradient descent, the weight updates are affected by noise in individual data points (this noise tends to get averaged out in batch (conventional) gradient descent), which provides the opportunity to escape from local optima wells in the objective-function surface. Due to the nonlinear sigmoid functions, the error surface typically contains many local minima wells, as in the figure below, and these minima as very close to each other. Therefore, even if one is ultimately stuck in a local minimum, it is often not too poor compared to the global minimum.

There are a bunch of necessary steps one must take to pre-process one's data when employing gradient descent in order to prevent crazy convergence times.

  1. Normalize each input feature to have a mean of zero.
  2. Normalize the dataset so that there are no linear correlations between features. This is done by projecting each point on to an orthonormal basis -- this is effectively, a rotation, which can be accomplished by multiplying the input data by a matrix  containing the eigenvectors of its covariance matrix.
  3. Normalize each input feature so that it has the same variance. Doing this allows one to use the same learning rate for each free parameter in the model. If the variance of each feature is different and one chose the same learning rate for each free parameter, then the convergence in some directions would be very slow, while converge in other directions would be quick.

Can we de-parameterize Neural Networks?

During the last 3 months, I have been constantly surprised and impressed by the recent achievements of neural networks. In particular, convolutional neural networks (see the work of Yann lecun) are currently the only technique that allow real-time object recognition using video input, with reasonable performance. Another piece of neural networks magic that I have been delighted by recently is autoencoders. In particular, a 2006 paper [PDF] in the journal Science by Ruslan Salakhutdinov and Geoffrey Hinton shows how multi-layer autoencoders (also called 'deep autoencoders') can be used to learn non-linear relationships. Yet another cool use of neural network that I recently came across is: paraphrase-detection on long sentences using recursive autoencoders -- the problem here is to detect, given two sentences, whether they are the same (or close enough) in meaning or not -- Here is the paper by Richard Socher, whose talk at NYU is where I first heard this stuff. However, I have been disappointed by the fact that there is no systematic way of determining the right number of layers, and the size of each hidden layer that one must use in a neural network. Intuitively, there seems to be a relationship between the number of layers needed and the 'amount of non-linearity' in the problem. For example, suppose one is required to find the classification boundary in N-dimensional space, between points belonging to two classes. If this boundary is 'highly non-linear', meaning that it is very contorted, then a deeper network is probably a good idea because each additional layer allows the neural network to afford a greater capacity for learning a non-linear function since non-linear units in the hidden layers at as building blocks for the non-linear surface. The issue is the standard one of model complexity versus data complexity -- a more complex model (e.g. a deeper neural network) is more capable of capturing complex data if the data really is complex enough to begin with. If however, the data isn't too complex then too complex a model can perform poorly if there isn't sufficient data.

In a way, the fact that one needs to have some idea of what neural network architecture to use is a bit like the problem with parametric Bayesian methods. The recent development of Bayesian non-parametrics (see for example, the work of Michael Jordan, Yee Why Teh, David Blei and Zhoubin Ghahramani) is an effort to fix that problem. Just like parametric Bayesian methods require the specification of a parametric family of distributions, thereby limiting the complexity that the model can capture, neural networks require the specification of an architecture, which if not well-tuned to the data at hand, can underfit or overfit the data. The paramaters in the case of neural networks are the number, size and type of layers in the network.

It appears to me that a similar effort is needed in the domain of neural networks -- i.e. come up with a systematic approach to estimate how much 'non-linearity' or complexity is present in data, so as to estimate good neural network architecture suitable for the problem at hand. At present, practitioners seem to mostly employ trial-and-error to guide the selection of an appropriate architecture. I, for one, would start using neural networks much more frequently in my research, were there a method to estimate a good architecture.

Intelligence is overrated.

Peter Norvig is, by now, well known to have stressed in 'The Unresonable Effectivenes of Data' that "a decent learning algorithm with tons of data can do much better than a very advanced learning algorithm and little data". Einstein is famously reported to have said "Success is 1% inspiration, and 99% perspiration".

I see a kind of parallel between the two trains of thought above. In particular, I have noticed that a lot of people build their careers based on a reasonable amount of intelligence + lots of hard work, while some others build their careers based on sheer intelligence. Invariably, the ones with careers based on hard work are monetarily more successful. In contrast, the ones with careers based on intelligence tend to get lazy after a while. Examples of the intelligence category are tenured professors, and examples of the hard work category are people in sales, consulting, and general management. Notice that the latter category also requires intelligence, but it is the hard work that is directly tied to the payoff. On the other hand, publishing papers sitting in the ivory tower is much more heavy on raw intelligence than hard work.

I feel this observation explains a lot of the different between the difference in (monetary) success rates of the two types of people. It should serve as a relief to people who overrate the value of intelligence.

The best case of course is to have  lots of intelligence *and* be very hardworking. Just like a super cool learning algorithm showered with tons of data. However, a very good alternative is to have reasonable intelligence and a ton of hard work!

Mobicom paper accepted!

My Mobicom paper has officially been accepted - now that the several rounds of changes and formatting issues have been fixed (I was both surprised and amazed as the painstaking detail with which ACM perused the paper; IEEE standards fall pale in comparison).

The paper is titled 'Radio-telepathy: Extracting Secret Bits from an Unauthenticated Wireless Channel'. In this paper, I teamed up with some folks over at InterDigital to build a system that can use the received signal from 802.11 cards to 'extract' bits at the two ends of a wireless channel, in such a way that a third user cannot infer any useful information about the bits that were extracted. The idea is to enable generation of identical bit sequences at the two ends, which can then be used as cryptographic keys for encrypting future communications. What is more, these keys can be refreshed at regular intervals using channel information that the 802.11 system can extract from regular received packets.

The system we proposed provides an analog of quantum cryptography for 'everyday wireless channels'. While QC relies on the quantum physics behind photons, our proposed system relies on two simple, somewhat surprising but verifiable properties of the wireless medium: (1) The channel decorrelates in space very quickly - over distances of the oder of a wavelength (a few cm for 802.11) and (2) The wireless channel is, toa good extent, reciprocal. This means that although it is continuously changing in time, at a fixed instant of time and at a  fixed frequency, the channel behaves in exactly the same way, irrespective of whether Alice transmits and Bob receives, or vice-versa. In practice, it is hard to have both users transmit and receive simultaneously, so there is a small time-delay between transmissions in the two directions. However, if this delay is small, then the channel only has a chance to change by a small amount and is still heavily correlated.

The most notable point about our algorithm is that it provides information theoretic secrecy. This means that the secrecy of the keys extracted is unconditional - it does not depend upon the assumption of a computationally bounded adversary or the computational hardness of a mathematical problem.

I'm looking forward to attending the conference in September. There are a number of other very interesting papers in wireless and otherwise.