Deep learning thrives on problems where automatic feature learning is crucial and lots of data are available. Large amounts of data are needed to optimize the large number of free parameters in a deep network. In many domains, that would benefit from automatic feature learning, large amounts of data are not available, or not available initially. This means that models with low capacity must be used with hand crafted features, and if large amounts of data become available at a later time, it may be worthwhile to learn a deep network on the problem. If deep learning really reflects the way the human brain works, then I imagine that the network being trained by a baby must not be very deep, and must have few parameters than networks that are trained later in life, as more data becomes available. This is reminiscent of Bayesian Nonparametric models, wherein the model capacity is allowed to grow with increasing amounts of data.
On March 4, I attended the 10th annual Machine Learning Symposium organized by the New York Academy of Sciences. This was the third time I attended. It is a nice day-long event with a 3-4 keynote talks and several posters and 5-minute talks by selected poster presenters. It was at this venue that I first heard David Blei talk about topic modeling and LDA in 2010. Below are the talks I enjoyed most this time.
Alex Graves, Google DeepMind
This was the best talk of the day in my opinion. Alex presented a set of results on the role of specifying which parts of the input to pay attention to in deep neural networks. Deep networks naturally learn where to focus their attention for a particular task (e.g. which parts of an image are responsible for detecting the desired output), but being able to provide some side information to the network pertaining to where to focus can improve learning tremendously. The mathematical development of this idea makes use of smooth, differentiable operators for specifying where to focus attention, so that these operators can be included in gradient descent learning. Alex is famous for his work on handwriting recognition using Recurrent Neural Networks, and has a cool widget on his website where you can translate typed text into cursive handwriting in a chosen style.
Sasha Rakhlin, UPenn
This keynote was more theoretical than Alex’s talk. The setting for Sasha’s talk was predicting attributes of a node in a graph (e.g. a social network), when the input predictors include the graph structure as well as non-graph features of the node, and data about nodes arrives in an online fashion. Training a machine in the offline setting (e.g. a classifier for whether a user in a social network will click on a given ad) using conventional methods runs into a combinatorial explosion. Sasha’s work shows that the online setting allows for a relaxation that allows for polynomial time learning.
Multitask Matrix Completion for Learning Protein Interactions across Diseases
Meghana Kshirsagar, IBM Research
The matrix completion problem, exemplified by the Netflix Challenge is that a large matrix with a low rank has very few entries available and the task is to estimate the missing entries. The problem also arises in the context of protein-protein interactions between the proteins of disease causing germs and proteins of a host organism. While there are several known methods for matrix completion (e.g. nuclear norm minimization, Candes et al. ), this setting can also benefit from multi-task learning because often several diseases have similar responsible proteins. Multi-task learning, which is a general idea (not specific to matrix completion) that basically proposes that when multiple problems have portion of the signal in common, then using a combination of a shared representation and a specific representation can lead to better generalization.
Another Look at DWD: Thrifty Algorithm and Bayes Risk Consistency in RKHS
Boxiang Wang, University of Minnesota
Distance Weighted Discrimination was proposed as a superior alternative to the SVM, because unlike the SVM, the DWD cares about the distances of *all* data points from the separating hyperplane, rather than just the points closest to it. However, the DWD hasn’t yet enjoyed the popularity of the SVM because it require solving a SOCP rather than a quadratic program (SVM). This paper extends theoretical results on DWD (established Bayes risk consistency) and proposes an algorithm faster than second order cone programming.
... the process of determining the price of an asset in the marketplace through the interactions of buyers and sellers.
Buyers and sellers of a good or service set the price by expressing how much they are willing to buy or sell for. When a very large number of buyers and sellers interact in this way, the final price of the good or service settles to one that incorporates all this information. However, in the healthcare market, this mechanism is absent.
When you are sick and you visit a doctor, you do not get to ask up front what you will be charged for seeing the doctor! Instead, you present your health insurance card to the receptionist, thereby disclosing information about your ability to pay, and creating information assymetry. The other party (i.e. the healthcare provider) can now select a price , taking into account the information you have just disclosed, and that too, after you have made use of their services, rather than before. This is akin to getting to know the price of a haircut in a barber shop, after you have had the haircut. There is no price discovery mechanism in the healthcare industry in US. I recommend:
- Providers should not be allowed to see the insurance coverage a patient has. Each patient should simply be treated. Claims and payments should be handled by a third party, such as a government agency, that acts as a mediator between the hospital and the patient.
- Providers should be mandated to disclose a menu of charges, like a restaurant or a barber.
- Providers should charge the same fee, irrespective of whether a patient has insurance or not, or how much insurance coverage she has.
Ostensibly, providers check a patient's insurance card in order to make sure that the patient can pay for the services she is about to purchase. Instead, this function should be done by a 'lookup' over the internet on a server run by a government agency: Input = 'name of the medical procedure', Output = 'Covered or not'. This will prevent the provider from learning about the extent of coverages available to the patient. Not learning about this will force providers to more honestly assess fees, and compete with each other more realistically, bringing down costs borne by the patients. I think this will also prevent providers from passing on large unnecessary fees to insurance companies.
A formal mechanism design formulation of the problem would be interesting to tackle. 4 players: patients, providers, insurance companies, government.
Conventional wisdom says that the marginal gains accrued per unit increase in effort are diminishing. In some cases this does not hold, and the gains seem to be super-linear, rather than sublinear in the amount of effort. Consider for a moment that you are driving in your car, let's call it Car 1, down a straight long road that has a series of traffic lights. A second car, lets call it Car 2, is driving near you at a speed that is only very slightly greater than yours. The other car is inching ahead of your car very slowly. At time t, the distance by which Car 2 is ahead of you is d(t).
As you near the next traffic signal, the light turns amber and you begin to slow down. But the driver of the other car doesn't feel the need slow down, but instead makes it through the light without stopping, while you wait at the traffic signal. The gap between the two cars has suddenly increased. If the two cars were to maintain the same small different in speeds throughout the road, slowing and stopping only at traffic lights, then at time t, the distance d(t) between them will be a non-linear function, quite different from the linear function that it would have been, had there been no traffic lights. What do you think d(t) would look like?
It's not hard to reason out the shape of d(t). At any given time, either both cars are moving,, in which case, they have a relative velocity of v2-v1, or car 1 is stopped at a signal (relative velocity of v2), or car 2 is stopped at a signal (relative velocity of -v1). So the plot of d(t) versus t would be piece wise linear with one of 3 possible slopes in any segment. In the figure below, each segment is labelled with which car(s) is moving during that segment.
Over a long enough time t, d(t) will be significantly above the dotted line, i.e. the value of d(t) that one would expect if there were no traffic lights. Therefore, in the long term, Car 2, which started out with a small velocity advantage ends up far far ahead of Car 1, thanks to non-linearities introduced by the traffic lights.
I have often mused about the similarity between the scenario above and other aspects of life, such as career growth. Just like the traffic lights, there are a number of thresholds in real life, e.g cut-off marks for the A+ grade in exams, a threshold for being accepted at a prestigious university or a job interview, or the difference between getting and not getting an award. Two individuals that differ by only a small amount, will, over the course of a long time interval, find themselves on different sides of these thresholds enough number of times. Thus, small differences in skill level and/or determination between very similar workers get amplified, leading to drastic differences in their achievements over a lifetime. Natural processes around us, including interactions between humans, are highly non-linear, and my suspicion is that humans do not accurately perceve the extent of the non-linearities. The fact that long term benefit can be superlinear in effort is a significant realization, because it means that putting in just a little bit of extra effort -- e.g. an extra hour at work daily, an extra skill, extra effort in networking or maintaining relationships -- can have a disproportionately positive effect in the long term. I'll end with a passage from Richard Hamming's 1986 lecture at Bell Labs titled 'You and Your Research' that touches upon this:
.. Now for the matter of drive. You observe that most great scientists have tremendous drive. I worked for ten years with John Tukey at Bell Labs. He had tremendous drive. One day about three or four years after I joined, I discovered that John Tukey was slightly younger than I was. John was a genius and I clearly was not. Well I went storming into Bode's office and said, ``How can anybody my age know as much as John Tukey does?'' He leaned back in his chair, put his hands behind his head, grinned slightly, and said, ``You would be surprised Hamming, how much you would know if you worked as hard as he did that many years.'' I simply slunk out of the office!
What Bode was saying was this: "Knowledge and productivity are like compound interest.'' Given two people of approximately the same ability and one person who works ten percent more than the other, the latter will more than twice outproduce the former. The more you know, the more you learn; the more you learn, the more you can do; the more you can do, the more the opportunity - it is very much like compound interest. I don't want to give you a rate, but it is a very high rate. Given two people with exactly the same ability, the one person who manages day in and day out to get in one more hour of thinking will be tremendously more productive over a lifetime. I took Bode's remark to heart; I spent a good deal more of my time for some years trying to work a bit harder and I found, in fact, I could get more work done.
1 hour extra per day turns out to be equal to be an extra ~1.5 months of 8 hour work days per year!
This post builds upon an earlier post on modeling habit formation. To follow this post, please read this post first. Go on, I'll wait. The urn model in that post is not really a complete description of habit formation because the limit distribution of the system (i.e., fraction of white balls after a very large number of draws) is completely determined by the initial configuration of the urn (i.e., number of white and black balls). In reality, habits are formed from repeated actions, and our actions are based on not just our inclinations but also the conscious choices that we make, which can sometimes be against our inclinations. Free will is missing from this model. There is no way to break a bad habit if one start out with one.
Therefore, let us alter the basic urn model slightly to capture choice: At each step, before picking a ball, we explicitly declare which color we are interested in for that pick: a white ball ( = follow the desirable habit) or a black ball ( = follow the undesirable habit). Suppose we choose to aim for a white ball. A ball is then sampled from the urn at random, and if a white ball does indeed come up, then:
- A payoff of $W is received (We wanted white and we got white, so we are happy to the extent of $W).
- The white ball is placed back into the urn, along with an extra white ball (Reinforcement).
But if we decide to pick white and a black ball comes up, then there is no payoff, and the black ball is placed back in the urn without any extra ball. However, if we chose black, a black ball is guaranteed to show up, we get a payoff of #B, where # is a different currency from $, and the black ball is placed back along with an extra black. The extra black ball makes picking a white ball harder when a white is desired.
Consider the implications of this model:
- Suppose there are very few white balls compared to black. A white ball would rarely show up. With the decision to pick a white, most of the time there will be no payoff and no reinforcement. But with the decision to pick a black ball, there is guaranteed payoff of #B units, but at the cost of picking a white later on harder.
- As the bad habit is reinforced, the chance of reverting to the good habit never dies away completely, but diminishes as the number of black balls grows in comparison to the whites. ('It's never too late, but it does get harder').
- The purpose of the different currencies is to model the fact that the pleasure obtained from following a bad habit is qualitatively different from the one obtained from following a good one. Since most habits are based on short-term pleasure at the expense of long term benefit (e.g. cigarette smoking, binge eating /drinking, putting off exercise), currency # may correspond to lots of short term kicks, while $ may correspond to long term well being.
I believe human behavior is fundamentally driven by a desire to be happy. However, different individuals can have very different ideas about what will make them happy. Since most bad habits result from a focus on short term happiness, at the expense of long term happiness, we would do well to make the following tweak: # and $ can be added to give a total payoff, but each unit of # currency is worth only $latex \epsilon < 1$ unit, 1 iteration after it was earned. The other currency, $, however, does not decay with time. The question is what behavior emerges when the objective is simply maximization of total payoff? It is intuitive that the answer will depend upon whether the game has a finite horizon or an infinite horizon. Since players have finite lifetimes, it is appropriate to use a finite time horizon, but since a person does not know when he/she will die, the horizon must be a life expectancy with some uncertainty (variance). In other words, nobody knows when they will die, but everybody knows they will die. The only time it makes logical sense to pick a black ball is if the player believes his remaining lifetime is so small that the expected cumulative gain from deciding to pick whites is smaller than the expected cumulative gain (including the decay effect) from picking blacks. Of course this depends upon how he spent his past -- i.e. whether or not he built up a high fraction of whites.
Breaking a bad habit
Suppose a person starts out with a predominant inclination towards an undesirable habit and wishes to switch to the good habit. It is clear that if she always chooses the white ball, then she will eventually succeed in creating a larger number of white balls and a good $ balance. But she will have to go many tries and no payoff before that happens. In practice, this requires determination, and she may be tempted to pick the black ball because it is so easily available.
Suppose $latex D \in [0,1]$ is the fraction of times that (s)he will decide to aim for a white ball, i.e. the player's determination. It is intuitive that a larger value $latex D$ would help her switch to the good habit in fewer iterations. It would be interesting to see if there is a threshold value of $latex D$ below which the bad habit is never broken. In particular, I expect there to be a threshold value $latex D(w,N-w)$, below which the bad hait is never broken, w.p. 1, where $latex w$ is the number of whites in the urn and $latex N-w$ is the number of blacks.
Game over when wealth < 1 unit
Now let us further suppose, that in the course of playing the game, a player can never have less than 1 currency units. In other words, the game stops if the total currency units with a player reaches zero. All other rules are the same: 1 unit earned from a white ball, 1 unit from a black ball but units earned from black balls decay to $latex \epsilon$ of their value every time slot. Deciding to pick a black guarantees a black, and an extra black goes in. Deciding to pick a white, provides a white with a probability proportional to the fraction of whites in the urn, and an extra white goes in. With the new rule stipulating 'game over' when the player's wealth falls below 1, the player is incentivized to pick a black if she ever approaches 1 unit of wealth. This is meant to model the inclination of individuals with low self worth and happiness levels to engage in activity that would yield short term happiness (alcohol, drugs, binge eating, etc. ) at the expense of long term prospects.
The problem with conventional text documents is that you cannot easily query them. Suppose I have a 20 page piece of text explaining various things, perhaps with mathematical equations, and plots, etc. In order to figure out something, I need to patiently read through the document till I have my answer. When time is short, and usually time is short, I will skip some parts in a hurry to get to what I need. The skipped parts might contain intermediate facts or definitions, making it impossible for me to understand what I am looking for even when I do find it. This situation is because information is being presented in the document in an order decided by the author. Perhaps this is the best order. But for a specific query (e.g. what is the main result of this paper? ), it would be nice to have a way to do better than having to patiently read the whole document. The best we have right now is Ctrl+F on words and phrases, which just doesn't cut it for queries of reasonable complexity (e.g. 'Why did the GDP of Germany double so quickly?' while reading an article on the history of the German economy).
The computer doesn't understand the document :-(
It would be pretty useful to somehow model documents as databases that can support complex queries. Especially when the documents are large, like books. Perhaps I am asking for the holy grail of NLP? Surely, if IBM can build a Jeopardy-winning Watson, it should be possible to build a question-answering service for individual documents? Outside information is permitted.
Tasks that require a high cognitive load, such as thinking about a problem, reading a research paper or writing C++ code, are very sensitive to interruptions. I find that when I am interrupted during such activity, say by a phone call, or by a co-worker asking me something, the net cost of the interruption is not just the amount of time I spend attending to the call or person. It's as if I have a short term RAM in my head, and I have to re-load the entire context of what I was doing when I get back to it. And this can take a lot of time. What I like are large chunks of uninterrupted time, not several small chunks, even if they add up to the big chunk. However, shutting oneself off from all interactions with others at one's workplace is not the solution, because one runs the risk of not interacting enough with colleagues. Richard Hamming said about Bell Labs:
I notice that if you have the door to your office closed, you get more work done today and tomorrow, and you are more productive than most. But 10 years later somehow you don't know quite know what problems are worth working on; all the hard work you do is sort of tangential in importance.
I have personally found that conversations over coffee and lunch have led to much more interesting directions in my work than solitary thinking. While Hamming was speaking about a research lab environment, I suspect his statement holds true in other creative domains as well.
The apparently contradictory requirements above can be resolved by planning out both solitary time and interaction time during the work week. Both are necessary. I feel creative work requires a combination of two distinct phases. One that allows focused concentration on a task at hand, and another for free discussions, exchange of ideas, and making oneself available to others as a sounding board. Exploit and Explore. An ideal workspace should provide for both. It is said of the original Bell Labs building, that its designer deliberately created long corridors, so that researchers would be forced to encounter one another. Steven Johnson explains in his book 'Where Good Ideas Come From' that coffee houses played a crucial role, because they were breeding grounds for collective thinking.
During my travels on the Internets, I have come across a number of writings on the impact of interruptions on productivity that I can attest to from my own experience:
Paul Graham writes about meetings:
I find one meeting can sometimes affect a whole day. A meeting commonly blows at least half a day, by breaking up a morning or afternoon. But in addition there's sometimes a cascading effect. If I know the afternoon is going to be broken up, I'm slightly less likely to start something ambitious in the morning. I know this may sound oversensitive, but if you're a maker, think of your own case. Don't your spirits rise at the thought of having an entire day free to work, with no appointments at all? Well, that means your spirits are correspondingly depressed when you don't. And ambitious projects are by definition close to the limits of your capacity. A small decrease in morale is enough to kill them off.
Joel Spolsky writes about context switching between different projects.
Some interruptions are self created. Compulsive checking of email for instance. Or deciding to multitask, or trying to handle too many disconnected projects. I find I can do one major thing per day, if I try to do 2 or more major things, I risk accomplishing none. Having smartphones set to beep when an email arrives doesn't help. I switched my phone to not auto check email, long ago.
An urn contain balls of two colors, white and black. A ball is drawn at random, and then put back in the urn along with a new ball of the same color. This process is repeated many times. What is the ratio of white balls to black balls in the urn after $latex n$ tries? What is the ratio as $latex n \rightarrow \infty$? The answer depends upon the initial number of white and black balls of course. This is the Polya Urn problem, and I find it to be an interesting way to think about the process of habit building. Imagine that the two types of balls represent two opposing choices that we can make about a certain aspect or activity in our lives, e.g. waking up early or not waking up early, doing regular exercise or not, smoking a cigarette after dinner, or not, completing homework on time every evening vs. playing video games during that time, etc.
[sourcecode language="python"] # Polya urn simulation in Python import random, pylab as plt
nruns = niter = 200 whitenum = blacknum = 1 for i in range(0,nruns): frac =  nwhite = whitenum nblack = blacknum for j in range(0,niter): if random.random() < nwhite/float(nwhite+nblack): nwhite = nwhite + 1 else: nblack = nblack + 1 frac.append(nwhite/float(nwhite+nblack)) plt.plot(frac) [/sourcecode]
Let us say that picking a white ball corresponds to the desirable activity in these examples, and picking a black ball corresponds to the undesirable one. The fraction of white balls in the urn at any given time represents our proclivity to pick the desirable activity at that time.
This model has some interesting properties. First, if the number of balls of each type to begin with are equal, then the fraction of white balls at any step, and in particular at $latex n \rightarrow \infty$ is a random variable that behaves as follows:
Above: Histogram of the fraction of white balls after 20k iterations
Note that extreme outcomes are possible as $latex n \rightarrow \infty$ if appropriate choices are enforced early on. Also the final value is attained after only a few iterations. Early iterations provide the opportunity for large swings in the final outcome, but there are hardly any swings after crossing about 50 iterations. Similarly, the that longer a habit has had to cement itself, that harder it is to change it.
But the model above converges to an almost uniform distribution on the fraction of white balls after a large number of tries. Perhaps starting with 1 white and 1 black isn't quite right, because we are do not necessarily begin life with equal tendencies for picking each side of a binary choice. If we set the number of white balls to be greater than the number of black balls at the start, then the distribution of the fraction of white balls after many tries is, as expected, skewed in favor of white balls. Again, the fraction of white balls usually settles down to a fairly stable value that is typically determined largely by the actions in the first few iterations.
Above: Fraction of whites, starting with 1 black and 15 white balls
Distribution of fraction of whites after 20k iterations starting with 5 whites, 1 black
If the initial number of balls is allowed to be fractional, then the limit distribution has most of its mass near 0 and 1. The plot below shows 500 iterations starting with 0.2 white and 0.2 black balls. Perhaps this better models habit formation in some people.
Habit building is not the only thing this model can describe. There are a number of phenomena in which reinforcement plays a role:
- Popularity of a brand: Among competing brands of similar quality, one that is slightly more popular may get picked more by customers, making it even more popular ("So many people choose this brand, so it must be good"). I'm sure marketing majors know this well.
- Rich get richer: It is easier to make $X when you have $100, compared to if you have only $10 to begin with.
- Short term market fluctuation in the price of a security on the stock exchange can sometimes have this property. A large number of buy orders will signal to the market a positive outlook on the security, resulting in even more buy orders. Ditto for sell orders. Until the market stabilizes. Similar to (1) above.
- Markets with network effects have a rather obvious reinforcement property: for e.g., the greater the number of users that facebook has, the greater the number of new users it is likely to attract, compared to, say, a competing network like Google+, because the marginal utility for a new user is greater when she join the larger network.
- Tragedy of the commons [PDF]: Defined by Wikipedia as: 'The depletion of a shared resource by individuals, acting independently and rationally according to each one's self-interest, despite their understanding that depleting the common resource is contrary to their long-term best interests'. Example: A grassy hill is shared by several farmer to graze their cows. If one farmer overgrazes, others have an incentive to quickly overgraze too, and eventually the utility of the hill is destroyed for all farmers.
- Hiring in a new company/group: If the organization has a good number of great/well-known people, it is easier to attract other good employees. Similar to (1).
- Thinking: Sometimes people make a choice about thinking about something in a certain way (e.g. so-and-so is a mean person), and this colors their interpretation of future observations, in effect reinforcing their prior notion.
1. In the form above, the urn model displays positive feedback. How can we change the urn model to display negative instead of positive feedback? Simple: put in an extra black ball if a white ball is picked and vice versa. This forces the fraction of white and black balls to remain close to 1/2.
2. Learning by rewards: There has been work in the psychology community that studies reinforcement learning in humans and animals, that is, given a set of choices and associated rewards, how does an animal learn which choice to pick? The Law of Effect is the hypothesis that the probability of picking a choice is proportional to the total reward accumulated from picking that choice in the past. Given a reward scheme, this translates to an Urn model with the colors representing rewards from the various possible actions. For a nice survey of reinforcement models, see: Robin Permantle, A survey of random processes with reinforcement, Probability Surveys, Vol. 4 (2007) .
3. The model above is the simplest urn model. There are several variations of this model, that have proved useful in modeling and learning tasks, other than reinforcement. E.g. using more than 2 colors, or using multiple urns, or allowing for the introduction of new colors. There are a number of interesting distributions for which these Pola Urn variations act as generating processes (Dirichlet-multinomial, Dirichlet Process, Chinese Restaurant Process), but that's another post.
A word is not the same as the thing it represents, only a representation of it, meant to convey information about it. For e.g. the word 'Castle' is not the same as a Castle. This is obvious. Similarly, a map of a place is not the same as the place itself, it is only a map of it. This is also obvious. An academic paper published at a conference or journal is not scholarship! It is a presentation of the scholarship. The scholarship itself is the research work. Somehow, this is not so obvious to a lot of people.
- Both fields benefit from rigorous quantitative modeling for tasks such as detecting patterns and anomalies.
- Both financial trading and security operate in an adversarial environment. In finance, the adversary is the 'market', i.e. the rest of the participants. In security, the adversary is the set of attackers.
- Because of the adversarial environments, simple models do not work for very long, because the adversary evolves in response to actions taken by the other party, and it is hard for a model to capture the direction in which the adversary will evolve. Therefore both fields require that models be updated. Models are built using training data . The challenge therefore is to use the right mix of recent data and longer term historical data. How to do this is not obvious in both fields..
- Game theory is a useful tool for modeling problems in both fields because of the adversarial nature of the field.
- Modern financial instruments, apart from serving as investment vehicles, are designed for mitigating risk. The business of selling information security products or services is also one of mitigating risk -- risk of information loss. There is no such thing as perfect security. Instead, organizations allocate an amount of money they consider appropriate to meet their risk appetite.
- One difference is that trading in financial markets often has the 'self fulfilling prophesy' property, i.e. it can act as a positive feedback system (if a certain fraction of people with opinion that the price of a security is likely to fall, sell their holdings, the rest start thinking its probably a good idea to sell, and the end result is that the price does fall), whereas there doesn't seem to be an equivalent in cybersecurity.
I have been thinking quite a bit about the best language to write machine learning algorithms in. While this depends on the exact algorithm in questions, there are certain constructs and tasks that show up in many algorithms. As I work through this, I'll publish my notes here on this blog as I go along. There seems to be a fairly clear tradeoff between the speed of computation and the speed of development. For example, the speed of computation is best with low level languages like C/C++, but writing code can be cumbersome and time consuming. On the other hand, high level languages like R and Python offer very quick development times and have a good number of libraries/packages developed by the community, but have very poor computational performance. For this reason most professional environments seem to adopt the approach of first hacking in a high level language like R and then writing production quality code in C or C++. I have been hearing this quite a bit recently both within AT&T, as well as from other friends doing machine learning work in other industries, and most recently at the NY linux user group's meeting about R. The tradeoff between performance and development time is important for me at AT&T because I often need to deal with very large volumes of data, often both in terms of sample size as well as dimensionality.
Programming languages are known to be good at specific things and not necessarily generalizable to all types of tasks. I will use this post as a working document to summarize the best practices in the use of picking the right languages/features for writing and running machine learning algorithms and why. Often, the type of computation needed by a class of machine learning algorithms (e.g. MCMC require repeated sampling in a sequential manner making parallelizeability difficult) suggests the type of language features that would be useful. Other times, the mapping is not so obvious.
Matlab/Octave: Expensive for individual users. Quite powerful, the many toolboxes that are available (for a price) make it worthwhile. Fast development time. Octave is the free open source version of Matlab, but does not include all its functionality. I don't know if special toolboxes for Octave exist.
Java: I have no real experience with writing machine learning algorithms or using other people's ML code in Java.
R: Open source, big community, large number of packages, slower than Python and Java, similar in speed to Octave. the large number of packages available makes it very attractive as the language for initial hacking on a learning problem. R also has very good built-in visualization. All these factors make R very very useful for interactively work with data.
Python: Python is my personal favorite language to write code in, so I may be biased, but there are good technical reasons for that bias. Python is open source, it has a huge user community, and a number of good packages for machine learning. Matplotlib/pylab are packages that offer good solid visualization, IMO comparable to R's visualization. The design of Python stresses good code readability. It is a full fledged language offering simple objected oriented design. The only downside to Python is its execution speed. In many cases, this can be overcome by careful optimization of the data structures. Various efforts have been made to improve the speed of Python. PyPy is a just-in-time compiler for Python that can sometimes speed up your Python code, depending on what it does. Cython allows for low level optimization using C by first translating Python code to equivalent C code.
C/C++: Best in performance. Use only if (i) other approaches prove too slow for the problem at hand, or (ii) if a well-tested library is already available (there are lots of unbaked, buggy libraries out there), or (iii) the code needs to go into production or a real-time system in which performance is important (e.g. finance). Some good libraries: SVMlight, liblinear. The boost set of libraries usually proves to be pretty useful, especially when dealing with a learning algorithm that afford parallelism. C and C++ are often clubbed together in discussions on programming languages, but there is a lot of difference between the two. Personally, I have spent more time with C than C++. C is cleaner than C++. In terms of ease of coding and readability, C++ is an ugly language. Here is what Linus Torvalds had to say about C++.
Torch 7 is a recently (NIPS, Dec 2011) released library in C++/GPL for writing ML algorithms and interfaces with a scripting language called Lua. In their words. it provides "a Matlab-like environment for state-of-the-art machine learning algorithms. It is easy to use and provides a very efficient implementation, thanks to an easy and fast scripting language (Lua) and a underlying C implementation." I am told that Torch7 + Lua is a good combination for writing neural networks, but I haven't verified this myself yet. I wish they had interfaced Torch 7 with Python instead of Lua, but in their words:
For neural networks, I have used R's nnet package (very mature, included in the base distribution), and toyed briefly with the lisp-like language Lush (also mature, but has a sttep learning curve if you havent played with functional languages like lisp/scheme before), developed by Yann lecun.
Vowpal Wabbit is another C++ library (with Boost as a dependency) with a focus on online learning, that I haven't yet tried but have heard about quite a bit. Claims by the authors of vw include high speed and scalability. Will post an update here after I try it.
The dream language for machine learning: Fast development, very good performance. Close to english (like Python) but compiled and optimized (like C) instead of interpreted. It sounds like these qualities have little to do with machine learning, but are desirable for most tasks.
Writeups by other people on the topic of best programming languages for machine learning (I will continue to add to this list as I find them):
- Hoyt Koepke on why Python rocks for use in machine learning (and scientific computing in general).
- Two writeups by Darren Wilkinson on empirical evaluations of R, Java, Python, Scala and C for MCMC based learning, in particular for Gibbs Sampling (a type of MCMC algorithm).
- John Langford has a nice wiretup discussing various aspects important to the choice of language for machine learning.
One of the problems for practitioners (as well as for researchers) of machine learning is that there is no well defined 'frontier' of machine learning algorithms that perform 'best' in a given problem domain. Even though there are some guidelines as to what is likely to work well in a given setting, this is not guaranteed, and often other approaches outperform well known approaches. In fact a large number of papers at machine learning conferences seem to be based on this kind of surprise ('look, I tried an autoencoder on this problem and it outperforms the algorithms people have been using for decades!'). It is hard to have a universal frontier of this type because much of the ability of a machine learning algorithm depends upon the structure of the data it is run on. For this reason I was very happy to come across MLcomp.org, a website that allow people to upload algorithms (coded up in any language), and datasets, and compare the performance of their algorithm on all the datasets in their system, as well as the performance of any of the algorithms in their system on one's dataset. This is immensely useful because it serves as an empirical frontier of the performance of machine learning algorithms on different types of datasets. MLcomp connects to computational resources hosted on a standard Amazon EC2 linux instance, allowing users to test algorithms on datasets. And it is free. There are at least a couple of reasons why something like MLcomp should be adopted by more people (current usage seems low, based on how useful Ithink it can be). First, the utility of such a platform is proportional to the number of algorithms and datasets hosted on it, i.e. proportional to the number of people using it and contributing to it. I feel that if a certain threshold number of people begin using it actively, then it's utility as a platform will take off rapidly after that. Second, it can prove to be an invaluable resource for ensemble based learning: Upload your dataset, run a number of algorithms on it, download the code of the algorithms that seem to do well, and use them in a ensemble to improve the performance of your supervised classifier / regressor.
Getting one's hands dirty with applying machine learning theory to a real world dataset is really quite different from simply understanding the theory behind machine learning algorithms. Often, the experience of learning a regressor or a classifier from real a world dataset goes something like this: You have a nice big dataset. You have cleaned it up, normalized it and are ready to apply your chosen algorithm on the dataset in your chosen programming language/statistical computing framework. You find that the generalization error of the model you have learnt is really poor. What do you do next? You can chose a different learning algorithm, one that is more complex, one that is less complex. Or, you can try to supply more training data to the same algorithm in the hope that the generalization error performance improves. What you do next largely determines whether or not you are going to succeed in learning well from the dataset.
It turns out that there is a principled approach to figuring out how to navigate the set of available choices (more/less complex algorithm, more training data). Computational learning theory provides insights into what must be done next. Andrew Ng has a really nice section in his Machine Learning class in which he goes over the most problematic parts of applied machine learning research. According to Andrew, a lot of inexperienced practitioners and startups using machine learning are unable to answer this question, and go about making ad-hoc choices, wasting precious time in the process. I completely agree. I suspect that understanding this aspect of machine learning is also what separates the winners from the rest in machine learning competitions.
The plot on the left show what happens as the number of training samples is increased with a fixed model. The error on the training set actually increases, while the generalization error, i.e. the error on an unseen test sample, decreases. The two converge to an asymptote. Naturally, models with different complexity have asymptots at different error values. In the plot above, model 1 converges to a higher error rate than model 2 because model 1 is not complex enough to capture the structure of the dataset. However, if the amount of training data available is less than a certain threshold, then the less complex model 1 wins. This regime is important because often the amount of training data is fixed -- it is just not possible to obtain any more training data.
The plot on the right shows the change in training error and generalization error as the model complexity is increased, and the size of the training set is held constant. The training set error decreases monotonically, but the generalization error first falls and then increases. This occurs because by choosing a progressively complex model, at some point, the model begins to overfit the training data. That is, given the larger number of degrees of freedom in a more complex model, the model begins to adapt to the noise present in the dataset. This hurts generalization error. One way to get around this is to regularize the model -- i.e. somehow constrain the model parameters, which implicitly reduces the degrees of freedom. If the learning problem can be framed as an optimization problem (e.g. find minimum mean square error), then one way to regularize is to alter the objective function by adding on a term involving the model parameters. Regularization is a flexible means to control the model's capacity so as to avoid overfitting or underfitting.
If application of an ML algorithm shows that training and generalization error rates have leveled out, close to their asymptotic value, then adding additional training samples is not going to help! The only way to improve generalization error would be to use a different, more complex model. On the other hand, if all the available training samples have been used, and increasing the model's complexity seems to be increasing the generalization error, then this points to overfitting. In this case, increasing the model complexity is not going to help. generalization error is usually measured by keeping aside part of the training data (say, 30%) and using only the remaining data for training. A more common technique, termed k-fold cross validation, is to use a smaller portion (say, 5%) as test data, but repeat k-times with random portions kept aside as test data.
I spent the day at the annual machine learning symposium organized by the NY academy of sciences. The keynotes were by Léon Bottou, Yoav Freund, and Stephen Boyd. In addition, there were ~30 posters and a number of short talks by some of the poster presenters. I enjoyed Stephen Boyd's keynote the most. He talked about a method for decoupling convex optimization problems for performing distributed machine learning with data sets that have a very large number of features and/or training samples. This is particularly relevant to the cluster/parallel computing paradigm. Although I used his textbook in grad school, it was the first time I head him speak. His talk was lucid, entertaining and seemed effortless. The poster talks that caught my attention were:
- A Reliable, Effective Terascale Linear Learning System - Miroslav Dudik (Yahoo! Research). I'd heard a longer description of this work by John Langford at one of the NY ML meetups where John described a system they've developed called 'AllReduce'.
- Online Learning for Mixed Membership Network Models - Prem Gopalan (Princeton University). I enjoyed seeing this mainly because of my interest in Latent Dirichlet Allocation and because of a close parallel with some work I've trying to do with network data.
- Planning in Reward Rich Domains via PAC Bandits - Sergiu Goschin (Rutgers University). Sergio drove home the point of his work beautifully and most aptly by showing a demo of his idea applied to the old Super Mario Bros. Nintendo videogame. The video shows a Mario navigating deftly through a level with an artificially high number of obstacles (ducks, and other little creatures that Mario isn't supposed to come in to contact with). This pushed me to go back and take a look at their work.
- Preserving Proximity Relations and Minimizing Edge-crossings in Graph Embeddings - Amina Shabbeer (Rensselaer Polytechnic Institute). Amina's work on displaying high dimensional graph data (each vertex of the graph is a point in high dimensional metric space) in a lower dimensional space, while preserving the distance properties as far as possible (this is a standard dimension reduction problem - PCA, LLE, MDS all seem fair game -- she used MDS), and trying to prevent edge crossings (she addresses this by creating a small SVM-style linear separability problem for each pair of edges, and doing a global optimization). This work was interesting because it is a problem I encounter often when trying to visualize large scale data at AT&T as a graph. SFDP, an algorithm developed by Yifan Hu of AT&T Labs addresses the same problem.
A research challenge on semi-supervised feature learning, posted by D. Sculley of Google Research and hosted on Kaggle just ended this past Monday. The goal of the competition was to learn a collection of 100 features from a small amount of labelled data (50K points) and a large amount of unlabeled data (1M points) in a 1M dimensional space. The data in the original 1M-dimensional space was very sparse. The winning entry didn't use the unlabeled data at all! This was a surprise to me because I was convinced that it would be possible to better by learning a low dimensional feature representation at all. The winning strategy is described by one of its authors here. Although I didn't participate, I wanted to try building a deep autoencoder to lower the dimensionality. This Science 2006 paper by Hinton and Salakhutdinov addresses this very problem. This paper came up with a method for better training of deep autoencoder architectures, by training each RBM layer-by-layer, instead of doing backprop on the entire network end to end. The paper has a number of neat examples of high dimensional data with low inherent dimensionality, and the authors show how nicely their autoencoders manage to recover the low dimensional features better than other popular methods such as PCA, locally linear embedding and multi-dimensional scaling.
I think one reason the deep autoencoder approach didn't feature among the top entries in the Google/Kaggle challenge is that it is not obvious how to pick the right size and number of layers in the autoencoder. In general, there seems to be no solid work on this issue. Hinton and Slakhutdinov do not mention in their 2006 Science paper. A second reason I suspect it didn't show up in the challenge results is that the challenge ran for just about 3 weeks, and it requires significant time to build and fine-tune autoencoder architectures for the specific problem.