10th Annual NYAS Machine Learning Symposium

On March 4, I attended the 10th annual Machine Learning Symposium organized by the New York Academy of Sciences. This was the third time I attended. It is a nice day-long event with a 3-4 keynote talks and several posters and 5-minute talks by selected poster presenters. It was at this venue that I first heard David Blei talk about topic modeling and LDA in 2010. Below are the talks I enjoyed most this time. 

Keynotes:

Alex Graves, Google DeepMind
This was the best talk of the day in my opinion. Alex presented a set of results on the role of specifying which parts of the input to pay attention to in deep neural networks. Deep networks naturally learn where to focus their attention for a particular task (e.g. which parts of an image are responsible for detecting the desired output), but being able to provide some side information to the network pertaining to where to focus can improve learning tremendously. The mathematical development of this idea makes use of smooth, differentiable operators for specifying where to focus attention, so that these operators can be included in gradient descent learning. Alex is famous for his work on handwriting recognition using Recurrent Neural Networks, and has a cool widget on his website where you can translate typed text into cursive handwriting in a chosen style. 

Sasha Rakhlin, UPenn
This keynote was more theoretical than Alex’s talk. The setting for Sasha’s talk was predicting attributes of a node in a graph (e.g. a social network), when the input predictors include the graph structure as well as non-graph features of the node, and data about nodes arrives in an online fashion. Training a machine in the offline setting (e.g. a classifier for whether a user in a social network will click on a given ad) using conventional methods runs into a combinatorial explosion. Sasha’s work shows that  the online setting allows for a relaxation that allows for polynomial time learning. 


Notable Posters: 

Multitask Matrix Completion for Learning Protein Interactions across Diseases
Meghana Kshirsagar, IBM Research
The matrix completion problem, exemplified by the Netflix Challenge is that a large matrix with a low rank has very few entries available and the task is to estimate the missing entries. The problem also arises in the context of protein-protein interactions between the proteins of disease causing germs and proteins of a host organism. While there are several known methods for matrix completion (e.g. nuclear norm minimization, Candes et al. ), this setting can also benefit from multi-task learning because often several diseases have similar responsible proteins. Multi-task learning, which is a general idea (not specific to matrix completion) that basically proposes that when multiple problems have portion of the signal in common, then using a combination of a shared representation and a specific representation can lead to better generalization. 

Another Look at DWD: Thrifty Algorithm and Bayes Risk Consistency in RKHS
Boxiang Wang, University of Minnesota
Distance Weighted Discrimination was proposed as a superior alternative to the SVM, because unlike the SVM, the DWD cares about the distances of *all* data points from the separating hyperplane, rather than just the points closest to it. However, the DWD hasn’t yet enjoyed the popularity of the SVM because it require solving a SOCP rather than a quadratic program (SVM). This paper extends theoretical results on DWD (established Bayes risk consistency) and proposes an algorithm faster than second order cone programming. 

 

In the eye of the Beholder

Why is the picture on the right more appealing than the one on the left?

What is it that we find more interesting about the picture on the right, compared to the one on the left? The picture on the left contains more information. So we are certainly not looking for more information. One might say we don't know how to interpret the image on the left into anything familiar, but it is television static. A more precise answer is given by Jurgen Schmidhuber, who argues convincingly that:

Artists (and observers of art) get rewarded for making (and observing) novel patterns: data that is neither arbitrary (like incompressible random white noise) nor regular in an already known way, but regular in way that is new with respect to the observer's current knowledge, yet learnable (that is, after learning fewer bits are needed to encode the data).

This explains the pictures on top. The picture on the left is not compressible because it is a matrix of uniformly random 0/1 pixels. The Monet on the right evokes familiar feelings, and yet adds something new. I think what Schmidhuber is saying is that the amount of compressibility should neither be too little, nor too much. If  something is not very compressible, then it is too unfamiliar. If something is too compressible, then it is basically boring. In other words, the pleasure derived first increases and then decreases with the compressibility, not unlike this binary entropy curve.

Let us ask the same question again for the following pair of images (you have to pick one over the other):

My guess is that most people will find the image on the right more appealing (it is for me at least). Please drop me a comment with a reason if you differ. When I look at the image on the right, it feels a little more familiar, there are some experiences in my mind that I can relate to the image - for example looking straight up at the sky through a canopy of trees (white = sky, black=tree leaves), or a splatter of semisolid food in the kitchen.

In order for an object to be appealing, the beholder must have some side information, or familiarity with the object beforehand. I learnt this lesson the hard way. About 2 years ago, I gave a talk at a premier research institution in the New York area. Even though I had received complements when I'd given this talk at other venues, to my surprise, this time, audience almost slept through my talk. I learnt later that I had made the following mistake: in the abstract I'd sent to the talk's organizer, I had failed to signal that my work would likely appeal to an audience of information theorists and signal processing researchers. My audience had ended up being a bunch of systems researchers. The reason they dozed through my talk was that they had just a bit less than the required background to connect the dots I was showing them.

It is the same with cultural side information or context -- the familiar portion of the object allows the observer to latch on. The extra portion is the fun. Without the familiar, there is nothing to latch on to. The following phrases suddenly take on a precise quantifiable meaning:

  • "Beauty lies in the eyes of the beholder": the beholder carries a codebook that allows her to compress the object she is observing. Each beholder has a different codebook, and this explains 'subjective taste'.
  • "Ahead of its time": Something is ahead of its time if it is very good but does not have enough of a familiar portion to it, to be appreciated by the majority of observers.

I can think of lots of examples of art forms that deliberately incorporate partial familiarity into them -- e.g. music remixes, Bollywood story lines. Even classical music first establishes a base pattern and then builds on top of it. In this TED talk, Kirby Ferguson argues that all successful creative activity is a type of remix, meaning that it builds upon something familiar.

Takeaways:

  1. When writing a paper or giving a talk, always make sure the audience has something familiar to latch on to first. Otherwise, even a breakthrough result will appear uninteresting
  2. Ditto for telling a story or a joke, DJing music at a party, or building a consumer facing startup. Need something familiar to latch on to.
  3. In some situations it may be possible to gauge the codebook of the audience (e.g. having a dinner party conversation with a person you just met), to make sure you seem neither too familiar, nor too obscure.

Climate change and The Burglar's Stopping Problem

The climate change hypothesis is that global changes in climate leading to significantly higher number of severe weather events are predominantly man-made, and in particular, the release of greenhouse gases such as carbon-dioxide into the atmosphere is a leading cause. After conveniently escaping the national spotlight in the US during the presidential campaigns, climate change has once again appeared in the news, thanks to Hurricane Sandy. Munich-Re, the reinsurance giant released a report, somewhat presciently on Oct 17, that says:

Nowhere in the world is the rising number of natural catastrophes more evident than in North America. The study shows a nearly quintupled number of weather-related loss events in North America for the past three decades, compared with an increase factor of 4 in Asia, 2.5 in Africa, 2 in Europe and 1.5 in South America.

Unambiguously proving that man-made climate change has a role to play in a specific event such as Sandy is more of an ideological debate, than a statistical exercise. And so there are many many people who can boldly claim that man-made climate change is fiction.

A few industrialized nations are responsible for the bulk of CO2 emissions.

Some of these nations have refused to ratify the Kyoto Protocol, which calls for reduction in CO2 emissions. No points for guessing which colors in the map below denote countries that have not ratified the treaty.

(Brown = No intention to ratify. Red = Countries which have withdrawn from the Protocol. Source: Wikipedia)

Most of the world apparently believes in man-made climate change. When will these other countries wake up? I can't help but think of the following stopping problem:

Taken from G. Haggstrom (1966) “Optimal stopping and experimental design”, Ann. Math. Statist. 37, 7-29.

A burglar contemplates a series of burglaries. He may accumulate his larcenous earnings as long as he is not caught, but if he is caught during a burglary, he loses everything including his initial fortune, if any, and he is forced to retire. He wants to retire before he is caught. Assume that returns for each burglary are i.i.d. and independent of the event that he is caught, which is, on each trial, equally probable. He wants to retire with a maximum expected fortune. When should the burglar stop?

Traffic Lights: Regulation vs. free-markets

In the aftermath of the Sandy Hurricane, many parts of the NY/NJ area have sustained power outages, and as a result, traffic lights in these areas are not functional. This requires drivers to approach a traffic junction as a multi-way stop-sign. This got me thinking: What if, in place of traffic lights, we had just stop signs everywhere, and the rule was: the next car to go should be the car at the head of the longest queue. I believe this is an optimal scheduling policy in a certain sense (it provides an optimal throughput x delay product -- that is for a given average delay at the intersection, it would provide the highest rate number of cars going through [1] ). In this policy, each driver is trusted to follow the scheduling policy faithfully. For argument sake, I am ignoring (1) the time spent by each driver having to figure out which queue is the longest at each step, (2) how the driver at the head of each queue gets information about the length of each queue, and (3) the loss in efficiency incurred by slowing down and starting. Compared to this self-enforced scheduling policy, traffic lights can be very suboptimal. You know this if you have ever stood on a red light waiting to turn green while the street with the green signal has no traffic. Why then do we have traffic lights? The problem is that in the self-enforcing scheduling policy, there will be some drivers who will free-load, i.e. they will not obey the rule and simply take the turn for themselves, even if the turn belongs to someone else according to the scheduling rule. Further, when this happens, it will often result in collisions between the free loader and the rightful owner of the turn. This is why traffic lights are necessary, even though they come at the expense of reduced overall efficiency.

There is a nice lesson embedded here that speaks to the need for government regulation by way of analogy: Regulation is necessary to enforce fairness and safety by preventing freeloaders and accidents, even though a free market might provide higher overall benefit if everyone was guaranteed to behave properly. Therefore regulation is the price we must pay, in the form of reduced overall benefit, to counter the fact that all market participants do not behave as per the rules if left to themselves.

EDIT 1: The loss in overall utility when all participants are allowed to act selfishly, compared to the state where each participant acts for the overall good of the set of all participants, is called the price-of-anarchy. This is different from (but related to) the loss in overall utility from the imposition of regulations. A simple 2-player prisoner's dilemma can exhibit the price of anarchy when all participants are worse off if allowed at act selfishly, compared to the overall optimal for the 2 players. In the traffic light example, when players act selfishly, they create unfairness and also end up endangering everyone (including themselves, but perhaps they don't realize this bit). Hence the utility derived by each participant is lower, compared to if they all cooperated perfectly.

EDIT 2: Regulation can be thought of simply as a mechanism designed to improve the utility received by players beyond what it would be in anarchy, by changing the (rules of the) game a little. Regulation typically doesn't take the system to the overall optimal (which corresponds to perfectly cooperating players in the original game) of the original game. The 'price of regulation' ( = utility of overall optimum - that achieved by regulation) should be less than the price of anarchy (= overall optimum - state achieved by anarchy). Modern day regulators need to be really good at mechanism design!

EDIT 3: Perfect cooperation can be unstable against defection by free loaders [2] because the utility a player derives by unilaterally defecting is greater than that obtained by cooperating. If everyone is well aware of the risk of an accident upon defecting, then this can serve as a disincentive to defecting because the utility from defecting, after factoring in the probability of an accident may no longer make defecting worthwhile. This suggests that simply increasing awareness of the risks posed by misbehavior upon the misbehaving player, might improve the overall equilibrium a bit. Of course, this requires that the defector bear extra personal risk.

___

[1] I know this because it holds true for scheduling packets transmissions in a class of communication networks [citation required].

[2] I experienced free loaders first hand during the last few days after Sandy in 2 different contexts: people going out of turn at road intersections, and people trying to break into the long line at a gas station.

Stopping Problems

Optimal stopping problems frequently pop up in real life decision making. I have been collecting a bunch of stopping problems, with the aim of clustering similar problems together, and finding general solution strategies to specific flavors of problems. Here is my (growing) list:

  1. Hiring, Marriage, Parking, Home selling: You are to hire an employee from a pool of N candidates. Each candidate has a fixed integer as its rank, but the rank of a candidate becomes known only after the interview. After each interview, you must decide whether to hire that candidate or to interview the next one. On moving to the next candidate, you loose the opportunity to hire the one you just interviewed. If you hire none of the first N-1 candidates, you must hire the N^th. When should you stop interviewing? Other versions of the same problem: Marriage: Replace candidates with potential mates. House selling: The goal is to sell a house at max profit. Parking: To look for a parking spot so as to park your car nearest to the entrance of a supermarket. You can either choose an empt spot or pass up on it in the hope of getting a better one. Other versions: (i) N is potentially infinite, but there is a cost proportional to the amount of time you take to decide. (ii) In the house selling problem, instead of ranks, one observes a real number as the score of a candidate. The goal is to stop at the candidate with as high a score as possible.
  2. Bug finding, proof-reading: Review of a piece of code or writing has a certain cost per review $C. The number of bugs is a random number with some known distribution. An undetected bug costs $B. Each time a review is done, each bug present has an iid chance of being detected and fixed, with some known probability p. When to stop reviewing?
  3. Gaussian process excursions: You are given a Gaussian process X(n), n = 0, 1, 2.. such that X(n) is Gaussian distributed with mean zero, and a known covariance function K(i,j) between X(i) and X(j). You are allowed to observer a maximum of N successive samples of X(t), and at each step decide whether to stop or continue. The reward upon stopping at $latex t_1 \leq N$ is $latex X(t_1)$. This problem generalizes problem #1, by allowing observations to be correlated rather than IID.
  4. A drunkard can walk either 1 step west or 1 step east at each time step, with equal probability. When must she stop, in order to maximize her distance from where she started? What if the number of steps is limited to N? Another version of the same problem: You have a shuffled deck of cards and you must draw cards from it, one at a time. A red card gives you $1 and a black card makes you $1 poorer. When should you stop picking cards in order to maximize profit?
  5. You are to toss an unbiased coin repeatedly. When you stop tossing, your reward is the fraction of tosses that came up heads. When do you stop tossing? What if you are allowed only N tosses?
  6. You are to throw an unbiased die a maximum of N times. When you stop throwing, your reward is the number on the upturned face of the die. How do you decide when to stop? This is a bit like the version of the Hiring problem in #1 above in which each candidate has a score rather than just a rank.
  7. Two players A and B play a game. Player A writes down two different integers chosen from the set {1, 2, ..., 100} on two separate pieces of paper and places them face downward on a table. Player B's task is to pick one of the pieces of paper, and either stop if she believe it is the larger of the two number, or pick the 2nd one if she believe the 2nd one is the larger one. What should be the strategy of player B? What should be the strategy of player A if her goal is to make player B loose?
  8. A discrete time stochastic process that produces iid random variables $latex X_1, X_2,... $ from a known distribution, changes at a certain time T to a different distribution. The task is to detect the change as quickly as posible after it has occurred, assuming the the cost of detection is proportional to the time since the change and that there is a fixed cost to a false alarm.
  9. [One armed Bandit] As a doctor, you have two possible treatments for a certain disease. Treatment A is known to work with probability p_A, and treatment B has an unknown probability of working, but with some prior distribution over the probability of working. Given N patients to be treated sequentially, how do you decide which treatment to use, if the goal is the maximize the number of patients cured. This is called a 1-armed bandit problem because of the following equivalent problem: A casino slot-machine has two lever. Pulling a lever provides the player with either $1 or $0. The lever on the left is known to have a probability of p_left of providing $1 in winnings, while the lever on the right has an unknown probability. How do you go about picking levers if you have N tries? If you have infinitely many tries? These are stopping problems because, if, by virtue of trials of treatment A, if ever becomes known that treatment B is better, then it is optimal to continue using treatment B thenceforth. The problem therefore boils down to one in which one need only find a time at which to switch from using treatment A to treatment B.

Finite number of steps

It turns out that stopping problems are non-trivial to solve when the number of tries is infinite (e.g Question #2 above). But when the number of tries is finite, then dynamic programming is often useful in arriving at an optimal strategy.

For a simple example of how DP can help find an optimal stopping strategy, let us take Question #3 with N=3. A finite horizon provides a way to reason about the problem at each step by working backwards. After having tossed the die 2 times, it makes sense to toss the die a 3rd time only if the expected outcome of the 3rd toss is better than the 2nd toss. Since tosses are IID, the expected outcome of the 3rd toss is 3.5 and therefore it makes sense to go into the 3rd toss only if the 2nd toss is a 1,2 or 3, but not if it is a 4,5, or 6.  Applying the same logic, working backwards one stage: after having thrown the die the first time, it makes sense to toss a 2nd time only if the expected reward from continuing the game is higher than the present reward. The expected reward from going into the 2nd throw can be computed by considering two disjoint cases: if the game stops with the 2nd throw and if the game continues into the 3rd throw. This gives: (3/6)(4+5+6)/2 + (3/6)(1+2+3+4+5+6)/2 = 4.25. Therefore the 1st throw is good enough to end the game with only if it is > 4.25, that is, if it is a 5 or a 6.

Problem #1 has been shown to have a relatively simple solution, which was pretty surprising when I first came across it. The solution, for reasonably large values of N, is: observe the first ~37% of the candidates without stopping, and then pick the first one that beats all the ones before it. The intuition is that observing some of the candidates without stopping provides information about the quality of candidates, which provides a baseline for making a decision. Stopping too soon is suboptimal because you do not have sufficient information to make an informed decision. Stopping too late is suboptimal because you do not have much choice left. The 37% solution is provable optimal and is guaranteed to find the best candidate 37% of the time in repeated trials. A proof is readily available by googling.

Habit Formation, Part 2

This post builds upon an earlier post on modeling habit formation. To follow this post, please read this post first. Go on, I'll wait. The urn model in that post is not really a complete description of habit formation because the limit distribution of the system (i.e., fraction of white balls after a very large number of draws) is completely determined by the initial configuration of the urn (i.e., number of white and black balls). In reality, habits are formed from repeated actions, and our actions are based on not just our inclinations but also the conscious choices that we make, which can sometimes be against our inclinations. Free will is missing from this model. There is no way to break a bad habit if one start out with one.

Therefore, let us alter the basic urn model slightly to capture choice: At each step, before picking a ball, we explicitly declare which color we are interested in for that pick: a white ball ( = follow the desirable habit) or a black ball ( = follow the undesirable habit). Suppose we choose to aim for a white ball. A ball is then sampled from the urn at random, and if a white ball does indeed come up, then:

  1. A payoff of $W is received (We wanted white and we got white, so we are happy to the extent of $W).
  2. The white ball is placed back into the urn, along with an extra white ball (Reinforcement).

But if we decide to pick white and a black ball comes up, then there is no payoff, and the black ball is placed back in the urn without any extra ball. However, if we chose black, a black ball is guaranteed to show up, we get a payoff of #B, where # is a different currency from $, and the black ball is placed back along with an extra black. The extra black ball makes picking a white ball harder when a white is desired.

Consider the implications of this model:

  1. Suppose there are very few white balls compared to black. A white ball would rarely show up. With the decision to pick a white, most of the time there will be no payoff and no reinforcement. But with the decision to pick a black ball, there is guaranteed payoff of #B units, but at the cost of picking a white later on harder.
  2. As the bad habit is reinforced, the chance of reverting to the good habit never dies away completely, but diminishes as the number of black balls grows in comparison to the whites. ('It's never too late, but it does get harder').
  3. The purpose of the different currencies is to model the fact that the pleasure obtained from following a bad habit is qualitatively different from the one obtained from following a good one. Since most habits are based on short-term pleasure at the expense of long term benefit (e.g. cigarette smoking, binge eating /drinking, putting off exercise),  currency # may correspond to lots of short term kicks, while $ may correspond to long term well being.

Short-term thrills

I believe human behavior is fundamentally driven by a desire to be happy. However, different individuals can have very different ideas about what will make them happy. Since most bad habits result from a focus on short term happiness, at the expense of long term happiness, we would do well to make the following tweak: # and $ can be added to give a total payoff, but each unit of # currency is worth only $latex \epsilon < 1$ unit, 1 iteration after it was earned. The other currency, $, however, does not decay with time. The question is what behavior emerges when the objective is simply maximization of total payoff? It is intuitive that the answer will depend upon whether the game has a finite horizon or an infinite horizon. Since players have finite lifetimes, it is appropriate to use a finite time horizon, but since a person does not know when he/she will die, the horizon must be a life expectancy with some uncertainty (variance). In other words, nobody knows when they will die, but everybody knows they will die. The only time it makes logical sense to pick a black ball is if the player believes his remaining lifetime is so small that the expected cumulative gain from deciding to pick whites is smaller than the expected cumulative gain (including the decay effect) from picking blacks. Of course this depends upon how he spent his past -- i.e. whether or not he built up a high fraction of whites.

Breaking a bad habit

Suppose a person starts out with a predominant inclination towards an undesirable habit and wishes to switch to the good habit. It is clear that if she always chooses the white ball, then she will eventually succeed in creating a larger number of white balls and a good $ balance. But she will have to go many tries and no payoff before that happens. In practice, this requires determination, and she may be tempted to pick the black ball because it is so easily available.

Suppose $latex D \in [0,1]$ is the fraction of times that (s)he will decide to aim for a white ball, i.e. the player's determination. It is intuitive that a larger value  $latex D$ would help her switch to the good habit in fewer iterations. It would be interesting to see if there is a threshold value of $latex D$ below which the bad habit is never broken. In particular, I expect there to be a threshold value $latex D(w,N-w)$, below which the bad hait is never broken, w.p. 1, where $latex w$ is the number of whites in the urn and $latex N-w$ is the number of blacks.

Game over when wealth < 1 unit

Now let us further suppose, that in the course of playing the game, a player can never have less than 1 currency units. In other words, the game stops if the total currency units with a player reaches zero. All other rules are the same: 1 unit earned from a white ball, 1 unit from a black ball but units earned from black balls decay to $latex \epsilon$ of their value every time slot. Deciding to pick a black guarantees a black, and an extra black goes in. Deciding to pick a white, provides a white with a probability proportional to the fraction of whites in the urn, and an extra white goes in. With the new rule stipulating 'game over' when the player's wealth falls below 1, the player is incentivized to pick a black if she ever approaches 1 unit of wealth. This is meant to model the inclination of individuals with low self worth and happiness levels to engage in activity that would yield short term happiness (alcohol, drugs, binge eating, etc. ) at the expense of long term prospects.

Prime Factorization Trees

A positive integer can be visualized as a tree, structured by its prime factors. I wrote a little program which takes a positive integer as input, and creates a tree layout that shows its prime factors visually. Here's the tree for 70:

Screen shot 2012-10-07 at 9.55.37 AM

Screen shot 2012-10-07 at 9.55.37 AM

The tree is drawn recursively, starting with a root node in the center of a circular layout. The degree of the root node is the largest prime factor. Then, more prime factors are added recursively (including multiplicities of the same prime factor) in decreasing order of factor, by growing the tree. Finally, the leaf nodes are displayed as filled black circles. As seen from the tree above, the prime factorization of 70 is 7*5*2.

Here is the layout for 700 = 7 * 5 * 5 * 2 * 2:

Screen shot 2012-10-07 at 11.17.22 AM

Screen shot 2012-10-07 at 11.17.22 AM

This method of displaying numbers has been proposed and implemented by others on the internet, with much prettier results. I was too tempted to quickly try it for myself using a graph visualization library, because the graph drawing library can do all the heavy lifting. I used Graphviz (developed at AT&T Labs) which I use often for visualizing network data. I also wanted to see if the underlying tree structure revealed by the edges in the graph makes the prime factorization more evident.

Here are a few more trees:

243 = 3^5:

Screen shot 2012-10-07 at 11.44.01 AM

Screen shot 2012-10-07 at 11.44.01 AM

43 (a prime):

Screen shot 2012-10-07 at 10.04.39 AM

Screen shot 2012-10-07 at 10.04.39 AM

128 = 2^7:

Screen shot 2012-10-07 at 11.04.56 AM

Screen shot 2012-10-07 at 11.04.56 AM

121 = 11 * 11:

Screen shot 2012-10-07 at 10.55.17 AM

Screen shot 2012-10-07 at 10.55.17 AM

625 = 5^4:

Screen shot 2012-10-07 at 11.42.04 AM

Screen shot 2012-10-07 at 11.42.04 AM

2310 = 11 * 7 * 5 * 3 * 2:

Screen shot 2012-10-07 at 11.09.27 AM

Screen shot 2012-10-07 at 11.09.27 AM

Code gist (Without prime factorization implementation and graph library):

[sourcecode language="python"] def createnode(x): # x == 1: support node # x == 0: real node global nodenames name = random.random() while name in nodenames: name = random.random() nodenames.add(name) if x == 1: retval = '"'+str(name)+'"'+whitenode else: retval = '"'+str(name)+'"'+blacknode print retval return '"'+str(name)+'"'

def createedge(x,y,t): # x and y are nodes, t is type (0=last stage of the tree, 1=all other stages) if t == 1: print x + '--' + y + ' [color="#DDDDDD"]' if t == 0: print x + '--' + y + ' [color=grey]' return x + '--' + y

pf = factorization(n) # keys are factors, values are the multiplicity of the prime factors factors = [] for fac in pf: for multiplicity in range(0, pf[fac]): factors.append(fac) factors = sorted(factors, reverse=True) print 'graph G{' print 'overlap = false' nodenames = set() edgeset = set() whitenode = ' [fixedsize=True, width=0.1, height=0.1, color = white, fontcolor = white, shape = circle]' blacknode = ' [color = black, style = filled, fontcolor = black, shape = circle]' rootnode = createnode(1) currentset = set() currentset.add(rootnode) level = len(factors) + 1 for i in range(0, len(factors)): level -= 1 f = factors[i] nodetype = 1 if i == len(factors) - 1: nodetype = 0 newnodeset = set() for eachnode in currentset: for j in range(0, f): newnode = createnode(nodetype) newnodeset.add(newnode) edgeset.add(createedge(eachnode, newnode, level, nodetype)) currentset = newnodeset print '}' [/sourcecode]

For prime factorization, I used the this implementation.

Learning using Puzzles

Mathematical and logic puzzles are a great way to learn. I think they should be used to a greater extent in school and college education than they are, because they:

  1. Create an excitement in the mind of the student/solver.
  2. Force the solver to think about what method to use to attack the problem. This is much closer to real world situations, where one is not told about what area a problem is in. Tech companies recognize this when they use puzzles  while interviewing job candidates as a way to observe how the candidate thinks.
  3. Over time, allow students to recognize patterns in problems and the best way to attack them, so that wehn they are faced with a new, real world problem, they have a good idea of what to try first.

I am certainly not the first one to note the educational value ofpuzzles. Martin Gardner, whose puzzles I have tremendously enjoyed has himself advocated for puzzle-based learning.

Each puzzle is usually based on one (or, sometimes two, but very rarely more than two) fundamental areas of math: probability, number theory, geometry, critical thinking etc. For example, take the following puzzle (try these puzzles before reading answers at the end of the post):

[1] By moving just one of the digits in the following equation, make the equation true:

$latex 62 - 63 = 1$

It should be clear that this puzzle is part math and part logic.

Often there is more than one way to arrive at the solutions -- e.g. using a geometric visualization for a algebraic problem in 2 or 3 variables. For example take the following puzzle:

[2] A stick is broken up into 3 parts at random. What is the probability that the thee parts form a triangle?

One may choose to solve the above problem using algebra, but it can be visualized geometrically too.

Speaking of geometric methods, there is also something to be said about solving puzzles inside one's head, rather than using pen and paper. I feel it adds to the thrill because it can be more challenging. It is also fun to visualize objects and their relationships in mathematical terms,  completely inside one's head. The following is an example of a pretty hard puzzle, which was solved entirely inside one mathematicians head :

[3] Peter is given the product of two integers both of which are between 2 and 99, both included. Sally is given only their sum. The following conversation takes place between them:

P: I cannot figure out the two numbers from the product alone.

S: I already knew that you couldn't have figured them out.

P: Oh, now that you say that, I know what the two numbers are.

S: I see. Now that you say you know what they are, I also know what they are.

What are the two integers?

___

ANSWERS:

[1] $latex 2^6 - 63 = 1$

[2] Without loss of generality, assume that the stick is of unit length. Since the three lengths add up to 1, and are each $latex \in [0,1]$, they can be treated as three random variables $latex (x,y,z)$ in 3D space, lying on the 2-dimensional triangular simplex with vertices at (1,0,0), (0,1,0) and (0,0,1). The problem now boils down to determining what fraction of the surface of this triangle corresponds to tuples that cannot form a triangle. From the triangle inequality, we know that $latex x + y > z$ and so on for the other two pairs. A little mental visualization reveals that these boundary conditions are met when each side of these inequalities is $latex \frac{1}{2}$. The region corresponding to feasible points is a smaller equilateral triangle inscribed within the simplex and is one-fourth the area of the simplex. The answer therefore must be $latex \frac{1}{4}$.

[3] The answer can be found in this PDF of a note written by EW Dijkstra, who solved the problem in his head.

Fresh stock of puzzles

The past few weeknights have been spent solving puzzles from Martin Gardner's books, which I really enjoy. I just came across a book by Peter Winkler, whose name I had heard before in talks by Muthu. After reading a few of the puzzles on Amazon's look inside preview, this book is too hard to resist buying. Also, here [PDF] is a beautiful set of 7 puzzles compiled by Peter Winkler on the occasion of a remembrance for Martin Gardner in 2006.  Other places I have found good puzzles recently include this page maintained by Gurmeet Singh Manku. Puzzles + coffee + weekend afternoon = awesomeness.

Visualizing the IPv4 space using Hilbert Curves

The IPv4-heatmap is an awesome little program created by Duane Wessels. It allows the visualization of the entire 32-bit IPv4 address space in 2-dimensions, using the idea of a Hilbert curve, a type of space-filling curve, which are fascinating constructs in and of themselves. Here is an image from Wikipedia showing an iterative process to create higher order Hilbert curves from lower order ones.

Due to the properties of Hilbert curves, the IPv4 heatmap represents IP addresses that are numerically close to one another as physically proximate pixels. Thins of each IP address as a 32 bit integer. Then, 123.25.63.231 is just 1 larger than 123.25.63.230 in value, and the two  IP addresses probably belong to the same organization. The program takes in a raw text file containing IP addresses and creates an image like the one below. The particular map below shows the intensity of peer to peer traffic that I measured during one five minute segment. Since large blocks of IP addresses have well known organizations as their holders, the map also displays this information as an overlay.

Paper at ACM Mobisys 2011

My paper titled 'ProxiMate: Proximity-based Secure Pairing using Ambient Wireless Signals' has been accepted at the 9th Annual International Conference on Mobile Systems, Applications and Services (ACM Mobisys) to be held at Washington D.C.. The paper is about how existing wireless signals, such as FM radio and TV signals, can be used by radios in close proximity to generate a shared secret, and then use that secret to communicate securely. By building a shared key in this manner, it is hoped that the problem of identity spoofing and eavesdropping on wireless links can be addressed. The principles at work in ProxiMate are: (i) Small-scale fading: the wireless channel between a distance RF source (such as an FM radio tower) and a receiver, behaves as a random process, thereby serving as a source of randomness, and (ii) receivers that are within half a wavelength of each other can observe highly correlated (but not identical) versions of these random processes, allowing wireless devices in proximity access to a shared source of common randomness that is not available to any sufficiently distant adversary. ProxiMate is a proof of concept system built on software-defined radio and evaluated using ambient FM and TV signals. It is intended to be information-theoretically secure -- meaning, the key generation phase does not need to assume a computationally bounded adversary -- as opposed to say, Diffie Hellman, which must assume a computationally bounded adversary in order to guarantee that the observations available to the adversary will not allow it to compute the key. On a side note, D.C. is one of the more boring cities to have a conference once you have done the museums circuit. Anyhow, looking forward to it!


Information asymmetry in the healthcare system

I suspect one of the problems in the health care system in America is information asymmetry. The moment a customer hands over an insurance ID to a doctor or hospital, the clinic knows a significant amount of information about the patient. The clinic knows the patients health history, financial situation and the type of insurance plan he has. The average patient knows almost nothing about the internal working of the hospital. Try to think about how many bits of information the clinic possess about the patient, and how many bits the patient possess about the clinic. From a purely rational point of view, it may be in the best interest of the clinic to overcharge the patient for services he doesn't really need, knowing that his insurance will cover it. Even if the clinic gets caught doing this once in a while, the odds of getting caught may be low enough that the clinic may decide to pursue such a policy in the long run.One simple change that can be made is: do not provide insurance information about individual patients to the healthcare provider. This can be done by placing a government body as a liaison between the health care provide and the insurance companies. In this model, the hospital submit the bill to the government body and this body follows up with the insurance company. The hospital/clinic and the insurance company never interact directly. Another way to disincentivize misbehavior on the part of clinics might be to create a public, review-supported, ratings service, ala Yelp, for medical clinics and individual doctors. The problem with this is potential for false reviews and sybil attacks, which makes me wonder why apartmentratings.com seems to have so many fake reviews, while yelp seems relatively cleaner.Update: A large amount of medical records data has just been released by the Heritage Health Prize via Kaggle. While the intended purpose of this data is to host a competition to predict which patients are admitted to a hospital within 1 year of the time frame of the data, I wonder whether it could be used for other interesting investigations and verification of theories, such as the one above.


Comparing the communication problem with machine learning

In the problem of communicating information from point A to point B, we are faced with the task of dealing with a noisy channel that sits between A and B and corrupts or distorts the signal fed into it. A and B can be points in time intend of points in space - for e.g. recording data on some medium, such as a compact disk. Our goal is to take information produced by a source, and map it to a signal that can be carried by our intended channel, in such a way that when this signal is fed into the channel at point A, the distorted version produced by the channel at point B can be used by us to faithfully recover the original information with little or no change in meaning of the original information. In solving this problem, we first try to understand the nature of distortion that the channel creates, and based on this knowledge, we design an appropriate mapping between the original information emitted by the information source, and the signal that is fed into the channel. This process is called encoding. Using the knowledge of the type of distortion produced by the channel, we can then use the distorted signal at B and map it to our guess of the original information emitted by the source. This process is called decoding. The study of how to create efficient encoding and decoding strategies for specific types of channels is called coding theory.In machine learning, we are placed at point B and at point A is a source of information that is not under our control. There is no entity sitting at point A creating a mapping from the original information to a signal appropriate for the channel in order to better withstand the distortions the channel will produce. The channel between A and B is 'nature'. We are faced without the task of figuring out the original information at the information source by observing the signal emitted by the channel at B. To do this, we attempt to learn a function that can correctly map the signal output by the channel at B to the original information at A.  In terms of the communication problem, we are attempting to learn the decoding rule in the communication problem above, even though we do not have prior knowledge about the nature of distortion produced by the channel. In supervised learning, we are given several input-output pairs of the channel, and we attempt to learn this function that correctly maps the output to the input. Once we have estimated this function, we apply it to estimate the original information for instances where only the output of the channel is known.

Big data at Foursquare

Attended a meeting of the NY ML meetup. The speakers presented on the BIG DATA work happening at foursquare. I noticed that the speakers as well as the audience is heavily tilted towards programming rather than statistics/ml. Most questions were about hadoop, hive and mongo db. Hardly any questions about the statistical aspects. This could be because the audience understood the math behind recommendation systems quite well. One exception was the 'cold start' problem. One thing that surprised me was that Foursquare doesn't use matrix factorization but simpler nearest neighbor methods! I asked. Bell, koren and volinsky showed in the Netflix prize that latent models based on matrix factorization are better than NN.

Also felt that some of the questions were interesting ideas. For example, how they can collect implicit feedback by noticing what a user does after being offered recommendations. I thought this was a really neat idea for tuning the recommendations for a user -- someone in the audience asked if they were doing it, and they replied they aren't but they are thinking about it.


Voice morphing

http://mi.eng.cam.ac.uk/~hy216/VoiceMorphingPrj.html Why this is so cool (and dangerous): Imagine that I have a large enough sample of your voice (if I have access to some of your phone messages, or voice chats, this is easy). I can then go and build a model of your voice. Using the technique in the papers above, I can then transform my voice into your voice *in real time*. This means that I maybe able to defeat a speaker-recognition/speaker-verification system, thereby breaking biometric systems that rely on voice for authentication. For example, if the biometric system asks me to speak a particular random phrase or sentence, I speak into a computer, that modifies my voice to sound like your voice in software, and then plays it back to the biometric system. Biometric systems should have 2 or 3 factor authentication -- biometric alone are broken!