# Notes on non-parametric regression & smoothing

Ordinary least squares regression analysis makes a few strong assumptions about the data:

• A linear relationship of y to the x’s, so that $$y | x = f(x) + \epsilon$$ • That the conditional distribution of y is, except for its mean, everywhere the same, and that this distribution is a normal distribution $$y \sim \mathcal{N}(f(x), \sigma)$$ • That observations are sampled independently, so the $$y_i$$ and $$y_j$$ are independent for $$i \neq j$$.

There are a number of ways in this things can go wrong, for example:
• The errors may not be independent (common in time series data)
• The conditional variance of the residual error may not be constant in $$x$$
• The conditional distribution of $$y$$ may be very non-Gaussian.

There are a few ways out, centering around non-parametric methods.

Kernel Estimation: This is really a non-parametric smoothing method. The y value estimated for a given input x depends upon the weight of each labelled data point which in turn depends upon the distance between the input x and the labelled data points.

$$\hat{y} = \frac{\sum_i w_i y_i}{\sum_i w_i},$$ where $$w_i = K(\frac{x - x_i}{h})$$. $$K(\cdot)$$ is a Kernel function, commonly taken to be the Gaussian Kernel, and $$h$$ is a bandwidth parameter that dictates the extent to which labelled data points exert their influence in the weighted average. The weighted estimator is also called a Nadaraya-Watson Estimator. Unlike LOESS however, there is no fitting of parameters involved. The y value is simply the weighted average of the y values of all other labelled data points.

LOWESS / LOWESS: A non-parametric method (i.e. the mount of data one needs to keep around for a model specification grows linearly with the amount of training data) where, for every new input x for which the value y is desired, one computes a weighted version of the quadratic loss used in OLS linear regression. A weight is attached to each training data point, and the weight of a training point depends upon its the distance from the new input x.

SVM regression: Transform data to a higher dimension using the kernel trick and then perform linear regression using a specific type of loss function (epsilon - insensitive) rather than the usual quadratic (squared loss) function. The epsilon-insensitive loss function encourages the support vectors to be sparse. Imagine doing polynomial regression in the original space by including polynomial terms of the data. The SVM regression achieves a similar effect via the non-linear kernel transform.

Gaussian Process regression: Also a non-parametric method, somewhat similar to LOESS in spirit in that all the data points need to be kept around (as in any non-parametric method), but this is a Bayesian method in that it allows not just a point estimate of y for a new input x, but a complete posterior distribution. This is achieved by placing a prior distribution on the values of a Guassian Process (A GP is simply a collection of random variables, any subset of which have a joint Gaussian distribution) via a covariance kernel function that only specifies the covariance between two Guassian Random variables in the GP given their x values (typically, only the distance between their x values matters). The posterior distribution of Y values for a given set of input X values also turns out to be a Gaussian distribution with an easily computable mean and covariance matrix.